Blog Post

W3: Day 4 – Hugging Face Model Class: Running Inference on Open-Source AI Models

January 10, 2025 Week 3 by Pradip Wasre

Welcome to Day 4 of our deep dive into Hugging Face models! Today we’re going to explore advanced techniques in running inference, model quantization, and streaming output generation. This session will allow you to work with powerful open-source models such as LLaMA 3.1, Phi 3, Mixtral, Qwen2, and Gemma. We’ll also take a closer look at optimizing models through quantization and managing real-time output generation.

4.1 Hugging Face Model Class: Running Inference on Open-Source AI Models

Inference is the process of using a trained AI model to generate predictions from new, unseen data. When we refer to running inference on Hugging Face models, it involves feeding input text into these large models, which process the data and generate relevant outputs.

We will work with several high-performance models:

LLaMA 3.1 from Meta: Known for its versatility across various NLP tasks, this model is capable of understanding and generating natural language with high accuracy.
PHI 3 from Microsoft: A powerful language model optimized for nuanced understanding and context-aware generation.
Mixtral from Mistral: A model focused on efficiency and performance across diverse NLP tasks.
Qwen2 from Alibaba: A robust multi-lingual model designed to work across different languages and scenarios.
Gemma from Google: A recent entrant with impressive capabilities for a wide range of language tasks.

In this section, we will also discuss quantization, model internals, and streaming results from models. These features are important for making large models usable on devices with limited resources and generating responses in real time.

4.2 Hugging Face Transformers: Loading & Quantizing Models with Bits and Bytes

Hugging Face offers the ability to load models in ways that reduce their memory footprint and make them more efficient for real-time applications. Quantization plays a crucial role in this process, especially when working with large models that could otherwise overwhelm your available system memory.

What is Quantization?

Quantization is a technique that reduces the precision of the model’s internal parameters (such as weights), typically from 32-bit floating-point numbers to lower bit widths, like 4-bit precision. This reduction helps achieve the following:

Memory Efficiency: Lowering the precision of model weights reduces the amount of memory needed to store them. This is crucial when working with models that would otherwise exceed your available GPU or CPU memory.
Faster Computation: Quantization leads to smaller model sizes, which, in turn, allows for faster computation, speeding up inference times, especially for real-time applications like chatbots.
Deployment on Edge Devices: Quantization enables you to run models on devices with limited hardware resources, such as mobile phones, edge servers, and smaller GPUs.

Quantization Parameters:

Hugging Face’s BitsAndBytesConfig provides several options to control the quantization process. The following key parameters help you fine-tune the behavior:

load_in_4bit=True: This tells the model to load in 4-bit precision, significantly reducing its memory footprint.
bnb_4bit_use_double_quant=True: This enables a technique called double quantization, which enhances the quality of the 4-bit quantization while maintaining the model’s performance.
bnb_4bit_compute_dtype=torch.bfloat16: Specifies the datatype used for computations during inference, typically bfloat16, which balances performance and accuracy.
bnb_4bit_quant_type="nf4": Refers to the specific 4-bit quantization method, designed for better efficiency in neural networks.

The goal of applying quantization is to allow large models, such as LLaMA 3.1, to be loaded into memory and run on systems with lower available resources without sacrificing much in terms of model performance.

Why Use Quantization?

By applying the quantization configuration, you can enable running these models on devices that would otherwise lack the necessary memory, especially for models that are several gigabytes in size. The process ensures that you’re able to leverage large-scale models without needing an extremely high-memory GPU.

4.3 Hugging Face Transformers: Generating Text with Open-Source AI Models

Now that we’ve discussed the technical details of quantization and how it makes models more efficient, let’s explore text generation. For many AI applications, such as chatbots, generating meaningful text outputs is key.

Streaming Results:

One exciting feature of Hugging Face is the ability to stream results in real time. This is particularly useful when creating chatbots or applications that require instant feedback from the model. Streaming ensures that tokens are generated one at a time, and the response is shown progressively as the model generates it.

In this setup, you replace the standard model generate() function with a TextStreamer, which streams the results back as they are generated. This allows for real-time interaction, a crucial feature in building applications such as conversational AI systems.

Generation Prompts:

Another important feature is the generation prompt. When working with conversational AI models, these models are often fine-tuned to work within specific formats. For instance, an instruction model may require specific input structures, such as prompts starting with “system”, followed by “user” and “assistant” roles.

The add_generation_prompt=True option is used to inform the model to generate a response to the user’s query, ensuring that the model doesn’t just predict the next token in the sequence, but responds in a contextually appropriate manner.

In practical terms, when building AI-powered chatbots, you need to provide system messages (telling the AI how to behave), user messages (what the user asks), and assistant messages (the model’s responses). The apply_chat_template() function is used to format these appropriately.

Example of Real-Time Generation:

For generating content, such as jokes or responses to user queries, you can stream the results progressively. This approach ensures the user sees the output in real-time rather than waiting for the entire response to be generated before displaying it.

4.4 Mastering Hugging Face Transformers: Models, Pipelines, and Tokenizers

As we’ve seen, Hugging Face provides a powerful suite of tools for AI development. After today’s session, you’ll be proficient in the following:

Working with Tokenizers: Tokenizers break down text into understandable chunks for models, allowing the model to interpret natural language input. Hugging Face makes it easy to work with different tokenizers suited to various models.
Utilizing Models: Hugging Face models like LLaMA, Phi 3, Mixtral, and Gemma are pre-trained, and you can use them for a wide range of tasks, including text generation, summarization, and question answering.
Building Pipelines: Pipelines are an abstraction that simplify using models for specific tasks. For example, Hugging Face provides text generation pipelines that allow you to feed in text and get a generated output seamlessly.

These tools, combined with your understanding of model internals, quantization, and streaming, will allow you to efficiently use open-source models and integrate them into applications with real-time feedback.

Conclusion:

By the end of Day 4, you will have mastered how to run inference on Hugging Face models, implement quantization to save on memory, stream real-time outputs, and utilize the best practices for generating text from models. You will be well-equipped to use these models effectively and incorporate them into real-world applications like multi-modal assistants, chatbots, and more. These skills will help you build AI systems that are both high-performing and resource-efficient.

Tags: LLMs Journey