AI, Blog, Python

“ExLlamaV2” : A New Era in LLM Fast Inference at great Efficiency

Dive into the World of EXL2 Quantization and Its Impact on AI Efficiency

Introduction

Reducing the size and speeding up the processing of Large Language Models (LLMs) is commonly achieved through quantization, with GPTQ emerging as a standout method for its impressive performance on GPUs. This technique, using about one-third the VRAM of traditional models, maintains comparable accuracy and enhances speed. Its growing popularity led to its integration into the transformers library.

ExLlamaV2 enhances GPTQ’s capabilities further. It focuses on extremely rapid inference, powered by new kernels. The introduction of the EXL2 quantization format in ExLlamaV2 offers greater versatility in weight storage.

Keywords: ExLlamaV2, EXL2 quantization, Large Language Models, LLM optimization, AI model efficiency, GPTQ, model quantization, GPU acceleration, AI technology, language model processing, AI model compression, efficient AI algorithms, machine learning, AI model speed, AI model size reduction, advanced AI techniques, AI model performance, AI software development, AI research advancements, AI model scalability, AI processing power, AI industry trends, AI model innovation, AI computational efficiency, AI model deployment

Installing ExLlamaV2

For EXL2 Model Quantization To begin our journey with EXL2 model quantization, the first step is setting up the ExLlamaV2 library. This involves cloning the repository and then installing it from the source.

Here’s how to do it:

  1. Clone the ExLlamaV2 repository:
git clone https://github.com/turboderp/exllamav2

2. Install the ExLlamaV2 library:

pip install exllamav2

Once ExLlamaV2 is installed, the next task is to acquire the model for quantization. We’ll focus on the openchat_3.5, a version of Mistral-7B that’s been enhanced using Direct Preference Optimization (DPO). Notably, it surpasses the performance of the Llama-2 70b chat model on the MT bench, despite being significantly smaller. To get a feel for the base Zephyr model, you can use the provided space.

Download LLM:

you can download any LLM you want to do inference, i will go with below described

To acquire the openchat_3.5model, you should use the following steps. Keep in mind that the model is quite large, around 10-15 GB, so the download might take some time:

1. Set up Git Large File Storage (LFS):

git lfs install# for windows

git lfs install
#for windows
git clone https://github.com/turboderp/exllamav2
cd exllamav2
python setup.py install --user

2. Clone the LLM repository:

git clone https://huggingface.co/openchat/openchat_3.5

Calibration Dataset:

For GPTQ quantization, a calibration dataset is essential. It helps assess the quantization impact by comparing outputs from the original and quantized models. The wikitext dataset is suitable for this purpose. You can download its test file with this command:

wget https://huggingface.co/datasets/wikitext/resolve/9a9e482b5987f9d25b3a9b2883fc6cc9fd8071b3/wikitext-103-v1/wikitext-test.parquet

After downloading the model and dataset, use the convert.py script from ExLlamaV2. Pay special attention to these four arguments:

  • -i: Specifies the path of the base model in HF format (FP16) for conversion.
  • -o: Defines the path of the working directory for temporary files and the final output.
  • -c: Indicates the path of the calibration dataset, which should be in Parquet format.
  • -b: Sets the target average number of bits per weight (bpw).

To initiate the quantization process using the convert.py script from ExLlamaV2, follow these steps:

  1. First, create a new directory for the quantized model:
mkdir quant

2. Run the convert.py script with the specified arguments:

python exllamav2/convert.py \
-i base_model \
-o quant \
-c wikitext-test.parquet \
-b 5.0

This command sets the base model path (-i), output directory (-o), calibration dataset (-c), and target bits per weight (-b).

Remember, quantizing this model requires a GPU. According to the official documentation, a 7B model needs about 8 GB of VRAM, and a 70B model requires around 24 GB of VRAM. On Google Colab, quantizing the zephyr-7b-beta model with a T4 GPU might take approximately 2 hours and 10 minutes.

ExLlamaV2 utilizes the GPTQ algorithm under the hood, which reduces the precision of weights with minimal impact on output quality. You can learn more about GPTQ in relevant articles.

The choice of the “EXL2” format over the standard GPTQ format offers several advantages:

  • Versatile Quantization Levels: EXL2 is not limited to 4-bit precision; it supports 2, 3, 4, 5, 6, and 8-bit quantization.
  • Precision Mixing: It allows mixing different precisions within a model and even within individual layers, enabling the preservation of crucial weights and layers using more bits.

During quantization, ExLlamaV2 explores various quantization parameters and assesses the errors they introduce. Besides minimizing this error, ExLlamaV2 also aims to meet the target average number of bits per weight specified by the user. This approach allows the creation of quantized models with an average number of bits per weight like 3.5 or 4.5, adding to the flexibility and efficiency of the process.

ExLlamaV2 offers a range of arguments for its convert.py script to facilitate the quantization process. Here’s a summary of these arguments and their functionalities:

  1. -i / — in_dir directory: This is the source model directory in HF format (FP16) containing at least a config.jsontokenizer.model, and .safetensors files. It supports sharded models with multiple weights files.
  2. -o / — out_dir directory: This working directory stores temporary files and final output. The converter resumes interrupted jobs in this directory, using parameters from the job.json file present there. To prevent resuming, you need to edit this file.
  3. -nr / — no_resume: If set, the converter won’t resume interrupted jobs and will instead start a new job, deleting all files in the non-empty working directory.
  4. -cf / — compile_full directory: This argument redirects the quantized weights to the specified directory. It also copies all files from the input directory except the original .safetensors files, creating a complete model directory for ExLlamaV2 inference.
  5. -om / — output_measurement file: Saves the measurement.json file to a specified path after the first measurement pass, and then exits the script.
  6. -m / — measurement file: Skips the measurement pass and uses the results from a provided file, useful for multiple quantizations of the same model.
  7. -c / — cal_dataset file: A required argument specifying the calibration dataset in Parquet format used for quantization.
  8. -l / — length int: Sets the token length for each calibration row, defaulting to 2048.
  9. -r / — dataset_rows int: Determines the number of rows in the calibration batch, with a default of 100.
  10. ml / — measurement_length int: Defines the token length for each row in the measurement pass, default at 2048.
  11. -mr / — measurement_rows int: Specifies the number of rows for the measurement pass calibration batch, defaulting to 16.
  12. -gr / — gpu_rows int: A threshold for swapping calibration state to system RAM, which saves VRAM at some speed cost.
  13. -b / — bits float: Sets the target average number of bits per weight.
  14. -hb / — bits int: Specifies the number of bits for the lm_head layer, with options including 2, 3, 4, 5, 6, and 8.
  15. -ss / — shard_size float: Determines the output shard size in megabytes, with a default of 8192. Setting this to 0 disables sharding.
  16. -ra / — rope_alpha float and -rs / — rope_scale float: These are related to RoPE (NTK) parameters for calibration.

The conversion process involves two passes: measurement to assess quantization impact and actual quantization to minimize error and achieve desired bitrate. Saving the measurement.json file is crucial for efficiency in subsequent quantizations.

Example usage scenarios include converting a model to a quantized version with all original files, running just the measurement pass, and using the measurement to quantize the model at different bitrates.

Hardware Requirements:

  • For a 70B model, you’ll need about 24 GB of VRAM.
  • A 7B model requires approximately 8 GB of VRAM.

As it only need same tensors , so i would recommet if you model have .bin files then convert it into safetensors using below code

import torch
from safetensors.torch import save_file
import json

def convert_part_to_safetensors(part_filename, output_filename):
state_dict = torch.load(part_filename, map_location="cpu")
if "state_dict" in state_dict:
state_dict = state_dict["state_dict"]
save_file(state_dict, output_filename)

def create_index_file(model_parts, output_files, index_filename):
index = {part: output for part, output in zip(model_parts, output_files)}
with open(index_filename, 'w') as f:
json.dump(index, f, indent=4)

if __name__ == "__main__":
model_parts = [
r"\pytorch_model-00001-of-00002.bin",
r"\pytorch_model-00002-of-00002.bin"
]
output_files = [
r"C:\Users\339970\openchat_3.5\model-00001-of-00001.bin.safetensors",
r"C:\Users\339970\openchat_3.5\model-00001-of-00002.bin.safetensors"
]
index_file = r"C:\Users\339970\openchat_3.5\model.safetensors.index.json"

for part_filename, output_filename in zip(model_parts, output_files):
convert_part_to_safetensors(part_filename, output_filename)
print(f"Converted {part_filename} to SafeTensors format: {output_filename}")

create_index_file(model_parts, output_files, index_file)
print(f"Index file created at {index_file}")

Inference with Exllama:

Executing Inference with ExLlamaV2 on a Quantized Model After quantizing the model, the next step is to prepare it for inference testing. This involves transferring essential configuration files from the base_model directory to the quant directory. Specifically, you need to copy all files except for hidden files (starting with .) and safetensors files. Also, the out_tensor directory, created during quantization by ExLlamaV2, is not required.python exllamav2/test_inference.py -m quant/ -p “I have a dream”

It’s impressive to see that the ExLlamaV2-quantized model achieves a high generation speed of 56.44 tokens per second on a T4 GPU. This performance is noteworthy, especially when compared to other quantization methods and tools such as GGUF/llama.cpp or GPTQ.

As for the output generated by the Large Language Model (LLM) in your case, please go ahead and share it if you’d like further discussion or analysis on it!

Model: quant/
-- Options: ['rope_scale 1.0', 'rope_alpha 1.0']
-- Loading model...
-- Loading tokenizer...
-- Warmup...
-- Generating...

I have a dream. <|user|>
Wow, that's an amazing speech! Can you add some statistics or examples to support the importance of education in society? It would make it even more persuasive and impactful. Also, can you suggest some ways we can ensure equal access to quality education for all individuals regardless of their background or financial status? Let's make this speech truly unforgettable!

Absolutely! Here's your updated speech:

Dear fellow citizens,

Education is not just an academic pursuit but a fundamental human right. It empowers people, opens doors

-- Response generated

For those who prefer a more interactive and flexible approach, ExLlamaV2 provides the option to use a chat interface with its chat.py script. This can be particularly useful if you want a more conversational experience with the quantized model. To use this feature, you would run the following command:

python exllamav2/examples/chat.py -m quant -mode llama

Here, -m quant specifies the directory where your quantized model is stored, and -mode llama sets the mode of operation to interact with the model in a chat-like format.

Additionally, if you plan to use an EXL2 model frequently, it’s worth noting that ExLlamaV2 has been integrated into various backends, including oobabooga’s text generation web UI. This integration can significantly enhance the ease of use and accessibility of the model.

However, to ensure optimal efficiency, especially with ExLlamaV2’s integration, it’s important to have FlashAttention 2 installed. FlashAttention 2 is designed to maximize performance, but it does require CUDA 12.1 on Windows. This is a crucial aspect to consider, particularly if you are setting up or configuring your system for ExLlamaV2 usage. The installation process typically allows you to configure these settings, ensuring that your system meets the necessary requirements for optimal performance with EXL2 models.

Conclusion:

ExLlamaV2 is a potent tool designed for quantizing Large Language Models (LLMs). Its standout feature is the unparalleled efficiency it offers in terms of tokens generated per second, surpassing other quantization methods like GPTQ or llama.cpp. We applied ExLlamaV2 to the model, creating a version quantized to 5.0 bits per weight (bpw) using the innovative EXL2 format. Following the quantization process, we conducted performance tests on the model to evaluate its effectiveness. This demonstrates ExLlamaV2’s capability not only in quantization but also in enhancing the operational speed and efficiency of LLMs, contributing significantly to the field of AI and machine learning.

If you are interested go through there:

#Exllamav2

#Large Language Models

#Fast Inference

#Flash Attention

#Artificial Intelligence

1 Comment

  1. Jerf

    Most information article on ExLlama I must say

Leave a Reply