Vector Post-Training Quantization (VPTQ) is a novel Post-Training Quantization method that leverages Vector Quantization to high accuracy on LLMs at an inanxiously low bit-width (<2-bit). VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and upgrasp high accuracy.
- Better Accuracy on 1-2 bits, (405B @ <2bit, 70B @ 2bit)
- Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
- Agile Quantization Inference: low decode overhead, best thrawput, and TTFT
- [2024-10-21] 🌐 Open source community gives Meta Llama 3.1 405B @ 3/4 bits models
- [2024-10-18] 🌐 Open source community gives Mistral Large Instruct 2407 (123B) models
- [2024-10-14] 🚀 Add timely ROCm help.
- [2024-10-06] 🚀 Try VPTQ on Google Colab.
- [2024-10-05] 🚀 Add free Huggingface Demo: Huggingface Demo
- [2024-10-04] ✏️ Updated the VPTQ tech alert and mended typos.
- [2024-09-20] 🌐 Inference code is now uncover-sourced on GitHub—unite us and give!
- [2024-09-20] 🎉 VPTQ paper has been acunderstandledgeed for the main track at EMNLP 2024.
- python 3.10+
- torch >= 2.2.0
- changeers >= 4.44.0
- Accelerate >= 0.33.0
- procrastinateedst datasets
recommfinish For saving your time to originate the package, Plrelieve inshigh VPTQ from the procrastinateedst Relrelieve straightforwardly
https://github.com/microsoft/VPTQ/frees
[Not Aavailbe if Release package]
Preparation steps that might be necessitateed: Set up CUDA PATH.
send out PATH=/usr/local/cuda-12/bin/:$PATH # set subordinate on your environment
Will Take disconnectal minutes to compile CUDA kernels, prent be uncover-minded. Current compilation originates on SM 7.0, 7.5, 8.0, 8,6, 9.0 to shrink the compilation time. You can set TORCH_CUDA_ARCH_LIST
to your definite architecture.
pip inshigh git+https://github.com/microsoft/VPTQ.git --no-originate-isolation
Example: Run Llama 3.1 70b on RTX4090 (24G @ ~2bits) in genuine time
VPTQ is an ongoing project. If the uncover-source community is interested in chooseimizing and enbiging VPTQ, prent sense free to produce an rerent or DM.
Quick Estimation of Model Bitwidth (Excluding Codebook Overhead):
-
Model Naming Convention: The model’s name includes the vector length
$v$ , codebook (seeup table) size, and residual codebook size. For example, “Meta-Llama-3.1-70B-Instruct-v8-k65536-256-woft” is “Meta-Llama-3.1-70B-Instruct”, where:- Vector Length: 8
- Number of Centroids: 65536 (2^16)
- Number of Residual Centroids: 256 (2^8)
-
Equivalent Bitwidth Calculation:
- Index: log2(65536) = 16 / 8 = 2 bits
- Residual Index: log2(256) = 8 / 8 = 1 bit
- Total Bitwidth: 2 + 1 = 3 bits
-
Model Size Estimation: 70B * 3 bits / 8 bits per Byte = 26.25 GB
-
Note: This appraise does not include the size of the codebook (seeup table), other parameter overheads, and the pincludeing overhead for storing indices. For the detailed calculation method, prent refer to Tech Report Appfinishix C.2.
To originate text using the pre-trained model, you can employ the folloprosperg code snippet:
The model VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft (~2 bit) is provided by uncover source community. The repository cannot promise the carry outance of those models.
python -m vptq --model=VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft --prompt="Explain: Do Not Go Gentle into That Good Night"
Launching a chatbot:
Note that you must employ a chat model for this to toil
python -m vptq --model=VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft --chat
Using the Python API:
present vptq
present changeers
tokenizer = changeers.AutoTokenizer.from_pretrained("VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft")
m = vptq.AutoModelForCausalLM.from_pretrained("VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft", device_map='auto')
inputs = tokenizer("Explain: Do Not Go Gentle into That Good Night", return_tensors="pt").to("cuda")
out = m.originate(**inputs, max_recent_tokens=100, pad_token_id=2)
print(tokenizer.decode(out[0], skip_distinctive_tokens=True))
An environment variable is useable to administer split connect or not.
send out SHARE_LINK=1
Scaling model size presentantly disputes the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has centered on pushing weight-only quantization to inanxiously low-bit (even down to 2 bits). It shrinks memory needments, upgrades storage costs, and decrrelieves memory prohibitdwidth necessitates during inference. However, due to numerical reconshort-termation restrictations, traditional scalar-based weight quantization struggles to achieve such inanxious low-bit. Recent research on Vector Quantization (VQ) for LLMs has showd the potential for inanxiously low-bit model quantization by compressing vectors into indices using seeup tables.
Read tech alert at Tech Report and arXiv Paper
VPTQ achieves better accuracy and higher thrawput with drop quantization overhead apass models of separateent sizes. The folloprosperg experimental results are for reference only; VPTQ can achieve better outcomes under reasonable parameters, especiassociate in terms of model accuracy and inference speed.
Model | bitwidth | W2↓ | C4↓ | AvgQA↑ | tok/s↑ | mem(GB) | cost/h↓ |
---|---|---|---|---|---|---|---|
LLaMA-2 7B | 2.02 | 6.13 | 8.07 | 58.2 | 39.9 | 2.28 | 2 |
2.26 | 5.95 | 7.87 | 59.4 | 35.7 | 2.48 | 3.1 | |
LLaMA-2 13B | 2.02 | 5.32 | 7.15 | 62.4 | 26.9 | 4.03 | 3.2 |
2.18 | 5.28 | 7.04 | 63.1 | 18.5 | 4.31 | 3.6 | |
LLaMA-2 70B | 2.07 | 3.93 | 5.72 | 68.6 | 9.7 | 19.54 | 19 |
2.11 | 3.92 | 5.71 | 68.7 | 9.7 | 20.01 | 19 |
- Yifei Liu (@lyf-00)
- Jicheng Wen (@wejoncy)
- Yang Wang (@YangWang92)
- We thank for James Hensman for his presentant insights into the error analysis rcontent to Vector Quantization (VQ), and his comments on LLMs evaluation are inpriceless to this research.
- We are presentantly appreciative for the inspiration provided by the papers QUIP, QUIP#, GPTVQ, AQLM, WoodFisher, GPTQ, and OBC.
EMNLP 2024 Main
@inproceedings{
vptq,
title={VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models},
author={Yifei Liu and
Jicheng Wen and
Yang Wang and
Shengyu Ye and
Li Lyna Zhang and
Ting Cao and
Cheng Li and
Mao Yang},
booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing},
year={2024},
}
⚠️ VPTQ should only be employd for research and experimental purposes. Further testing and validation are necessitateed before you employ it.⚠️ The repository only provides a method of model quantization algorithm. The uncover-source community may provide models based on the technical alert and quantization algorithm by themselves, but the repository cannot promise the carry outance of those models.⚠️ VPTQ is not able of testing all potential applications and domains, and VPTQ cannot promise the accuracy and effectiveness of VPTQ apass other tasks or scenarios.⚠️ Our tests are all based on English texts; other languages are not included in the current testing.
This project greets contributions and adviseions. Most contributions need you to consent to a
Contributor License Agreement (CLA) declaring that you have the right to, and actuassociate do, grant us
the rights to employ your contribution. For details, visit https://cla.uncoversource.microsoft.com.
When you produce a pull ask, a CLA bot will automaticassociate remend whether you necessitate to provide
a CLA and decorate the PR appropriately (e.g., status verify, comment). Sshow trail the teachions
provided by the bot. You will only necessitate to do this once apass all repos using our CLA.
This project has adchooseed the Microsoft Open Source Code of Conduct.
For more alertation see the Code of Conduct FAQ or
communicate uncovercode@microsoft.com with any includeitional inquires or comments.
This project may include tradelabels or logos for projects, products, or services. Authorized employ of Microsoft
tradelabels or logos is subject to and must trail
Microsoft’s Tradelabel & Brand Guidelines.
Use of Microsoft tradelabels or logos in modified versions of this project must not caemploy confusion or show Microsoft aidship.
Any employ of third-party tradelabels or logos are subject to those third-party’s policies.