awq-quantization

Solid

Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.

AI & Automation 9,182 stars 697 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%
100
Recency 20%
75
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# AWQ (Activation-aware Weight Quantization) 4-bit quantization that preserves salient weights based on activation patterns, achieving 3x speedup with minimal accuracy loss. ## When to use AWQ **Use AWQ when:** - Need 4-bit quantization with <5% accuracy loss - Deploying instruction-tuned or chat models (AWQ generalizes better) - Want ~2.5-3x inference speedup over FP16 - Using vLLM for production serving - Have Ampere+ GPUs (A100, H100, RTX 40xx) for Marlin kernel support **Use GPTQ instead when:** - Need maximum ecosystem compatibility (more tools support GPTQ) - Working with ExLlamaV2 backend specifically - Have older GPUs without Marlin support **Use bitsandbytes instead when:** - Need zero calibration overhead (quantize on-the-fly) - Want to fine-tune with QLoRA - Prefer simpler integration ## Quick start ### Installation ```bash # Default (Triton kernels) pip install autoawq # With optimized CUDA kernels + Flash Attention pip install autoawq[kernels] # Intel CPU/XPU optimization pip install autoawq[cpu] ``` **Requirements**: Python 3.8+, CUDA 11.8+, Compute Capability 7.5+ ### Load pre-quantized model ```python from awq import AutoAWQForCausalLM from transformers import AutoTokenizer model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ" model = AutoAWQForCausalLM.from_quantized( model_name, fuse_layers=True # Enable fused attention for speed ) tokenizer = AutoTokenizer.from_pretrained(model_name) # Generate inputs = tokenizer("Explain quantum com...

Details

Author
Orchestra-Research
Repository
Orchestra-Research/AI-Research-SKILLs
Created
7 months ago
Last Updated
1 months ago
Language
TeX
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

awq-quantization

Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.

27,705 Updated today
davila7
AI & Automation Featured

gptq

Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.

27,705 Updated today
davila7
AI & Automation Solid

gptq

Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.

9,182 Updated 1 months ago
Orchestra-Research
AI & Automation Featured

hqq-quantization

Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.

27,705 Updated today
davila7
AI & Automation Solid

hqq-quantization

Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.

9,182 Updated 1 months ago
Orchestra-Research