awq-quantization

Solid

Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.

AI & Automation 9,182 stars 697 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%

100

Recency 20%

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# AWQ (Activation-aware Weight Quantization) 4-bit quantization that preserves salient weights based on activation patterns, achieving 3x speedup with minimal accuracy loss. ## When to use AWQ **Use AWQ when:** - Need 4-bit quantization with <5% accuracy loss - Deploying instruction-tuned or chat models (AWQ generalizes better) - Want ~2.5-3x inference speedup over FP16 - Using vLLM for production serving - Have Ampere+ GPUs (A100, H100, RTX 40xx) for Marlin kernel support **Use GPTQ instead when:** - Need maximum ecosystem compatibility (more tools support GPTQ) - Working with ExLlamaV2 backend specifically - Have older GPUs without Marlin support **Use bitsandbytes instead when:** - Need zero calibration overhead (quantize on-the-fly) - Want to fine-tune with QLoRA - Prefer simpler integration ## Quick start ### Installation ```bash # Default (Triton kernels) pip install autoawq # With optimized CUDA kernels + Flash Attention pip install autoawq[kernels] # Intel CPU/XPU optimization pip install autoawq[cpu] ``` **Requirements**: Python 3.8+, CUDA 11.8+, Compute Capability 7.5+ ### Load pre-quantized model ```python from awq import AutoAWQForCausalLM from transformers import AutoTokenizer model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ" model = AutoAWQForCausalLM.from_quantized( model_name, fuse_layers=True # Enable fused attention for speed ) tokenizer = AutoTokenizer.from_pretrained(model_name) # Generate inputs = tokenizer("Explain quantum com...

Details

Author: Orchestra-Research
Repository: Orchestra-Research/AI-Research-SKILLs
Created: 7 months ago
Last Updated: 1 months ago
Language: TeX
License: MIT

Integrates with

Hugging Face · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured