gptq

Solid

Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.

AI & Automation 9,182 stars 697 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%
100
Recency 20%
75
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# GPTQ (Generative Pre-trained Transformer Quantization) Post-training quantization method that compresses LLMs to 4-bit with minimal accuracy loss using group-wise quantization. ## When to use GPTQ **Use GPTQ when:** - Need to fit large models (70B+) on limited GPU memory - Want 4× memory reduction with <2% accuracy loss - Deploying on consumer GPUs (RTX 4090, 3090) - Need faster inference (3-4× speedup vs FP16) **Use AWQ instead when:** - Need slightly better accuracy (<1% loss) - Have newer GPUs (Ampere, Ada) - Want Marlin kernel support (2× faster on some GPUs) **Use bitsandbytes instead when:** - Need simple integration with transformers - Want 8-bit quantization (less compression, better quality) - Don't need pre-quantized model files ## Quick start ### Installation ```bash # Install AutoGPTQ pip install auto-gptq # With Triton (Linux only, faster) pip install auto-gptq[triton] # With CUDA extensions (faster) pip install auto-gptq --no-build-isolation # Full installation pip install auto-gptq transformers accelerate ``` ### Load pre-quantized model ```python from transformers import AutoTokenizer from auto_gptq import AutoGPTQForCausalLM # Load quantized model from HuggingFace model_name = "TheBloke/Llama-2-7B-Chat-GPTQ" model = AutoGPTQForCausalLM.from_quantized( model_name, device="cuda:0", use_triton=False # Set True on Linux for speed ) tokenizer = AutoTokenizer.from_pretrained(model_name) # Generate prompt = "Explain quantum computing"...

Details

Author
Orchestra-Research
Repository
Orchestra-Research/AI-Research-SKILLs
Created
7 months ago
Last Updated
1 months ago
Language
TeX
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

gptq

Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.

27,705 Updated today
davila7
AI & Automation Featured

awq-quantization

Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.

27,705 Updated today
davila7
AI & Automation Solid

awq-quantization

Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.

9,182 Updated 1 months ago
Orchestra-Research
AI & Automation Featured

hqq-quantization

Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.

27,705 Updated today
davila7
AI & Automation Solid

hqq-quantization

Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.

9,182 Updated 1 months ago
Orchestra-Research