gptq

Solid

Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.

AI & Automation 9,182 stars 697 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%

100

Recency 20%

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# GPTQ (Generative Pre-trained Transformer Quantization) Post-training quantization method that compresses LLMs to 4-bit with minimal accuracy loss using group-wise quantization. ## When to use GPTQ **Use GPTQ when:** - Need to fit large models (70B+) on limited GPU memory - Want 4× memory reduction with <2% accuracy loss - Deploying on consumer GPUs (RTX 4090, 3090) - Need faster inference (3-4× speedup vs FP16) **Use AWQ instead when:** - Need slightly better accuracy (<1% loss) - Have newer GPUs (Ampere, Ada) - Want Marlin kernel support (2× faster on some GPUs) **Use bitsandbytes instead when:** - Need simple integration with transformers - Want 8-bit quantization (less compression, better quality) - Don't need pre-quantized model files ## Quick start ### Installation ```bash # Install AutoGPTQ pip install auto-gptq # With Triton (Linux only, faster) pip install auto-gptq[triton] # With CUDA extensions (faster) pip install auto-gptq --no-build-isolation # Full installation pip install auto-gptq transformers accelerate ``` ### Load pre-quantized model ```python from transformers import AutoTokenizer from auto_gptq import AutoGPTQForCausalLM # Load quantized model from HuggingFace model_name = "TheBloke/Llama-2-7B-Chat-GPTQ" model = AutoGPTQForCausalLM.from_quantized( model_name, device="cuda:0", use_triton=False # Set True on Linux for speed ) tokenizer = AutoTokenizer.from_pretrained(model_name) # Generate prompt = "Explain quantum computing"...

Details

Author: Orchestra-Research
Repository: Orchestra-Research/AI-Research-SKILLs
Created: 7 months ago
Last Updated: 1 months ago
Language: TeX
License: MIT

Integrates with

Hugging Face · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

gptq

27,705 Updated today

davila7

AI & Automation Featured

awq-quantization

Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.

27,705 Updated today

davila7

AI & Automation Solid

awq-quantization

9,182 Updated 1 months ago

Orchestra-Research

AI & Automation Featured

hqq-quantization

Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.

27,705 Updated today

davila7

AI & Automation Solid

hqq-quantization

9,182 Updated 1 months ago

Orchestra-Research