quantizing-models-bitsandbytes

Featured

Quantizes LLMs to 8-bit or 4-bit for 50-75% memory reduction with minimal accuracy loss. Use when GPU memory is limited, need to fit larger models, or want faster inference. Supports INT8, NF4, FP4 formats, QLoRA training, and 8-bit optimizers. Works with HuggingFace Transformers.

AI & Automation 27,984 stars 2901 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# bitsandbytes - LLM Quantization ## Quick start bitsandbytes reduces LLM memory by 50% (8-bit) or 75% (4-bit) with <1% accuracy loss. **Installation**: ```bash pip install bitsandbytes transformers accelerate ``` **8-bit quantization** (50% memory reduction): ```python from transformers import AutoModelForCausalLM, BitsAndBytesConfig config = BitsAndBytesConfig(load_in_8bit=True) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", quantization_config=config, device_map="auto" ) # Memory: 14GB → 7GB ``` **4-bit quantization** (75% memory reduction): ```python config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16 ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", quantization_config=config, device_map="auto" ) # Memory: 14GB → 3.5GB ``` ## Common workflows ### Workflow 1: Load large model in limited GPU memory Copy this checklist: ``` Quantization Loading: - [ ] Step 1: Calculate memory requirements - [ ] Step 2: Choose quantization level (4-bit or 8-bit) - [ ] Step 3: Configure quantization - [ ] Step 4: Load and verify model ``` **Step 1: Calculate memory requirements** Estimate model memory: ``` FP16 memory (GB) = Parameters × 2 bytes / 1e9 INT8 memory (GB) = Parameters × 1 byte / 1e9 INT4 memory (GB) = Parameters × 0.5 bytes / 1e9 Example (Llama 2 7B): FP16: 7B × 2 / 1e9 = 14 GB INT8: 7B × 1 / 1e9 = 7 GB INT4: 7B × 0.5 / 1e9 = 3.5 GB ``` **Step 2: C...

Details

Author: davila7
Repository: davila7/claude-code-templates
Created: 11 months ago
Last Updated: today
Language: Python
License: MIT

Integrates with

Anthropic · AI Hugging Face · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

quantizing-models-bitsandbytes

9,609 Updated 1 months ago

Orchestra-Research

AI & Automation Solid

gptq

Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.

9,609 Updated 1 months ago

Orchestra-Research

AI & Automation Featured

gptq

27,984 Updated today

davila7