LLM Quantization Guide for Apple Silicon
Q4, Q5, Q6, Q8 — what does each level mean for speed, RAM, and quality? Measured data on M4 Max from real benchmark runs, not estimates.
What quantization means
Quantization compresses model weights by reducing the precision of each number. A Q4 model stores each weight in 4 bits instead of 16 bits (BF16) — roughly 4× smaller. The tradeoff is a small quality loss, which is often imperceptible for most tasks.
On Apple Silicon, quantization affects two things equally: RAM usage (lower quant = smaller model fits in memory) and speed (lower quant = faster because less data moves through the memory bus). Both matter for local inference.
Why lower quant is faster on Apple Silicon
- Apple Silicon LLM inference is memory-bandwidth-bound, not compute-bound
- The bottleneck is moving weights from memory to compute — smaller weights means faster moves
- Q4 moves ~4× less data per token than BF16 — faster throughput
- Q8 is ~50% slower than Q4 for pure generation on the same chip
Quality tradeoffs
- Q8_0: Near-identical to BF16 for most tasks. Perplexity difference is negligible.
- Q6_K: Excellent quality. Hard to distinguish from Q8 in practice.
- Q4_K_M: Good quality. Slight degradation on complex reasoning but excellent for chat and coding.
- Q3_K_L / IQ3: Noticeable quality loss. Use only when RAM is severely constrained.
- IQ2 / Q2: Significant degradation. Avoid for reasoning tasks.
Measured quantization ladder — Qwen 3 4B on M4 Max (40-core GPU, 64 GB)
All runs with MLX on the same chip. Same model, different quantization levels. Shows the speed vs RAM tradeoff directly.
| Quantization | RAM usage | Generation tok/s |
|---|---|---|
| Q4 | 2.54 GB |
148.1 tok/s
|
| Q4_G32 | 2.78 GB |
149.1 tok/s
|
| Q5 | 3.26 GB |
143.2 tok/s
|
| Q5_G32 | 3.5 GB |
143.0 tok/s
|
| Q6 | 3.98 GB |
136.6 tok/s
|
| Q8 | 5.06 GB |
111.5 tok/s
|
Source: MLX benchmark gist by awni on M4 Max.
Measured quantization ladder — Qwen 3 30B A3B on M4 Max (40-core GPU, 64 GB)
Same chip, same model family, larger MoE model. Shows how quantization affects a memory-constrained 30B model.
| Quantization | RAM usage | Generation tok/s |
|---|---|---|
| Q4 | 16.12 GB |
92.1 tok/s
|
| Q5 | 18.09 GB |
84.9 tok/s
|
| Q6 | 21.87 GB |
76.7 tok/s
|
| Q8 | 29.78 GB |
52.6 tok/s
|
At Q4, this MoE 30B model fits comfortably (16 GB) and runs at 92 tok/s. At Q8 it uses 29.8 GB and drops to 52.6 tok/s — still usable, but RAM becomes the constraint before quality becomes the reason to step up.
Which quantization should you use?
Q4_K_M — the general recommendation
- Best balance of quality, speed, and RAM efficiency
- Supported by Ollama, llama.cpp, LM Studio, MLX
- Good for: chat, coding assistance, Q&A, summarization
- Start here. Step up to Q6_K if you have RAM headroom.
Q8_0 — near full precision
- Perceptibly better on complex reasoning and math
- Doubles RAM usage vs Q4 — same model needs 2× the memory
- Good for: research, complex reasoning, long-form generation where quality matters
- Use when you have RAM headroom and can accept ~30% lower speed
Q5_K_M / Q6_K — the sweet spot upgrade
- Minimal quality improvement over Q4 for most tasks
- ~25% more RAM than Q4, ~10–15% slower than Q4
- Good for: users who want "slightly better" without doubling RAM
- Niche: Q5_K_M is often the max you can fit when RAM is the constraint
IQ2 / Q2 / Q3 — constrained RAM only
- Significant quality degradation at IQ2/Q2
- Q3_K_L is borderline acceptable for chat, not for reasoning
- Use case: fitting a model that wouldn't otherwise load at all
- Example: Q3_K_S 70B in 36 GB when you don't have 64 GB
Quantization formats in the dataset
Quantizations with the most benchmark data in SiliconBench:
The "Q4_K - Medium" label is LocalScore's normalized name for Q4_K_M. They are equivalent. Different tools use slightly different naming conventions for the same quantization.
Related
Data
benchmarks.json — full dataset · models.json — model summaries