← All benchmarks

LLM Quantization Guide for Apple Silicon

Q4, Q5, Q6, Q8 — what does each level mean for speed, RAM, and quality? Measured data on M4 Max from real benchmark runs, not estimates.

~25%Speed gain Q4 over Q8 (typical)
~2×RAM difference Q4 vs Q8 for same model
Q4_K_MBest general starting point
Q8_0Near full-precision quality, 2× RAM

What quantization means

Quantization compresses model weights by reducing the precision of each number. A Q4 model stores each weight in 4 bits instead of 16 bits (BF16) — roughly 4× smaller. The tradeoff is a small quality loss, which is often imperceptible for most tasks.

On Apple Silicon, quantization affects two things equally: RAM usage (lower quant = smaller model fits in memory) and speed (lower quant = faster because less data moves through the memory bus). Both matter for local inference.

Why lower quant is faster on Apple Silicon

  • Apple Silicon LLM inference is memory-bandwidth-bound, not compute-bound
  • The bottleneck is moving weights from memory to compute — smaller weights means faster moves
  • Q4 moves ~4× less data per token than BF16 — faster throughput
  • Q8 is ~50% slower than Q4 for pure generation on the same chip

Quality tradeoffs

  • Q8_0: Near-identical to BF16 for most tasks. Perplexity difference is negligible.
  • Q6_K: Excellent quality. Hard to distinguish from Q8 in practice.
  • Q4_K_M: Good quality. Slight degradation on complex reasoning but excellent for chat and coding.
  • Q3_K_L / IQ3: Noticeable quality loss. Use only when RAM is severely constrained.
  • IQ2 / Q2: Significant degradation. Avoid for reasoning tasks.

Measured quantization ladder — Qwen 3 4B on M4 Max (40-core GPU, 64 GB)

All runs with MLX on the same chip. Same model, different quantization levels. Shows the speed vs RAM tradeoff directly.

QuantizationRAM usageGeneration tok/sPrompt tok/s
Q4 2.54 GB
148.1 tok/s
2977 tok/s
Q4_G32 2.78 GB
149.1 tok/s
2838 tok/s
Q5 3.26 GB
143.2 tok/s
2736 tok/s
Q5_G32 3.5 GB
143.0 tok/s
2755 tok/s
Q6 3.98 GB
136.6 tok/s
2736 tok/s
Q8 5.06 GB
111.5 tok/s
1781 tok/s

Source: MLX benchmark gist by awni on M4 Max.

Measured quantization ladder — Qwen 3 30B A3B on M4 Max (40-core GPU, 64 GB)

Same chip, same model family, larger MoE model. Shows how quantization affects a memory-constrained 30B model.

QuantizationRAM usageGeneration tok/sPrompt tok/s
Q4 16.12 GB
92.1 tok/s
823 tok/s
Q5 18.09 GB
84.9 tok/s
820 tok/s
Q6 21.87 GB
76.7 tok/s
818 tok/s
Q8 29.78 GB
52.6 tok/s
773 tok/s

At Q4, this MoE 30B model fits comfortably (16 GB) and runs at 92 tok/s. At Q8 it uses 29.8 GB and drops to 52.6 tok/s — still usable, but RAM becomes the constraint before quality becomes the reason to step up.

Which quantization should you use?

Q4_K_M — the general recommendation

  • Best balance of quality, speed, and RAM efficiency
  • Supported by Ollama, llama.cpp, LM Studio, MLX
  • Good for: chat, coding assistance, Q&A, summarization
  • Start here. Step up to Q6_K if you have RAM headroom.

Q8_0 — near full precision

  • Perceptibly better on complex reasoning and math
  • Doubles RAM usage vs Q4 — same model needs 2× the memory
  • Good for: research, complex reasoning, long-form generation where quality matters
  • Use when you have RAM headroom and can accept ~30% lower speed

Q5_K_M / Q6_K — the sweet spot upgrade

  • Minimal quality improvement over Q4 for most tasks
  • ~25% more RAM than Q4, ~10–15% slower than Q4
  • Good for: users who want "slightly better" without doubling RAM
  • Niche: Q5_K_M is often the max you can fit when RAM is the constraint

IQ2 / Q2 / Q3 — constrained RAM only

  • Significant quality degradation at IQ2/Q2
  • Q3_K_L is borderline acceptable for chat, not for reasoning
  • Use case: fitting a model that wouldn't otherwise load at all
  • Example: Q3_K_S 70B in 36 GB when you don't have 64 GB

Quantization formats in the dataset

Quantizations with the most benchmark data in SiliconBench:

Q4_K - Medium 164 rows   Q4_0 6 rows   Q4_K_M 4 rows   Q8_0 3 rows   Q4 2 rows   Q5 2 rows   Q6 2 rows   Q8 2 rows   Q4_G32 1 rows   Q5_G32 1 rows

The "Q4_K - Medium" label is LocalScore's normalized name for Q4_K_M. They are equivalent. Different tools use slightly different naming conventions for the same quantization.

benchmarks.json — full dataset  ·  models.json — model summaries

See all benchmarks →