← All benchmarks

LLM Quantization Guide for Apple Silicon

Q4, Q5, Q6, Q8 — what does each level mean for speed, RAM, and quality? Measured data on M4 Max from real benchmark runs, not estimates.

~25%Speed gain Q4 over Q8 (typical)

~2×RAM difference Q4 vs Q8 for same model

Q4_K_MBest general starting point

Q8_0Near full-precision quality, 2× RAM

What quantization means

Quantization compresses model weights by reducing the precision of each number. A Q4 model stores each weight in 4 bits instead of 16 bits (BF16) — roughly 4× smaller. The tradeoff is a small quality loss, which is often imperceptible for most tasks.

On Apple Silicon, quantization affects two things equally: RAM usage (lower quant = smaller model fits in memory) and speed (lower quant = faster because less data moves through the memory bus). Both matter for local inference.

Why lower quant is faster on Apple Silicon

Apple Silicon LLM inference is memory-bandwidth-bound, not compute-bound
The bottleneck is moving weights from memory to compute — smaller weights means faster moves
Q4 moves ~4× less data per token than BF16 — faster throughput
Q8 is ~50% slower than Q4 for pure generation on the same chip

Quality tradeoffs

Q8_0: Near-identical to BF16 for most tasks. Perplexity difference is negligible.
Q6_K: Excellent quality. Hard to distinguish from Q8 in practice.
Q4_K_M: Good quality. Slight degradation on complex reasoning but excellent for chat and coding.
Q3_K_L / IQ3: Noticeable quality loss. Use only when RAM is severely constrained.
IQ2 / Q2: Significant degradation. Avoid for reasoning tasks.

Measured quantization ladder — Qwen 3 4B on M4 Max (40-core GPU, 64 GB)

All runs with MLX on the same chip. Same model, different quantization levels. Shows the speed vs RAM tradeoff directly.

Quantization	RAM usage	Generation tok/s	Prompt tok/s
Q4	2.54 GB	148.1 tok/s	2977 tok/s
Q4_G32	2.78 GB	149.1 tok/s	2838 tok/s
Q5	3.26 GB	143.2 tok/s	2736 tok/s
Q5_G32	3.5 GB	143.0 tok/s	2755 tok/s
Q6	3.98 GB	136.6 tok/s	2736 tok/s
Q8	5.06 GB	111.5 tok/s	1781 tok/s

Source: MLX benchmark gist by awni on M4 Max.

Measured quantization ladder — Qwen 3 30B A3B on M4 Max (40-core GPU, 64 GB)

Same chip, same model family, larger MoE model. Shows how quantization affects a memory-constrained 30B model.

Quantization	RAM usage	Generation tok/s	Prompt tok/s
Q4	16.12 GB	92.1 tok/s	823 tok/s
Q5	18.09 GB	84.9 tok/s	820 tok/s
Q6	21.87 GB	76.7 tok/s	818 tok/s
Q8	29.78 GB	52.6 tok/s	773 tok/s

At Q4, this MoE 30B model fits comfortably (16 GB) and runs at 92 tok/s. At Q8 it uses 29.8 GB and drops to 52.6 tok/s — still usable, but RAM becomes the constraint before quality becomes the reason to step up.

Which quantization should you use?

Q4_K_M — the general recommendation

Best balance of quality, speed, and RAM efficiency
Supported by Ollama, llama.cpp, LM Studio, MLX
Good for: chat, coding assistance, Q&A, summarization
Start here. Step up to Q6_K if you have RAM headroom.

Q8_0 — near full precision

Perceptibly better on complex reasoning and math
Doubles RAM usage vs Q4 — same model needs 2× the memory
Good for: research, complex reasoning, long-form generation where quality matters
Use when you have RAM headroom and can accept ~30% lower speed

Q5_K_M / Q6_K — the sweet spot upgrade

Minimal quality improvement over Q4 for most tasks
~25% more RAM than Q4, ~10–15% slower than Q4
Good for: users who want "slightly better" without doubling RAM
Niche: Q5_K_M is often the max you can fit when RAM is the constraint

IQ2 / Q2 / Q3 — constrained RAM only

Significant quality degradation at IQ2/Q2
Q3_K_L is borderline acceptable for chat, not for reasoning
Use case: fitting a model that wouldn't otherwise load at all
Example: Q3_K_S 70B in 36 GB when you don't have 64 GB

Quantization formats in the dataset

Quantizations with the most benchmark data in SiliconBench:

Q4_K - Medium 164 rows Q4_0 6 rows Q4_K_M 4 rows Q8_0 3 rows Q4 2 rows Q5 2 rows Q6 2 rows Q8 2 rows Q4_G32 1 rows Q5_G32 1 rows

The "Q4_K - Medium" label is LocalScore's normalized name for Q4_K_M. They are equivalent. Different tools use slightly different naming conventions for the same quantization.

Buying guide: best Mac for local LLMs Minimum RAM for 70B Apple Silicon generation comparison M4 Max (64 GB) benchmarks

Data

benchmarks.json — full dataset · models.json — model summaries

See all benchmarks →