M1 vs M2 vs M3 vs M4 — LLM Inference by Generation
How much faster is each Apple Silicon generation for local LLM inference? Measured tok/s across all chip tiers — M1 through M4 Max — on shared models.
Generation comparison — Llama 3.1 8B Instruct (Q4_K_M)
Best published result per chip generation. The bar shows relative throughput for Llama 3.1 8B. Data from LocalScore community benchmarks (Q4_K - Medium quantization). Outliers removed.
M1 generation M2 M3 M4
| Chip tier | Llama 3.1 8B tok/s |
|---|---|
| M1 M1 (8-core GPU, 8 GB) |
|
| M1 Pro M1 Pro (16-core GPU, 16 GB) |
|
| M1 Max M1 Max (32-core GPU, 64 GB) |
|
| M1 Ultra M1 Ultra (64-core GPU, 128 GB) |
|
| M2 M2 (8-core GPU, 8 GB) |
|
| M2 Pro M2 Pro (19-core GPU, 32 GB) |
|
| M2 Max M2 Max (38-core GPU, 96 GB) |
|
| M2 Ultra M2 Ultra (60-core GPU, 64 GB) |
|
| M3 M3 (10-core GPU, 16 GB) |
|
| M3 Pro M3 Pro (18-core GPU, 36 GB) |
|
| M3 Max M3 Max (40-core GPU, 128 GB) |
|
| M3 Ultra M3 Ultra (80-core GPU, 256 GB) |
|
| M4 M4 (10-core GPU, 32 GB) |
|
| M4 Pro M4 Pro (20-core GPU, 64 GB) |
|
| M4 Max M4 Max (40-core GPU, 48 GB) |
Source: benchmarks.json. All rows from LocalScore community benchmark aggregation.
Key findings
Base chip (M1/M2/M3/M4)
- M4 base is ~20–30% faster than M3 base at the same RAM
- M2 base is faster than M3 base in some configurations (bandwidth regression at M3 base tier)
- M4 max config (32 GB) significantly outperforms earlier generations
- RAM ceiling at base tier is 24–32 GB — limits 14B+ model fit
Pro tier (M1 Pro → M4 Pro)
- M4 Pro delivers the largest generation-over-generation jump (~55–60%)
- M3 Pro and M2 Pro are close; M1 Pro is noticeably slower
- GPU core count matters within a generation: 20-core M4 Pro vs 16-core M4 Pro shows 7–10% gain
- M4 Pro 64 GB now enables 70B inference — not possible on M3 Pro (36 GB max)
Max tier (M1 Max → M4 Max)
- M4 Max is ~40–50% faster than M3 Max on generation speed
- M2 Max and M3 Max are similar — minimal generation gap in Max tier
- M1 Max is still competitive for 7B–14B models but shows its age at higher quantization
- RAM ceiling raised to 128 GB in M4 Max — enables Q8 70B inference
Ultra tier (M1 Ultra → M3 Ultra)
- M3 Ultra (80-core GPU, 256 GB) tops published throughput charts for most models
- M2 Ultra (76-core) is close to M3 Ultra on raw 8B generation speed
- Ultra chips have 192–512 GB RAM — the only Mac-based path to Q8 70B or large MoE models
- M4 Ultra not yet in dataset (not yet released at time of writing)
Should you upgrade from M3 to M4?
The M4 generation is measurably faster, but not so dramatically that it justifies a hardware upgrade cost if your M3 is working. The exception is the Pro tier: M4 Pro gets a RAM ceiling increase from 36 GB to 64 GB — that enables 70B inference which M3 Pro simply cannot do. If that model size matters to you, the jump is worth it.
Detailed per-chip comparisons
Data
benchmarks.json — full dataset · chips.json — chip summaries · benchmarks.csv