Research & Reasoning
Research tasks benefit from larger models — better reasoning, more nuanced outputs. This means 32B+ models where quality matters more than raw speed. Here is the hardware data.
Why these models for this use case
Research and reasoning tasks need model quality that only comes at 32B+ parameter counts. Qwen 3 32B at Q4_K_M runs at ~22 tok/s on M4 Max 64 GB — fast enough for interactive research, and the quality gap versus 7B is dramatic. For frontier-scale local inference (Qwen 3 235B A22B at 8 tok/s on M4 Max 128 GB), you need 128 GB+ unified memory. Speed is acceptable for batch tasks even if it feels slow for real-time chat.
Benchmark results — fastest rows first
Filtered to models commonly used for research & reasoning. Sorted by avg tok/s descending.
| Chip | Model | Quant | Avg tok/s | Source | ||
|---|---|---|---|---|---|---|
| M4 Max (40-core GPU, 64 GB) | Qwen 3 30B A3B | Q4 | 16.12 GB | 92.1 tok/s | MLX | ref |
| M4 Max (40-core GPU, 64 GB) | Qwen 3 30B A3B | Q5 | 18.09 GB | 84.9 tok/s | MLX | ref |
| M4 Max (40-core GPU, 64 GB) | Qwen 3 30B A3B | Q6 | 21.87 GB | 76.7 tok/s | MLX | ref |
| M4 Max (128 GB) | Qwen 3 30B A3B | Q4_K_M | — | 70.2 tok/s | LM Studio | ref |
| M4 Max (40-core GPU, 64 GB) | Qwen 3 30B A3B | Q8 | 29.78 GB | 52.6 tok/s | MLX | ref |
| M4 Max (40-core GPU, 64 GB) | Qwen 3 32B | Q4_K_M | 20 GB | 22.0 tok/s | factory harness | factory lab |
| M4 Max (128 GB) | Gemma 3 27B | Q8_0 | — | 14.5 tok/s | LM Studio | ref |
| M4 Max (32-core GPU) | Qwen 3 32B | iQ2_K_S | 11 GB | 13.2 tok/s | — | ref |
| M4 Max (128 GB) | Qwen 3 235B A22B | Q4_K_M | — | 8.1 tok/s | LM Studio | ref |
| M4 Max (24-core GPU) | Llama 3.3 70B | Q5_K_M | 50 GB | 7.1 tok/s | — | ref |
Recommended chips for this use case
Other use cases
Data
benchmarks.json — full dataset · models.json — model summaries · benchmarks.csv