Phi-4 on Apple Silicon
Microsoft's 14B reasoning model. Fits comfortably on a 24 GB Mac. Strong performance per watt and per RAM-GB.
Phi-4 is a 14B parameter model from Microsoft Research, released in late 2024. It focuses on reasoning quality through high-quality synthetic training data. Phi-4 consistently outperforms models significantly larger than it on reasoning benchmarks — making it an excellent choice for developers who want strong reasoning without the RAM overhead of 32B+ models.
Estimated inference speed by chip
Phi-4 is a 14B dense transformer. Speed should be close to Qwen 2.5 14B Instruct, which has measured data in the SiliconBench dataset. The estimates below use those measurements as a proxy.
| Chip | RAM | Qwen 2.5 14B (measured) | Phi-4 (estimated) | Source |
|---|---|---|---|---|
| M4 Max (40-core GPU, 64 GB) | 64 GB | 30.1 tok/s | ~28–32 tok/s | estimated |
| M4 Pro (20-core GPU, 48 GB) | 48 GB | 18.0 tok/s | ~17–20 tok/s | estimated |
| M4 Pro (16-core GPU, 24 GB) | 24 GB | 15.2 tok/s | ~14–17 tok/s | estimated |
| M3 Max (40-core GPU, 128 GB) | 128 GB | 25.5 tok/s | ~24–28 tok/s | estimated |
| M3 Max (30-core GPU, 36 GB) | 36 GB | 19.8 tok/s | ~18–22 tok/s | estimated |
| M3 Pro (18-core GPU, 36 GB) | 36 GB | 12.1 tok/s | ~11–14 tok/s | estimated |
| M2 Ultra (76-core GPU, 128 GB) | 128 GB | 36.6 tok/s | ~34–40 tok/s | estimated |
See full Qwen 2.5 14B Instruct benchmark data → (proxy data for Phi-4 estimates)
Phi-4 vs Qwen 2.5 14B and Llama 3.1 8B
Phi-4's key advantage is reasoning quality per parameter. It outperforms 7B and 8B models significantly on reasoning tasks despite similar inference speed to 14B models.
| Model | Params | RAM at Q4_K_M | Speed on M4 Pro 24 GB | Reasoning quality |
|---|---|---|---|---|
| Phi-4 | 14B | ~9 GB | ~14–17 tok/s | Excellent for size |
| Qwen 2.5 14B Instruct | 14B | ~9 GB | 15 tok/s | Very good |
| Llama 3.1 8B Instruct | 8B | ~4.7 GB | 32 tok/s | Good |
At similar RAM requirements and inference speed to Qwen 2.5 14B, Phi-4 delivers notably stronger reasoning capabilities — particularly on math, coding, and multi-step problems. For developers who primarily use local LLMs for analytical tasks, code review, or research, Phi-4 is a strong choice. For conversational chat and summarization, Qwen 2.5 14B and Llama 3.1 8B are competitive and faster (8B is ~2× faster than 14B).
Running Phi-4 with Ollama
# Run Phi-4 (14B, auto-selects Q4_K_M ~9 GB)
ollama run phi4
# Run Phi-4 at Q8_0 for maximum quality (~15 GB)
ollama run phi4:q8_0
# Run Phi-4 Mini (smaller, faster variant)
ollama run phi4-mini
Related model and hardware pages
Data
benchmarks.json — full dataset · chips.json — chip summaries · benchmarks.csv — CSV export