DeepSeek R1 on Apple Silicon
Hardware requirements, RAM estimates, and expected inference speed for running DeepSeek R1 variants locally on a Mac.
DeepSeek R1 is a reasoning model with explicit chain-of-thought. It comes in distilled variants (7B, 8B, 14B, 32B, 70B) and a full 671B MoE — the distilled versions are the practical choice for local inference. The distillations preserve most of R1's reasoning capability at a fraction of the compute cost.
DeepSeek R1 variants for local inference
All variants below are distillations from the full R1 671B model. They use Llama or Qwen base architectures with R1 reasoning training.
| Variant | Base arch | RAM at Q4_K_M | RAM at Q8_0 | Minimum Mac | Quality |
|---|---|---|---|---|---|
| DeepSeek R1 1.5B | Qwen 2.5 | ~1.2 GB | ~2.2 GB | 8 GB (any Mac) | Basic reasoning |
| DeepSeek R1 7B | Qwen 2.5 7B | ~4.7 GB | ~8.0 GB | 16 GB Mac | Good for simple tasks |
| DeepSeek R1 8B | Llama 3 8B | ~5.2 GB | ~9.0 GB | 16 GB Mac | Good for simple tasks |
| DeepSeek R1 14B | Qwen 2.5 14B | ~9.3 GB | ~15.0 GB | 24 GB Mac (M Pro) | Good reasoning |
| DeepSeek R1 32B | Qwen 2.5 32B | ~20 GB | ~34 GB | 48 GB Mac (M Max) | Strong reasoning |
| DeepSeek R1 70B | Llama 3 70B | ~43 GB | ~78 GB | 64 GB Mac (M Max 64 GB) | Near-full R1 quality |
| DeepSeek R1 671B | DeepSeek V3 (MoE) | ~400 GB | — | Not practical locally | Full R1 capability |
RAM figures are approximate. Add ~3 GB for OS and runtime overhead. Q4_K_M is the recommended starting quantization — it preserves reasoning quality well.
Expected inference speed on Apple Silicon
Estimated generation tok/s at Q4_K_M based on similar-architecture models (Llama 3.1 8B Instruct, Qwen 2.5 14B Instruct) measured in the SiliconBench dataset. DeepSeek R1 distillations use these same base architectures, so performance should be comparable.
| Chip | R1 7B (Q4_K_M) | R1 14B (Q4_K_M) | R1 32B (Q4_K_M) | R1 70B (Q4_K_M) |
|---|---|---|---|---|
| M4 Max (40-core GPU, 64 GB) | ~55–60 tok/s | ~30 tok/s | ~18–20 tok/s | ~8–10 tok/s |
| M4 Max (32-core GPU, 36 GB) | ~45–50 tok/s | ~25 tok/s | ~14 tok/s | Won't fit |
| M4 Pro (20-core GPU, 48 GB) | ~30–35 tok/s | ~18 tok/s | ~11 tok/s | Won't fit |
| M3 Max (40-core GPU, 48 GB) | ~40–45 tok/s | ~22 tok/s | ~12–14 tok/s | Won't fit |
| M3 Pro (18-core GPU, 36 GB) | ~20–22 tok/s | ~12 tok/s | ~7 tok/s | Won't fit |
| M2 Ultra (192 GB) | ~55–65 tok/s | ~36 tok/s | ~22 tok/s | ~12 tok/s |
Speed estimates based on Llama 3.1 8B Instruct and Qwen 2.5 14B Instruct benchmark data for equivalent chip/size combos. See full benchmark data →
Which DeepSeek R1 variant should you run?
| Your Mac | Best R1 variant | Why |
|---|---|---|
| 16 GB Mac (M1/M2/M3/M4 base) | R1 7B or 8B | Only variants that fit with headroom |
| 24–36 GB Mac (M Pro, M Max base) | R1 14B | Best quality that comfortably fits |
| 48 GB Mac (M Max, M Pro 48 GB) | R1 32B | Strong reasoning, ~11–18 tok/s |
| 64 GB Mac (M Max 64 GB) | R1 32B at Q8 or R1 70B at Q4 | R1 70B Q4 needs ~43 GB — fits with headroom |
| 128 GB Mac (M Max 128 GB) | R1 70B at Q8_0 | Near-lossless 70B quality, ~5–8 tok/s |
About reasoning models and token generation speed
DeepSeek R1 generates explicit reasoning traces ("thinking tokens") before answering. This means token generation counts are higher per response compared to a standard chat model. A 200-token response might involve 500–2000 reasoning tokens first.
Practical implication: time to answer is what matters for R1, not just raw tok/s. At 15 tok/s on R1 32B, a response with 1000 reasoning tokens takes ~67 seconds before the visible answer begins. If that's too slow, consider a smaller distillation or use thinking mode off.
At Q4_K_M on a 48 GB M Max, R1 32B delivers strong reasoning capability at 14–20 tok/s generation. This is slow enough that you notice the thinking tokens, but fast enough for productive research use. For developers who need a capable reasoning model and have 48 GB or more, R1 32B is the recommended starting point. R1 70B requires 64 GB minimum and delivers ~8–10 tok/s — meaningful for quality-critical tasks where latency is acceptable.
Running DeepSeek R1 with Ollama
The easiest way to run R1 locally:
# Install Ollama (if not installed)
# Download from https://ollama.ai or:
brew install ollama
# Run DeepSeek R1 7B
ollama run deepseek-r1:7b
# Run DeepSeek R1 14B (needs 24 GB+ Mac)
ollama run deepseek-r1:14b
# Run DeepSeek R1 32B (needs 48 GB+ Mac)
ollama run deepseek-r1:32b
Ollama auto-selects Q4_K_M quantization by default. To specify: ollama run deepseek-r1:32b-q8_0
Related pages
Data
benchmarks.json — full dataset · chips.json — chip summaries · benchmarks.csv — CSV export