← All benchmarks

DeepSeek R1 on Apple Silicon

Hardware requirements, RAM estimates, and expected inference speed for running DeepSeek R1 variants locally on a Mac.

DeepSeek R1 is a reasoning model with explicit chain-of-thought. It comes in distilled variants (7B, 8B, 14B, 32B, 70B) and a full 671B MoE — the distilled versions are the practical choice for local inference. The distillations preserve most of R1's reasoning capability at a fraction of the compute cost.

5 practical local variants (7B–70B)
32B sweet spot for quality vs speed
~20 GB RAM for 32B at Q4_K_M
128 GB needed for 70B at Q8_0

DeepSeek R1 variants for local inference

All variants below are distillations from the full R1 671B model. They use Llama or Qwen base architectures with R1 reasoning training.

Variant Base arch RAM at Q4_K_M RAM at Q8_0 Minimum Mac Quality
DeepSeek R1 1.5B Qwen 2.5 ~1.2 GB ~2.2 GB 8 GB (any Mac) Basic reasoning
DeepSeek R1 7B Qwen 2.5 7B ~4.7 GB ~8.0 GB 16 GB Mac Good for simple tasks
DeepSeek R1 8B Llama 3 8B ~5.2 GB ~9.0 GB 16 GB Mac Good for simple tasks
DeepSeek R1 14B Qwen 2.5 14B ~9.3 GB ~15.0 GB 24 GB Mac (M Pro) Good reasoning
DeepSeek R1 32B Qwen 2.5 32B ~20 GB ~34 GB 48 GB Mac (M Max) Strong reasoning
DeepSeek R1 70B Llama 3 70B ~43 GB ~78 GB 64 GB Mac (M Max 64 GB) Near-full R1 quality
DeepSeek R1 671B DeepSeek V3 (MoE) ~400 GB Not practical locally Full R1 capability

RAM figures are approximate. Add ~3 GB for OS and runtime overhead. Q4_K_M is the recommended starting quantization — it preserves reasoning quality well.

Expected inference speed on Apple Silicon

Estimated generation tok/s at Q4_K_M based on similar-architecture models (Llama 3.1 8B Instruct, Qwen 2.5 14B Instruct) measured in the SiliconBench dataset. DeepSeek R1 distillations use these same base architectures, so performance should be comparable.

Note: These are estimates based on base-architecture performance. SiliconBench does not yet have first-party DeepSeek R1 benchmark measurements. We are collecting data and will add verified rows when available.
Chip R1 7B (Q4_K_M) R1 14B (Q4_K_M) R1 32B (Q4_K_M) R1 70B (Q4_K_M)
M4 Max (40-core GPU, 64 GB) ~55–60 tok/s ~30 tok/s ~18–20 tok/s ~8–10 tok/s
M4 Max (32-core GPU, 36 GB) ~45–50 tok/s ~25 tok/s ~14 tok/s Won't fit
M4 Pro (20-core GPU, 48 GB) ~30–35 tok/s ~18 tok/s ~11 tok/s Won't fit
M3 Max (40-core GPU, 48 GB) ~40–45 tok/s ~22 tok/s ~12–14 tok/s Won't fit
M3 Pro (18-core GPU, 36 GB) ~20–22 tok/s ~12 tok/s ~7 tok/s Won't fit
M2 Ultra (192 GB) ~55–65 tok/s ~36 tok/s ~22 tok/s ~12 tok/s

Speed estimates based on Llama 3.1 8B Instruct and Qwen 2.5 14B Instruct benchmark data for equivalent chip/size combos. See full benchmark data →

Which DeepSeek R1 variant should you run?

Your Mac Best R1 variant Why
16 GB Mac (M1/M2/M3/M4 base) R1 7B or 8B Only variants that fit with headroom
24–36 GB Mac (M Pro, M Max base) R1 14B Best quality that comfortably fits
48 GB Mac (M Max, M Pro 48 GB) R1 32B Strong reasoning, ~11–18 tok/s
64 GB Mac (M Max 64 GB) R1 32B at Q8 or R1 70B at Q4 R1 70B Q4 needs ~43 GB — fits with headroom
128 GB Mac (M Max 128 GB) R1 70B at Q8_0 Near-lossless 70B quality, ~5–8 tok/s

About reasoning models and token generation speed

DeepSeek R1 generates explicit reasoning traces ("thinking tokens") before answering. This means token generation counts are higher per response compared to a standard chat model. A 200-token response might involve 500–2000 reasoning tokens first.

Practical implication: time to answer is what matters for R1, not just raw tok/s. At 15 tok/s on R1 32B, a response with 1000 reasoning tokens takes ~67 seconds before the visible answer begins. If that's too slow, consider a smaller distillation or use thinking mode off.

DeepSeek R1 32B is the best local reasoning model for most M Max configurations.

At Q4_K_M on a 48 GB M Max, R1 32B delivers strong reasoning capability at 14–20 tok/s generation. This is slow enough that you notice the thinking tokens, but fast enough for productive research use. For developers who need a capable reasoning model and have 48 GB or more, R1 32B is the recommended starting point. R1 70B requires 64 GB minimum and delivers ~8–10 tok/s — meaningful for quality-critical tasks where latency is acceptable.

Running DeepSeek R1 with Ollama

The easiest way to run R1 locally:

# Install Ollama (if not installed)
# Download from https://ollama.ai or:
brew install ollama

# Run DeepSeek R1 7B
ollama run deepseek-r1:7b

# Run DeepSeek R1 14B (needs 24 GB+ Mac)
ollama run deepseek-r1:14b

# Run DeepSeek R1 32B (needs 48 GB+ Mac)
ollama run deepseek-r1:32b

Ollama auto-selects Q4_K_M quantization by default. To specify: ollama run deepseek-r1:32b-q8_0

Related pages

benchmarks.json — full dataset  ·  chips.json — chip summaries  ·  benchmarks.csv — CSV export

See all chips →