← All benchmarks

Qwen 3 on Apple Silicon

Alibaba's third-generation open-weight family. The MoE variants (30B A3B, 235B A22B) deliver exceptional throughput — 30B A3B runs faster than Llama 3.1 8B on M4 Max.

Qwen 3 (released April 2025) is Alibaba's most capable open-weight series. It introduces hybrid Mixture-of-Experts (MoE) models alongside dense variants. The breakthrough for local use is the 30B A3B model: despite having 30B total parameters, it only activates 3B per forward pass — making it as fast as a 3B dense model in inference while delivering the quality of a much larger model. SiliconBench has measured data for most Qwen 3 variants on M4 Max hardware.

92 tok/s Qwen 3 30B A3B at Q4 on M4 Max (MLX)
148 tok/s Qwen 3 4B at Q4 on M4 Max (MLX)
22 tok/s Qwen 3 32B at Q4_K_M on M4 Max (factory)
8 tok/s Qwen 3 235B A22B at Q4_K_M on M4 Max 128 GB

Qwen 3 variants and what makes them different

Model Type Active params RAM at Q4 Best for
Qwen 3 0.6B Dense 0.6B ~0.4 GB Edge / embedded use, very fast
Qwen 3 4B Dense 4B ~2.5 GB Fast assistant, fits in 8 GB Mac
Qwen 3 8B Dense 8B ~4.8 GB Balanced, like Llama 3.1 8B class
Qwen 3 30B A3B MoE 3B active ~16 GB Best quality-per-speed tradeoff
Qwen 3 32B Dense 32B ~20 GB High quality, needs 24 GB+ Mac
Qwen 3 235B A22B MoE 22B active ~130 GB Maximum quality, M3/M4 Ultra only
Why MoE matters for local inference: A MoE model like Qwen 3 30B A3B has 30B total weights but only loads ~3B of them per token generation. The memory footprint scales with the total parameters (so you need ~16 GB RAM), but the compute cost scales with the active parameters (3B). This gives you 30B-class quality at 3B-class inference speed.

Measured benchmark data — M4 Max (64 GB)

All rows measured on M4 Max (40-core GPU, 64 GB). MLX runtime via factory harness and community reference runs.

Model Quantization RAM Avg tok/s Prompt tok/s Source
Qwen 3 4B Q4 2.54 GB 148.1 tok/s 2,977 tok/s measured
Qwen 3 4B Q4_G32 2.78 GB 149.1 tok/s 2,838 tok/s measured
Qwen 3 4B Q5 3.26 GB 143.2 tok/s 2,736 tok/s measured
Qwen 3 4B Q8 5.06 GB 111.6 tok/s 1,781 tok/s measured
Qwen 3 30B A3B Q4 16.12 GB 92.1 tok/s 823 tok/s measured
Qwen 3 30B A3B Q5 18.09 GB 84.9 tok/s 820 tok/s measured
Qwen 3 30B A3B Q6 21.87 GB 76.7 tok/s 818 tok/s measured
Qwen 3 30B A3B Q8 29.78 GB 52.6 tok/s 773 tok/s measured
Qwen 3 32B Q4_K_M ~20 GB 22.0 tok/s factory lab

Data source: benchmarks.json. MLX measurements via community reference runs (awni/mlx). Factory lab: M4 Max Mac Studio (40-core GPU, 64 GB) using factory harness.

Measured benchmark data — M4 Max (128 GB)

LM Studio reference runs on M4 Max (128 GB). Different runtime (LM Studio uses llama.cpp backend) — generally slower than MLX on Apple Silicon.

Model Quantization Avg tok/s Source
Qwen 3 0.6B Q8_0 184.5 tok/s measured
Qwen 3 8B Q4_K_M 63.2 tok/s measured
Qwen 3 30B A3B Q4_K_M 70.2 tok/s measured
Qwen 3 235B A22B Q4_K_M 8.1 tok/s measured

Note: Qwen 3 30B A3B at 70 tok/s on M4 Max 128 GB (LM Studio) vs 92 tok/s on M4 Max 64 GB (MLX). The gap reflects the runtime advantage of MLX over llama.cpp on Apple Silicon — MLX uses Metal natively while llama.cpp uses Metal as a secondary backend.

Qwen 3 30B A3B vs other models — the MoE advantage

Qwen 3 30B A3B compared to dense alternatives at similar RAM requirements.

Model Type RAM at Q4 Speed on M4 Max (MLX) Quality tier
Llama 3.1 8B Instruct Dense ~4.7 GB 55.1 tok/s Good
Qwen 2.5 14B Instruct Dense ~9 GB 30.1 tok/s Very good
Qwen 3 30B A3B MoE ~16 GB 92.1 tok/s Excellent (30B-class)
Qwen 3 32B Dense ~20 GB 22.0 tok/s Excellent
Qwen 3 30B A3B is the most important model in the lineup for most Apple Silicon users.

At 92 tok/s on M4 Max, Qwen 3 30B A3B is faster than Llama 3.1 8B (55 tok/s) while delivering 30B-class reasoning quality. The only catch: it needs ~16 GB RAM, so a 16 GB Mac won't run it (you need 24 GB+). If you have a 24 GB or larger Mac with MLX support, Qwen 3 30B A3B is arguably the best local LLM available today. The 32B dense model is slower (22 tok/s) but preferred when quality is paramount and speed is acceptable.

Running Qwen 3 with Ollama

# Qwen 3 4B — fast, fits in 8 GB Mac
ollama run qwen3:4b

# Qwen 3 8B — balanced choice
ollama run qwen3:8b

# Qwen 3 30B A3B — MoE, fast for its quality, needs 24 GB+
ollama run qwen3:30b-a3b

# Qwen 3 32B — dense, highest quality at this RAM level
ollama run qwen3:32b

# Qwen 3 235B A22B — MoE flagship, needs 128 GB+ Mac
ollama run qwen3:235b-a22b

# For MLX (faster on Apple Silicon):
pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen3-30B-A3B-4bit --prompt "Hello"

The MLX runtime delivers significantly higher throughput than Ollama (which uses llama.cpp) on Apple Silicon. If you use Qwen 3 30B A3B regularly and care about speed, MLX is worth the setup.

Related model and hardware pages

benchmarks.json — full dataset  ·  chips.json — chip summaries  ·  benchmarks.csv — CSV export

See all chips →