Qwen 3 on Apple Silicon
Alibaba's third-generation open-weight family. The MoE variants (30B A3B, 235B A22B) deliver exceptional throughput — 30B A3B runs faster than Llama 3.1 8B on M4 Max.
Qwen 3 (released April 2025) is Alibaba's most capable open-weight series. It introduces hybrid Mixture-of-Experts (MoE) models alongside dense variants. The breakthrough for local use is the 30B A3B model: despite having 30B total parameters, it only activates 3B per forward pass — making it as fast as a 3B dense model in inference while delivering the quality of a much larger model. SiliconBench has measured data for most Qwen 3 variants on M4 Max hardware.
Qwen 3 variants and what makes them different
| Model | Type | Active params | RAM at Q4 | Best for |
|---|---|---|---|---|
| Qwen 3 0.6B | Dense | 0.6B | ~0.4 GB | Edge / embedded use, very fast |
| Qwen 3 4B | Dense | 4B | ~2.5 GB | Fast assistant, fits in 8 GB Mac |
| Qwen 3 8B | Dense | 8B | ~4.8 GB | Balanced, like Llama 3.1 8B class |
| Qwen 3 30B A3B | MoE | 3B active | ~16 GB | Best quality-per-speed tradeoff |
| Qwen 3 32B | Dense | 32B | ~20 GB | High quality, needs 24 GB+ Mac |
| Qwen 3 235B A22B | MoE | 22B active | ~130 GB | Maximum quality, M3/M4 Ultra only |
Measured benchmark data — M4 Max (64 GB)
All rows measured on M4 Max (40-core GPU, 64 GB). MLX runtime via factory harness and community reference runs.
| Model | Quantization | RAM | Avg tok/s | Prompt tok/s | Source |
|---|---|---|---|---|---|
| Qwen 3 4B | Q4 | 2.54 GB | 148.1 tok/s | 2,977 tok/s | measured |
| Qwen 3 4B | Q4_G32 | 2.78 GB | 149.1 tok/s | 2,838 tok/s | measured |
| Qwen 3 4B | Q5 | 3.26 GB | 143.2 tok/s | 2,736 tok/s | measured |
| Qwen 3 4B | Q8 | 5.06 GB | 111.6 tok/s | 1,781 tok/s | measured |
| Qwen 3 30B A3B | Q4 | 16.12 GB | 92.1 tok/s | 823 tok/s | measured |
| Qwen 3 30B A3B | Q5 | 18.09 GB | 84.9 tok/s | 820 tok/s | measured |
| Qwen 3 30B A3B | Q6 | 21.87 GB | 76.7 tok/s | 818 tok/s | measured |
| Qwen 3 30B A3B | Q8 | 29.78 GB | 52.6 tok/s | 773 tok/s | measured |
| Qwen 3 32B | Q4_K_M | ~20 GB | 22.0 tok/s | — | factory lab |
Data source: benchmarks.json. MLX measurements via community reference runs (awni/mlx). Factory lab: M4 Max Mac Studio (40-core GPU, 64 GB) using factory harness.
Measured benchmark data — M4 Max (128 GB)
LM Studio reference runs on M4 Max (128 GB). Different runtime (LM Studio uses llama.cpp backend) — generally slower than MLX on Apple Silicon.
| Model | Quantization | Avg tok/s | Source |
|---|---|---|---|
| Qwen 3 0.6B | Q8_0 | 184.5 tok/s | measured |
| Qwen 3 8B | Q4_K_M | 63.2 tok/s | measured |
| Qwen 3 30B A3B | Q4_K_M | 70.2 tok/s | measured |
| Qwen 3 235B A22B | Q4_K_M | 8.1 tok/s | measured |
Note: Qwen 3 30B A3B at 70 tok/s on M4 Max 128 GB (LM Studio) vs 92 tok/s on M4 Max 64 GB (MLX). The gap reflects the runtime advantage of MLX over llama.cpp on Apple Silicon — MLX uses Metal natively while llama.cpp uses Metal as a secondary backend.
Qwen 3 30B A3B vs other models — the MoE advantage
Qwen 3 30B A3B compared to dense alternatives at similar RAM requirements.
| Model | Type | RAM at Q4 | Speed on M4 Max (MLX) | Quality tier |
|---|---|---|---|---|
| Llama 3.1 8B Instruct | Dense | ~4.7 GB | 55.1 tok/s | Good |
| Qwen 2.5 14B Instruct | Dense | ~9 GB | 30.1 tok/s | Very good |
| Qwen 3 30B A3B | MoE | ~16 GB | 92.1 tok/s | Excellent (30B-class) |
| Qwen 3 32B | Dense | ~20 GB | 22.0 tok/s | Excellent |
At 92 tok/s on M4 Max, Qwen 3 30B A3B is faster than Llama 3.1 8B (55 tok/s) while delivering 30B-class reasoning quality. The only catch: it needs ~16 GB RAM, so a 16 GB Mac won't run it (you need 24 GB+). If you have a 24 GB or larger Mac with MLX support, Qwen 3 30B A3B is arguably the best local LLM available today. The 32B dense model is slower (22 tok/s) but preferred when quality is paramount and speed is acceptable.
Running Qwen 3 with Ollama
# Qwen 3 4B — fast, fits in 8 GB Mac
ollama run qwen3:4b
# Qwen 3 8B — balanced choice
ollama run qwen3:8b
# Qwen 3 30B A3B — MoE, fast for its quality, needs 24 GB+
ollama run qwen3:30b-a3b
# Qwen 3 32B — dense, highest quality at this RAM level
ollama run qwen3:32b
# Qwen 3 235B A22B — MoE flagship, needs 128 GB+ Mac
ollama run qwen3:235b-a22b
# For MLX (faster on Apple Silicon):
pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen3-30B-A3B-4bit --prompt "Hello"
The MLX runtime delivers significantly higher throughput than Ollama (which uses llama.cpp) on Apple Silicon. If you use Qwen 3 30B A3B regularly and care about speed, MLX is worth the setup.
Related model and hardware pages
Data
benchmarks.json — full dataset · chips.json — chip summaries · benchmarks.csv — CSV export