← All benchmarks

Qwen 3 on Apple Silicon

Alibaba's third-generation open-weight family. The MoE variants (30B A3B, 235B A22B) deliver exceptional throughput — 30B A3B runs faster than Llama 3.1 8B on M4 Max.

Qwen 3 (released April 2025) is Alibaba's most capable open-weight series. It introduces hybrid Mixture-of-Experts (MoE) models alongside dense variants. The breakthrough for local use is the 30B A3B model: despite having 30B total parameters, it only activates 3B per forward pass — making it as fast as a 3B dense model in inference while delivering the quality of a much larger model. SiliconBench has measured data for most Qwen 3 variants on M4 Max hardware.

92 tok/s Qwen 3 30B A3B at Q4 on M4 Max (MLX)

148 tok/s Qwen 3 4B at Q4 on M4 Max (MLX)

22 tok/s Qwen 3 32B at Q4_K_M on M4 Max (factory)

8 tok/s Qwen 3 235B A22B at Q4_K_M on M4 Max 128 GB

Qwen 3 variants and what makes them different

Model	Type	Active params	RAM at Q4	Best for
Qwen 3 0.6B	Dense	0.6B	~0.4 GB	Edge / embedded use, very fast
Qwen 3 4B	Dense	4B	~2.5 GB	Fast assistant, fits in 8 GB Mac
Qwen 3 8B	Dense	8B	~4.8 GB	Balanced, like Llama 3.1 8B class
Qwen 3 30B A3B	MoE	3B active	~16 GB	Best quality-per-speed tradeoff
Qwen 3 32B	Dense	32B	~20 GB	High quality, needs 24 GB+ Mac
Qwen 3 235B A22B	MoE	22B active	~130 GB	Maximum quality, M3/M4 Ultra only

          Why MoE matters for local inference: A MoE model like Qwen 3 30B A3B has 30B total weights but only loads ~3B of them per token generation. The memory footprint scales with the total parameters (so you need ~16 GB RAM), but the compute cost scales with the active parameters (3B). This gives you 30B-class quality at 3B-class inference speed.
        

Measured benchmark data — M4 Max (64 GB)

All rows measured on M4 Max (40-core GPU, 64 GB). MLX runtime via factory harness and community reference runs.

Model	Quantization	RAM	Avg tok/s	Prompt tok/s	Source
Qwen 3 4B	Q4	2.54 GB	148.1 tok/s	2,977 tok/s	measured
Qwen 3 4B	Q4_G32	2.78 GB	149.1 tok/s	2,838 tok/s	measured
Qwen 3 4B	Q5	3.26 GB	143.2 tok/s	2,736 tok/s	measured
Qwen 3 4B	Q8	5.06 GB	111.6 tok/s	1,781 tok/s	measured
Qwen 3 30B A3B	Q4	16.12 GB	92.1 tok/s	823 tok/s	measured
Qwen 3 30B A3B	Q5	18.09 GB	84.9 tok/s	820 tok/s	measured
Qwen 3 30B A3B	Q6	21.87 GB	76.7 tok/s	818 tok/s	measured
Qwen 3 30B A3B	Q8	29.78 GB	52.6 tok/s	773 tok/s	measured
Qwen 3 32B	Q4_K_M	~20 GB	22.0 tok/s	—	factory lab

Data source: benchmarks.json. MLX measurements via community reference runs (awni/mlx). Factory lab: M4 Max Mac Studio (40-core GPU, 64 GB) using factory harness.

Measured benchmark data — M4 Max (128 GB)

LM Studio reference runs on M4 Max (128 GB). Different runtime (LM Studio uses llama.cpp backend) — generally slower than MLX on Apple Silicon.

Model	Quantization	Avg tok/s	Source
Qwen 3 0.6B	Q8_0	184.5 tok/s	measured
Qwen 3 8B	Q4_K_M	63.2 tok/s	measured
Qwen 3 30B A3B	Q4_K_M	70.2 tok/s	measured
Qwen 3 235B A22B	Q4_K_M	8.1 tok/s	measured

Note: Qwen 3 30B A3B at 70 tok/s on M4 Max 128 GB (LM Studio) vs 92 tok/s on M4 Max 64 GB (MLX). The gap reflects the runtime advantage of MLX over llama.cpp on Apple Silicon — MLX uses Metal natively while llama.cpp uses Metal as a secondary backend.

Qwen 3 30B A3B vs other models — the MoE advantage

Qwen 3 30B A3B compared to dense alternatives at similar RAM requirements.

Model	Type	RAM at Q4	Speed on M4 Max (MLX)	Quality tier
Llama 3.1 8B Instruct	Dense	~4.7 GB	55.1 tok/s	Good
Qwen 2.5 14B Instruct	Dense	~9 GB	30.1 tok/s	Very good
Qwen 3 30B A3B	MoE	~16 GB	92.1 tok/s	Excellent (30B-class)
Qwen 3 32B	Dense	~20 GB	22.0 tok/s	Excellent

Qwen 3 30B A3B is the most important model in the lineup for most Apple Silicon users.

At 92 tok/s on M4 Max, Qwen 3 30B A3B is faster than Llama 3.1 8B (55 tok/s) while delivering 30B-class reasoning quality. The only catch: it needs ~16 GB RAM, so a 16 GB Mac won't run it (you need 24 GB+). If you have a 24 GB or larger Mac with MLX support, Qwen 3 30B A3B is arguably the best local LLM available today. The 32B dense model is slower (22 tok/s) but preferred when quality is paramount and speed is acceptable.

Running Qwen 3 with Ollama

# Qwen 3 4B — fast, fits in 8 GB Mac
ollama run qwen3:4b

# Qwen 3 8B — balanced choice
ollama run qwen3:8b

# Qwen 3 30B A3B — MoE, fast for its quality, needs 24 GB+
ollama run qwen3:30b-a3b

# Qwen 3 32B — dense, highest quality at this RAM level
ollama run qwen3:32b

# Qwen 3 235B A22B — MoE flagship, needs 128 GB+ Mac
ollama run qwen3:235b-a22b

# For MLX (faster on Apple Silicon):
pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen3-30B-A3B-4bit --prompt "Hello"

The MLX runtime delivers significantly higher throughput than Ollama (which uses llama.cpp) on Apple Silicon. If you use Qwen 3 30B A3B regularly and care about speed, MLX is worth the setup.

Related model and hardware pages

Qwen 3 30B A3B benchmark data Qwen 3 32B benchmark data Qwen 3 4B benchmark data Qwen 3 8B benchmark data Qwen 2.5 14B benchmarks Llama vs Qwen comparison RAM Calculator Best Mac for local LLMs

Data

benchmarks.json — full dataset · chips.json — chip summaries · benchmarks.csv — CSV export

See all chips →