Gemma 3 on Apple Silicon
Google's open-weight model family with multimodal support and 128K context. The 4B variant runs well on any Mac; the 27B needs 24 GB or more.
Gemma 3 is Google DeepMind's third generation of open-weight models, released in early 2025. Unlike prior Gemma generations, Gemma 3 supports vision (images as input), has a 128K context window, and covers a wider parameter range — 1B, 4B, 12B, and 27B dense models. On Apple Silicon, Gemma 3 benefits from the same unified memory architecture that makes all dense LLMs fast on M-series chips. The 4B model is particularly compelling: it runs at 100+ tok/s on M4-era hardware, making it an excellent fast local assistant.
Gemma 3 variants at a glance
Gemma 3 4B
- ~2.5 GB RAM at Q4_K_M
- Fits on any Mac with 8 GB+ RAM
- 100+ tok/s on M4-era chips
- Best choice for speed-first use
- Multimodal (vision) supported
Gemma 3 27B
- ~16 GB RAM at Q4_K_M
- ~27 GB RAM at Q8_0
- Needs 24 GB+ Mac (M Pro or better)
- ~14.5 tok/s on M4 Max 128 GB (measured)
- Strongest reasoning in the family
Gemma 3 1B and 12B are intermediate options — 1B is extremely fast (200+ tok/s) but limited in quality; 12B sits between 4B speed and 27B reasoning quality, requiring ~7 GB at Q4_K_M.
Gemma 3 27B — measured and estimated speed by chip
One measured data point from SiliconBench dataset; estimates extrapolated from architecture and memory bandwidth scaling.
| Chip | RAM | Gemma 3 27B Q4_K_M (est.) | Gemma 3 27B Q8_0 | Source |
|---|---|---|---|---|
| M4 Max (40-core GPU, 64 GB) | 64 GB | ~22–26 tok/s | ~13–16 tok/s | estimated |
| M4 Max (128 GB) | 128 GB | ~24–28 tok/s | 14.5 tok/s | measured |
| M4 Pro (20-core GPU, 48 GB) | 48 GB | ~13–16 tok/s | ~8–11 tok/s | estimated |
| M4 Pro (16-core GPU, 24 GB) | 24 GB | ~11–14 tok/s | needs 36 GB+ | estimated |
| M3 Max (40-core GPU, 128 GB) | 128 GB | ~20–24 tok/s | ~12–15 tok/s | estimated |
| M3 Max (30-core GPU, 36 GB) | 36 GB | ~16–20 tok/s | ~9–12 tok/s | estimated |
| M3 Pro (18-core GPU, 36 GB) | 36 GB | ~9–12 tok/s | ~6–8 tok/s | estimated |
| M2 Ultra (76-core GPU, 128 GB) | 128 GB | ~28–34 tok/s | ~17–21 tok/s | estimated |
The measured Q8_0 result at 14.49 tok/s on M4 Max 128 GB is slower than Q4_K_M would be — Q8_0 uses ~2× the RAM, so memory bandwidth is the bottleneck. Q4_K_M should reach 22–26 tok/s on the same chip.
Gemma 3 4B — speed by chip
One measured data point from SiliconBench: M4 Max (128 GB) / Q4_0 via LM Studio = 100.54 tok/s. Estimates for other chips extrapolated from memory bandwidth.
| Chip | RAM | Gemma 3 4B Q4_K_M (est.) | Source |
|---|---|---|---|
| M4 Max (40-core GPU, 64 GB) | 64 GB | ~90–110 tok/s | estimated |
| M4 Max (128 GB) | 128 GB | ~95–105 tok/s | ~100 tok/s measured (Q4_0) |
| M4 Pro (20-core GPU, 48 GB) | 48 GB | ~60–75 tok/s | estimated |
| M4 Pro (16-core GPU, 24 GB) | 24 GB | ~50–65 tok/s | estimated |
| M3 Max (40-core GPU, 128 GB) | 128 GB | ~75–90 tok/s | estimated |
| M3 Pro (18-core GPU, 36 GB) | 36 GB | ~40–55 tok/s | estimated |
| M2 Max (38-core GPU, 96 GB) | 96 GB | ~65–80 tok/s | estimated |
| M1 Max (32-core GPU, 64 GB) | 64 GB | ~50–65 tok/s | estimated |
Gemma 3 4B is fast enough that even older M1/M2 Max hardware delivers a snappy conversational experience. At 50+ tok/s, responses feel nearly instantaneous.
Gemma 3 vs other 4B and 27B models on Apple Silicon
| Model | Params | RAM at Q4_K_M | Speed on M4 Pro 24 GB | Context | Multimodal |
|---|---|---|---|---|---|
| Gemma 3 4B | 4B | ~2.5 GB | ~50–65 tok/s | 128K | Yes (vision) |
| Llama 3.2 3B | 3B | ~2 GB | ~65–80 tok/s | 128K | No |
| Qwen 2.5 7B | 7B | ~4.4 GB | ~35–45 tok/s | 128K | No |
| Gemma 3 27B | 27B | ~16 GB | ~11–14 tok/s | 128K | Yes (vision) |
| Qwen 2.5 14B | 14B | ~9 GB | 15 tok/s | 128K | No |
| Llama 3.1 8B | 8B | ~4.7 GB | 32 tok/s | 128K | No |
If you need vision (analyzing images, screenshots, documents) on a base MacBook Air or MacBook Pro, Gemma 3 4B is the only capable multimodal model that fits in 8 GB. The 27B variant is a strong general-purpose model for 24 GB+ Macs — its main trade-off vs Qwen 2.5 14B is slower speed (27B requires ~2× the RAM) but stronger reasoning. For pure text tasks without vision, Qwen 2.5 14B or Llama 3.1 8B will feel faster at similar or better quality.
Running Gemma 3 with Ollama
# Gemma 3 4B — fast, multimodal, fits in 8 GB
ollama run gemma3:4b
# Gemma 3 12B — balanced choice for 16 GB Macs
ollama run gemma3:12b
# Gemma 3 27B — strongest reasoning, needs 24 GB+
ollama run gemma3:27b
# Gemma 3 27B at Q8_0 for maximum quality (~27 GB)
ollama run gemma3:27b-q8_0
Gemma 3 models support image inputs in Ollama via the API. Use the images field in the API request or the /set command in the interactive CLI to pass image paths.
Related model and hardware pages
Data
benchmarks.json — full dataset · chips.json — chip summaries · benchmarks.csv — CSV export