← All benchmarks

Gemma 3 on Apple Silicon

Google's open-weight model family with multimodal support and 128K context. The 4B variant runs well on any Mac; the 27B needs 24 GB or more.

Gemma 3 is Google DeepMind's third generation of open-weight models, released in early 2025. Unlike prior Gemma generations, Gemma 3 supports vision (images as input), has a 128K context window, and covers a wider parameter range — 1B, 4B, 12B, and 27B dense models. On Apple Silicon, Gemma 3 benefits from the same unified memory architecture that makes all dense LLMs fast on M-series chips. The 4B model is particularly compelling: it runs at 100+ tok/s on M4-era hardware, making it an excellent fast local assistant.

~2.5 GB RAM: Gemma 3 4B at Q4_K_M
~16 GB RAM: Gemma 3 27B at Q4_K_M
128K context window (all variants)
4 variants: 1B, 4B, 12B, 27B

Gemma 3 variants at a glance

Gemma 3 4B

  • ~2.5 GB RAM at Q4_K_M
  • Fits on any Mac with 8 GB+ RAM
  • 100+ tok/s on M4-era chips
  • Best choice for speed-first use
  • Multimodal (vision) supported

Gemma 3 27B

  • ~16 GB RAM at Q4_K_M
  • ~27 GB RAM at Q8_0
  • Needs 24 GB+ Mac (M Pro or better)
  • ~14.5 tok/s on M4 Max 128 GB (measured)
  • Strongest reasoning in the family

Gemma 3 1B and 12B are intermediate options — 1B is extremely fast (200+ tok/s) but limited in quality; 12B sits between 4B speed and 27B reasoning quality, requiring ~7 GB at Q4_K_M.

Gemma 3 27B — measured and estimated speed by chip

One measured data point from SiliconBench dataset; estimates extrapolated from architecture and memory bandwidth scaling.

SiliconBench has one measured Gemma 3 27B result: M4 Max (128 GB) at Q8_0 via LM Studio = 14.49 tok/s. Other rows are estimates based on memory bandwidth ratios between chips.
Chip RAM Gemma 3 27B Q4_K_M (est.) Gemma 3 27B Q8_0 Source
M4 Max (40-core GPU, 64 GB) 64 GB ~22–26 tok/s ~13–16 tok/s estimated
M4 Max (128 GB) 128 GB ~24–28 tok/s 14.5 tok/s measured
M4 Pro (20-core GPU, 48 GB) 48 GB ~13–16 tok/s ~8–11 tok/s estimated
M4 Pro (16-core GPU, 24 GB) 24 GB ~11–14 tok/s needs 36 GB+ estimated
M3 Max (40-core GPU, 128 GB) 128 GB ~20–24 tok/s ~12–15 tok/s estimated
M3 Max (30-core GPU, 36 GB) 36 GB ~16–20 tok/s ~9–12 tok/s estimated
M3 Pro (18-core GPU, 36 GB) 36 GB ~9–12 tok/s ~6–8 tok/s estimated
M2 Ultra (76-core GPU, 128 GB) 128 GB ~28–34 tok/s ~17–21 tok/s estimated

The measured Q8_0 result at 14.49 tok/s on M4 Max 128 GB is slower than Q4_K_M would be — Q8_0 uses ~2× the RAM, so memory bandwidth is the bottleneck. Q4_K_M should reach 22–26 tok/s on the same chip.

Gemma 3 4B — speed by chip

One measured data point from SiliconBench: M4 Max (128 GB) / Q4_0 via LM Studio = 100.54 tok/s. Estimates for other chips extrapolated from memory bandwidth.

Chip RAM Gemma 3 4B Q4_K_M (est.) Source
M4 Max (40-core GPU, 64 GB) 64 GB ~90–110 tok/s estimated
M4 Max (128 GB) 128 GB ~95–105 tok/s ~100 tok/s measured (Q4_0)
M4 Pro (20-core GPU, 48 GB) 48 GB ~60–75 tok/s estimated
M4 Pro (16-core GPU, 24 GB) 24 GB ~50–65 tok/s estimated
M3 Max (40-core GPU, 128 GB) 128 GB ~75–90 tok/s estimated
M3 Pro (18-core GPU, 36 GB) 36 GB ~40–55 tok/s estimated
M2 Max (38-core GPU, 96 GB) 96 GB ~65–80 tok/s estimated
M1 Max (32-core GPU, 64 GB) 64 GB ~50–65 tok/s estimated

Gemma 3 4B is fast enough that even older M1/M2 Max hardware delivers a snappy conversational experience. At 50+ tok/s, responses feel nearly instantaneous.

Gemma 3 vs other 4B and 27B models on Apple Silicon

Model Params RAM at Q4_K_M Speed on M4 Pro 24 GB Context Multimodal
Gemma 3 4B 4B ~2.5 GB ~50–65 tok/s 128K Yes (vision)
Llama 3.2 3B 3B ~2 GB ~65–80 tok/s 128K No
Qwen 2.5 7B 7B ~4.4 GB ~35–45 tok/s 128K No
Gemma 3 27B 27B ~16 GB ~11–14 tok/s 128K Yes (vision)
Qwen 2.5 14B 14B ~9 GB 15 tok/s 128K No
Llama 3.1 8B 8B ~4.7 GB 32 tok/s 128K No
Gemma 3 4B is the best multimodal model for Macs with 8–16 GB RAM.

If you need vision (analyzing images, screenshots, documents) on a base MacBook Air or MacBook Pro, Gemma 3 4B is the only capable multimodal model that fits in 8 GB. The 27B variant is a strong general-purpose model for 24 GB+ Macs — its main trade-off vs Qwen 2.5 14B is slower speed (27B requires ~2× the RAM) but stronger reasoning. For pure text tasks without vision, Qwen 2.5 14B or Llama 3.1 8B will feel faster at similar or better quality.

Running Gemma 3 with Ollama

# Gemma 3 4B — fast, multimodal, fits in 8 GB
ollama run gemma3:4b

# Gemma 3 12B — balanced choice for 16 GB Macs
ollama run gemma3:12b

# Gemma 3 27B — strongest reasoning, needs 24 GB+
ollama run gemma3:27b

# Gemma 3 27B at Q8_0 for maximum quality (~27 GB)
ollama run gemma3:27b-q8_0

Gemma 3 models support image inputs in Ollama via the API. Use the images field in the API request or the /set command in the interactive CLI to pass image paths.

Related model and hardware pages

benchmarks.json — full dataset  ·  chips.json — chip summaries  ·  benchmarks.csv — CSV export

See all chips →