SiliconBench — Apple Silicon LLM Benchmarks

Tokens per second by model, chip, and quantization. Real runs on owned hardware, not estimates.

Use this table to compare model throughput, runtime differences, and RAM requirements across chips before you buy hardware or upgrade a model.

0 Published benchmark rows
0 Apple Silicon chip tiers covered
0 Distinct models tracked so far
0 Runtimes represented in the dataset
0 Longest published context window

Use the data programmatically

SiliconBench is meant to be queried, not just skimmed. The row-level dataset is still the canonical source, but stable summary endpoints now expose the faster planning surfaces readers and tools actually need.

Full benchmark rows

Every published measurement with chip, model, quantization, runtime, context, and source metadata.

Hardware planners

Pre-aggregated summaries for chip tiers and model targets, useful for buying guides and fit checks.

Source coverage

Track which upstream source pages and provider families contribute prompt-ingest, latency, and cross-chip signal to the corpus.

Research queue

Machine-readable editorial priorities derived from the current dataset instead of a hidden internal TODO.

Filter the benchmark table

Showing all published rows.

Current view exports the full published dataset.

No rows match this filter yet. That gap is useful signal: it marks a benchmark combination SiliconBench still needs to publish.
Chip Model Quant Runtime RAM required Context Avg tok/s Source

Coverage snapshot

SiliconBench is still early, but it now exposes the benchmark table as machine-readable data so you can sort, diff, and join it with your own hardware planning spreadsheets. Current coverage centers on practical local-model buying questions: a verified 32B run on M4 Max 64 GB, a compressed 32B reference run, a 70B high-RAM reference run, a mid-tier 14B reference run on M4 Pro, a sourced LM Studio batch on M4 Max 128 GB spanning 0.6B through 235B A22B-class models, and a same-machine MLX quantization ladder on M4 Max 64 GB for Qwen 3 4B and Qwen 3 30B A3B. This pass also adds a cross-generation `llama.cpp` baseline batch so you can compare chip families on one shared workload instead of mixing unrelated blog screenshots.

The biggest visible gaps right now are M3 Max vs M4 Max, Ollama vs MLX runtime deltas, and same-model quantization ladders for larger planning targets like Qwen 3 32B and Llama 3.3 70B. Those are the next rows that matter because they answer actual purchase and deployment decisions instead of adding random one-off numbers.

Source-linked reference batches

Reference rows now carry public source links when the original benchmark includes enough measurement detail to normalize. The current sourced batch comes from estsauver’s Apple M4 Max 128 GB LM Studio benchmark gist, published on April 30, 2025, with tokens per second, time-to-first-token, context size, and runtime notes. That source now anchors a wider model ladder on SiliconBench, from a tiny 0.6B Qwen row through 30B and 32B entries up to a 235B A22B MoE reference point on the same M4 Max 128 GB hardware class.

SiliconBench also now includes a comparable cross-chip baseline from the official `llama.cpp` Apple Silicon benchmark discussion, started on November 22, 2023. Those rows use the same `Llama 2 7B` `Q4_0` workload and 512-token prompt-processing setup across M1 Pro, M2 Ultra, M3 Pro, M3 Max, and M4, which makes them useful for generation-to-generation hardware sanity checks.

A second source-linked batch now comes from Awni Hannun’s MLX LM benchmarks gist, last updated on October 14, 2025. Those rows add a full M4 Max 64 GB quantization ladder for Qwen 3 4B and Qwen 3 30B A3B with prompt throughput, generation throughput, and measured memory footprint from the same `mlx_lm.benchmark --model ... -p 2048 -g 128` harness.

A third source-linked batch family now comes from the public LocalScore accelerator corpus. Those official rows add shared prompt-eval, generation, and TTFT measurements for Llama 3.2 1B Instruct, Llama 3.1 8B Instruct, and Qwen 2.5 14B Instruct across dozens of Apple Silicon variants spanning M1, M2, M3, M4, and early M5 tiers. LocalScore still does not publish runtime or context for that suite, so SiliconBench keeps those fields marked as not published instead of guessing.

When a source does not publish one of SiliconBench’s core fields, the table marks that field as not published instead of inferring a number. When a source does publish extra detail, SiliconBench keeps it in the downloadable dataset too, including prompt-processing speed for the shared `llama.cpp` baseline batch.

Browse by benchmark source

Trust starts with provenance. These cards regroup the current corpus by source family so you can see which providers publish prompt-ingest or TTFT data, which ones are broad but shallow, and where the cleanest apples-to-apples slices actually come from.

Research queue

Coverage gaps should be visible. This queue turns the current benchmark corpus into a public measurement backlog: owned-hardware rows that still need first-party verification, large-model targets that only exist on one or two chip tiers, and published slices that still need RAM or latency instrumentation.

Comparable Apple Silicon chip baseline

One recurring problem in local-LLM hardware research is that most published numbers compare different models, different quantizations, and different runtimes at the same time. That makes the hardware signal noisy. SiliconBench now has two cleaner comparison families: a cross-generation `llama.cpp` baseline and a shared LocalScore M4-family suite that pins model and quantization across multiple memory tiers.

In this comparable batch, M2 Ultra reaches 94.27 tok/s, M3 Max 40-core reaches 65.85 tok/s, M1 Pro reaches 36.41 tok/s, M3 Pro reaches 30.74 tok/s, and the base M4 reaches 24.11 tok/s. Use this slice to answer "which Apple Silicon tier buys me more headroom?" before you drill into model-specific rows like Qwen 3 32B or Llama 3.3 70B.

The LocalScore M4-family suite now adds a second apples-to-apples lens: on shared `Q4_K - Medium` slices, Llama 3.2 1B Instruct spans 111.94 to 182.56 tok/s, Llama 3.1 8B Instruct spans 32.53 to 52.41 tok/s, and Qwen 2.5 14B Instruct spans 16.13 to 28.75 tok/s across current M4 Pro and M4 Max configurations. That makes it much easier to answer whether a cheaper M4 Pro still delivers enough headroom for a specific local model target.

Run an apples-to-apples chip comparison

Pick any benchmark slice that SiliconBench can hold constant across multiple chips, then rank the resulting rows by generation speed, prompt-ingest speed, or time to first token. This is the fastest way to compare M4 Max versus M4 Pro without mixing unrelated models or quantizations.

SiliconBench is assembling comparable chip slices.

# Chip Prompt tok/s Generation tok/s TTFT Gap Source

Quantization ladders on one M4 Max 64 GB

Same-chip quantization ladders are more useful than isolated peak numbers because they reveal the real trade: how much memory each step saves, and how much generation speed it buys back. SiliconBench now exposes that directly for two modern Qwen 3 families on the same M4 Max 64 GB MLX stack.

Plan by RAM budget

Most Apple Silicon purchase decisions are memory-budget decisions first. These planning cards are derived from the published rows with explicit RAM footprints, so they answer a simple question fast: which model sizes already fit at each unified-memory tier, and what speed range has actually been observed there?

Pick hardware by model target

This is the buyer view: for each model family, SiliconBench surfaces the smallest published Apple Silicon fit and the fastest published row so far. That compresses the table into the decision readers actually need to make: what is the minimum Mac tier that can run this model, and how much speed headroom is visible if you spend up?

Plan by context window

Context length changes the buying decision as much as parameter count. A short 2k prompt can feel fast on a much smaller Mac than a 128k coding or research session, so these cards regroup the published rows by context band instead of by model name alone.

Coverage matrix by chip tier

This matrix compresses the corpus into the editorial view SiliconBench needs every day: which Apple Silicon chip tiers already have published rows in each model-size band, and how much verification, prompt-ingest, and TTFT signal each tier carries today.

Chip tier 0.5B-4B 7B-8B 14B 27B-32B 70B 200B+ Signal

Coverage gaps by chip

These cards turn the dataset into an editorial queue. Each one shows which model-size bands a chip tier already covers, which benchmark bands are still missing, and whether prompt-ingest, TTFT, and verified-lab instrumentation exist yet.

Prompt ingest and warm-start latency

Generation tok/s is only half the local-LLM experience. Prompt-processing speed determines how quickly long coding contexts load, and time to first token determines whether the model feels immediate or sluggish. SiliconBench now exposes every published prompt-eval and TTFT slice already present in the dataset.

Published leaderboards

Need a fast answer before buying hardware? These ranked slices surface the fastest published rows, the best tokens-per-second per GB, the deepest published model fits, and the longest context windows currently in the dataset.

Browse by chip tier

The fastest way to use SiliconBench is to start from the Mac you already own, then compare the model sizes and quantization levels that fit its unified memory ceiling.

These cards are derived from the live dataset, so each chip family expands automatically as new Apple Silicon variants and memory tiers land.

Browse by model family

Model-first views are where hardware buying questions become concrete. Each card below is derived directly from the published dataset, so it updates automatically as new rows land.

Browse by quantization

Compare memory footprint, throughput, and long-context coverage by quant level without relying on fuzzy text search.

Browse by runtime

Apple Silicon performance depends on the runtime as much as the chip. These runtime cards group the current dataset into comparable slices so you can see where SiliconBench already has real MLX, LM Studio, llama.cpp, and first-party lab coverage.

Method and scope

Each row is a representative local measurement captured with short, standardized prompts in a single benchmark harness. Runs warm up for 30 seconds before measurement, then average tokens per second across a fixed prompt block for apples-to-apples comparison.

RAM required is the model weight footprint at the given quantization, not including OS and runtime overhead. Plan for an additional 2–4 GB of system baseline. For KV cache-heavy workloads (long context), add ~128 KB per token per layer. Rows marked verified are from direct factory lab measurements on owned hardware. Reference rows are from reproducible runs with explicit environment notes.

Future additions: memory pressure under load, time-to-first-token, thermal throttle behavior, runtime-specific comparisons, and deeper same-model quantization ladders. To request a specific model–chip combination, email hello@siliconbench.radicchio.page.

Next measurement batches

Factory lab priority hardware: M4 Max Mac Studio 40-core GPU with 64 GB RAM is active now. M5 MacBook 128 GB arrives on March 18, 2026, and an M4 Max Mac Studio 256 GB configuration is expected in May 2026. Those incoming machines matter because first-party Apple Silicon LLM measurements are still scarce, especially above 64 GB unified memory.

Planned benchmark sets: Qwen 3 32B at Q4 and Q8 on M4 Max, Llama 3.3 70B at Q4 with longer context windows, and head-to-head runtime comparisons once the same prompt harness is captured across Ollama and MLX. If you need a broader hardware cost view, pair this table with AI Data Center Index for external inference options.

How much RAM do you need for local LLM inference?

Apple Silicon uses unified memory — GPU and CPU share the same pool. This means every gigabyte of installed RAM is available for model weights, KV cache, and OS overhead simultaneously. The rough rule for model weight footprint by quantization:

  • 7B model at Q4_K_M: ~4.5 GB weights. Fits in 8 GB with room for context.
  • 13B model at Q4_K_M: ~8 GB weights. 16 GB recommended for comfortable inference.
  • 32B model at Q4_K_M: ~18–20 GB weights. 24 GB is tight; 36 GB is comfortable.
  • 70B model at Q4_K_M: ~38–40 GB weights. Requires 48 GB minimum; 64 GB for long context.
  • 70B model at Q8_0: ~70 GB weights. Needs 96 GB or higher.

As a practical guide: 16 GB handles 7B–13B models well. 36 GB handles 32B at full quality. 64 GB handles 32B at Q8 or 70B at Q4 with generous context windows. 128 GB (M2/M3/M4 Ultra) handles multiple large models or 70B at Q8.

KV cache grows with context window depth. At 128K context, a 32B model at Q8_0 KV cache consumes roughly 40 GB on top of weight RAM — a 64 GB machine becomes fully saturated. Budget RAM for both weights and the context you actually use.

FAQ

Which Mac should I buy for running local LLMs?

If you want to run 7B–13B models for daily coding assistance, a MacBook Pro M4 Pro with 24 GB RAM is the practical starting point. If you want to run 32B models at full quality or 70B models at all, target 64 GB RAM — that means M4 Max, M3 Max, or M2/M3/M4 Ultra territory. The GPU core count matters less than the RAM ceiling for most LLM workloads.

Is M4 worth upgrading from M3 for LLM work?

The M4 generation offers roughly 20–30% higher memory bandwidth than M3, which translates directly to higher tokens per second for memory-bandwidth-bound inference. If you are buying new hardware, M4 Max is the clear choice in its tier. If you already own an M3 Max, the throughput gain alone is unlikely to justify the upgrade cost — the RAM ceiling matters more.

Can I request a specific benchmark?

Yes. Send a request and new model-chip combinations will be prioritized for the next measurement batch.

What makes these rows different from random blog charts?

The rows are maintained as reproducible entries with versioned run metadata and explicit environment notes. The goal is one place to resolve model-vs-chip confusion, not a one-off speed screenshot.

Can I trust these numbers for production planning?

Treat these as directional until you reproduce the workload in your exact stack. Hardware cooling, context windows, and prompt style can change real-world throughput significantly. Verified rows have a higher confidence floor.

What quantization should I use on Apple Silicon?

Q4_K_M is the best general starting point — good quality-to-size ratio, runs fully on-GPU across all Apple Silicon chips. Q8_0 delivers near-full-precision quality but roughly doubles RAM requirements. IQ2 and IQ3 variants reach extreme compression but may show quality degradation on reasoning tasks. For most use cases: start at Q4_K_M, step up to Q6_K if RAM allows.

Know your RAM budget? The LLM Hardware Calculator shows which models fit at each quantization — pair it with this table to go from "does it fit?" to "how fast?"

Other tools for local model builders: EnvLint (validate your .env files), Cronfig (test cron expressions), LinkScrub (strip tracking from URLs), JWTchop (decode JWTs in-browser), and support.