SiliconBench is meant to be queried, not just skimmed. The row-level dataset is still the canonical source,
but stable summary endpoints now expose the faster planning surfaces readers and tools actually need.
Full benchmark rows
Every published measurement with chip, model, quantization, runtime, context, and source metadata.
No rows match this filter yet. That gap is useful signal: it marks a benchmark combination SiliconBench still needs to publish.
Chip
Model
Quant
Runtime
RAM required
Context
Avg tok/s
Source
Coverage snapshot
SiliconBench is still early, but it now exposes the benchmark table as machine-readable data so you can
sort, diff, and join it with your own hardware planning spreadsheets. Current coverage centers on
practical local-model buying questions: a verified 32B run on M4 Max 64 GB, a compressed 32B reference run,
a 70B high-RAM reference run, a mid-tier 14B reference run on M4 Pro, a sourced LM Studio batch
on M4 Max 128 GB spanning 0.6B through 235B A22B-class models, and a same-machine MLX quantization ladder
on M4 Max 64 GB for Qwen 3 4B and Qwen 3 30B A3B. This pass also adds a cross-generation `llama.cpp`
baseline batch so you can compare chip families on one shared workload instead of mixing unrelated blog screenshots.
The biggest visible gaps right now are M3 Max vs M4 Max, Ollama vs MLX runtime deltas,
and same-model quantization ladders for larger planning targets like Qwen 3 32B and Llama 3.3 70B.
Those are the next rows that matter because they answer actual purchase and deployment decisions instead of
adding random one-off numbers.
Source-linked reference batches
Reference rows now carry public source links when the original benchmark includes enough measurement detail to
normalize. The current sourced batch comes from
estsauver’s Apple M4 Max 128 GB LM Studio benchmark gist,
published on April 30, 2025, with tokens per second, time-to-first-token, context size, and runtime notes.
That source now anchors a wider model ladder on SiliconBench, from a tiny 0.6B Qwen row through 30B and 32B entries
up to a 235B A22B MoE reference point on the same M4 Max 128 GB hardware class.
SiliconBench also now includes a comparable cross-chip baseline from the
official `llama.cpp` Apple Silicon benchmark discussion,
started on November 22, 2023. Those rows use the same `Llama 2 7B` `Q4_0` workload and
512-token prompt-processing setup across M1 Pro, M2 Ultra, M3 Pro, M3 Max, and M4, which makes them useful
for generation-to-generation hardware sanity checks.
A second source-linked batch now comes from
Awni Hannun’s MLX LM benchmarks gist,
last updated on October 14, 2025. Those rows add a full M4 Max 64 GB quantization ladder for
Qwen 3 4B and Qwen 3 30B A3B with prompt throughput, generation throughput, and measured
memory footprint from the same `mlx_lm.benchmark --model ... -p 2048 -g 128` harness.
A third source-linked batch family now comes from the public
LocalScore accelerator corpus.
Those official rows add shared prompt-eval, generation, and TTFT measurements for
Llama 3.2 1B Instruct, Llama 3.1 8B Instruct, and
Qwen 2.5 14B Instruct across dozens of Apple Silicon variants spanning M1, M2, M3, M4,
and early M5 tiers. LocalScore still does not publish runtime or context for that suite, so SiliconBench
keeps those fields marked as not published instead of guessing.
When a source does not publish one of SiliconBench’s core fields, the table marks that field as not published
instead of inferring a number. When a source does publish extra detail, SiliconBench keeps it in the downloadable
dataset too, including prompt-processing speed for the shared `llama.cpp` baseline batch.
Browse by benchmark source
Trust starts with provenance. These cards regroup the current corpus by source family so you can see which
providers publish prompt-ingest or TTFT data, which ones are broad but shallow, and where the cleanest
apples-to-apples slices actually come from.
Research queue
Coverage gaps should be visible. This queue turns the current benchmark corpus into a public measurement backlog:
owned-hardware rows that still need first-party verification, large-model targets that only exist on one or two
chip tiers, and published slices that still need RAM or latency instrumentation.
Comparable Apple Silicon chip baseline
One recurring problem in local-LLM hardware research is that most published numbers compare different models,
different quantizations, and different runtimes at the same time. That makes the hardware signal noisy.
SiliconBench now has two cleaner comparison families: a cross-generation `llama.cpp` baseline and a shared
LocalScore M4-family suite that pins model and quantization across multiple memory tiers.
In this comparable batch, M2 Ultra reaches 94.27 tok/s, M3 Max 40-core reaches 65.85 tok/s,
M1 Pro reaches 36.41 tok/s, M3 Pro reaches 30.74 tok/s, and the
base M4 reaches 24.11 tok/s. Use this slice to answer "which Apple Silicon tier buys me more headroom?"
before you drill into model-specific rows like Qwen 3 32B or Llama 3.3 70B.
The LocalScore M4-family suite now adds a second apples-to-apples lens: on shared `Q4_K - Medium` slices,
Llama 3.2 1B Instruct spans 111.94 to 182.56 tok/s,
Llama 3.1 8B Instruct spans 32.53 to 52.41 tok/s, and
Qwen 2.5 14B Instruct spans 16.13 to 28.75 tok/s across current M4 Pro and M4 Max configurations.
That makes it much easier to answer whether a cheaper M4 Pro still delivers enough headroom for a specific local model target.
Run an apples-to-apples chip comparison
Pick any benchmark slice that SiliconBench can hold constant across multiple chips, then rank the resulting rows by
generation speed, prompt-ingest speed, or time to first token. This is the fastest way to compare M4 Max versus M4 Pro
without mixing unrelated models or quantizations.
SiliconBench is assembling comparable chip slices.
#
Chip
Prompt tok/s
Generation tok/s
TTFT
Gap
Source
Quantization ladders on one M4 Max 64 GB
Same-chip quantization ladders are more useful than isolated peak numbers because they reveal the real trade:
how much memory each step saves, and how much generation speed it buys back. SiliconBench now exposes that
directly for two modern Qwen 3 families on the same M4 Max 64 GB MLX stack.
Plan by RAM budget
Most Apple Silicon purchase decisions are memory-budget decisions first. These planning cards are derived from
the published rows with explicit RAM footprints, so they answer a simple question fast: which model sizes
already fit at each unified-memory tier, and what speed range has actually been observed there?
Pick hardware by model target
This is the buyer view: for each model family, SiliconBench surfaces the smallest published Apple Silicon fit
and the fastest published row so far. That compresses the table into the decision readers actually need to make:
what is the minimum Mac tier that can run this model, and how much speed headroom is visible if you spend up?
Plan by context window
Context length changes the buying decision as much as parameter count. A short 2k prompt can feel fast on a much
smaller Mac than a 128k coding or research session, so these cards regroup the published rows by context band
instead of by model name alone.
Coverage matrix by chip tier
This matrix compresses the corpus into the editorial view SiliconBench needs every day: which Apple Silicon
chip tiers already have published rows in each model-size band, and how much verification, prompt-ingest,
and TTFT signal each tier carries today.
Chip tier
0.5B-4B
7B-8B
14B
27B-32B
70B
200B+
Signal
Coverage gaps by chip
These cards turn the dataset into an editorial queue. Each one shows which model-size bands a chip tier
already covers, which benchmark bands are still missing, and whether prompt-ingest, TTFT, and verified-lab
instrumentation exist yet.
Prompt ingest and warm-start latency
Generation tok/s is only half the local-LLM experience. Prompt-processing speed determines how quickly long
coding contexts load, and time to first token determines whether the model feels immediate or sluggish.
SiliconBench now exposes every published prompt-eval and TTFT slice already present in the dataset.
Published leaderboards
Need a fast answer before buying hardware? These ranked slices surface the fastest published rows, the best
tokens-per-second per GB, the deepest published model fits, and the longest context windows currently in the dataset.
Browse by chip tier
The fastest way to use SiliconBench is to start from the Mac you already own, then compare the model sizes
and quantization levels that fit its unified memory ceiling.
These cards are derived from the live dataset, so each chip family expands automatically as new Apple Silicon
variants and memory tiers land.
Browse by model family
Model-first views are where hardware buying questions become concrete. Each card below is derived directly
from the published dataset, so it updates automatically as new rows land.
Browse by quantization
Compare memory footprint, throughput, and long-context coverage by quant level without relying on fuzzy text search.
Browse by runtime
Apple Silicon performance depends on the runtime as much as the chip. These runtime cards group the current
dataset into comparable slices so you can see where SiliconBench already has real MLX, LM Studio, llama.cpp,
and first-party lab coverage.
Method and scope
Each row is a representative local measurement captured with short, standardized prompts in a single benchmark harness.
Runs warm up for 30 seconds before measurement, then average tokens per second across a
fixed prompt block for apples-to-apples comparison.
RAM required is the model weight footprint at the given quantization, not including OS and runtime overhead.
Plan for an additional 2–4 GB of system baseline. For KV cache-heavy workloads (long context), add ~128 KB per token per layer.
Rows marked verified are from direct factory lab measurements on owned hardware.
Reference rows are from reproducible runs with explicit environment notes.
Future additions: memory pressure under load, time-to-first-token, thermal throttle behavior,
runtime-specific comparisons, and deeper same-model quantization ladders.
To request a specific model–chip combination, email
hello@siliconbench.radicchio.page.
Next measurement batches
Factory lab priority hardware: M4 Max Mac Studio 40-core GPU with 64 GB RAM is active now.
M5 MacBook 128 GB arrives on March 18, 2026, and an M4 Max Mac Studio 256 GB configuration
is expected in May 2026. Those incoming machines matter because first-party Apple Silicon LLM
measurements are still scarce, especially above 64 GB unified memory.
Planned benchmark sets: Qwen 3 32B at Q4 and Q8 on M4 Max, Llama 3.3 70B at Q4 with longer context windows,
and head-to-head runtime comparisons once the same prompt harness is captured across Ollama and MLX.
If you need a broader hardware cost view, pair this table with
AI Data Center Index for external inference options.
How much RAM do you need for local LLM inference?
Apple Silicon uses unified memory — GPU and CPU share the same pool. This means every gigabyte
of installed RAM is available for model weights, KV cache, and OS overhead simultaneously.
The rough rule for model weight footprint by quantization:
7B model at Q4_K_M: ~4.5 GB weights. Fits in 8 GB with room for context.
13B model at Q4_K_M: ~8 GB weights. 16 GB recommended for comfortable inference.
32B model at Q4_K_M: ~18–20 GB weights. 24 GB is tight; 36 GB is comfortable.
70B model at Q4_K_M: ~38–40 GB weights. Requires 48 GB minimum; 64 GB for long context.
70B model at Q8_0: ~70 GB weights. Needs 96 GB or higher.
As a practical guide: 16 GB handles 7B–13B models well.
36 GB handles 32B at full quality.
64 GB handles 32B at Q8 or 70B at Q4 with generous context windows.
128 GB (M2/M3/M4 Ultra) handles multiple large models or 70B at Q8.
KV cache grows with context window depth. At 128K context, a 32B model at Q8_0 KV cache
consumes roughly 40 GB on top of weight RAM — a 64 GB machine becomes fully saturated.
Budget RAM for both weights and the context you actually use.
FAQ
Which Mac should I buy for running local LLMs?
If you want to run 7B–13B models for daily coding assistance, a MacBook Pro M4 Pro with 24 GB RAM
is the practical starting point. If you want to run 32B models at full quality or 70B models at all,
target 64 GB RAM — that means M4 Max, M3 Max, or M2/M3/M4 Ultra territory.
The GPU core count matters less than the RAM ceiling for most LLM workloads.
Is M4 worth upgrading from M3 for LLM work?
The M4 generation offers roughly 20–30% higher memory bandwidth than M3, which translates directly to
higher tokens per second for memory-bandwidth-bound inference. If you are buying new hardware, M4 Max
is the clear choice in its tier. If you already own an M3 Max, the throughput gain alone is unlikely
to justify the upgrade cost — the RAM ceiling matters more.
Can I request a specific benchmark?
Yes. Send a request and new model-chip combinations will be prioritized for the next measurement batch.
What makes these rows different from random blog charts?
The rows are maintained as reproducible entries with versioned run metadata and explicit environment
notes. The goal is one place to resolve model-vs-chip confusion, not a one-off speed screenshot.
Can I trust these numbers for production planning?
Treat these as directional until you reproduce the workload in your exact stack. Hardware cooling, context windows,
and prompt style can change real-world throughput significantly. Verified rows have a higher confidence floor.
What quantization should I use on Apple Silicon?
Q4_K_M is the best general starting point — good quality-to-size ratio, runs fully on-GPU across all
Apple Silicon chips. Q8_0 delivers near-full-precision quality but roughly doubles RAM requirements.
IQ2 and IQ3 variants reach extreme compression but may show quality degradation on reasoning tasks.
For most use cases: start at Q4_K_M, step up to Q6_K if RAM allows.
Know your RAM budget? The LLM Hardware Calculator
shows which models fit at each quantization — pair it with this table to go from "does it fit?" to "how fast?"
Other tools for local model builders:
EnvLint (validate your .env files),
Cronfig (test cron expressions),
LinkScrub (strip tracking from URLs),
JWTchop (decode JWTs in-browser),
and support.