Mistral 7B on Apple Silicon
One of the most downloaded local LLMs. Fast, capable, and comfortable on any Mac with 16 GB or more.
Mistral 7B was released in 2023 and quickly became one of the most popular base models for local inference. It outperforms Llama 2 7B on most benchmarks while using similar compute — and at Q4_K_M, it runs at 30–55+ tok/s depending on your chip. Mistral uses a sliding window attention and grouped-query attention, which makes it fast on bandwidth-limited hardware like Apple Silicon.
Mistral 7B speed estimates by chip
Mistral 7B uses the same basic transformer architecture as Llama 2 7B and Llama 3 8B. Speed should be comparable to Llama 3.1 8B benchmarks. Where Llama 3.1 8B data exists in our dataset, we use those directly as a proxy.
| Chip | RAM | Llama 3.1 8B (measured) | Mistral 7B (estimated) | Source |
|---|---|---|---|---|
| M4 Max (40-core GPU, 64 GB) | 64 GB | 55.1 tok/s | ~58–62 tok/s | estimated |
| M4 Max (40-core GPU, 48 GB) | 48 GB | 55.1 tok/s | ~58–62 tok/s | estimated |
| M4 Pro (20-core GPU, 24 GB) | 24 GB | 32.5 tok/s | ~34–37 tok/s | estimated |
| M3 Max (40-core GPU, 48 GB) | 48 GB | 37.5 tok/s | ~39–43 tok/s | estimated |
| M3 Max (30-core GPU, 36 GB) | 36 GB | 37.5 tok/s | ~39–43 tok/s | estimated |
| M3 Pro (18-core GPU, 36 GB) | 36 GB | 22.1 tok/s | ~23–26 tok/s | estimated |
| M2 Ultra (60-core GPU, 64 GB) | 64 GB | 59.5 tok/s | ~62–68 tok/s | estimated |
| M1 Pro (14-core GPU) | 16–32 GB | ~25–28 tok/s | ~26–30 tok/s | estimated |
See the full Llama 3.1 8B benchmark data → for measured chip-by-chip comparisons at Q4_K_M.
Mistral 7B vs Llama 3.1 8B
Both are the workhorses of local inference. Key differences:
| Feature | Mistral 7B | Llama 3.1 8B Instruct |
|---|---|---|
| Parameters | 7.2B | 8.0B |
| Context window | 32K (with sliding window) | 128K |
| RAM at Q4_K_M | ~4.1 GB | ~4.7 GB |
| Speed (relative) | Slightly faster | Baseline |
| Instruction following | Good (Instruct variant) | Excellent |
| Code quality | Good | Better (trained on more code) |
| Ollama model tag | mistral |
llama3.1 |
Llama 3.1 8B has a 128K context window vs Mistral 7B's 32K (with sliding window), significantly better instruction following, and comparable or better performance on most benchmarks. Mistral 7B remains a solid choice for simple chat and generation tasks, and benefits from a large ecosystem of fine-tunes. For most new setups, Llama 3.1 8B is the recommended 7B-class starting point.
Running Mistral 7B with Ollama
# Install Ollama (if not installed)
# Download from https://ollama.ai
# Run Mistral 7B (auto-selects Q4_K_M)
ollama run mistral
# Run Mistral 7B Instruct (better for conversation)
ollama run mistral:instruct
# Run Mixtral 8x7B (MoE — needs 26+ GB RAM)
ollama run mixtral
Related model benchmarks
Data
benchmarks.json — full dataset · chips.json — chip summaries · benchmarks.csv — CSV export