Offline Chat
Offline chat means running the full model locally — your conversations never leave your machine. The sweet spot is 7B–14B models for speed, or 32B for noticeably better quality.
Why these models for this use case
Offline chat has a wide model range. For casual use, a 7B model at Q8 runs at 60–80 tok/s and feels fast. For more thoughtful responses, 14B at Q4 is a good middle ground. If you want GPT-3.5 class quality offline, 32B models are the target — and you need at least 24 GB RAM with a Q4 quantization (~20 GB). Ollama and LM Studio both support all these configurations with zero configuration.
Benchmark results — fastest rows first
Filtered to models commonly used for offline chat. Sorted by avg tok/s descending.
Recommended chips for this use case
Other use cases
Data
benchmarks.json — full dataset · models.json — model summaries · benchmarks.csv