Are local LLMs as good as Claude or GPT-5?

For some tasks, close. For frontier coding, reasoning, long context — still a gap. The gap narrowed dramatically in 2025-2026 but didn't close. Best local in 2026 (Llama 4 70B, Qwen3 70B) is roughly 2024 frontier quality, not 2026.

How much RAM do I need?

For 7-13B models: 16GB minimum, 32GB comfortable. For 70B models (the interesting ones): 64GB+ unified memory on M-series. M4 Max with 128GB runs them well; M3 Pro 32GB struggles.

Is privacy the only reason to run local?

No, three reasons: privacy (data stays on machine), cost (zero per-token after hardware), offline (no internet dependency). Different reasons matter to different users. Privacy and offline are the strongest; cost only wins at very heavy use.

Should I use Ollama or LM Studio or something else?

Ollama for CLI/programmatic use. LM Studio for GUI. Open WebUI for ChatGPT-like web interface. Pick by interface preference; the underlying models are the same.

Can I use local LLMs with Claude Code or Cursor?

With Aider, yes — Aider is BYO-model. Claude Code is Anthropic-only. Cursor's main features use cloud models; you can configure custom models but the experience is rough. For local models, Aider is the smoothest path.

Local LLMs on macOS in 2026: when they're worth the GPU

Local LLMs on macOS got dramatically better in 2025-2026. The Apple Silicon unified memory architecture is genuinely well-suited for inference. Ollama and LM Studio matured. The 70B-class models are now usable on workstations.

But "better" doesn't mean "as good as frontier APIs." This post is the honest 2026 picture — which workflows fit local, which don't, and what to actually run.

TL;DR

Frontier coding/reasoning/long-context: APIs still win.
Privacy-required tasks: local is the answer (no other choice).
Offline work: local is the answer.
Cost-sensitive very heavy use: local breaks even after a few months.
Most casual use: APIs are simpler and good enough.

What local can do well in 2026

After running both daily for months, local models handle these well:

Code review for non-critical work

Llama 4 70B (or Qwen3 70B) does competent code review. Not as polished as Claude, but identifies most defects.

Summarization

Long-document summarization is a strength. 70B models summarize 30-page PDFs accurately.

Translation, paraphrase

Both reliable. For text transformations, frontier API isn't needed.

Simple tool calling

7B-13B models can call simple tools (web search, calculator). Complex multi-step tool use is harder; frontier API is more reliable.

Boilerplate generation

Generate a React component, a Pytest test, a CLI script — local handles fine.

What local still can't do well

Frontier coding tasks

Multi-file refactor across 50 files? Long context understanding? Subtle bug debugging? Frontier APIs are still ahead. The gap narrowed but isn't gone.

Long context (200k+ tokens)

Local models in 2026 handle 32k-128k context typically. Frontier APIs do 1M+. For tasks that need very long context, API wins.

Latest-knowledge questions

Local models have a cutoff. Frontier APIs are continuously updated, retrieve from web, get more recent context.

Subtle reasoning

Multi-step logical chains, math beyond basic arithmetic, contradiction detection — frontier APIs are better. Improving in local but still a real gap.

Hardware reality

What you can run on common Apple Silicon:

Mac	RAM	Best local model	Notes
M3 8GB	8GB	3B quantized	Slow; mostly toy use
M3 Pro 16GB	16GB	7B-8B	Usable for simple tasks
M3 Pro 32GB	32GB	13B-15B	Usable for medium tasks
M3 Max 64GB	64GB	70B 4-bit	Frontier-class local
M4 Max 128GB	128GB	70B 6-bit, multiple loaded	Best in class for individual workstation
M-series Studio Ultra	192GB+	405B partial	Specialized; rare

For typical AI dev work in 2026, M3 Max 64GB is the sweet spot. M4 Max 128GB is best-in-class but expensive.

What to actually run

In 2026, three model families are worth knowing:

Llama 4 (Meta)

70B variant is the workhorse. Strong general capability, good code quality, free for commercial use.

ollama pull llama3.3:70b

(Naming and exact tag varies; check Ollama's library.)

Qwen3 (Alibaba)

Strong on multilingual and code. 70B-class variant rivals Llama.

ollama pull qwen3:70b

DeepSeek (or successors)

Excellent at coding specifically. Smaller variants (16B mixture-of-experts) deliver disproportionate code quality per parameter.

The toolchain

Ollama

CLI-first runtime. Excellent for programmatic use.

brew install ollama
ollama serve
ollama pull llama3.3:70b
ollama run llama3.3:70b "Summarize this file"

API server at localhost:11434. Aider, custom scripts, anything OpenAI-API-compatible can hit it.

LM Studio

GUI for downloading and running models. Easier for non-CLI users. Good for evaluation and casual use.

Open WebUI

ChatGPT-like web interface, runs locally, talks to Ollama. Good if your team wants a shared local LLM service.

MLX (Apple's ML framework)

Direct Apple Silicon optimization. More setup; potentially better performance per watt. Niche.

When to use local vs API

Use local for:

Privacy-required tasks

Confidential client work that can't leave your machine.
Health records, financial data, anything regulated.
Internal proprietary code that shouldn't go to a third-party API.

Offline work

On a flight, on a remote location with bad internet.
During API outages (yes, frontier APIs sometimes go down).

Cost-heavy workflows at scale

If you're making thousands of API calls a day for routine tasks (summarization, classification), local breaks even after 3-6 months.

Experimentation

Free to try things, can't blow budget.

Use APIs for:

Frontier-quality work

Hard coding tasks, subtle bugs, multi-file refactors.

Long context

Anything beyond 100k tokens.

Casual use

Light daily use; API cost is trivial; setup time isn't worth it.

Tasks that benefit from continuous model updates

Recent libraries/frameworks; model needs to know about them.

A worked example: same task on local vs API

Task: review a 500-line PR for code quality issues.

Local (Llama 4 70B on M4 Max 128GB)

Time: ~30 seconds. Quality: identifies most issues. Misses some subtle ones. Cost: $0 marginal (one-time hardware). Privacy: code stays on machine.

API (Claude Sonnet)

Time: ~10 seconds. Quality: identifies more issues, including subtle ones. Cost: ~$0.01. Privacy: code goes to Anthropic.

For non-sensitive PRs, the API's better quality is worth $0.01. For sensitive PRs (proprietary, client-confidential), local is the answer.

The decision rule: privacy needs → local. Otherwise → API.

Setting up Ollama

If you want to try:

# install
brew install ollama

# start service
ollama serve &

# pull a model (size varies — Llama 4 70B is ~40GB)
ollama pull llama3.3:70b

# test
ollama run llama3.3:70b
> Tell me about Swift's strict concurrency mode in 50 words.

That's it. The model lives in ~/.ollama/models/. The service runs on localhost:11434.

Integrating with editors

Aider + local

aider --model ollama/llama3.3:70b

Aider talks to Ollama. You get a Claude-Code-style agent on a local model.

Quality: lower than Claude, but real. Privacy: full.

Cursor + local

Cursor has experimental local model support. Setup is more involved; expect rough edges. For most users, sticking with Cursor's cloud models is smoother. Use local separately via Aider.

Claude Code + local

Not supported. Claude Code is Anthropic-locked.

File-manager setup

mq-dir for local LLM workflows is the same as for cloud LLMs:

Pane 1: project repo.
Pane 2: cmux pane running Ollama session.
Pane 3: artifacts.
Pane 4: notes.

The agent location (cloud or local) is hidden behind the same interface (cmux + Aider, e.g.). mq-dir doesn't care.

Cost analysis (as of 2026)

For a heavy AI dev user:

Setup	Up-front	Monthly
Cloud APIs (Claude + GPT-5)	$0	$50-150
Local (M3 Max 64GB)	$3000-4000 hardware	~$10 (electricity)
Local (M4 Max 128GB)	$5000-7000 hardware	~$15 (electricity)
Hybrid: cloud for hard, local for routine	varies	$20-50 cloud + $10 electricity

Pure local breaks even at ~24 months for heavy users. Hybrid is the cost-effective sweet spot.

What's coming

A few 2026-trending directions worth watching:

1M+ context local models

Currently rare on local. Quantized versions of long-context models are getting closer. Expect viable options by mid-2026.

Open-weight frontier models

Llama 4, Qwen3, and others are pushing the open-weight frontier. The gap to closed APIs may continue narrowing. Worth re-evaluating quarterly.

Apple's own models

Apple has been steadily improving on-device models. macOS 26+ may include genuinely capable local models tied to Apple's APIs. Worth watching.

Mixture-of-experts at scale

DeepSeek-style MoE models are efficient. More variants likely.

Verdict

Local LLMs in 2026 are a viable option for specific workflows:

Privacy-required.
Offline.
Very heavy use (cost amortization).
Experimentation.

For frontier-quality work, multi-file refactors, long context, cloud APIs still win.

Hybrid is the right answer for most heavy users: cloud for hard work, local for routine. Total cost ~$30-60/month.

Hardware: M3 Max 64GB minimum for credible local. M4 Max 128GB is the workstation-tier sweet spot.

Toolchain: Ollama for CLI, Aider for agentic, LM Studio for GUI evaluation.

mq-dir doesn't care which agent you use — the file manager is the visualization layer regardless. Free, MIT.