What's the difference between prompt engineering and context engineering?

Prompt engineering is what you write in the chat. Context engineering is what's already in the agent's working memory before you write anything — system prompts, loaded files, tool outputs, conversation history. Context engineering shapes the agent's mental model; prompt engineering directs its action.

Why does context engineering matter in 2026?

Models are good enough that the bottleneck has shifted from 'how do I phrase it' to 'what should the agent know.' Two well-prompted agents with the same context perform similarly; two with different contexts can perform very differently. Context is the leverage.

Doesn't a 1M+ token context window make this unnecessary?

No. Bigger windows shift the problem from 'fit it in' to 'don't drown the model in noise.' Models attend more to recent and to opening tokens; mid-context content gets less weight. Curating context is still essential, just with more headroom.

Should I include test files in agent context?

Depends on the task. For 'add a feature with tests', yes. For 'refactor a function', no — the test should be re-read after the change to verify, not loaded into context. Selective inclusion.

What's the single biggest context engineering mistake?

Loading too much. Adding 'just in case' files burns context budget and dilutes the agent's attention. Be aggressive about excluding files that aren't directly needed.

Context engineering: the hidden lever for agent quality

Prompt engineering tutorials peaked around 2023. Context engineering is the underdiscussed lever in 2026 — what's in the agent's working memory, what's not, what's at the start vs middle vs end. Same model, different context, dramatically different output quality.

This post is the practical playbook.

What context engineering is

The agent's "context" is everything in its window when it generates a response:

System prompt (CLAUDE.md, instructions).
Tool definitions (what tools are available).
Loaded files (anything the agent has read).
Tool output (results of previous tool calls).
Conversation history (previous turns).

Each of these contributes. Together they shape the agent's behavior more than the individual prompt does.

Why it matters

Three observations from running large numbers of agent sessions:

1. The same agent with different contexts performs very differently

Give Claude (or GPT-5) a vague CLAUDE.md and a giant codebase dump → mediocre work.

Give the same model a tight CLAUDE.md, a focused 5-file context, and a clear prompt → excellent work.

The model is identical. The context isn't.

2. Context window size is a budget, not a license

A 1M+ token window doesn't mean "load everything." Models pay attention more to recent + opening tokens; mid-context content gets less weight (the "lost in the middle" effect). A 500k-token context with relevant content at positions 50-450 may underperform a 50k-token context with the same content concentrated.

3. Context cost is real

Each token is processed. Inflated context = more cost (and slower latency). Caching helps but doesn't eliminate.

The practical playbook

1. Scope discipline

Decide what files the agent needs before the agent decides.

For a focused task:

List the files explicitly in the prompt.
"Read only X, Y, Z. Don't read other files unless you ask."

For an exploratory task:

Specify the directory boundary.
"Look only under src/auth/. Don't read src/users/ or test fixtures."

2. Summary-first

For larger codebases, give the agent summaries instead of full files:

File-level: a notes/architecture.md with 200-word descriptions of each module.
Function-level: docstrings or type signatures, not full implementations (until needed).
Cross-cutting: a notes/conventions.md with the codebase's grammar.

The agent uses summaries to decide which files to actually read.

3. Tool output management

Tool outputs (especially terminal outputs) bloat context fast. Mitigation:

After a tool call, summarize the output explicitly. "The test output shows 5 failures, all in auth/middleware.test.ts. Discard the full output; we'll work with the summary."
The agent then has a 50-token summary instead of a 5000-token raw output.

4. Conversation pruning

Long sessions accumulate turns. Periodically:

"Summarize what we've established so far in 200 words. Then we'll continue from that summary."

The agent produces a compact synthesis. Subsequent turns ride on the summary; the early turns can be conceptually discarded.

5. Context budget discipline

For long-running sessions, set yourself a budget:

"I want this session's context to stay under 100k tokens before generation."
Track via the API's response metadata.
When approaching the budget, prune (summarize, drop old tool outputs, restart with summary).

6. Position matters

Critical content goes at the start (system prompt, CLAUDE.md) or the end (current task). Mid-context content gets less attention.

If you have a 50k-token reference document, it competes for attention with everything in the middle. Consider:

Reference at the start (cached system content).
Current file just before the user prompt.
User prompt at the end.

7. Selective inclusion

The agent doesn't need everything. For a task:

Code being changed: include.
Tests for the code being changed: include.
Code that calls the code being changed: include if signatures change.
Adjacent unrelated code: don't include.
Build configs, lockfiles: don't include unless directly relevant.

Be aggressive about exclusion.

What good context looks like

For "fix this bug in auth/login.ts":

[system: CLAUDE.md global preferences, ~50 lines]
[system: project CLAUDE.md, ~100 lines]
[system: session CLAUDE.md "scope is auth/", ~20 lines]

[file: src/auth/login.ts (the file being changed)]
[file: src/auth/login.test.ts (its tests)]
[file: src/auth/types.ts (shared types if relevant)]

[user: "There's a bug where login fails for usernames with apostrophes. Fix."]

Maybe 5k tokens total. Tight, focused. The agent's attention is concentrated.

What bad context looks like

For the same task:

[system: 500-line generic CLAUDE.md from a template]

[file: entire src/ tree (50 files, 30k tokens)]
[file: README.md, CHANGELOG.md, CONTRIBUTING.md]
[file: package.json, tsconfig.json, eslint.config.mjs]
[file: 10 unrelated test files]

[user: "Fix the login bug"]

50k+ tokens. The agent's attention is diluted across 60 files; it'll spend half its turns deciding what to read.

The same model produces noticeably worse output. Context engineering is the difference.

Strategies for specific scenarios

Scenario 1: large codebase exploration

Goal: understand a 500k-token codebase well enough to make a change.

Approach:

Surface map first (top-level dirs, 200 words).
Hot files (5-10 most relevant, ~3k tokens).
Conventions doc (extracted from hot files, ~1k tokens).
Specific change task with relevant files only.

Total budget: ~10k tokens of context for exploration; ~5k for the actual change.

Scenario 2: long-running refactor

Goal: refactor a pattern across 30 files over 2 hours.

Approach:

Strong CLAUDE.md scoping which files are in/out.
After every 5 file edits, summarize what's been done.
Drop old tool outputs (test runs from earlier).
Refresh context periodically with a "we've changed files A, B, C; pattern is settled; continue with D, E, F" reset.

Without this discipline, the session drifts into chaos by hour 1.5.

Scenario 3: documentation task

Goal: write a 2000-word post from a 5000-line research thread.

Approach:

Summarize the thread first (extraction step).
Use the summary as context for the post (compose step).
Don't load the full thread into context for the compose step.

Two passes. Each pass has tight context.

Scenario 4: agent loops with iteration

Goal: have the agent iterate on its own output.

Approach:

After each iteration, the agent reviews its previous attempt explicitly.
Drop the attempt's reasoning trace; keep only the artifact.
The next iteration sees: prompt + previous artifact + critique. Not the whole history.

Compounds well; otherwise context bloats fast.

Tools that help

Anthropic's prompt caching

Cache stable context (CLAUDE.md, references) at 10% input cost. See the prompt caching post for mechanics.

Claude Code's automatic context management

Claude Code summarizes when the context gets long. You don't manually prune. Trust this for short-medium sessions; intervene manually for very long ones.

Aider's `/add` and `/drop`

Aider lets you explicitly add/drop files from context. Direct control. Good if you want to manage manually.

Custom summarization scripts

For very large codebases, write a script that produces summaries on demand. The agent uses the summary; loads specific files when needed.

Common context engineering mistakes

Loading "just in case"

Adding a file because it might be relevant burns context budget. If it isn't directly needed, exclude.

Forgetting tool output bloat

A git log of 1000 commits is 50k tokens. Truncate or summarize before letting the agent process it.

Stale summaries

A summary written 6 months ago may not reflect the current code. Periodically regenerate.

Position-blind context

Putting critical instructions in the middle. Move them to the start or end.

Not cleaning between sessions

Starting a new session with the previous session's context = context pollution. Reset.

What models do (and don't) infer about context

For honesty:

Models do weight recent and opening tokens more than middle.
Models do sometimes miss critical mid-context content.
Models don't automatically realize when context is too small (they'll generate plausibly without it).
Models don't know what they don't know — they'll guess if context is missing.

The implication: you can't trust the model to ask for missing context. You have to provide it correctly upfront.

File-manager setup for context-aware workflows

mq-dir's structure helps with context engineering:

Pane 1: source code (the actual files in scope).
Pane 2: notes/summaries (the summaries the agent uses).
Pane 3: cmux session (the agent).
Pane 4: scratchpad (your synthesis of what's been established).

Visible separation between "actual code" and "summaries" reinforces the practice of curating, not dumping.

Verdict

Context engineering is the 2026 leverage point. The model is mostly fixed; the context is what you control.

The patterns:

Scope discipline.
Summary-first for large codebases.
Tool output management.
Conversation pruning.
Context budget awareness.
Position-aware structuring.
Selective inclusion (be aggressive about exclusion).

Same model, different context = different quality. Spend the engineering effort on context, not on incrementally tweaking prompt phrasing.

mq-dir pairs naturally — the file manager is where you decide what's in scope, the summaries pane is where you curate, the agent operates within the bounds you set.

What context engineering is

Why it matters

1. The same agent with different contexts performs very differently

2. Context window size is a budget, not a license

3. Context cost is real

The practical playbook

1. Scope discipline

2. Summary-first

3. Tool output management

4. Conversation pruning

5. Context budget discipline

6. Position matters

7. Selective inclusion

What good context looks like

What bad context looks like

Strategies for specific scenarios

Scenario 1: large codebase exploration

Scenario 2: long-running refactor

Scenario 3: documentation task

Scenario 4: agent loops with iteration

Tools that help

Anthropic's prompt caching

Claude Code's automatic context management

Aider's /add and /drop

Custom summarization scripts

Common context engineering mistakes

Loading "just in case"

Forgetting tool output bloat

Stale summaries

Position-blind context

Not cleaning between sessions

What models do (and don't) infer about context

File-manager setup for context-aware workflows

Verdict

mq-dir is fully open source.

Frequently asked questions

References

Ready to try mq-dir?

Related posts

Local LLMs on macOS in 2026: when they're worth the GPU

Claude Code memory without polluting global config

File-context strategies for AI agents: what to feed, what to skip, what to summarize

Aider's `/add` and `/drop`