Reviewing AI-generated PRs: a checklist that catches drift
AI agents produce diffs that look right and are subtly wrong. Here's the structured review checklist that catches the drift before it ships.
AI-generated PRs are subtly different from human-written ones. They look more correct than they are. The standard review checklist misses AI-specific failure modes; the agent passes its own tests by construction; and surface-level pattern-matching looks like real understanding until it doesn't.
This post is the structured review checklist for catching AI-PR drift before merge.
What "drift" means in AI-PRs
Five distinct failure modes:
1. Scope drift
The agent did what was asked AND adjacent things it thought were helpful. The PR is bigger than the task.
Example:
- Task: add
formatDateutility. - Agent: adds
formatDate, also refactors three existing date-handling files to use it, also fixes a "bug" it noticed in unrelated date-parsing code.
2. Convention drift
The agent pattern-matched to something that looks similar but isn't. The code follows the wrong pattern for this case.
Example:
- Codebase has
Repositorypattern insrc/users/users-repo.ts. - Agent adds
src/auth/auth-helpers.ts(instead ofauth-repo.ts). - Looks fine; doesn't follow the convention.
3. Surface coverage
The tests pass but only cover the happy path. The agent wrote tests to make the implementation pass; edge cases are missing.
Example:
- New function. Test: "happy case works."
- Missing: "what if input is empty? null? wrong type?"
4. Removal without warrant
The agent removed code it judged "unused" or "obsolete" — but the removal is wrong.
Example:
- Removes a const that "isn't referenced" but is actually exported and consumed by another file.
- Inlines a constant that was named for documentation purposes.
5. Side effect blindness
The agent's change has effects beyond the visible diff — bundle size, performance, public API surface, dependency tree.
Example:
- Adds a 50KB dependency to do something a 5-line helper would handle.
- Changes a function's return type, breaking downstream callers.
The review checklist
Five sections. Apply in order.
Section 1: Scope match (2 min)
Check: does the PR do exactly what was asked, and nothing more?
- Read the PR description (or session CLAUDE.md scope).
- Skim the file list. Does each file relate to the task?
- Files that don't obviously relate: investigate. If the agent edited a "nearby" file, ask why.
Red flag: more than 1.5x the files you'd expect for the task.
Action if drift found: comment on extraneous changes; have the agent revert them. Don't merge with scope creep — it sets a bad precedent.
Section 2: Convention match (3 min)
Check: does the new code follow the codebase's actual conventions, not just look like them?
- For new files: are they in the right location? Named per convention?
- For new functions: do they follow the codebase's naming and shape?
- For new types: is the structure consistent with existing types?
- Any new "patterns" introduced — were they discussed?
Red flag: a new file that has a slightly-off name (auth-helper.ts instead of auth-utils.ts).
Action: comment specifically. Reference the existing pattern. Have the agent re-do.
Section 3: Test adequacy (5 min)
Check: do tests cover the actual behavior, including edge cases?
- Run the tests. They pass.
- Read the test names. Do they cover happy path + edge cases + error paths?
- For each public function in the diff: is there a test for at least empty input, null/undefined, wrong type?
- Are there integration tests if the change crosses module boundaries?
Red flag: only happy-path tests. Or tests that look like they were generated to pass the implementation rather than to specify behavior.
Action: write at least one additional test that the agent didn't anticipate. Run it. If it fails, the implementation has a bug.
Section 4: Removal warrants (2 min)
Check: any removed code? Is the removal justified?
- Lines removed in the diff: are they actually unused (cross-check with grep)?
- Constants/types removed: any external consumers?
- Inlined values: was the original a documentation aid?
Red flag: removal of "unused" code that you can't immediately confirm is unused.
Action: revert any unjustified removals. Don't accept "looks unused" — verify or restore.
Section 5: Side effects (3 min)
Check: what does this PR change beyond the visible diff?
- Bundle size (
npm run buildand check output size). - Public API surface (exported types/functions changed?).
- Dependency tree (
npm installproduced new entries?). - Performance (any obvious O(n²) where the existing was O(n log n)?).
- Database/external service interaction patterns (new queries, new HTTP calls).
Red flag: new dependency without rationale; bundle size jump; behavioral change unrelated to the stated task.
Action: comment specifically. For new dependencies, ask why this dep instead of writing 5 lines.
The 15-minute review
If you do all five sections, ~15 minutes per PR. For trivial PRs (1-3 files, simple change), it can compress to 5 minutes. For sprawling PRs, longer — but the right reaction to a sprawling PR is to bounce it back, not spend 60 minutes reviewing.
Letting the agent self-review
Before you review, you can have the agent review its own diff:
Before I look at this PR, review your diff for:
- Files outside the original scope (
<paste session CLAUDE.md scope>).- Tests that only cover happy path.
- Removed code that might be used elsewhere (run grep for the removed identifiers).
- New dependencies (any not in package.json before?).
Output: list of concerns. If any, propose fixes.
Sometimes the agent catches its own drift. More often it doesn't (self-blind). Treat as a useful pre-pass, not a substitute for human review.
What good AI-PRs look like
For balance, what a clean AI-PR has:
- Files in the diff are exactly what the task required (no adjacent edits).
- New code follows the codebase's existing patterns explicitly.
- Tests cover edges (empty, null, wrong-type, large input, error path).
- No removals (or removals are explicitly justified in the PR description).
- No new dependencies (or new dependencies have a "why this dep" justification).
These PRs sail through review in 5 minutes. Most of an AI fleet's PRs should hit this bar; the checklist is for catching the ones that don't.
File-manager setup for review
mq-dir's quad-pane works well for review:
- Pane 1: PR diff (open the worktree of the session that produced it).
- Pane 2: original code being changed (the main branch worktree of the same repo).
- Pane 3: tests folder.
- Pane 4: notes — your review checklist progress, comments to write back.
⌘1-4 cycles. You see the diff, original, tests, and your notes simultaneously. No alt-tabbing.
What changes after enforcing the checklist
After a couple of weeks of consistent checklist use:
- Agents produce cleaner PRs. Not because the checklist trains them, but because you re-prompt with the right scope/conventions and the next session's CLAUDE.md gets sharper.
- You catch drift earlier. Mid-task you can spot-check, not just end-of-task.
- Review time drops. The checklist becomes muscle memory; 15 minutes feels generous.
- PRs ship faster overall. Less iteration, fewer revert-and-retry cycles.
What this checklist doesn't cover
For honesty:
- Architectural decisions — checklist won't catch "this should be a Repository pattern, not a Service."
- Long-term maintainability — code that works today but will be hard to extend.
- Cross-team conventions — checklist assumes you know your codebase's patterns.
- Security — separate concern. Run
npm auditand use a security-focused review pass.
These need human judgment beyond the checklist. The checklist catches mechanical drift; deeper concerns need deeper review.
Verdict
AI-PRs need a different review checklist than human-PRs. The five sections (scope, convention, test, removal, side effects) catch the drift modes the AI specifically falls into.
15 minutes per PR. Apply religiously for the first month; it becomes habit; the resulting PRs are higher quality.
If you're merging AI-generated PRs without this kind of structured review, you're shipping subtle drift. The checklist is small; the impact is real.
mq-dir's quad-pane is well-suited for this kind of review work — diff, original, tests, notes all visible at once. Free, MIT, pairs with the workflow.
mq-dir is fully open source.
MIT licensed, zero telemetry. Read the source, file an issue, send a PR.
★ Star on GitHub →Frequently asked questions
References
- [1]
Ready to try mq-dir?
A native quad-pane file manager built for AI multi-tasking on macOS. Free, MIT licensed, zero telemetry.
Related posts
Must-install Mac apps for productivity in 2026 (curated, not generic)
Generic 'best Mac apps' lists are everywhere. This one is curated for one criterion: each app saves 10+ minutes per day. No filler.
Setting up a new Mac for Claude Code work, end-to-end
From unboxed Mac to first Claude Code session in 90 minutes. Every step, every command, every config — the complete walkthrough.
Free Mac apps every developer should know about in 2026
Paid productivity apps get all the attention; some of the best Mac dev tools are free. Here are the ones I install before reaching for my credit card.