Reviewing AI-generated PRs: a checklist that catches drift
AI agents produce diffs that look right and are subtly wrong. Here's the structured review checklist that catches the drift before it ships.
AI-generated PRs are subtly different from human-written ones. They look more correct than they are. The standard review checklist misses AI-specific failure modes; the agent passes its own tests by construction; and surface-level pattern-matching looks like real understanding until it doesn't.
This post is the structured review checklist for catching AI-PR drift before merge.
What "drift" means in AI-PRs
Five distinct failure modes:
1. Scope drift
The agent did what was asked AND adjacent things it thought were helpful. The PR is bigger than the task.
Example:
- Task: add
formatDateutility. - Agent: adds
formatDate, also refactors three existing date-handling files to use it, also fixes a "bug" it noticed in unrelated date-parsing code.
2. Convention drift
The agent pattern-matched to something that looks similar but isn't. The code follows the wrong pattern for this case.
Example:
- Codebase has
Repositorypattern insrc/users/users-repo.ts. - Agent adds
src/auth/auth-helpers.ts(instead ofauth-repo.ts). - Looks fine; doesn't follow the convention.
3. Surface coverage
The tests pass but only cover the happy path. The agent wrote tests to make the implementation pass; edge cases are missing.
Example:
- New function. Test: "happy case works."
- Missing: "what if input is empty? null? wrong type?"
4. Removal without warrant
The agent removed code it judged "unused" or "obsolete" — but the removal is wrong.
Example:
- Removes a const that "isn't referenced" but is actually exported and consumed by another file.
- Inlines a constant that was named for documentation purposes.
5. Side effect blindness
The agent's change has effects beyond the visible diff — bundle size, performance, public API surface, dependency tree.
Example:
- Adds a 50KB dependency to do something a 5-line helper would handle.
- Changes a function's return type, breaking downstream callers.
The review checklist
Five sections. Apply in order.
Section 1: Scope match (2 min)
Check: does the PR do exactly what was asked, and nothing more?
- Read the PR description (or session CLAUDE.md scope).
- Skim the file list. Does each file relate to the task?
- Files that don't obviously relate: investigate. If the agent edited a "nearby" file, ask why.
Red flag: more than 1.5x the files you'd expect for the task.
Action if drift found: comment on extraneous changes; have the agent revert them. Don't merge with scope creep — it sets a bad precedent.
Section 2: Convention match (3 min)
Check: does the new code follow the codebase's actual conventions, not just look like them?
- For new files: are they in the right location? Named per convention?
- For new functions: do they follow the codebase's naming and shape?
- For new types: is the structure consistent with existing types?
- Any new "patterns" introduced — were they discussed?
Red flag: a new file that has a slightly-off name (auth-helper.ts instead of auth-utils.ts).
Action: comment specifically. Reference the existing pattern. Have the agent re-do.
Section 3: Test adequacy (5 min)
Check: do tests cover the actual behavior, including edge cases?
- Run the tests. They pass.
- Read the test names. Do they cover happy path + edge cases + error paths?
- For each public function in the diff: is there a test for at least empty input, null/undefined, wrong type?
- Are there integration tests if the change crosses module boundaries?
Red flag: only happy-path tests. Or tests that look like they were generated to pass the implementation rather than to specify behavior.
Action: write at least one additional test that the agent didn't anticipate. Run it. If it fails, the implementation has a bug.
Section 4: Removal warrants (2 min)
Check: any removed code? Is the removal justified?
- Lines removed in the diff: are they actually unused (cross-check with grep)?
- Constants/types removed: any external consumers?
- Inlined values: was the original a documentation aid?
Red flag: removal of "unused" code that you can't immediately confirm is unused.
Action: revert any unjustified removals. Don't accept "looks unused" — verify or restore.
Section 5: Side effects (3 min)
Check: what does this PR change beyond the visible diff?
- Bundle size (
npm run buildand check output size). - Public API surface (exported types/functions changed?).
- Dependency tree (
npm installproduced new entries?). - Performance (any obvious O(n²) where the existing was O(n log n)?).
- Database/external service interaction patterns (new queries, new HTTP calls).
Red flag: new dependency without rationale; bundle size jump; behavioral change unrelated to the stated task.
Action: comment specifically. For new dependencies, ask why this dep instead of writing 5 lines.
The 15-minute review
If you do all five sections, ~15 minutes per PR. For trivial PRs (1-3 files, simple change), it can compress to 5 minutes. For sprawling PRs, longer — but the right reaction to a sprawling PR is to bounce it back, not spend 60 minutes reviewing.
Letting the agent self-review
Before you review, you can have the agent review its own diff:
Before I look at this PR, review your diff for:
- Files outside the original scope (
<paste session CLAUDE.md scope>).- Tests that only cover happy path.
- Removed code that might be used elsewhere (run grep for the removed identifiers).
- New dependencies (any not in package.json before?).
Output: list of concerns. If any, propose fixes.
Sometimes the agent catches its own drift. More often it doesn't (self-blind). Treat as a useful pre-pass, not a substitute for human review.
What good AI-PRs look like
For balance, what a clean AI-PR has:
- Files in the diff are exactly what the task required (no adjacent edits).
- New code follows the codebase's existing patterns explicitly.
- Tests cover edges (empty, null, wrong-type, large input, error path).
- No removals (or removals are explicitly justified in the PR description).
- No new dependencies (or new dependencies have a "why this dep" justification).
These PRs sail through review in 5 minutes. Most of an AI fleet's PRs should hit this bar; the checklist is for catching the ones that don't.
File-manager setup for review
mq-dir's quad-pane works well for review:
- Pane 1: PR diff (open the worktree of the session that produced it).
- Pane 2: original code being changed (the main branch worktree of the same repo).
- Pane 3: tests folder.
- Pane 4: notes — your review checklist progress, comments to write back.
⌘1-4 cycles. You see the diff, original, tests, and your notes simultaneously. No alt-tabbing.
What changes after enforcing the checklist
After a couple of weeks of consistent checklist use:
- Agents produce cleaner PRs. Not because the checklist trains them, but because you re-prompt with the right scope/conventions and the next session's CLAUDE.md gets sharper.
- You catch drift earlier. Mid-task you can spot-check, not just end-of-task.
- Review time drops. The checklist becomes muscle memory; 15 minutes feels generous.
- PRs ship faster overall. Less iteration, fewer revert-and-retry cycles.
What this checklist doesn't cover
For honesty:
- Architectural decisions — checklist won't catch "this should be a Repository pattern, not a Service."
- Long-term maintainability — code that works today but will be hard to extend.
- Cross-team conventions — checklist assumes you know your codebase's patterns.
- Security — separate concern. Run
npm auditand use a security-focused review pass.
These need human judgment beyond the checklist. The checklist catches mechanical drift; deeper concerns need deeper review.
Verdict
AI-PRs need a different review checklist than human-PRs. The five sections (scope, convention, test, removal, side effects) catch the drift modes the AI specifically falls into.
15 minutes per PR. Apply religiously for the first month; it becomes habit; the resulting PRs are higher quality.
If you're merging AI-generated PRs without this kind of structured review, you're shipping subtle drift. The checklist is small; the impact is real.
mq-dir's quad-pane is well-suited for this kind of review work — diff, original, tests, notes all visible at once. Free, MIT, pairs with the workflow.
mq-dir is fully open source.
MIT licensed, zero telemetry. Read the source, file an issue, send a PR.
★ Star on GitHub →Frequently asked questions
References
- [1]
Ready to try mq-dir?
A native quad-pane file manager built for AI multi-tasking on macOS. Free, MIT licensed, zero telemetry.
Related posts
Session naming conventions that survive 100 sessions
Naming AI sessions feels trivial until you have 50 of them. The convention that scales — and the patterns that break.
Templates for AI projects: the bootstrap files that save hours
Every new AI project needs the same handful of bootstrap files. Skipping them costs an hour each session. Here are the seven templates that pay back immediately.
Worktree, branch, or session: parallelism patterns for AI dev
Three different ways to keep parallel AI agents from stepping on each other. Each has a place; getting the choice right per task saves real conflicts.