AI Workflows

Reviewing AI-generated PRs: a checklist that catches drift

AI agents produce diffs that look right and are subtly wrong. Here's the structured review checklist that catches the drift before it ships.

Honam Kang7 min read

AI-generated PRs are subtly different from human-written ones. They look more correct than they are. The standard review checklist misses AI-specific failure modes; the agent passes its own tests by construction; and surface-level pattern-matching looks like real understanding until it doesn't.

This post is the structured review checklist for catching AI-PR drift before merge.

What "drift" means in AI-PRs

Five distinct failure modes:

1. Scope drift

The agent did what was asked AND adjacent things it thought were helpful. The PR is bigger than the task.

Example:

  • Task: add formatDate utility.
  • Agent: adds formatDate, also refactors three existing date-handling files to use it, also fixes a "bug" it noticed in unrelated date-parsing code.

2. Convention drift

The agent pattern-matched to something that looks similar but isn't. The code follows the wrong pattern for this case.

Example:

  • Codebase has Repository pattern in src/users/users-repo.ts.
  • Agent adds src/auth/auth-helpers.ts (instead of auth-repo.ts).
  • Looks fine; doesn't follow the convention.

3. Surface coverage

The tests pass but only cover the happy path. The agent wrote tests to make the implementation pass; edge cases are missing.

Example:

  • New function. Test: "happy case works."
  • Missing: "what if input is empty? null? wrong type?"

4. Removal without warrant

The agent removed code it judged "unused" or "obsolete" — but the removal is wrong.

Example:

  • Removes a const that "isn't referenced" but is actually exported and consumed by another file.
  • Inlines a constant that was named for documentation purposes.

5. Side effect blindness

The agent's change has effects beyond the visible diff — bundle size, performance, public API surface, dependency tree.

Example:

  • Adds a 50KB dependency to do something a 5-line helper would handle.
  • Changes a function's return type, breaking downstream callers.

The review checklist

Five sections. Apply in order.

Section 1: Scope match (2 min)

Check: does the PR do exactly what was asked, and nothing more?

  • Read the PR description (or session CLAUDE.md scope).
  • Skim the file list. Does each file relate to the task?
  • Files that don't obviously relate: investigate. If the agent edited a "nearby" file, ask why.

Red flag: more than 1.5x the files you'd expect for the task.

Action if drift found: comment on extraneous changes; have the agent revert them. Don't merge with scope creep — it sets a bad precedent.

Section 2: Convention match (3 min)

Check: does the new code follow the codebase's actual conventions, not just look like them?

  • For new files: are they in the right location? Named per convention?
  • For new functions: do they follow the codebase's naming and shape?
  • For new types: is the structure consistent with existing types?
  • Any new "patterns" introduced — were they discussed?

Red flag: a new file that has a slightly-off name (auth-helper.ts instead of auth-utils.ts).

Action: comment specifically. Reference the existing pattern. Have the agent re-do.

Section 3: Test adequacy (5 min)

Check: do tests cover the actual behavior, including edge cases?

  • Run the tests. They pass.
  • Read the test names. Do they cover happy path + edge cases + error paths?
  • For each public function in the diff: is there a test for at least empty input, null/undefined, wrong type?
  • Are there integration tests if the change crosses module boundaries?

Red flag: only happy-path tests. Or tests that look like they were generated to pass the implementation rather than to specify behavior.

Action: write at least one additional test that the agent didn't anticipate. Run it. If it fails, the implementation has a bug.

Section 4: Removal warrants (2 min)

Check: any removed code? Is the removal justified?

  • Lines removed in the diff: are they actually unused (cross-check with grep)?
  • Constants/types removed: any external consumers?
  • Inlined values: was the original a documentation aid?

Red flag: removal of "unused" code that you can't immediately confirm is unused.

Action: revert any unjustified removals. Don't accept "looks unused" — verify or restore.

Section 5: Side effects (3 min)

Check: what does this PR change beyond the visible diff?

  • Bundle size (npm run build and check output size).
  • Public API surface (exported types/functions changed?).
  • Dependency tree (npm install produced new entries?).
  • Performance (any obvious O(n²) where the existing was O(n log n)?).
  • Database/external service interaction patterns (new queries, new HTTP calls).

Red flag: new dependency without rationale; bundle size jump; behavioral change unrelated to the stated task.

Action: comment specifically. For new dependencies, ask why this dep instead of writing 5 lines.

The 15-minute review

If you do all five sections, ~15 minutes per PR. For trivial PRs (1-3 files, simple change), it can compress to 5 minutes. For sprawling PRs, longer — but the right reaction to a sprawling PR is to bounce it back, not spend 60 minutes reviewing.

Letting the agent self-review

Before you review, you can have the agent review its own diff:

Before I look at this PR, review your diff for:

  1. Files outside the original scope (<paste session CLAUDE.md scope>).
  2. Tests that only cover happy path.
  3. Removed code that might be used elsewhere (run grep for the removed identifiers).
  4. New dependencies (any not in package.json before?).

Output: list of concerns. If any, propose fixes.

Sometimes the agent catches its own drift. More often it doesn't (self-blind). Treat as a useful pre-pass, not a substitute for human review.

What good AI-PRs look like

For balance, what a clean AI-PR has:

  • Files in the diff are exactly what the task required (no adjacent edits).
  • New code follows the codebase's existing patterns explicitly.
  • Tests cover edges (empty, null, wrong-type, large input, error path).
  • No removals (or removals are explicitly justified in the PR description).
  • No new dependencies (or new dependencies have a "why this dep" justification).

These PRs sail through review in 5 minutes. Most of an AI fleet's PRs should hit this bar; the checklist is for catching the ones that don't.

File-manager setup for review

mq-dir's quad-pane works well for review:

  • Pane 1: PR diff (open the worktree of the session that produced it).
  • Pane 2: original code being changed (the main branch worktree of the same repo).
  • Pane 3: tests folder.
  • Pane 4: notes — your review checklist progress, comments to write back.

⌘1-4 cycles. You see the diff, original, tests, and your notes simultaneously. No alt-tabbing.

What changes after enforcing the checklist

After a couple of weeks of consistent checklist use:

  1. Agents produce cleaner PRs. Not because the checklist trains them, but because you re-prompt with the right scope/conventions and the next session's CLAUDE.md gets sharper.
  2. You catch drift earlier. Mid-task you can spot-check, not just end-of-task.
  3. Review time drops. The checklist becomes muscle memory; 15 minutes feels generous.
  4. PRs ship faster overall. Less iteration, fewer revert-and-retry cycles.

What this checklist doesn't cover

For honesty:

  • Architectural decisions — checklist won't catch "this should be a Repository pattern, not a Service."
  • Long-term maintainability — code that works today but will be hard to extend.
  • Cross-team conventions — checklist assumes you know your codebase's patterns.
  • Security — separate concern. Run npm audit and use a security-focused review pass.

These need human judgment beyond the checklist. The checklist catches mechanical drift; deeper concerns need deeper review.

Verdict

AI-PRs need a different review checklist than human-PRs. The five sections (scope, convention, test, removal, side effects) catch the drift modes the AI specifically falls into.

15 minutes per PR. Apply religiously for the first month; it becomes habit; the resulting PRs are higher quality.

If you're merging AI-generated PRs without this kind of structured review, you're shipping subtle drift. The checklist is small; the impact is real.

mq-dir's quad-pane is well-suited for this kind of review work — diff, original, tests, notes all visible at once. Free, MIT, pairs with the workflow.

Open source

mq-dir is fully open source.

MIT licensed, zero telemetry. Read the source, file an issue, send a PR.

★ Star on GitHub →

Frequently asked questions

Two reasons. (1) AI is great at pattern-matching to surface conventions but misses the why behind them. (2) AI tends to over-deliver — adds error handling that wasn't asked for, refactors adjacent code, removes 'unused' things that were used. Human review needs to flag these specifically.

References

  1. [1]

Ready to try mq-dir?

A native quad-pane file manager built for AI multi-tasking on macOS. Free, MIT licensed, zero telemetry.

v0.1.0-beta.12 · MIT · macOS 14.0+ · github