Upgrade code review for AI-assisted work

Difficulty: medium-high
Time horizon: days to define new norms, weeks to reinforce them
Primary owner: tech lead
Confidence: high

At a glanceEP-01

Situation: AI-assisted code generation is common, but review practices have not adapted.
Goal: Preserve real engineering judgment and code quality when generated output volume rises faster than human review habits evolve.
Do not use when: AI-assisted development is rare and low-impact in the team
Primary owner: tech lead
Roles involved: tech leadreviewerscontributorsengineering managerstaff engineer or architect for high-risk domains

Context

The situation

Deciding whether to reach for this playbook: when it fits, and when it doesn't.

Use when

Conditions where this playbook is the right tool.

Developers are using copilots or LLMs regularly
Pull requests are getting larger or faster without corresponding explanation quality
Reviewers are increasingly approving code they did not fully reason through
The team senses that output is rising faster than confidence

Stakes

Why this matters

What this playbook protects against, and why skipping or half-running it tends to be expensive.

AI changes review economics. Generated code often looks cleaner than human-authored code, which makes shallow approval easier. The danger is not only bad code. It is that the team gradually loses the habit of proving understanding before merge.

Quality bar

What good looks like

The observable qualities of a team or system that is actually doing this well. Not just going through the motions.

Signs of the playbook done well

Review depth is matched to change risk, not to how polished the diff looks
Authors can explain the intent, failure modes, and constraints of AI-assisted changes
Reviewers look for behavior, design fit, and hidden risk rather than style alone
Large generated diffs are broken down or framed so they remain reviewable
AI-assisted work is visible enough to trigger the right review posture

Preparation

Before you start

What you need available and true before running the procedure. Skipping this is the most common reason playbooks fail.

Inputs

Material you'll want to gather first.

Current review norms
Sample AI-assisted pull requests
Review latency and defect patterns
Team expectations on disclosure and authorship
Hotspot areas where generated code is especially risky

Prerequisites

Conditions that should be true for this to work.

The team acknowledges that AI-assisted code is materially changing review conditions
Reviewers are willing to challenge polished but weakly understood changes
There is permission to slow high-risk merges when needed

Procedure

The procedure

Each step carries its purpose (why it exists), its actions (what you do), and its outputs (what you produce). Read the purpose. It's what keeps the step from degenerating into checklist theatre.

01
Define what review must still prove
Anchor review in judgment rather than ritual.
Actions
- State the minimum questions a review must answer for AI-assisted code
- Separate syntax cleanliness from behavioral correctness and design fit
- Clarify that authorship still includes accountability for generated sections
Outputs
- AI review principles
02
Require better author framing
Reduce reviewer guesswork on generated diffs.
Actions
- Ask authors to summarize what changed, why, and what was AI-assisted if relevant
- Require explicit notes on risky assumptions, touched boundaries, and validation approach
- Make giant generated diffs unacceptable without decomposition or guiding context
Outputs
- AI-assisted PR template
03
Tier review by risk
Preserve speed for low-risk work while increasing scrutiny for consequential changes.
Actions
- Define review tiers for low-risk, medium-risk, and high-risk AI-assisted changes
- Reserve deeper review for generated logic in critical domains, hidden side-effect areas, or large refactors
- Ensure risky changes require reviewers with domain understanding
Outputs
- Risk-tier review model
04
Audit review behavior after merge
Check whether faster review became weaker review.
Actions
- Review escaped defects and brittle changes tied to AI-assisted work
- Sample approved PRs for actual review depth and comprehension
- Adjust norms when evidence shows approval outpacing understanding
Outputs
- Review quality audit
05
Train reviewers on AI-specific failure patterns
Build new instincts, not just new forms.
Actions
- Teach common failure modes like plausible wrong abstractions, hidden duplication, and dead-path logic
- Review example diffs that looked good but failed behaviorally
- Promote questions that surface understanding rather than style preference
Outputs
- Reviewer guidance pack

Judgment

Judgment calls and pitfalls

The places where execution actually diverges: decisions that need thought, questions worth asking, and mistakes that recur regardless of good intent.

Decision points

Moments where judgment and trade-offs matter more than procedure.

What kinds of AI-assisted changes require disclosure or stronger review framing?
Which code areas should have stricter human-review expectations?
What signals show that approval is outpacing understanding?
Where is automation enough and where must domain reviewers stay central?

Questions worth asking

Prompts to use on yourself, the team, or an AI assistant while running the procedure.

Can the author explain why this generated solution fits this system?
What part of this diff is highest-risk despite looking clean?
Would this still pass review if it were handwritten and equally large?

Common mistakes

Patterns that surface across teams running this playbook.

Treating generated code as lower-effort review because it reads fluently
Allowing huge AI-assisted diffs to merge because breaking them down feels slower
Assuming green tests compensate for shallow conceptual review
Letting authors merge code they cannot explain under pressure

Warning signs you are doing it wrong

Signals that the playbook is being executed but not landing.

Review comments get shorter while diff size rises
Authors cannot explain important edge cases in their own PRs
The team merges generated code mainly because it compiles and looks coherent
Post-merge surprises rise while review speed improves

Outcomes

Outcomes and signals

What should exist after the playbook runs, how you'll know it worked, and what to watch for over time.

Artifacts to produce

Durable outputs the playbook should leave behind.

AI review principles
AI-assisted PR template
Risk-tier review model
Review quality audit
Reviewer guidance pack

Success signals

Observable changes that mean the playbook landed.

Review comments stay behavior-focused even as AI use rises
Risky AI-assisted changes receive stronger review and clearer framing
Escaped defects from shallow generated-code review decline
Contributors understand that AI use changes review posture, not authorship responsibility

Follow-up actions

Moves that keep the playbook's effects compounding after it finishes.

Revisit review tiers as AI usage patterns change
Connect recurring review failures to training or ownership work
Update onboarding so new engineers understand AI-era review expectations from day one

Metrics or signals to watch

Longer-horizon indicators that the underlying problem is receding.

Median review time by AI-assisted risk tier
Post-merge defect rate for AI-assisted changes
Average diff size for AI-assisted PRs
Percentage of reviewed PRs with meaningful reviewer questions

AI impact

AI effects on this playbook

How AI-assisted and AI-driven workflows help execution, and the ways they can make it worse.

AI can help with

Where AI tooling genuinely reduces the cost of running this playbook well.

Summarizing large diffs into behavior-focused review notes
Highlighting risky files, generated patterns, and duplicated logic
Suggesting reviewer checklists for specific change types

AI can make worse by

Distortions AI introduces that make the underlying problem harder to see.

Making weak designs look elegant
Encouraging reviewers to trust summaries instead of code paths
Inflating diff volume beyond what the team can meaningfully inspect

Relationships

Connected playbooks

Failure modes this playbook tends to address, decisions behind the situation, red flags that motivate running it, and neighboring playbooks.