Upgrade code review for AI-assisted work
Redesign review so that AI-assisted changes are judged by risk, understanding, and behavioral correctness, not by surface polish or author confidence.
- Situation
- AI-assisted code generation is common, but review practices have not adapted.
- Goal
- Preserve real engineering judgment and code quality when generated output volume rises faster than human review habits evolve.
- Do not use when
- AI-assisted development is rare and low-impact in the team
- Primary owner
- tech lead
- Roles involved
tech leadreviewerscontributorsengineering managerstaff engineer or architect for high-risk domains
Context
The situation
Deciding whether to reach for this playbook: when it fits, and when it doesn't.
Use when
Conditions where this playbook is the right tool.
- Developers are using copilots or LLMs regularly
- Pull requests are getting larger or faster without corresponding explanation quality
- Reviewers are increasingly approving code they did not fully reason through
- The team senses that output is rising faster than confidence
Do not use when
Contexts where this playbook will waste effort or make things worse.
- AI-assisted development is rare and low-impact in the team
- The real review problem is unrelated to AI and comes from deeper ownership or staffing gaps alone
- Leaders want faster approval but not stronger review expectations
Stakes
Why this matters
What this playbook protects against, and why skipping or half-running it tends to be expensive.
AI changes review economics. Generated code often looks cleaner than human-authored code, which makes shallow approval easier. The danger is not only bad code. It is that the team gradually loses the habit of proving understanding before merge.
Quality bar
What good looks like
The observable qualities of a team or system that is actually doing this well. Not just going through the motions.
Signs of the playbook done well
- Review depth is matched to change risk, not to how polished the diff looks
- Authors can explain the intent, failure modes, and constraints of AI-assisted changes
- Reviewers look for behavior, design fit, and hidden risk rather than style alone
- Large generated diffs are broken down or framed so they remain reviewable
- AI-assisted work is visible enough to trigger the right review posture
Preparation
Before you start
What you need available and true before running the procedure. Skipping this is the most common reason playbooks fail.
Inputs
Material you'll want to gather first.
- Current review norms
- Sample AI-assisted pull requests
- Review latency and defect patterns
- Team expectations on disclosure and authorship
- Hotspot areas where generated code is especially risky
Prerequisites
Conditions that should be true for this to work.
- The team acknowledges that AI-assisted code is materially changing review conditions
- Reviewers are willing to challenge polished but weakly understood changes
- There is permission to slow high-risk merges when needed
Procedure
The procedure
Each step carries its purpose (why it exists), its actions (what you do), and its outputs (what you produce). Read the purpose. It's what keeps the step from degenerating into checklist theatre.
Define what review must still prove
Anchor review in judgment rather than ritual.
Actions
- State the minimum questions a review must answer for AI-assisted code
- Separate syntax cleanliness from behavioral correctness and design fit
- Clarify that authorship still includes accountability for generated sections
Outputs
- AI review principles
Require better author framing
Reduce reviewer guesswork on generated diffs.
Actions
- Ask authors to summarize what changed, why, and what was AI-assisted if relevant
- Require explicit notes on risky assumptions, touched boundaries, and validation approach
- Make giant generated diffs unacceptable without decomposition or guiding context
Outputs
- AI-assisted PR template
Tier review by risk
Preserve speed for low-risk work while increasing scrutiny for consequential changes.
Actions
- Define review tiers for low-risk, medium-risk, and high-risk AI-assisted changes
- Reserve deeper review for generated logic in critical domains, hidden side-effect areas, or large refactors
- Ensure risky changes require reviewers with domain understanding
Outputs
- Risk-tier review model
Audit review behavior after merge
Check whether faster review became weaker review.
Actions
- Review escaped defects and brittle changes tied to AI-assisted work
- Sample approved PRs for actual review depth and comprehension
- Adjust norms when evidence shows approval outpacing understanding
Outputs
- Review quality audit
Train reviewers on AI-specific failure patterns
Build new instincts, not just new forms.
Actions
- Teach common failure modes like plausible wrong abstractions, hidden duplication, and dead-path logic
- Review example diffs that looked good but failed behaviorally
- Promote questions that surface understanding rather than style preference
Outputs
- Reviewer guidance pack
Judgment
Judgment calls and pitfalls
The places where execution actually diverges: decisions that need thought, questions worth asking, and mistakes that recur regardless of good intent.
Decision points
Moments where judgment and trade-offs matter more than procedure.
- What kinds of AI-assisted changes require disclosure or stronger review framing?
- Which code areas should have stricter human-review expectations?
- What signals show that approval is outpacing understanding?
- Where is automation enough and where must domain reviewers stay central?
Questions worth asking
Prompts to use on yourself, the team, or an AI assistant while running the procedure.
- Can the author explain why this generated solution fits this system?
- What part of this diff is highest-risk despite looking clean?
- Would this still pass review if it were handwritten and equally large?
Common mistakes
Patterns that surface across teams running this playbook.
- Treating generated code as lower-effort review because it reads fluently
- Allowing huge AI-assisted diffs to merge because breaking them down feels slower
- Assuming green tests compensate for shallow conceptual review
- Letting authors merge code they cannot explain under pressure
Warning signs you are doing it wrong
Signals that the playbook is being executed but not landing.
- Review comments get shorter while diff size rises
- Authors cannot explain important edge cases in their own PRs
- The team merges generated code mainly because it compiles and looks coherent
- Post-merge surprises rise while review speed improves
Outcomes
Outcomes and signals
What should exist after the playbook runs, how you'll know it worked, and what to watch for over time.
Artifacts to produce
Durable outputs the playbook should leave behind.
- AI review principles
- AI-assisted PR template
- Risk-tier review model
- Review quality audit
- Reviewer guidance pack
Success signals
Observable changes that mean the playbook landed.
- Review comments stay behavior-focused even as AI use rises
- Risky AI-assisted changes receive stronger review and clearer framing
- Escaped defects from shallow generated-code review decline
- Contributors understand that AI use changes review posture, not authorship responsibility
Follow-up actions
Moves that keep the playbook's effects compounding after it finishes.
- Revisit review tiers as AI usage patterns change
- Connect recurring review failures to training or ownership work
- Update onboarding so new engineers understand AI-era review expectations from day one
Metrics or signals to watch
Longer-horizon indicators that the underlying problem is receding.
- Median review time by AI-assisted risk tier
- Post-merge defect rate for AI-assisted changes
- Average diff size for AI-assisted PRs
- Percentage of reviewed PRs with meaningful reviewer questions
AI impact
AI effects on this playbook
How AI-assisted and AI-driven workflows help execution, and the ways they can make it worse.
AI can help with
Where AI tooling genuinely reduces the cost of running this playbook well.
- Summarizing large diffs into behavior-focused review notes
- Highlighting risky files, generated patterns, and duplicated logic
- Suggesting reviewer checklists for specific change types
AI can make worse by
Distortions AI introduces that make the underlying problem harder to see.
- Making weak designs look elegant
- Encouraging reviewers to trust summaries instead of code paths
- Inflating diff volume beyond what the team can meaningfully inspect
AI synthesis
The core rule is simple: AI can assist authorship, but it must not weaken accountability. Review should shift from surface polish toward behavioral scrutiny and explicit understanding.
Relationships
Connected playbooks
Failure modes this playbook tends to address, decisions behind the situation, red flags that motivate running it, and neighboring playbooks.