Skip to main content
The Hard Parts.dev
EP-01 Ai EP Engineering Playbook
Difficulty medium-high Owner · tech lead

Upgrade code review for AI-assisted work

Redesign review so that AI-assisted changes are judged by risk, understanding, and behavioral correctness, not by surface polish or author confidence.

Difficulty
medium-high
Time horizon
days to define new norms, weeks to reinforce them
Primary owner
tech lead
Confidence
high
At a glanceEP-01
Situation
AI-assisted code generation is common, but review practices have not adapted.
Goal
Preserve real engineering judgment and code quality when generated output volume rises faster than human review habits evolve.
Do not use when
AI-assisted development is rare and low-impact in the team
Primary owner
tech lead
Roles involved

tech leadreviewerscontributorsengineering managerstaff engineer or architect for high-risk domains

Context

The situation

Deciding whether to reach for this playbook: when it fits, and when it doesn't.

Use when

Conditions where this playbook is the right tool.

  • Developers are using copilots or LLMs regularly
  • Pull requests are getting larger or faster without corresponding explanation quality
  • Reviewers are increasingly approving code they did not fully reason through
  • The team senses that output is rising faster than confidence

Stakes

Why this matters

What this playbook protects against, and why skipping or half-running it tends to be expensive.

AI changes review economics. Generated code often looks cleaner than human-authored code, which makes shallow approval easier. The danger is not only bad code. It is that the team gradually loses the habit of proving understanding before merge.

Quality bar

What good looks like

The observable qualities of a team or system that is actually doing this well. Not just going through the motions.

Signs of the playbook done well

  • Review depth is matched to change risk, not to how polished the diff looks
  • Authors can explain the intent, failure modes, and constraints of AI-assisted changes
  • Reviewers look for behavior, design fit, and hidden risk rather than style alone
  • Large generated diffs are broken down or framed so they remain reviewable
  • AI-assisted work is visible enough to trigger the right review posture

Preparation

Before you start

What you need available and true before running the procedure. Skipping this is the most common reason playbooks fail.

Inputs

Material you'll want to gather first.

  • Current review norms
  • Sample AI-assisted pull requests
  • Review latency and defect patterns
  • Team expectations on disclosure and authorship
  • Hotspot areas where generated code is especially risky

Prerequisites

Conditions that should be true for this to work.

  • The team acknowledges that AI-assisted code is materially changing review conditions
  • Reviewers are willing to challenge polished but weakly understood changes
  • There is permission to slow high-risk merges when needed

Procedure

The procedure

Each step carries its purpose (why it exists), its actions (what you do), and its outputs (what you produce). Read the purpose. It's what keeps the step from degenerating into checklist theatre.

  1. Define what review must still prove

    Anchor review in judgment rather than ritual.

    Actions

    • State the minimum questions a review must answer for AI-assisted code
    • Separate syntax cleanliness from behavioral correctness and design fit
    • Clarify that authorship still includes accountability for generated sections

    Outputs

    • AI review principles
  2. Require better author framing

    Reduce reviewer guesswork on generated diffs.

    Actions

    • Ask authors to summarize what changed, why, and what was AI-assisted if relevant
    • Require explicit notes on risky assumptions, touched boundaries, and validation approach
    • Make giant generated diffs unacceptable without decomposition or guiding context

    Outputs

    • AI-assisted PR template
  3. Tier review by risk

    Preserve speed for low-risk work while increasing scrutiny for consequential changes.

    Actions

    • Define review tiers for low-risk, medium-risk, and high-risk AI-assisted changes
    • Reserve deeper review for generated logic in critical domains, hidden side-effect areas, or large refactors
    • Ensure risky changes require reviewers with domain understanding

    Outputs

    • Risk-tier review model
  4. Audit review behavior after merge

    Check whether faster review became weaker review.

    Actions

    • Review escaped defects and brittle changes tied to AI-assisted work
    • Sample approved PRs for actual review depth and comprehension
    • Adjust norms when evidence shows approval outpacing understanding

    Outputs

    • Review quality audit
  5. Train reviewers on AI-specific failure patterns

    Build new instincts, not just new forms.

    Actions

    • Teach common failure modes like plausible wrong abstractions, hidden duplication, and dead-path logic
    • Review example diffs that looked good but failed behaviorally
    • Promote questions that surface understanding rather than style preference

    Outputs

    • Reviewer guidance pack

Judgment

Judgment calls and pitfalls

The places where execution actually diverges: decisions that need thought, questions worth asking, and mistakes that recur regardless of good intent.

Decision points

Moments where judgment and trade-offs matter more than procedure.

  • What kinds of AI-assisted changes require disclosure or stronger review framing?
  • Which code areas should have stricter human-review expectations?
  • What signals show that approval is outpacing understanding?
  • Where is automation enough and where must domain reviewers stay central?

Questions worth asking

Prompts to use on yourself, the team, or an AI assistant while running the procedure.

  • Can the author explain why this generated solution fits this system?
  • What part of this diff is highest-risk despite looking clean?
  • Would this still pass review if it were handwritten and equally large?

Common mistakes

Patterns that surface across teams running this playbook.

  • Treating generated code as lower-effort review because it reads fluently
  • Allowing huge AI-assisted diffs to merge because breaking them down feels slower
  • Assuming green tests compensate for shallow conceptual review
  • Letting authors merge code they cannot explain under pressure

Warning signs you are doing it wrong

Signals that the playbook is being executed but not landing.

  • Review comments get shorter while diff size rises
  • Authors cannot explain important edge cases in their own PRs
  • The team merges generated code mainly because it compiles and looks coherent
  • Post-merge surprises rise while review speed improves

Outcomes

Outcomes and signals

What should exist after the playbook runs, how you'll know it worked, and what to watch for over time.

Artifacts to produce

Durable outputs the playbook should leave behind.

  • AI review principles
  • AI-assisted PR template
  • Risk-tier review model
  • Review quality audit
  • Reviewer guidance pack

Success signals

Observable changes that mean the playbook landed.

  • Review comments stay behavior-focused even as AI use rises
  • Risky AI-assisted changes receive stronger review and clearer framing
  • Escaped defects from shallow generated-code review decline
  • Contributors understand that AI use changes review posture, not authorship responsibility

Follow-up actions

Moves that keep the playbook's effects compounding after it finishes.

  • Revisit review tiers as AI usage patterns change
  • Connect recurring review failures to training or ownership work
  • Update onboarding so new engineers understand AI-era review expectations from day one

Metrics or signals to watch

Longer-horizon indicators that the underlying problem is receding.

  • Median review time by AI-assisted risk tier
  • Post-merge defect rate for AI-assisted changes
  • Average diff size for AI-assisted PRs
  • Percentage of reviewed PRs with meaningful reviewer questions

AI impact

AI effects on this playbook

How AI-assisted and AI-driven workflows help execution, and the ways they can make it worse.

AI can help with

Where AI tooling genuinely reduces the cost of running this playbook well.

  • Summarizing large diffs into behavior-focused review notes
  • Highlighting risky files, generated patterns, and duplicated logic
  • Suggesting reviewer checklists for specific change types

AI can make worse by

Distortions AI introduces that make the underlying problem harder to see.

  • Making weak designs look elegant
  • Encouraging reviewers to trust summaries instead of code paths
  • Inflating diff volume beyond what the team can meaningfully inspect

Relationships

Connected playbooks

Failure modes this playbook tends to address, decisions behind the situation, red flags that motivate running it, and neighboring playbooks.