Run a phased migration · thehardparts.dev

Difficulty: high
Time horizon: multi-sprint to multi-quarter
Primary owner: tech lead
Confidence: high

At a glanceEP-17

Situation: You need to replace or move a live system without stopping delivery.
Goal: Reduce migration risk by replacing behavior incrementally instead of betting everything on one cutover.
Do not use when: the target system is still conceptually undefined
Primary owner: tech lead
Roles involved: tech leadarchitectdelivery leadservice ownerQA or quality leadoperations or platform ownerproduct owner if user-facing impact exists

Context

The situation

Deciding whether to reach for this playbook: when it fits, and when it doesn't.

Use when

Conditions where this playbook is the right tool.

A legacy system must be replaced or decomposed
A monolith capability is moving into a new service or platform
A schema, provider, or runtime migration is unavoidable
The existing system is painful but still business-critical

Stakes

Why this matters

What this playbook protects against, and why skipping or half-running it tends to be expensive.

Most migrations fail because ambition outruns displacement. A phased migration keeps the team focused on moving real behavior, not just producing new code.

Quality bar

What good looks like

The observable qualities of a team or system that is actually doing this well. Not just going through the motions.

Signs of the playbook done well

The migration is divided into business-meaningful slices
Old and new paths are both observable during transition
Every slice has explicit entry and exit criteria
Teams can say what was displaced, not just what was built
The legacy surface shrinks over time in visible ways

Preparation

Before you start

What you need available and true before running the procedure. Skipping this is the most common reason playbooks fail.

Inputs

Material you'll want to gather first.

Current system map
Dependency inventory
Business-critical workflows
Cutover constraints
Rollback options
Operational readiness expectations

Prerequisites

Conditions that should be true for this to work.

Shared understanding of why migration is needed
Minimum observability on the current path
Named owners for both source and target behavior
Agreement on what counts as displacement

Procedure

The procedure

Each step carries its purpose (why it exists), its actions (what you do), and its outputs (what you produce). Read the purpose. It's what keeps the step from degenerating into checklist theatre.

01
Define the migration unit
Turn the migration into slices that move real behavior instead of abstract layers.
Actions
- Map the current system by business capability, not only by code structure
- Identify the smallest slice that can be moved and verified independently
- Write down what remains in the old system after that slice moves
Outputs
- Migration slice map
- First slice definition
02
Make legacy behavior explicit
Avoid losing hidden behavior that nobody remembered until production proves it mattered.
Actions
- Capture current inputs, outputs, side effects, edge cases, and operational expectations
- Review incident history for hidden assumptions
- Identify consumers that depend on undocumented behavior
Outputs
- Behavior inventory
- Known edge-case list
03
Design cutover and coexistence
Ensure the team can compare, phase, and reverse safely.
Actions
- Define how old and new paths will coexist
- Decide what traffic, data, or users move first
- Define rollback conditions and rollback mechanics
Outputs
- Cutover strategy
- Rollback approach
- Coexistence model
04
Move one slice and retire one slice
Prevent endless parallel ownership.
Actions
- Implement the target slice
- Validate parity and operational behavior
- Retire or isolate the old slice explicitly once confidence is sufficient
Outputs
- Migrated slice
- Legacy retirement note
05
Measure displacement, not activity
Keep the program honest.
Actions
- Track legacy dependency reduction
- Track which users, workflows, or traffic moved
- Review whether old operational load actually fell
Outputs
- Displacement dashboard
- Migration progress review

Judgment

Judgment calls and pitfalls

The places where execution actually diverges: decisions that need thought, questions worth asking, and mistakes that recur regardless of good intent.

Decision points

Moments where judgment and trade-offs matter more than procedure.

What is the first slice: technical seam or business capability?
Do we mirror behavior first or improve behavior during the move?
Can old and new paths run in parallel, or do we need progressive cutover?
What level of parity is required before retirement?

Questions worth asking

Prompts to use on yourself, the team, or an AI assistant while running the procedure.

What is the smallest migration slice that removes real legacy responsibility?
What hidden behaviors does the current system provide that nobody wrote down?
How will we prove this slice is displaced rather than duplicated?

Common mistakes

Patterns that surface across teams running this playbook.

Making the slice too technical and not user- or behavior-meaningful
Adding net-new product ambition before parity is reached
Measuring generated code instead of displaced legacy behavior
Keeping the old path alive indefinitely out of vague caution
Assuming undocumented legacy behavior is accidental

Warning signs you are doing it wrong

Signals that the playbook is being executed but not landing.

Nobody can name what has been retired so far
The new system keeps expanding in scope while the old one stays fully alive
Migration status sounds impressive but no user or workload has actually moved
Teams describe the migration using architecture language only, not business behavior

Outcomes

Outcomes and signals

What should exist after the playbook runs, how you'll know it worked, and what to watch for over time.

Artifacts to produce

Durable outputs the playbook should leave behind.

Migration slice map
Behavior parity checklist
Cutover plan
Rollback plan
Legacy retirement record

Success signals

Observable changes that mean the playbook landed.

Specific legacy paths are retired on a steady cadence
Support and operational burden drops on the source system
Cutover decisions are made with evidence, not optimism
Stakeholders can see what has actually moved

Follow-up actions

Moves that keep the playbook's effects compounding after it finishes.

Clean up coexistence code quickly after each slice
Review whether the target design still fits reality after each major slice
Update ownership and runbooks as the center of gravity shifts

Metrics or signals to watch

Longer-horizon indicators that the underlying problem is receding.

Legacy dependency count
Percentage of traffic or workflows on new path
Rollback frequency
Incidents caused by parity gaps
Time spent maintaining dual paths

AI impact

AI effects on this playbook

How AI-assisted and AI-driven workflows help execution, and the ways they can make it worse.

AI can help with

Where AI tooling genuinely reduces the cost of running this playbook well.

Mapping legacy code paths and dependencies
Summarizing incident history around the old system
Generating migration checklists and parity matrices
Finding likely consumers of hidden behaviors

AI can make worse by

Distortions AI introduces that make the underlying problem harder to see.

Accelerating speculative replacement code before behavior is understood
Making parallel systems grow faster than the team can validate
Creating false confidence through polished migration documentation

Relationships

Connected playbooks

Failure modes this playbook tends to address, decisions behind the situation, red flags that motivate running it, and neighboring playbooks.