a worked example
Where a request is actually decided.
Two real features, two very different shapes through the same five layers. Most teams bolt AI onto the typing step. The work that decides whether either ships well happens long before the typing.
the request · the straight line
“Let people cancel a booking and get their deposit back.”
This looks like one Stripe call. Watch what it actually contains, and who carries each layer.
the cast
Five roles. Watch which ones carry each layer.
Judgment lives at the front, machines at the back. Each action below is tagged with who performs it, so you can track the driver as the request moves through the model.
Holds business intent and the rulings only they can make.
Facilitates alignment, owns the context and the spec.
Claude, orchestrated at M3. Interrogates, drafts, builds, writes.
Reviews every PR, approves, configures the guardrails.
Automated. Scans, allowlists, logs, generates the audit pack, measures.
Engagement Context
/contextBefore any code, the request is interrogated until the hidden work is on the table. Eight decisions surface. Not one is a coding problem.
Sits with the owner and turns “give the deposit back” into questions.
Reads the codebase and RO / EU consumer law, surfaces eight hidden decisions and the constraints behind them.
States the business intent and the numbers that matter: revenue retained, no-show rate.
Writes the /context doc: intent, stakeholders, KPIs, constraints, seven open questions.
Confirms it reflects reality, and the engagement proceeds.
- Intent · Fair, self-serve cancellation that protects revenue and ends the manual phone-and-delete workflow.
- Constraint · RO OUG 34/2014 exempts fixed-date stays from the 14-day withdrawal right. Non-refundable is legal; tiers are a goodwill choice.
- KPIs · Self-serve resolution, time-to-refund, dispute rate, revenue retained, refund accuracy.
Seven open questions routed to L2. Nothing here is code.
Spec Engineering
spec + PREvery ambiguity becomes a decided, testable line, reviewed and merged before execution. Once locked, there is nothing left for the build to get wrong.
Proposes a ruling for each open question: the tier schedule, the fee, force majeure.
Drafts the spec as eight testable acceptance criteria and flags two contradictions to resolve.
Makes the business calls: the tiers, the event-cabin band, who absorbs the Stripe fee.
Reviews the spec PR for technical soundness and missing edge cases.
Signs off with finance and legal. Merging the PR locks the rulings.
AC-1 ≥14d → refund 100% of deposit, status cancelled AC-2 7–13d → refund exactly 50% (rounded to bani) AC-3 <7d / no-show → forfeit, recorded with reason AC-4 reschedule presented before any refund AC-5 force majeure → reschedule, else 100% any lead time AC-6 Stripe fee not deducted from guest; on business AC-7 idempotent: at most one refund per booking AC-8 no personal data reaches the LLM
Owner, finance, legal and ops sign off. Merging the PR locks them. The signatures are the deliverable.
Agentic Execution
provenance logWith the contract locked, execution is the short part. The agent builds against the spec and the human reviews the trail.
Writes the tests first, one per acceptance criterion, orchestrated at M3.
Implements the refund engine, the reschedule flow and the new refunds table to pass them.
Reviews the PR. Every agent action is in the provenance log; no rule was invented.
Approves and merges behind the refunds.enabled flag.
test AC-1 refund_full_ge_14d ✓ written first test AC-2 refund_half_7_13d ✓ written first test AC-3 forfeit_lt_7d_or_noshow ✓ written first impl refunds table + cancel flow agent impl stripe.refunds.create(deposit) agent note 0 policy rules invented; all trace to spec pr #142 reviewed-by: engineer merged (flag off)
Tests-first. The provenance log ties every change back to an acceptance criterion.
Runtime Guardrails
audit packThe guardrails enforce the contract at runtime, and produce the evidence a regulator could check.
Blocks personal data before any model call. Tier math sees dates and amounts only.
Holds Stripe on an allowlist and logs every refund with actor, amount and reason.
Runs the flag in shadow mode first, then turns it live.
Generates the audit pack each release: every refund the system decided, and why.
guard PII to LLM ........ 0 events (dates + amounts only) guard stripe calls ...... allowlist: refunds.create only log refund #142-03 .... 50% · 7–13d tier · actor: system flag refunds.enabled ... shadow → live (engineer) out audit-pack-2026-06 every refund + reason, signed
Generated, not authored. Near-zero cost per release.
Outcome Telemetry
variance briefTelemetry measures the feature against the exact numbers L1 agreed. The model is a loop, not a line.
Streams the L1 KPIs to one dashboard: resolution rate, time-to-refund, dispute rate, revenue retained.
Runs the weekly variance review against the baseline with the owner.
Decides: hold, adjust a tier, or expand. The loop closes and feeds the next L1.
kpi baseline target cadence self-serve refunds 0% > 80% weekly time-to-refund manual < 60s weekly deposit disputes n/a trend ↓ weekly revenue retained baseline hold weekly
Numbers are illustrative of what gets tracked. Targets are set at L1 and measured against a real baseline.
That was the straightforward case. See the same model on an ambiguous request.
the point
The cost of change is lowest before the first line of code.
Two requests, one model. A clean ask runs almost straight. An ambiguous one changes its mind four times, and the layers absorb it for the cost of a conversation, not a rebuild. That is the point.
Book the diagnosticPaid, fixed-price, walk-away-friendly. You keep the starter kit either way.