a worked example

Where a request is actually decided.

Two real features, two very different shapes through the same five layers. Most teams bolt AI onto the typing step. The work that decides whether either ships well happens long before the typing.

the request · the straight line

Let people cancel a booking and get their deposit back.

This looks like one Stripe call. Watch what it actually contains, and who carries each layer.

the cast

Five roles. Watch which ones carry each layer.

Judgment lives at the front, machines at the back. Each action below is tagged with who performs it, so you can track the driver as the request moves through the model.

Owner

Holds business intent and the rulings only they can make.

Delivery Lead

Facilitates alignment, owns the context and the spec.

AI Agent

Claude, orchestrated at M3. Interrogates, drafts, builds, writes.

Engineer

Reviews every PR, approves, configures the guardrails.

Platform

Automated. Scans, allowlists, logs, generates the audit pack, measures.

L1

Engagement Context

/context

Before any code, the request is interrogated until the hidden work is on the table. Eight decisions surface. Not one is a coding problem.

Delivery Lead

Sits with the owner and turns “give the deposit back” into questions.

AI Agent

Reads the codebase and RO / EU consumer law, surfaces eight hidden decisions and the constraints behind them.

Owner

States the business intent and the numbers that matter: revenue retained, no-show rate.

Delivery Lead

Writes the /context doc: intent, stakeholders, KPIs, constraints, seven open questions.

Owner

Confirms it reflects reality, and the engagement proceeds.

/contextL1 artifact
  • Intent · Fair, self-serve cancellation that protects revenue and ends the manual phone-and-delete workflow.
  • Constraint · RO OUG 34/2014 exempts fixed-date stays from the 14-day withdrawal right. Non-refundable is legal; tiers are a goodwill choice.
  • KPIs · Self-serve resolution, time-to-refund, dispute rate, revenue retained, refund accuracy.

Seven open questions routed to L2. Nothing here is code.

L2

Spec Engineering

spec + PR

Every ambiguity becomes a decided, testable line, reviewed and merged before execution. Once locked, there is nothing left for the build to get wrong.

Delivery Lead

Proposes a ruling for each open question: the tier schedule, the fee, force majeure.

AI Agent

Drafts the spec as eight testable acceptance criteria and flags two contradictions to resolve.

Owner

Makes the business calls: the tiers, the event-cabin band, who absorbs the Stripe fee.

Engineer

Reviews the spec PR for technical soundness and missing edge cases.

Owner

Signs off with finance and legal. Merging the PR locks the rulings.

spec + PRL2 artifact
AC-1  ≥14d  → refund 100% of deposit, status cancelled
AC-2  7–13d → refund exactly 50% (rounded to bani)
AC-3  <7d / no-show → forfeit, recorded with reason
AC-4  reschedule presented before any refund
AC-5  force majeure → reschedule, else 100% any lead time
AC-6  Stripe fee not deducted from guest; on business
AC-7  idempotent: at most one refund per booking
AC-8  no personal data reaches the LLM

Owner, finance, legal and ops sign off. Merging the PR locks them. The signatures are the deliverable.

L3

Agentic Execution

provenance log

With the contract locked, execution is the short part. The agent builds against the spec and the human reviews the trail.

AI Agent

Writes the tests first, one per acceptance criterion, orchestrated at M3.

AI Agent

Implements the refund engine, the reschedule flow and the new refunds table to pass them.

Engineer

Reviews the PR. Every agent action is in the provenance log; no rule was invented.

Engineer

Approves and merges behind the refunds.enabled flag.

provenance logL3 artifact
test  AC-1 refund_full_ge_14d        ✓ written first
test  AC-2 refund_half_7_13d         ✓ written first
test  AC-3 forfeit_lt_7d_or_noshow   ✓ written first
impl  refunds table + cancel flow    agent
impl  stripe.refunds.create(deposit) agent
note  0 policy rules invented; all trace to spec
pr    #142  reviewed-by: engineer  merged (flag off)

Tests-first. The provenance log ties every change back to an acceptance criterion.

L4

Runtime Guardrails

audit pack

The guardrails enforce the contract at runtime, and produce the evidence a regulator could check.

Platform

Blocks personal data before any model call. Tier math sees dates and amounts only.

Platform

Holds Stripe on an allowlist and logs every refund with actor, amount and reason.

Engineer

Runs the flag in shadow mode first, then turns it live.

Platform

Generates the audit pack each release: every refund the system decided, and why.

audit packL4 artifact
guard  PII to LLM ........ 0 events (dates + amounts only)
guard  stripe calls ...... allowlist: refunds.create only
log    refund #142-03 .... 50% · 7–13d tier · actor: system
flag   refunds.enabled ... shadow → live (engineer)
out    audit-pack-2026-06  every refund + reason, signed

Generated, not authored. Near-zero cost per release.

L5

Outcome Telemetry

variance brief

Telemetry measures the feature against the exact numbers L1 agreed. The model is a loop, not a line.

Platform

Streams the L1 KPIs to one dashboard: resolution rate, time-to-refund, dispute rate, revenue retained.

Delivery Lead

Runs the weekly variance review against the baseline with the owner.

Owner

Decides: hold, adjust a tier, or expand. The loop closes and feeds the next L1.

variance briefL5 artifact
kpi                  baseline   target     cadence
self-serve refunds     0%        > 80%      weekly
time-to-refund        manual     < 60s      weekly
deposit disputes       n/a       trend ↓    weekly
revenue retained     baseline    hold       weekly

Numbers are illustrative of what gets tracked. Targets are set at L1 and measured against a real baseline.

That was the straightforward case. See the same model on an ambiguous request.

the point

The cost of change is lowest before the first line of code.

Two requests, one model. A clean ask runs almost straight. An ambiguous one changes its mind four times, and the layers absorb it for the cost of a conversation, not a rebuild. That is the point.

Book the diagnostic

Paid, fixed-price, walk-away-friendly. You keep the starter kit either way.