ClientsFlow · Skill guide · 2026-06-23

bugfix-newfeature-qa-ultra

The autonomous bug-fix / feature factory — how to use it, and how to iterate on it. Scrambled input → EBO → one human sign-off → built + merged + QA-proven on a safe branch → PENDING_REVIEW.

🟢 the 2 human moments 🔵 automatic phase 🟡 a mechanical guard 🟣 a place you iterate the skill

When to fire it

Hand it scrambled input and let it run. Two human touches: sign the EBO, accept the result.

Fire it when…What you give itDon't, when…
A bug-report chat went long and you want it driven to a merged, QA-proven fixThe chat transcript (.jsonl)It's a trivial one-file edit — just do it
You have a rambling feature idea and want it built + QA'd hands-offThe ramble (--text / chat)You want to plan ONE fuzzy feature from scratch → use plan-orchestrate (the sequential sibling)
Several scoped fixes need isolating, merging + live-QA-proving as one buildThe asks (it creates + merges its own branches)You want to merge someone else's pre-built branches — there is no such mode (single-purpose)

The one upfront human moment — the EBO sign-off BLOCKING

The Expected-Behavior Oracle is the answer key the whole run is judged against. You sign it once, at the start. Until you do, nothing builds.

What you actually do: open the EBO (in EBO.md or its Notion page), fix anything wrong, add anything missing, delete rows that don't apply, set Status → 🟢 Signed off, and reply done. That reply is your signature — compile_ebo.py mechanically refuses (exit 7) to compile any QA slice until ebo.signed exists, so no builder can ever go green against an un-reviewed oracle. The skill never self-signs.

What runs between your two touches — the 9 phases

🔵 0–2 · Intake → EBO → Confirm
transcriptintent.json/user-journeys EBO🟢 sign
Parses type:"user" AND queued_command (the follow-ups a naive parser drops). Authors the oracle via /user-journeys, mirrors to Notion, blocks on your sign-off.
🔵 3 · Decompose
EBO rowstaskspick modelcompile slices
God-module (flows.py/dash.py) touchers serialise; file-disjoint work fans out. Each slice is read-only + never-guesses a probe path.
🔵 4 · Fan-out
builder (worktree+TDD)Sonnet QA twinreturn + qa
One builder per task on its own branch; a Sonnet QA twin (no shared context) runs visual-qa-ultra on the slice + audits each test for "theater". board-card-qa during, visual-qa-ultra after.
🔵 5–7 · Converge → Final QA → Fix loop
bundle backup--no-ff mergewhole-EBO QAbug→fix→fresh QA
Sequential merges, suite green after each (red ⇒ STOP, never ship red). Final QA over the whole EBO catches cross-feature bugs; each fix is re-verified in a fresh QA context.

Caps: ≤3 concurrent builders · ≤4 product-fix attempts/row · ≤3 whole-build cycles · ≤5 question rounds/agent.

The heartbeat (long runs)

On a long run the orchestrator sets a 30-minute heartbeat (ScheduleWakeup(1800s)). On each beat it appends a checkpoint line and resumes from run_state.json if anything stalled. A background wait that times out while work is still running is not a stop — it re-issues. You can ignore the run until it pings you for the EBO sign-off (start) or the accept (end). Check progress any time with init_run.py status --run-id <id>.

The second human moment — reading the final report & accepting

The run ends at PENDING_REVIEW — it will not push until you accept.

In the report, look for…What it means
The verdict table (behavior → PASS / STATE_MISSING / BLOCKED → the one deciding fact)Every EBO row + the single observation that settled it. PASS only on a LIVE frame + a real state delta.
run-trust = scenario_confidence × oracle_coverageTrust in the asserted invariants — never "the system works". PASSED_WITH_GAPS whenever coverage < 100%.
residual human TODOs (bucket-d rows)The genuinely-can't-automate residue (e.g. a real Stripe LIVE charge) for you to finish by hand.
probe-path TODOs in a sliceA state row the compiler refused to guess a path for — fill it to deepen the oracle (see "iterate" below).
Accepting: when the report is clean, the skill announces the integration + feature branches and pushes only then — and main stays an explicit trigger. Nothing reaches a remote before you say so.

The mechanical guards (the locked Q1–Q10 decisions) 🟡

GuardWhat it preventsWhere it lives
Never-guess-path compiler (Q2)False greens — a test passing while the data under it is wrongcompile_ebo.py (known-path allowlist + _todo)
BLOCKING sign-off gate (Q7)Building against an un-reviewed oraclecompile_ebo.py exit 7 until ebo.signed
QA twin can't self-sign + audits tests (Q4)"Test theater" (null/exists/assert True) going greenQA twin wrapper (always Sonnet, no shared context)
Serialise god-module touchers (Q5)Two builders trashing flows.py/dash.py in parallelgroup_tasks.py
Bundle backup + suite-after-each + STOP-on-red (Q9)Shipping a red merge; an unrecoverable mergeconverge recipe (wrapper §E)
No push before accept (Q10)A remote getting an unreviewed buildrun_state.pushed, Phase 8
AUTOSEND on · sentinel-gate · send-blocklist · warmup-redactEmailing a real lead; flipping a live switch; leaking warmup mailsafety invariants, every phase

How to iterate on the skill itself 🟣

Every seam is a deterministic script with a pinned self-test. Change the constant, add a self-test assertion, re-run python3 scripts/selftest.py.

You want to…Do this
Add a bucket-(c) state-setup helper (a precondition with no UI, like neg_reply=true)state_setup.py scaffold --spec precond.json → fill the TODO click path (verify each selector live — never guess) → wire it into the QA twin's run before the live drive
Tune the Sonnet→Opus escalation thresholdsEdit the trigger constants in pick_model.py decide() (≥4 files / ≥120 LOC / ≥5 rows / risk flags); the self-test cases pin current behavior
Stop the intake parser missing a new kind of ask, or leaking a secretAdd to intake_extract.py SKILL_DUMP_PREFIXES / NOISE_TAG_RE / REDACTIONS + a self-test assertion so the miss can't recur
Make a new state field count as an "obvious" probe pathAdd it to compile_ebo.py DEFAULT_KNOWN_PATHS (or pass --known-paths) — this is what makes "obvious path" mechanical
Improve file resolution for decompositionSupply --surface-map from graphify path for exact resolution, or extend group_tasks.py GOD_MODULES

Proof it works (last validation run)

CheckResult
Booked-card transcript replay (intake→EBO→decompose→compile→builder/QA contracts→PENDING_REVIEW)22/22 PASS — extracted the queued V1/V2, classified the row (b)+(c), gate refused unsigned, never-guess TODO on the prose row, no AUTOSEND flip / no real deals / no push
Converge machinery vs the 4 existing branches (bundle backup → SHA-pin → sequential --no-ff → suite-after-each → no push)12/12 PASS — real merges, suite green (392→407) after each merged state, isolated worktree cleaned up
Offline self-tests (scripts/selftest.py)ALL PASS — 6 unit + 7 end-to-end (both guards)