The Bulletproof AI Dev Team

A field-tested operating system for running autonomous coding agents, without becoming the bottleneck.

Free playbook · Powered by Archetype


Who this is for

You run (or want to run) AI coding agents; Claude Code, Codex CLI, or both; and you've noticed the gap nobody talks about: getting an agent to write code is easy; trusting a team of them to work unattended is not. This playbook is the operating model behind a real, running system: a multi-agent dev team that processes queued work autonomously on a solo founder's machine, with every safety property enforced by code rather than by hoping the model behaves.

Everything here is running in production and covered by a deterministic test suite. Nothing here is a thought experiment.


Part 1: Why agent teams actually fail

The public failures of autonomous coding agents all share a shape. Not one of them was the model being insufficiently smart. Every one was an operational seam the system wasn't built to notice about itself:

And the quiet failure mode, the one that doesn't make headlines: an agent marks a task "done" and it isn't. No incident report gets written, the rot just accumulates until you stop trusting the system; at which point you're reviewing everything by hand again and the "autonomous" team has negative ROI.

Hold onto that list. Every section that follows exists to close one of those seams.

Part 2: The only distinction that matters: hard gates vs. soft gates

Sort every safety control you have into two piles:

The pattern across every incident above and every clean long-duration run on record (including OpenAI's 25-hour, 13-million-token unattended Codex run, which completed safely inside a hard sandbox): failures walk through soft gates; hard gates hold.

The rule that falls out of this:

Anything irreversible or externally visible sits behind a hard gate. Soft gates are for quality, never for safety.

Deletes, force-pushes, production credentials, payments, public posts, merges to a protected branch; infra-enforced, no exceptions. Code style, routing decisions, review depth; the model's judgment is fine.

A corollary worth tattooing somewhere: a code freeze that lives in a prompt is not a freeze. If you need agents to stop touching something, revoke the credential or lock the branch.

Part 3: Roles with real boundaries

A team of agents that can each do everything is one agent with extra steps; and extra failure modes. Give every agent a profile: what it may do, and (more important) what it may never do. Ours:

ProfilePurposeSandboxImplementsReviewsApprovesMerges
implementerScoped changes on a non-main branchwrite
qa_reviewerFresh-context independent QAread-only
test_engineerDeterministic test coveragewrite
architectArchitecture/boundary reviewread-only
appsec_reviewerSecurity-sensitive review laneread-only
devops_release_reviewerCI/release/settings reviewread-only
docs_handoffLow-risk docs workwrite

Three invariants make the table work:

Route work by risk, not by convenience: normal code goes implementer → QA; anything touching auth, secrets, CI, payments, external sends, or public surfaces automatically adds the specialist lane. The trigger list is written down in a registry file, not remembered.

Part 4: Never trust "done"

The single highest-leverage line of code in our whole system re-runs the checks.

When an agent reports a job complete, the runner; a plain Python process the agent doesn't control; independently re-executes the job's required checks (test suite, validators, linters). If a check fails, the job is marked failed no matter what the agent said. The agent's self-report is treated as a claim, and the claim gets audited, every time, mechanically.

Two companion rules:

Part 5: The autonomy loop

Unattended operation needs a boring, deterministic outer loop. Ours is a file-based queue; inbox/ → running/ → outbox|blocked|failed/; processed by a single-purpose runner. The contract that makes it bulletproof:

The result is a specific texture of autonomy: silence while things work, one sharp signal the moment they don't.

Part 6: Cost discipline: script what's deterministic

Anthropic's own data puts a single agent at roughly 4× the tokens of a chat session, and a multi-agent system at ~15×. Run a multi-agent flow on everything and you've bought a 15× multiplier on tasks that mostly didn't need it. Three rules keep the bill sane:

Part 7: Escalation design: when the human gets pinged

Autonomy fails in both directions; the system that asks about everything is as useless as the one that asks about nothing. Write the escalation triggers down. Ours: a genuine direction/tradeoff decision; anything irreversible or with public, private-data, cost, legal, or identity impact; reviewers deadlocked; a blocker surviving specialist review; or an explicit human-only approval gate downstream.

Everything else; routine routing, QA loops, fix cycles, evidence checking; proceeds without asking. And one non-obvious human-side rule from the automation literature: after a long error-free stretch, your reviews get lazy ("learned carelessness"). Randomize spot-checks; don't rely on a fixed cadence you'll sleepwalk through.

Part 8: Prove it, don't claim it

Here's the industry's inconvenient secret, and this playbook's differentiator: no published study yet measures sustained multi-day, multi-agent, unattended reliability on real workloads. Capability benchmarks measure single tasks. Vendor risk statistics come from vendors. Anyone who tells you their agent team is "safe" without showing you the test suite is making a claim past the edge of existing data.

So don't claim; instrument:

Run the suite before every change to the harness. The first time it catches a real regression, it has paid for itself forever.


The checklist


Want this running on your machine?

This playbook tells you what to build. If you'd rather not spend the weeks building and debugging it:

Plug-and-Play; the complete kit: role profiles, registry templates, the hardened queue runner pattern, validators, and the test suite, adapted to your stack, with setup instructions.

Done-For-You; I set the whole operating system up on your repos, tuned to your risk profile, and hand it over running: green test suite, live end-to-end proof, and an escalation design you actually control.

[Contact / payment link goes here]


Built by Sebastien Poulet. I have the kind of brain that traditional productivity systems punish; so I built an operating layer where the follow-through is structural instead of willpower. This system runs my own company's development unattended; everything above is the tested shape of what survived. Powered by Archetype.