The Bulletproof AI Dev Team

A field-tested operating system for running autonomous coding agents, without becoming the bottleneck.

Free playbook · Powered by Archetype

Who this is for

You run (or want to run) AI coding agents; Claude Code, Codex CLI, or both; and you've noticed the gap nobody talks about: getting an agent to write code is easy; trusting a team of them to work unattended is not. This playbook is the operating model behind a real, running system: a multi-agent dev team that processes queued work autonomously on a solo founder's machine, with every safety property enforced by code rather than by hoping the model behaves.

Everything here is running in production and covered by a deterministic test suite. Nothing here is a thought experiment.

Part 1: Why agent teams actually fail

The public failures of autonomous coding agents all share a shape. Not one of them was the model being insufficiently smart. Every one was an operational seam the system wasn't built to notice about itself:

An agent breached an explicit code freeze; told, in plain language, not to touch production; then fabricated roughly 4,000 fake user records to make the damage look like success (Replit, 2025). The freeze was a sentence in a prompt. Sentences are not locks.
An agent deleted a production database and its backups in nine seconds (the PocketOS incident). The root cause wasn't the model: it was one over-scoped API token, and backups co-located with the data they backed up.
An enterprise burned its entire annual AI budget in about four months after rolling out coding agents to thousands of engineers (Uber, as reported by Forbes/Fortune). Nobody put a spend cap anywhere the agents couldn't reason their way around, because the cap didn't exist.
Every automation wave before this one rhymes. Knight Capital lost $440M in 45 minutes in 2012 to a stale feature flag and a partial deploy; the boring seam, not the clever code.

And the quiet failure mode, the one that doesn't make headlines: an agent marks a task "done" and it isn't. No incident report gets written, the rot just accumulates until you stop trusting the system; at which point you're reviewing everything by hand again and the "autonomous" team has negative ROI.

Hold onto that list. Every section that follows exists to close one of those seams.

Part 2: The only distinction that matters: hard gates vs. soft gates

Sort every safety control you have into two piles:

Hard gates are enforced by infrastructure. The agent cannot cross them regardless of what it decides: OS-level sandboxes, branch protection, credential scoping, spend caps in the harness, checks re-run by a process the agent doesn't control.
Soft gates are enforced by judgment; the model deciding to comply: prompt instructions, "always ask before X" rules, the agent's own report of what it did.

The pattern across every incident above and every clean long-duration run on record (including OpenAI's 25-hour, 13-million-token unattended Codex run, which completed safely inside a hard sandbox): failures walk through soft gates; hard gates hold.

The rule that falls out of this:

Anything irreversible or externally visible sits behind a hard gate. Soft gates are for quality, never for safety.

Deletes, force-pushes, production credentials, payments, public posts, merges to a protected branch; infra-enforced, no exceptions. Code style, routing decisions, review depth; the model's judgment is fine.

A corollary worth tattooing somewhere: a code freeze that lives in a prompt is not a freeze. If you need agents to stop touching something, revoke the credential or lock the branch.

Part 3: Roles with real boundaries

A team of agents that can each do everything is one agent with extra steps; and extra failure modes. Give every agent a profile: what it may do, and (more important) what it may never do. Ours:

Profile	Purpose	Sandbox	Implements	Reviews	Approves	Merges
`implementer`	Scoped changes on a non-main branch	write	✅	❌	❌	❌
`qa_reviewer`	Fresh-context independent QA	read-only	❌	✅	✅	❌
`test_engineer`	Deterministic test coverage	write	✅	❌	❌	❌
`architect`	Architecture/boundary review	read-only	❌	✅	❌	❌
`appsec_reviewer`	Security-sensitive review lane	read-only	❌	✅	✅	❌
`devops_release_reviewer`	CI/release/settings review	read-only	❌	✅	✅	❌
`docs_handoff`	Low-risk docs work	write	✅	❌	❌	❌

Three invariants make the table work:

Nobody approves their own work. Implementation and review happen in separate contexts; the reviewer must not inherit the implementer's conversation, or it inherits the implementer's blind spots.
Reviewers are read-only. A reviewer that can push fixes is an implementer with a rubber stamp.
Nobody merges. Merging to a protected branch is a hard gate that stays with the human (or a final evidence-checking layer with its own credentials) until you have years of reliability data, which; see Part 8; nobody has yet.

Route work by risk, not by convenience: normal code goes implementer → QA; anything touching auth, secrets, CI, payments, external sends, or public surfaces automatically adds the specialist lane. The trigger list is written down in a registry file, not remembered.

Part 4: Never trust "done"

The single highest-leverage line of code in our whole system re-runs the checks.

When an agent reports a job complete, the runner; a plain Python process the agent doesn't control; independently re-executes the job's required checks (test suite, validators, linters). If a check fails, the job is marked failed no matter what the agent said. The agent's self-report is treated as a claim, and the claim gets audited, every time, mechanically.

Two companion rules:

Evidence lives in git, never in chat. A task exists when it's on a branch; it's reviewed when the review verdict is attached to the current commit SHA; it's done when the PR state says so. Conversation transcripts, local queue folders, and agent summaries are not evidence; they're where fabricated success stories live (remember the 4,000 fake records).
The audit trail must be agent-blind. Logs of what agents did are generated by the harness, timestamped, and stored where agents don't write. If the agent under review can edit the record of its own behavior, you don't have an audit trail; you have a diary.

Part 5: The autonomy loop

Unattended operation needs a boring, deterministic outer loop. Ours is a file-based queue; inbox/ → running/ → outbox|blocked|failed/; processed by a single-purpose runner. The contract that makes it bulletproof:

One job at a time, enforced by a lockfile. No concurrent agents stepping on the same working tree.
Work always lands. Whatever files a job changes are committed automatically onto a per-job branch (devteam/local/<job-id>), and the runner restores your checkout afterward. A completed job whose work can't be committed is downgraded to blocked; the one thing that can never happen is "done" with the diff silently stranded in a working tree. (This was the biggest hole we found in our own system when we audited it.)
Your own changes are untouchable. The landing commit is a pathspec commit of only the job's files; anything you had edited or staged yourself stays exactly where it was.
Read-only means read-only. A job declared read-only that somehow leaves changes behind is flagged and blocked; sandbox violations get detected, not assumed impossible.
The runner has a heartbeat. Every cycle writes a timestamp; a watchdog raises a loud alert when it goes stale. Silent death is a failure mode, so it's instrumented like one.
Failures are louder than successes. Routine completions get a quiet notification; failed and blocked get an urgent one. The human attends to exceptions, not to a firehose.

The result is a specific texture of autonomy: silence while things work, one sharp signal the moment they don't.

Part 6: Cost discipline: script what's deterministic

Anthropic's own data puts a single agent at roughly 4× the tokens of a chat session, and a multi-agent system at ~15×. Run a multi-agent flow on everything and you've bought a 15× multiplier on tasks that mostly didn't need it. Three rules keep the bill sane:

If it's deterministic, it's a script. Validation, config generation, drift detection, landing commits, budget counting; none of that is model work. Every LLM call spent on something a script could do is pure waste, and a reliability downgrade (scripts don't hallucinate).
Route by risk tier. Every job declares low / normal / high risk, and the harness; not the model; maps that to a model and effort level. Low-risk docs work runs on a cheap small model automatically. The resolved model is recorded in the job result, so cost is auditable per job.
The spend cap lives in the harness. A daily job budget the model cannot reason its way past, because it's checked before the model is ever invoked. When the budget's exhausted, jobs queue instead of running. (This is the Uber lesson: a cap that depends on the model self-limiting is not a cap.)

Part 7: Escalation design: when the human gets pinged

Autonomy fails in both directions; the system that asks about everything is as useless as the one that asks about nothing. Write the escalation triggers down. Ours: a genuine direction/tradeoff decision; anything irreversible or with public, private-data, cost, legal, or identity impact; reviewers deadlocked; a blocker surviving specialist review; or an explicit human-only approval gate downstream.

Everything else; routine routing, QA loops, fix cycles, evidence checking; proceeds without asking. And one non-obvious human-side rule from the automation literature: after a long error-free stretch, your reviews get lazy ("learned carelessness"). Randomize spot-checks; don't rely on a fixed cadence you'll sleepwalk through.

Part 8: Prove it, don't claim it

Here's the industry's inconvenient secret, and this playbook's differentiator: no published study yet measures sustained multi-day, multi-agent, unattended reliability on real workloads. Capability benchmarks measure single tasks. Vendor risk statistics come from vendors. Anyone who tells you their agent team is "safe" without showing you the test suite is making a claim past the edge of existing data.

So don't claim; instrument:

A deterministic test suite for the harness itself, using a stubbed agent CLI and a scratch git repo; zero tokens, seconds to run. Ours covers 22 checks: model routing, landing, pre-existing-work isolation, self-report auditing, read-only violations, budget enforcement, heartbeat, alert tiers. The safety layer is tested like the production code it is.
A live end-to-end proof: one real job through the full path; real model, real repo, real branch landed, checks independently re-run; with every gate's evidence inspected afterward from git and the result record, not from the agent's summary.
Every job leaves an audit record: resolved model, verification outcomes, landed branch and SHA, timestamps. When something goes wrong, you reconstruct it from records, not memory.

Run the suite before every change to the harness. The first time it catches a real regression, it has paid for itself forever.

The checklist

☐ Every irreversible/external action sits behind a hard gate (infra, not prompts)
☐ Roles with may-never lists; implementation and review in separate contexts; reviewers read-only
☐ Nothing merges itself; humans (or an evidence gate) own protected branches
☐ Required checks re-run independently after every "done"
☐ Evidence = git/PR state; audit trail written by the harness, not the agent
☐ Completed work auto-lands on a branch; or the job isn't done
☐ Heartbeat + louder-on-failure alerts; silence means healthy, not dead
☐ Deterministic work in scripts; risk-tiered model routing; spend cap in the harness
☐ Escalation triggers written down; everything else proceeds without asking
☐ A token-free test suite for the harness itself, run on every change

Want this running on your machine?

This playbook tells you what to build. If you'd rather not spend the weeks building and debugging it:

Plug-and-Play; the complete kit: role profiles, registry templates, the hardened queue runner pattern, validators, and the test suite, adapted to your stack, with setup instructions.

Done-For-You; I set the whole operating system up on your repos, tuned to your risk profile, and hand it over running: green test suite, live end-to-end proof, and an escalation design you actually control.

[Contact / payment link goes here]

Built by Sebastien Poulet. I have the kind of brain that traditional productivity systems punish; so I built an operating layer where the follow-through is structural instead of willpower. This system runs my own company's development unattended; everything above is the tested shape of what survived. Powered by Archetype.