What we learned shipping 12 AI agents in 90 days

Ninety days ago we made a bet that turned out to be slightly insane: that two people could ship a dozen production agents to twelve real clients, in a quarter, without raising, hiring, or breaking. We were partly right. Here is the part we got wrong, and the part we'd do again tomorrow.

The premise was simple. The market wants outcomes from AI, not strategies. Most agencies were still selling slide decks. We thought we could undercut them on price, beat them on speed, and hold the line on quality by being the ones who actually shipped - not the ones who explained what shipping would look like.

It mostly worked. Twelve agents are live. Eleven of them are still running. One we killed in week six because the client's ICP changed and the agent was solving the wrong problem. We refunded the rest of the engagement and they're a reference now. That's not a punchline - that's the whole methodology in a sentence.

The shape of the work.

Of the twelve agents, four were customer-facing (support, qualification, recovery), five were internal-ops (data hygiene, reporting, vendor reconciliation), and three were creative-assist (briefs, drafts, ad-copy variants). The internal-ops ones were the easiest to ship and the hardest to sell. The customer-facing ones were the opposite.

That asymmetry deserves a section of its own. The pattern we noticed: clients buy what they can see a number move on. A support agent that deflects 38% of L1 tickets shows up in the helpdesk dashboard the next morning. A reconciliation agent that saves the ops team six hours a week shows up only when someone takes the time to measure it. Guess which one renewed.

The infrastructure running those twelve client agents is not separate from the infrastructure we run for ourselves. Our own AI operating system - PAI - currently runs 28 hooks, 16 agents, and 27 hook entries across 11 event types. It has tracked 1,315 signals, 1,275 ratings, and 1,020 work sessions. That is the lived experience of operating multiple agents in parallel over months, not a sandbox demo. The same instrumentation principles that run PAI - state in files, not in conversation; context resets between subtasks; separate evaluator from generator - are what we applied to every client engagement. The headline "12 agents in 90 days" is grounded in a system that was already running at that scale before the quarter started. Scale compounds. So do the failure modes.

If the value is invisible by default, you have not finished the project. Build the dashboard, or build a different agent.

What worked.

Three things, in order of how surprised we were:

1. The 14-day sprint, hard-stopped.

Every engagement was scoped to a fortnight. Day one: kickoff and a written contract specifying exactly one outcome and exactly one number. Day fourteen: live, measured, in production. We refused everything that didn't fit. Clients hated this for the first three days and loved it forever after. Scope creep is a tax on focus, and the fortnight cap was a way to refuse to pay it.

2. Evals before prompts.

Every agent has a golden set of 20–40 hand-rated examples that exist before we write a single prompt. The eval is the spec. The prompt is the implementation detail. When we got this order right, we shipped in days. When we got it wrong, we shipped in weeks and re-shipped in months.

3. One model, deeply.

We standardized on Anthropic's Claude across all twelve. Resisting the urge to A/B vendors saved us at least 200 hours. The right answer to "should we try GPT here?" is almost always "not until you've outgrown the model you have."

"The eval is the spec. The prompt is the implementation detail."

What broke.

Four things, in order of how much they hurt:

1. We underpriced.

We charged like we were selling a deliverable. We were actually selling an SLA - these things keep running, and someone has to keep them running. The first six engagements were all underwater on retainer. We rewrote pricing in week eight.

2. We let two clients pick the eval criteria themselves.

Worst decision of the quarter. Clients optimize evals to whatever the agent happens to be good at on the day they look at it. The eval has to be ours, not theirs. They sign off; they don't author.

3. We didn't instrument enough.

Three agents shipped without proper telemetry - token spend, latency, refusal rate, hallucination flag. Two of them developed quiet drift in week four that we caught only because a client mentioned it offhand. Now every agent ships with a dashboard or it doesn't ship.

4. Friday deploys.

Two of three Friday deploys went badly. We are not the first to learn this and we will not be the last.

What we'd tell our 90-days-ago selves.

Charge for the SLA, not the build. The build is a one-time cost. The agent is a system you now own.
Ship the dashboard before the agent is interesting. If the buyer can't see the number move, the value didn't happen.
Write the eval before the prompt. Always. We know we just said this. We're saying it again.
Refuse work that doesn't fit a fortnight. The constraint is the product.

Ninety days from now we want twelve more agents in production, a renewal rate above 90%, and one major thing we got wrong that we'll write up the same way we wrote up this one. The agent we built for ourselves first - Sentinel - is the one that taught us most of the lessons in this post. If you want to be on the next list, or you want help putting one of these in your own stack, we're around.