An AI development company in 2026 is supposed to mean something specific. Twelve months ago it stopped meaning anything at all. Every dev shop that could spell OpenAI changed its homepage in a weekend, every consultancy with a slide template added an AI practice, and the buyers ended up where buyers always end up after a category gold rush - staring at a list of forty firms whose websites are indistinguishable, trying to figure out which three will return calls in November.
This guide is the answer to that problem, written by people who care about the answer because we have to defend it ourselves. We are JAAX Labs. Two operators, Sid and Arjun. We run Sentinel, a live AI analytics product on Shopify with paying merchants and the bugs you get when real money flows through the system at three in the morning. Every habit on this page was earned shipping it. The pricing on our flagship AI development page is the same pricing we apply to ourselves.
We have shipped twelve agents in ninety days across twelve clients, with eleven of them still running and a 92% renewal rate, and we have made every one of the mistakes we are about to warn you against. The framework below is what we wish someone had handed us on day one of the first quarter - both as the buyers we used to be, and as the firm we became.
What changed, and why the category broke.
Until late 2023, "AI development" was a phrase used by people who had been doing it for a decade. They built recommendation systems, fraud-detection models, computer-vision pipelines, the kind of work that made it into engineering blogs because it ran in production and moved a number. The category was small, technical, and mostly serious.
Then ChatGPT shipped to a hundred million users in two months and the entire B2B services industry rebranded. The honest cohort - maybe two hundred firms in the United States - kept doing the work. The other ten thousand discovered that you could prepend the word "AI" to whatever you had been selling on Tuesday and a Series B board would approve a fresh budget by Friday. Everyone became an AI development company. Nobody became an AI development company.
This is not a moral complaint. It is a market-information complaint. When the same label covers a firm that has shipped twenty production agents and a firm that has shipped two PowerPoints, the label has stopped doing work. Buyers who relied on it to filter their shortlist are now standing in a field full of identical signs, and the only question that matters is which signs are attached to a building.
The pattern repeats in every wave of enterprise tech. Cloud, then mobile, then crypto, now AI. The category fills with costume firms in eighteen months and shakes them out in the following thirty-six. The buyers who pay for the shake-out are the ones who hired during month nine. The buyers who avoid it are the ones who learn to read four signals - the ones we are about to walk through - instead of reading the homepage.
The four signals that separate operators from deck-writers.
None of these are clever. We resisted writing them down for a long time because they felt embarrassingly obvious. Then we ran a few hundred sales calls and realized that the firms losing those calls were losing them by failing one or more of the four. The signals are obvious in the same way that "ask for the keys before you sign the lease" is obvious. It is still the right thing to ask.
1. Do they ship their own AI in production?
Not as a portfolio piece. Not as a demo on a partner's stage. As a living product that paying customers use, that the team operates themselves, that goes down occasionally and gets fixed. There is no substitute for this signal. The firm that runs its own production AI knows what it costs to keep an LLM-backed system inside its SLA. The firm that has only built for clients knows what it costs to ship; it does not know what it costs to keep. The difference shows up in week six of every engagement, when the eval drifts and somebody has to be on the hook.
For us this signal is Sentinel. It is the reason we wrote a credibility argument into the masthead of our flagship service page and the reason we will keep operating it long after the agency math says we should outsource the support rotation. The product is the proof. We tell prospects to read the changelog before they read the deck.
If a firm cannot point you to one such product - and "we use AI internally" does not count, neither does "we have a chatbot on our marketing site" - you are talking to consultants who will learn on your project. They may be excellent consultants. They will still learn on your project, and the learning will show up in the bill.
2. Do they write the eval before they write the prompt?
This is the load-bearing technical signal and the one that catches the most costume firms. Ask the engineer they will assign you what their eval harness looks like for an engagement of your size. Watch what they say.
The right answer is some version of this: "We write a hand-rated golden set of 20–40 examples before we write a single prompt. The eval is the spec. The prompt is the implementation detail. The prompt is iteration; the eval is the contract." The right answer is delivered without rehearsal because the engineer says it three times a week. The wrong answer is "we A/B prompts in production and look at user feedback," which is a phrase you should hear as "we have not built one of these before, and we will be discovering the problem with your money."
We learned this the way most operators learn it: by getting it backwards on two of our first six engagements. We wrote the prompt first. We tuned it. We declared it good. Then a client mentioned in week four that it was hallucinating product SKUs. We had no eval, so we had nothing to roll back to, and we burned a fortnight rebuilding the spec we should have written on day one. The full write-up includes the part where we charged ourselves for the lesson.
3. Do they price by the SLA, not the build?
An AI agent is not a deliverable. It is a system that runs, drifts, and needs to be tended. The firm that prices by the build is selling you a one-time cost and a goodbye letter. The firm that prices by the SLA is selling you the running system, which is what you actually want.
The pricing question to ask is direct. "Six months from now, when the model provider deprecates the version we are on and the eval drops by four points, who fixes it and how is that paid for?" The right answer is a retainer, an SLA, or a clearly named hourly rate against an agreed runbook. The wrong answer is "we'll quote a change order," which means you are about to renegotiate from a position of weakness because the system is already in production and you cannot turn it off.
We underpriced our first six engagements doing exactly this - quoting a build, billing a deliverable, then absorbing the cost of keeping the agent alive. We rewrote pricing in week eight of the first quarter. The firms that have been running production AI for years figured this out before we did. The firms that have not, never figure it out, because they never get to month four with the same client.
4. Can they show you the dashboard the agent is running on right now?
This is the instrumentation signal, and it is the easiest one to test on a sales call. Ask the firm to share their screen and show you the operations dashboard for one of their live engagements. Ask to see the token spend, the latency distribution, the refusal rate, the hallucination flag rate, the cost per inference broken down by tenant. Watch what happens.
A real AI development company has this dashboard open in a tab on the day you ask, because they look at it every morning. They will redact the client name and walk you through the chart that surprised them most last week. A costume firm will say they will follow up. The follow-up will arrive as a screenshot of a Notion page with three sparklines on it, two days later, after the sales engineer has built it.
If the value is invisible by default, the project is not finished. Buyers feel this intuitively. The dashboard is the artifact that lets them feel it explicitly, on the day the question is asked, with a real number on the screen. We ship the dashboard before the agent is interesting, because if the buyer cannot see the number move, the value did not happen.
"The eval is the spec. The dashboard is the receipt. Everything else is decoration."
That operating principle is worth stating explicitly as a hiring criterion, not just a pull-quote. If a pilot cannot show you the dashboard the day it goes live, it is not a pilot - it is a demo with extra steps. The eval defines what "working" means before a line is written. The dashboard is the ongoing receipt that working is still true. Any firm that separates those two - that defines success after the build, or that shows you metrics only on request - has inverted the sequence that prevents expensive surprises. The principle applies at every scale: twelve-agent sprint or enterprise-wide deployment. Write the eval first. Ship the dashboard before the agent is interesting. The number that moves is the deliverable; everything else is scaffolding.
The MBB and Big Four trap.
The most expensive way to buy AI development in 2026 is from a top-tier strategy or audit consultancy. We are about to be uncharitable; we are also about to be specific. The four firms most prospects evaluate at the top of their RFP - and the three boutique cousins that follow them in the deck - share a structural problem that has nothing to do with the talent of the people who will be staffed on your project. The problem is that those people will not, in the meaningful sense, be doing the build.
Strategy is sold by the partner. The partner has spent the last decade selling strategy. The build is subcontracted, sometimes inside the same firm to a separate practice with separate leadership and a separate utilization target, sometimes to a system integrator they have a referral relationship with, sometimes to an offshore captive. The contract closes. The engagement begins. The strategy team builds the deck. The build team builds the system. By week six the deck has named a feature the build team cannot ship under the architecture the strategy team approved, and by week ten you are running a three-way reconciliation meeting that does not appear on any project plan.
The numbers on this are now public and ugly. The MIT NANDA "State of AI in Business" report circulated in mid-2025 found that roughly 95% of corporate generative AI initiatives produced no measurable return on investment. The figure has been picked over for definitional issues - what counts as a pilot, what counts as ROI - but every reasonable cut of the data lands in the same neighborhood. The implementation gap is real and it is large. A trillion dollars of GenAI capex sat on enterprise balance sheets in 2024 against a level of visible business outcome that would not justify a tenth of it.
That gap is mostly not a model problem. The models work. The gap is a buy-side problem and a delivery-model problem. Buyers paid strategy fees for advice that nobody held the build to, and delivery teams shipped against specs that nobody held the strategy to, and the seam between them was paper.
This is not a punch at the people. The senior associates and managers staffed on those projects are often excellent. We have hired from those firms; we have referred work into those firms; some of our best AI strategy consulting conversations have been with executives who used to staff those engagements and now buy them, and who can spot the failure mode in the first thirty minutes because they have lived inside it. The structural argument is about the contract, not the talent. If the partner who closes the deal does not write the code that ships, the firm cannot price honestly against the SLA, because the SLA is held by people the partner does not manage.
The honest fix is to buy from a firm that does both, in order, by the same people. That is the operator-led model. It is not magic; it just removes the seam. We do it because we are too small to subcontract and the constraint turned out to be a feature.
The offshore dev shop trap.
The other end of the price ladder has its own failure mode. Offshore body-shops will bid an AI development engagement at half the day rate of any domestic firm and staff a team of six against it. The math looks great in procurement. The math fails in delivery, and it fails for reasons that are specific to AI development as a category rather than to offshoring as a model.
Body-shop economics work for projects with three properties: a stable specification, abundant clean training or reference data, and a buyer who can write a perfect ticket. Build me this CRUD app, here is the schema, here are the user stories, please ship by August. AI development in 2026 has none of those properties. The specification changes weekly because the eval reveals what the system can and cannot do. The data is dirty until the integration team builds the pipeline that cleans it, and that pipeline is part of the project. The buyer is finding out what they want during the build, because nobody knew in advance what the agent would be good at.
An offshore team trying to ship under those conditions is fighting all three of its strengths at once. The team will eventually need a senior architect on retainer to translate the changing spec, write the eval harness, manage the model selection, and handle the integration seam. At that point the buyer is paying twice - the offshore day rate plus the architect - and has a coordination tax on top of it.
The exception is real and worth naming. Offshore teams are excellent at the parts of an AI engagement that look like normal software: the integration plumbing, the dashboard frontend, the back-office tooling, the test harness around the eval. We have used offshore vendors for those layers ourselves. The mistake is asking the offshore team to own the eval, the model choice, and the production-hardening work. That is the part where the senior architect has to be in the room every day, and you cannot outsource the room.
What an actual engagement looks like.
If the four signals and the two traps describe what to avoid, the question becomes what to buy. Here is the rhythm we run, written down as a buyer would experience it. The shape is opinionated; that is the point. Most engagements that fail, fail because nobody enforced a shape.
Day zero. Use-case triage.
One written assessment per project against a fixed rubric - data availability, integration cost, signal-to-noise, who owns the metric on the other side. The output is a list with reds, yellows, and greens, and a paragraph each on why. Most clients we meet are budgeted for three projects and trying to ship six. The kindest engagement is sometimes one we refuse to start until they pick. The full AI development page has the longer version of this rubric.
Days one to fourteen. The fortnight sprint.
Every engagement is scoped to fourteen days. Day one is a kickoff and a written contract that names exactly one outcome and exactly one number. Day fourteen is live, measured, in production. We refuse anything that does not fit. Clients hate this for the first three days and love it forever after. Scope creep is a tax on focus, and the fortnight cap is how we refuse to pay it.
Inside the fortnight. Eval before prompt.
A hand-rated golden set of 20–40 examples exists before any prompt is written. The eval is the spec; the prompt is the implementation. When we get this order right, we ship in days. When we get it wrong - and we have, twice, on engagements where the client wrote the eval themselves - we ship in weeks and re-ship in months. The client signs off on the eval. The client does not author it. Authorship is ours because the eval has to survive the client's bad day.
Inside the fortnight. Dashboard before agent.
The instrumentation ships before the model is interesting. Token spend, latency, refusal rate, hallucination flag, cost per tenant, drift signal. The buyer can read it on day five, before the agent is solving anything important. By the time the agent is in front of users, the dashboard has been live long enough that the team trusts the chart.
Day fifteen onward. The SLA.
Pricing flips from the build to the running system. The agent is now a thing you own and we operate. We charge a monthly retainer against a published runbook with named response times for the alerts that matter. When the model provider deprecates the version we are on, the upgrade is on us. When the eval drops by four points because the corpus shifted, the diagnosis is on us. The SLA is the product. The build was the entry point.
For deeper context on the pieces around the build itself - the integration layer, the consulting work that scopes it, the strategy work that pre-dates it - the cluster pages are AI integration, AI consulting, and the parent AI development service page. For evidence in the wild, the case studies are Meraki Wraps (zero digital pipeline to 13× return on ad spend in eight weeks) and the rest of the work archive.