Skip to main content All posts

Why 95% of GenAI pilots fail.

Where the number actually comes from, the three failure modes hiding inside it, and the playbook the 5% use to ship. Written by people who have been on both sides of the stat.

Ninety-five percent. The number is the reason this post exists, and the reason most boards still will not fund a second AI pilot. It is also the most misread statistic in enterprise software since "85% of data projects fail," and the most expensive thing to misunderstand if you are about to greenlight a generative AI initiative this quarter.

The figure traces to MIT's Project NANDA - the "State of AI in Business 2025" report released in mid-2025, surveying 372 enterprises and 153 senior leaders against $30 to $40 billion in disclosed GenAI spend. The headline: roughly 95% of those initiatives produced no measurable P&L impact within the first six months of deployment. IBM's 2024 IBV report indicated fewer than one in four enterprise AI projects had delivered the expected ROI - verifying citation. Gartner, in a June 2025 advisory, forecast that more than 40% of agentic AI projects will be canceled by the end of 2027. Three independent sources, three different methodologies, one consistent shape.

More than half of B2B software buyers now begin product research inside an AI chatbot rather than Google - a share G2 Research pegged at 51 percent in late 2025, up from 29 percent earlier that year. By 2028, Gartner predicts 90 percent of B2B buying will be AI agent-intermediated, routing over $15 trillion in spend through AI agent exchanges. The category is not failing because the technology does not work. It is failing because the way the work is bought, scoped, and verified has not caught up with what the technology demands. We have been on both sides of the 95%. This is the post we wish someone had written for us in our first quarter.

Where the 95% actually comes from.

The MIT NANDA survey is the load-bearing citation, and the methodology matters because the headline number gets weaponized in two opposite directions. AI skeptics treat 95% as proof of a bubble. AI maximalists treat it as a measurement artifact and dismiss it. Both misread it. Both misread it for the same reason - they did not read past the first chart.

Project NANDA distinguished three states: pilots that produced no measurable change in the named metric within six months; pilots that produced a measurable change but no P&L attribution; and pilots that produced both. The 95% figure aggregates the first two. The 5% that succeeded concentrated in firms that picked one workflow, instrumented it before launch, and bought rather than built. Inside that subset the success rate is closer to a third. Outside it, the failure rate is closer to ninety-eight percent.

Anthropic's enterprise share - about 40% of Fortune 500 LLM API spend in late 2025, per Menlo Ventures' tracking - is the data point quietly omitted from the doom narrative. The firms that ship are concentrating their spend on the same handful of model vendors and integrators. The market is not collapsing. It is bifurcating. Half of buyers give up after pilot one. The other half triple down because the second pilot, run differently, returns the cost of the first three combined.

The three failure modes hiding inside the stat.

"Failed" is doing too much work in the headline. When we audited our first six engagements, and when we post-mortemed the dozen-odd pilots clients walked in carrying the ashes of, the failures sort cleanly into three buckets. The fix is different for each, which is why the single 95% number is so unhelpful.

1. Wrong-eval failures.

The pilot scored well on the demos the vendor controlled and could not survive the eval set the buyer actually cared about. The most common failure mode - roughly forty percent of the 95% in our sample. A consultancy builds a PoC against a curated dataset of forty examples that look like the buyer's data. The PoC nails it. Real production traffic enters the system, which contains the long tail of edge cases the curated set never represented, and accuracy collapses by week three.

We shipped one of these in week six of our first quarter. The agent passed our internal eval at 91% accuracy. The client's actual support queue contained a subgenre of refund-with-exception tickets we had not sampled, and the agent was answering those wrong with confidence. We killed the engagement, refunded the back half, and rebuilt the eval set from the client's last ninety days of real tickets before writing a single new prompt. It is now one of the eleven still running.

2. No-instrumentation failures.

The pilot worked. Nobody could prove it worked. The renewal conversation died because the ROI case relied on a vibe. About thirty-five percent of the 95% in our sample. This is the failure mode that bothers us most because the underlying system is doing real work - in a black box no one can read.

We saw this with an internal-ops agent we shipped in our first quarter. It reconciled vendor invoices against POs and saved the ops team four to eight hours a week. We knew because the ops lead told us. Nobody else did. There was no dashboard, no monthly invoice savings tally. By month four, the CFO asked what they were paying us for, and the ops lead's testimonial was not the answer the CFO needed. The agent went away. It worked. It still went away. The fix is simple: ship the dashboard before the agent is interesting, with token spend, latency, refusal rate, hallucination flag, and the named business metric on a single screen the CFO can pull up unsupervised.

3. Right-thing-built-wrong-buyer failures.

The pilot delivered exactly what was specified. The specification was wrong. About twenty percent of the 95% in our sample. This is the failure mode that hurts most because nobody notices it is happening until month four, after the agent has been "succeeding" against a metric no one outside the project room cares about.

MBB and Big Four engagements concentrate this failure mode, and not because the people running them are unintelligent. Strategy is sold by the partner, the build is subcontracted, and by week six the deck has named a feature the build team cannot ship under the architecture the strategy team approved. The agent is faithful to the deck. The deck was faithful to the workshop. The workshop never met the user. Our longer post on how to actually buy AI development walks through the structural fix - refuse to separate strategy from build at the contract level.

"If a pilot cannot show you the dashboard the day it goes live, it is not a pilot. It is a demo with extra steps."

From a recent audit.

The three failure modes above are tidy in retrospect. In the field they arrive interleaved, inside a codebase that already looks like it works. A B2C app we audited last month is the cleanest recent example. The team had wired AppsFlyer, Meta SDK, Firebase, and Mixpanel. Dashboards rendered. Events appeared in consoles. The pre-launch claim was that paid acquisition could begin the following week.

We pulled the audit. Mixpanel's thirty-day funnel showed 137 completed onboardings, 1 trial start, 1 paid conversion. Roughly 0.7% from onboarded user to revenue. Beneath that headline sat ten confirmed-broken telemetry items, each traceable to a specific file and line in production code. The single most revenue-relevant moment in the app - the subscription success path - had its tracking call commented out at store_manager.dart:33-38. On iOS, a Swift cast in the Meta SDK bridge silently nil-ed every event's parameters; events arrived in Meta's console with names but no payload. Google Ads' "Subscribe" goal was bidding against phone calls, on a pre-launch app that had no phone number.

None of that was visible from the dashboard. All of it was visible the moment we read the code against the live event stream. The deeper hazard, and the one most relevant to the 95% framing, is that the public GitHub repository was not the production source. The CI pipeline overrode the version string at build time; the live binary reported a build number ninety-five revisions ahead of main. Three different "ground truths" for the same library were circulating - one in the lockfile, one in the prior diagnosis doc, one in the live event stream. A buyer reading the public repo would have concluded the integrations were sound. A buyer reading the production telemetry would have concluded the opposite.

"The plumbing is more built-out than I expected, but the pipes are empty in the places that matter most."

This is what a no-instrumentation failure actually looks like before it is labeled one. The plumbing was real. The pipes were empty. Sentinel, the production AI we run on Shopify, exists because operators do not catch their own 137-to-1 problems for months - the dashboard says green while the revenue path is muted. The same anti-pattern, in a different shape, is what we found behind the Shopify attribution gap on a different engagement: GA4 was pointed at a domain the store no longer owned, and every dashboard above it was confidently wrong for ninety days.

What the 5% do differently.

The 5% are not magicians. They are operationally boring in a way that is impossible to fake. We have worked alongside enough firms in the successful cohort - and inside enough firms that thought they were and were not - to write down the habits that separate them. None are clever. All are unusual.

They write the eval before the prompt. They ship the dashboard before the agent is interesting. They alert on refusal rate and hallucination flag from week one, not week twenty. They price the engagement as an SLA against a runbook, not a one-time deliverable. They cap the engagement at fourteen days and refuse anything that does not fit. The long version of this lives in our 12 agents in 90 days post, the engineering memo behind the marketing argument you are reading now.

The receipt that cost us the most to earn is the fortnight cap. Day one is a written contract naming one outcome and one number. Day fourteen is live, instrumented, and measured in production, or the engagement does not bill the back half. We refuse roughly thirty percent of inbound on this constraint alone. The refused work is the work that would have been a 95% statistic with our name on it.

The most counterintuitive habit is the dashboard-before-agent rule. In every project where we shipped the agent first, the dashboard arrived too late to save the renewal. In every project where the dashboard shipped by day five - even when the agent itself was still hand-rolled - the renewal conversation in month four took fifteen minutes instead of three meetings. The dashboard is the receipt. The agent is the work. Ship the receipt early.

The pilot-to-production playbook.

If the 95% number is mostly a buy-side problem, the 5% number is mostly a process problem. The five steps below are the rhythm we run on every engagement. They are also the rhythm we look for when we audit a pilot that is in trouble. If three or more are missing, the pilot is on the 95% trajectory regardless of the model behind it.

Step one. Use-case triage before the contract.

Every prospective project gets a one-page assessment against a fixed rubric - data availability, integration cost, signal-to-noise on the target metric, who owns the metric on the buyer's side. Reds, yellows, greens. Most clients we meet are budgeted for three projects and trying to ship six. The most useful thing we do in the first hour is refuse the four they do not need. Our flagship AI development page contains the long version of the rubric.

Step two. Eval before prompt.

A hand-rated golden set of 20 to 40 examples is built before any prompt is written. The eval is the spec. The prompt is the implementation. The client signs off on the eval. The client does not author it; authorship has to live with the firm because the eval has to survive the client's bad day. When we got this order right we shipped in days. When we got it wrong - twice, on engagements where the client wrote the eval - we shipped in weeks and re-shipped in months.

Step three. Dashboard before agent.

By day five, the operations dashboard ships. Token spend, latency p50 and p95, refusal rate, hallucination flag rate, drift signal, and the single named business metric. The agent can still be a hand-rolled prototype. The dashboard cannot. Without the chart, no chart-reading happens, and a year later nobody can name a number that moved.

One detail the dashboard has to surface explicitly: the volume of the metric you are optimizing against. Most failed pilots are starved of training signal long before they are starved of model capability. As a working rule, an ML system - whether it is Meta's bidder, an internal ranker, or a quality classifier - needs roughly forty to sixty mid-funnel conversions per month before the optimizer can find a stable pattern. Below that, the model is curve-fitting noise and the dashboard is reporting on a sample size too small to act on. If your named metric fires once or twice a month, the first job is to pick a closer mid-funnel proxy and instrument that, not to wait six months for the headline number to accumulate.

Step four. Fortnight cap on scope.

Day fourteen is live in production or the engagement does not bill the back half. The cap forces a discipline the buyer rarely volunteers - the named outcome and the named metric have to be small enough to ship in two weeks. Pilots that "need more time" almost never need more time. They need a smaller question.

Step five. SLA pricing from day fifteen.

Pricing flips from the build to the running system. The agent is now a thing the buyer owns and we operate, against a published runbook with named response times. When the model provider deprecates the version we are on, the upgrade is on us. When the eval drops because the corpus shifted, the diagnosis is on us. The SLA is the product. AI consulting covers the strategy-level alternative for buyers not yet ready for a build engagement.

Step six. Refusal-rate alerting from week one.

The single instrumentation choice that has saved us most is alerting on refusal rate from the first day in production. When an agent starts refusing more queries than it did yesterday, something has changed - the input distribution, the model behavior, an upstream prompt edit, or a guardrail update. Most silent failures in other teams' pilots showed up here first, two weeks before the renewal conversation went sideways. Sentinel, our own production AI on Shopify, alerts on refusal rate before it alerts on revenue.

Frequently asked.

Is the 95% GenAI pilot failure number real?

Yes, with caveats. The figure traces to MIT's Project NANDA "State of AI in Business 2025" report, which surveyed 372 enterprises and 153 senior leaders against $30–40 billion in disclosed GenAI spend. The report concluded that roughly 95% of generative AI initiatives produced no measurable P&L impact within their first six months. The 5% that did succeed concentrated almost entirely in firms that picked one workflow, instrumented it before launch, and bought rather than built. The number is defensible. The interpretation that all GenAI is failing is not.

What does '95% of pilots fail' actually mean?

It means three different things stitched together. About 40% (in our sample) are wrong-eval failures - the pilot scored well on internal demos but never survived the eval set the buyer actually cared about. About 35% (in our sample) are no-instrumentation failures - the pilot worked but could not prove it worked, so the renewal conversation died. The remaining 20% (in our sample) are right-thing-built-wrong-buyer failures - the team shipped what was asked for, into a workflow the user did not actually have. Lumping all three under one stat hides the fix, which is different for each.

How long should an AI pilot take?

Fourteen days, hard-stopped, with one named outcome and one named metric written into the contract on day one. Pilots that run longer than a fortnight are not pilots; they are budget-dressed exploration. The 5% that succeed treat the fortnight cap as a feature. They refuse anything that does not fit. They write the eval before the prompt, ship the dashboard before the agent is interesting, and price the months that follow as an SLA, not a build.

What is the cost of a failed GenAI pilot?

The visible cost is the pilot fee, typically $50,000 to $250,000 for an enterprise PoC. The invisible cost is larger and rarely accounted for. Based on our internal observation, an average failed pilot consumes around 280 hours of internal stakeholder time, freezes the budget for a follow-up project for 9 to 12 months, and creates organizational antibodies against the next attempt. The MIT NANDA report estimates that the implementation gap - the delta between $30–40 billion of GenAI spend and the small fraction generating measurable ROI - is the dominant cost item, not the pilots themselves.

What do the 5% of successful GenAI pilots do differently?

Five habits, in the order of how often we see them missed. They write a hand-rated golden eval of 20 to 40 examples before the first prompt is written. They ship the operations dashboard before the agent is interesting. They price the running system as an SLA, not a one-time deliverable. They alert on refusal rate and hallucination flag from week one, not week twenty. And they cap the engagement at fourteen days, refusing scope that does not fit. None of these are clever. All of them are unusual.

Will Gartner's prediction that 40% of agent projects fail by 2027 change the picture?

Gartner's June 2025 forecast that more than 40% of agentic AI projects will be canceled by the end of 2027 is a tightening of the curve, not a reversal of it. The 95% figure is about pilots that produced no ROI; the 40% figure is about projects that are eventually killed. The gap between them is the projects that continue without measurable returns - pilots in zombie state. The forecast that matters most for buyers is the corollary one: Anthropic's enterprise LLM share grew to roughly 40% in 2025, meaning the firms that do ship are concentrating, fast. The market is bifurcating into operators and observers.


What to do tomorrow if you do not want to be the 95%.

If you are the executive deciding whether to greenlight a GenAI pilot this quarter, the homework is small and unpleasant. Pick the one workflow with the most legible number attached to it. Write that number into the contract before the kickoff call. Insist on the eval before the prompt and the dashboard before the agent. Cap the engagement at fourteen days. Refuse vendors who cannot show you, on the day you ask, the operations dashboard for an engagement they are running right now. Ninety percent of costume-cohort firms fail one of those tests. The ten percent that pass are the cohort the 5% comes from.

If you are the operator delivering the pilot, the homework is the same with the order reversed. Refuse engagements without a named outcome and metric. Ship the dashboard first. Alert on refusal rate from day one. Price the months after the build as an SLA, not a change order. Our case studies - including Meraki Wraps, where we took a Shopify operator from zero digital pipeline to thirteen times return on ad spend in eight weeks - and the rest of the work archive are the receipts. The methodology is the product.

One caveat we have learned to put in writing. Fixing the instrumentation is necessary, but it is not sufficient for growth. A clean dashboard on a workflow that does not have product-market fit will produce, with high precision, a chart of a thing that is not working. The 137-to-1 funnel above does not become a 137-to-100 funnel because the events finally fire correctly; it becomes a 137-to-1 funnel that everyone can finally see. The instrumentation step is what makes the next decision a real decision instead of a vibe. The decision still has to be made.

The honest answer to "why do 95% of GenAI pilots fail?" is that 95% of GenAI pilots are run as if they were software pilots from 2014. The technology has changed. The buying behavior has not. The 5% that ship are the ones that updated the buying behavior. Send a paragraph through the contact form with the workflow and the number you want moved, and we come back the same day with a yes, a no, or a sharper question. Two operators, fourteen-day sprints, a dashboard you can read on Monday, and an honest opinion about whether the pilot you are about to fund will land in the 5% or the 95%.

Siddharth Jaiman

Co-founder of JAAX Labs. Builds and runs Sentinel, a live AI analytics product on Shopify. Previously product and growth at two startups you have probably used. Writes about building, shipping, and the agency model in an AI-native world.