Ninety-five percent. The number is the reason this post exists, and the reason most boards still will not fund a second AI pilot. It is also the most misread statistic in enterprise software since "85% of data projects fail," and the most expensive thing to misunderstand if you are about to greenlight a generative AI initiative this quarter.
The figure traces to MIT's Project NANDA - the "State of AI in Business 2025" report released in mid-2025, surveying 372 enterprises and 153 senior leaders against $30 to $40 billion in disclosed GenAI spend. The headline: roughly 95% of those initiatives produced no measurable P&L impact within the first six months of deployment. IBM's 2024 IBV report indicated fewer than one in four enterprise AI projects had delivered the expected ROI - verifying citation. Gartner, in a June 2025 advisory, forecast that more than 40% of agentic AI projects will be canceled by the end of 2027. Three independent sources, three different methodologies, one consistent shape.
More than half of B2B software buyers now begin product research inside an AI chatbot rather than Google - a share G2 Research pegged at 51 percent in late 2025, up from 29 percent earlier that year. By 2028, Gartner predicts 90 percent of B2B buying will be AI agent-intermediated, routing over $15 trillion in spend through AI agent exchanges. The category is not failing because the technology does not work. It is failing because the way the work is bought, scoped, and verified has not caught up with what the technology demands. We have been on both sides of the 95%. This is the post we wish someone had written for us in our first quarter.
Where the 95% actually comes from.
The MIT NANDA survey is the load-bearing citation, and the methodology matters because the headline number gets weaponized in two opposite directions. AI skeptics treat 95% as proof of a bubble. AI maximalists treat it as a measurement artifact and dismiss it. Both misread it. Both misread it for the same reason - they did not read past the first chart.
Project NANDA distinguished three states: pilots that produced no measurable change in the named metric within six months; pilots that produced a measurable change but no P&L attribution; and pilots that produced both. The 95% figure aggregates the first two. The 5% that succeeded concentrated in firms that picked one workflow, instrumented it before launch, and bought rather than built. Inside that subset the success rate is closer to a third. Outside it, the failure rate is closer to ninety-eight percent.
Anthropic's enterprise share - about 40% of Fortune 500 LLM API spend in late 2025, per Menlo Ventures' tracking - is the data point quietly omitted from the doom narrative. The firms that ship are concentrating their spend on the same handful of model vendors and integrators. The market is not collapsing. It is bifurcating. Half of buyers give up after pilot one. The other half triple down because the second pilot, run differently, returns the cost of the first three combined.
The three failure modes hiding inside the stat.
"Failed" is doing too much work in the headline. When we audited our first six engagements, and when we post-mortemed the dozen-odd pilots clients walked in carrying the ashes of, the failures sort cleanly into three buckets. The fix is different for each, which is why the single 95% number is so unhelpful.
1. Wrong-eval failures.
The pilot scored well on the demos the vendor controlled and could not survive the eval set the buyer actually cared about. The most common failure mode - roughly forty percent of the 95% in our sample. A consultancy builds a PoC against a curated dataset of forty examples that look like the buyer's data. The PoC nails it. Real production traffic enters the system, which contains the long tail of edge cases the curated set never represented, and accuracy collapses by week three.
We shipped one of these in week six of our first quarter. The agent passed our internal eval at 91% accuracy. The client's actual support queue contained a subgenre of refund-with-exception tickets we had not sampled, and the agent was answering those wrong with confidence. We killed the engagement, refunded the back half, and rebuilt the eval set from the client's last ninety days of real tickets before writing a single new prompt. It is now one of the eleven still running.
2. No-instrumentation failures.
The pilot worked. Nobody could prove it worked. The renewal conversation died because the ROI case relied on a vibe. About thirty-five percent of the 95% in our sample. This is the failure mode that bothers us most because the underlying system is doing real work - in a black box no one can read.
We saw this with an internal-ops agent we shipped in our first quarter. It reconciled vendor invoices against POs and saved the ops team four to eight hours a week. We knew because the ops lead told us. Nobody else did. There was no dashboard, no monthly invoice savings tally. By month four, the CFO asked what they were paying us for, and the ops lead's testimonial was not the answer the CFO needed. The agent went away. It worked. It still went away. The fix is simple: ship the dashboard before the agent is interesting, with token spend, latency, refusal rate, hallucination flag, and the named business metric on a single screen the CFO can pull up unsupervised.
3. Right-thing-built-wrong-buyer failures.
The pilot delivered exactly what was specified. The specification was wrong. About twenty percent of the 95% in our sample. This is the failure mode that hurts most because nobody notices it is happening until month four, after the agent has been "succeeding" against a metric no one outside the project room cares about.
MBB and Big Four engagements concentrate this failure mode, and not because the people running them are unintelligent. Strategy is sold by the partner, the build is subcontracted, and by week six the deck has named a feature the build team cannot ship under the architecture the strategy team approved. The agent is faithful to the deck. The deck was faithful to the workshop. The workshop never met the user. Our longer post on how to actually buy AI development walks through the structural fix - refuse to separate strategy from build at the contract level.
"If a pilot cannot show you the dashboard the day it goes live, it is not a pilot. It is a demo with extra steps."
From a recent audit.
The three failure modes above are tidy in retrospect. In the field they arrive interleaved, inside a codebase that already looks like it works. A B2C app we audited last month is the cleanest recent example. The team had wired AppsFlyer, Meta SDK, Firebase, and Mixpanel. Dashboards rendered. Events appeared in consoles. The pre-launch claim was that paid acquisition could begin the following week.
We pulled the audit. Mixpanel's thirty-day funnel showed 137 completed onboardings, 1 trial start, 1 paid conversion. Roughly 0.7% from onboarded user to revenue. Beneath that headline sat ten confirmed-broken telemetry items, each traceable to a specific file and line in production code. The single most revenue-relevant moment in the app - the subscription success path - had its tracking call commented out at store_manager.dart:33-38. On iOS, a Swift cast in the Meta SDK bridge silently nil-ed every event's parameters; events arrived in Meta's console with names but no payload. Google Ads' "Subscribe" goal was bidding against phone calls, on a pre-launch app that had no phone number.
None of that was visible from the dashboard. All of it was visible the moment we read the code against the live event stream. The deeper hazard, and the one most relevant to the 95% framing, is that the public GitHub repository was not the production source. The CI pipeline overrode the version string at build time; the live binary reported a build number ninety-five revisions ahead of main. Three different "ground truths" for the same library were circulating - one in the lockfile, one in the prior diagnosis doc, one in the live event stream. A buyer reading the public repo would have concluded the integrations were sound. A buyer reading the production telemetry would have concluded the opposite.
"The plumbing is more built-out than I expected, but the pipes are empty in the places that matter most."
This is what a no-instrumentation failure actually looks like before it is labeled one. The plumbing was real. The pipes were empty. Sentinel, the production AI we run on Shopify, exists because operators do not catch their own 137-to-1 problems for months - the dashboard says green while the revenue path is muted. The same anti-pattern, in a different shape, is what we found behind the Shopify attribution gap on a different engagement: GA4 was pointed at a domain the store no longer owned, and every dashboard above it was confidently wrong for ninety days.
What the 5% do differently.
The 5% are not magicians. They are operationally boring in a way that is impossible to fake. We have worked alongside enough firms in the successful cohort - and inside enough firms that thought they were and were not - to write down the habits that separate them. None are clever. All are unusual.
They write the eval before the prompt. They ship the dashboard before the agent is interesting. They alert on refusal rate and hallucination flag from week one, not week twenty. They price the engagement as an SLA against a runbook, not a one-time deliverable. They cap the engagement at fourteen days and refuse anything that does not fit. The long version of this lives in our 12 agents in 90 days post, the engineering memo behind the marketing argument you are reading now.
The receipt that cost us the most to earn is the fortnight cap. Day one is a written contract naming one outcome and one number. Day fourteen is live, instrumented, and measured in production, or the engagement does not bill the back half. We refuse roughly thirty percent of inbound on this constraint alone. The refused work is the work that would have been a 95% statistic with our name on it.
The most counterintuitive habit is the dashboard-before-agent rule. In every project where we shipped the agent first, the dashboard arrived too late to save the renewal. In every project where the dashboard shipped by day five - even when the agent itself was still hand-rolled - the renewal conversation in month four took fifteen minutes instead of three meetings. The dashboard is the receipt. The agent is the work. Ship the receipt early.