The chatbot pitch is always the same. We'll train a model on your data. Point it at your customers. Watch the tickets solve themselves. It works great in their demo. It works great in your pilot. Then you go live, and within two weeks, you're in a Slack channel with sales, support, and finance asking why the chatbot just closed a $50K deal that your team hadn't approved yet.

The gap between a chatbot that demos well and one that works in production isn't the model. It's everything else. It's the fallback paths when the model is wrong. It's the eval loop that spots drift before customers do. It's the escalation design that keeps humans in control. It's the monitoring layer that knows when something breaks. Most chatbot vendors sell you the model. Almost none of them sell you the operational structure to run it.

If you're evaluating a chatbot development company right now, here's what separates the ones who build systems that survive from the ones who build systems that blow up.

Two operating principles that production-grade chatbot systems require, derived from multi-agent architecture research. First: persist state in files, not in conversation. A chatbot that stores its context only in the active conversation thread loses that context on every restart, every timeout, and every escalation handoff. State that lives in files - a structured progress log, a per-session JSON, a git-committed checkpoint - survives the conversation ending. Systems built without this principle accumulate invisible technical debt: each session restart is a cold start, the agent has no memory of what it told the same customer three days ago, and the "continuity" the pitch promised is actually just the customer re-explaining their problem. Second: apply the ten-step anti-hallucination playbook before any chatbot touches customers. Explicitly permit the model to say it does not have enough information rather than infer. Extract verbatim quotes before generating claims. Run a fresh-context evaluator that has never seen the generator's reasoning. Script a deterministic check that every cited fact resolves to a real source. A chatbot vendor who cannot explain which of these steps they implement - and which they skip, and why - has not thought rigorously about the failure modes their system will produce in production.

The demo vs. production problem, named.

A demo chatbot runs on curated queries. Known inputs. Known good outputs. Production is chaos. Your customers will ask things the chatbot wasn't trained on. They'll use typos and jargon. They'll feed it malformed data. And when the chatbot doesn't know what to do, it still has to do something. Either it sends a response that's confidently wrong, or it breaks trying to recover, or it escalates to a human. The vendor who builds the demo rarely designs the escalation path. That's your job. Or it becomes a support problem for the next six months.

The operational gap shows up in three places:


These problems are predictable. The vendors who solve them are the ones worth signing with.

The four questions to ask any chatbot development company.

Before you sign a contract, ask these four questions. The answers tell you whether you're getting a production system or a demo that will blow up at 2 AM on a Friday.

Question 1: How do you design fallback paths?

This is the clearest test. Ask them to walk you through a scenario: "The customer asks something completely outside the chatbot's knowledge base. What happens?" Good vendors will tell you about confidence thresholds, escalation queues, routing logic, human review workflows. They'll show you code or design docs. They'll name the scenarios where different fallbacks trigger.

Bad vendors will say "it just sends a default message" or worse, "the model is so good, it won't happen." If they can't explain the fallback, they haven't built one. That means the first truly weird customer query might either crash the system or return something confident and wrong. Don't sign with that vendor.

Question 2: What does your post-launch eval process look like?

Ask them: "My chatbot launches successfully. Six months from now, how do you know if it's still working?" Great vendors have a cadence. Weekly spot checks on a sample of real conversations. Quarterly deep dives on new intents. A process for re-training when drift is detected. They'll tell you the specific metrics they track and the thresholds that trigger action.

If they say "we test before launch" or "you'll monitor it yourself," that's a red flag. Post-launch eval is where most chatbots fail. A vendor who skips it is betting on luck. You'll lose that bet within six months.

Question 3: Walk me through your monitoring and alerting strategy.

Ask them to show you the monitoring dashboard. What gets tracked? Accuracy? Latency? Error rate? Escalation volume? Cost per interaction? A good vendor will have built a view into the system that shows real-time health. They'll have alerting rules. "If error rate spikes above 5%, the team gets notified in the next 15 minutes."

If they don't have a monitoring strategy, or if it's "we log things and you check them," they're shipping a system that can fail for days without anyone knowing. Use Sentinel or similar to fill the gap, but the vendor should have opinions about what matters and how to catch it. If they don't, they haven't shipped enough production chatbots.

Question 4: What's your approach to knowledge base maintenance?

Chatbots live on knowledge bases. The knowledge base gets stale. Documents change. Policies update. Customer intent evolves. Ask the vendor: "How do we keep the knowledge base fresh? What's the maintenance cadence? Do you help with updates, or is that our job?" A vendor who has answers about versioning, update workflows, and refresh intervals understands the long-term cost of chatbots. One who says "it's a one-time build and go" understands nothing.

Most chatbot vendors sell you the demo. The ones worth paying are the ones who build the operations around it.

What a production-grade engagement actually looks like.

A vendor who passes the four questions will propose something that looks different from the typical "we build it in three weeks" pitch. Here's what to expect:

This approach costs more than the vendors who ship and ghost. It's also the difference between a chatbot that's still working after a year and one that's shut down by month three. The ROI is obvious if you do the math: a chatbot that fails in three months is expensive and useless. One that works and improves is an asset.

Who to hire: The vendors who understand production.

If you're building a chatbot for a real business problem, hire a partner who understands the full cycle. Generative AI consultants who've shipped agents in production know these gaps. They know fallback design. They know eval frameworks. They know how to avoid the failures that crater projects.

Or work with your AI development partner directly. The best ones own the whole stack: model, fallbacks, evals, monitoring, ops. They'll spec it upfront. They'll build it right. They'll stay with it post-launch. That's not the cheapest option. It's the only option that actually works.

The 40% failure rate for AI projects without governance applies to chatbots too. Most chatbots fail because the vendor shipped the model and skipped the governance. Don't let that be you. Ask the four questions. Vet the vendor. Hire the one who understands that production is harder than the demo. That's the difference between a chatbot that works and one that becomes a support nightmare.