Integration is the hard part. Not the model. Not the prompt. Not even the eval. The hard part is the wedge between the model's output and the system that has to trust and act on it. You can have the smartest model in the world. If it lives in a black box that no one understands, that changes nothing. The model is 20% of the AI integration work. The architecture is the other 80%.
We learned this the hard way. We have shipped twelve AI integrations in the last ninety days - into HubSpot, Salesforce, Zendesk, Stripe, Slack, Snowflake. We've seen integrations that worked and integrations that half-worked and integrations that worked until they didn't, usually at month three when someone changed an API contract and nobody noticed. Most of the pain didn't come from the model. It came from the seams.
The seams are where AI integrations live and die. The seam between the data the integration needs and the data the system actually has. The seam between what the model outputs and what the downstream action handler expects. The seam between the confidence the team needs to ship and the observability they actually have. The seam between "this worked in staging" and "this broke in production at 3am." The seam between the model's hallucination and the system's audit log - if the hallucination is not logged, you cannot debug it. The seam between the cost estimate and the actual cost when a runaway loop hits the API at scale.
The seams are numerous and they are real. Most teams underestimate them. They build the model. They assume the integration will "just work." Six weeks later, they have a broken production system and no way to debug it because they never instrumented it.
A layered anti-hallucination approach closes most of the high-risk seams before they open. The ten-step playbook: (1) give the model explicit permission to say "I don't have enough information" rather than infer; (2) extract verbatim quotes from source material before generating prose, so output is grounded in what was actually read; (3) pair each claim to a quote after generation and retract any claim that has no supporting quote; (4) restrict the model to provided documents and a curated knowledge base - no free-form recall for citations or numbers; (5) use a Citations API or equivalent for sentence-level source attribution; (6) run a fresh-context evaluator with no knowledge of the generator's reasoning; (7) script a deterministic citation-existence check separate from any LLM; (8) for high-stakes claims, generate twice with different seeds and flag divergences as hallucination canaries; (9) whitelist source quality - only named, verifiable sources pass; (10) require chain-of-thought reasoning before the model states a claim, making faulty reasoning visible before it reaches the output. One production receipt: using steps 4 and 5 via the Citations API, a client dropped source hallucinations from 10% to 0%. The type-drift failure mode - where a shared type like a consent state is modified by one integration layer without the consuming layer updating - illustrates why these steps cannot be treated as optional. A type that accepts both enum and null during a migration window produces model outputs grounded in stale schema. Verbatim-extract discipline (step 2) forces the source to be read fresh on every call and catches that class of drift before it compounds.
This is the playbook for building integrations that survive the seams. Not the ones that pass the eval. The ones that survive the unexpected failure mode at 3am and can be debugged with the audit log.
What AI integration actually means.
Before we get to architecture, we need to be precise about what we're integrating. AI integration is not "use ChatGPT in our workflow." That is a demo. AI integration means wiring a model into the system that pays the bills - your CRM, your helpdesk, your order pipeline, your data warehouse - so that the system behaves better with the model than without it, and so that the team can prove it. It's different from AI development (which builds the model) and different from AI strategy (which decides which AI projects to pursue). Understanding the distinction matters because the skills and the timeline are different for each.
There are three canonical shapes of integration. Recognizing which shape you're building determines which problems you have to solve first.
Shape 1. API-call integration.
The model is called synchronously from inside the main application flow. A user triggers an action - they submit a support ticket, they fill out a lead form, they hit a Stripe webhook - and the model responds before the workflow continues. The model is on the critical path. Latency matters. Cost matters. Hallucinations get seen immediately.
This is the shape most teams try first. It is also the shape where most integration failures happen, because latency is adversarial to quality. Push the model for speed and you get a prompt that makes mistakes. Push for quality and you get latency that breaks the user experience. The trade-off lives right at the integration boundary, and there is no hiding it.
Shape 2. Background inference integration.
The model runs asynchronously. A user action triggers an async job. The model scores, classifies, or enriches the data in the background. The results sit in a database. The user sees them later, or they get picked up by a downstream system. Latency is not critical. Accuracy is. This is where cost caps matter most, because you can run the model over larger datasets without latency pressure killing you. You can also retry failed inferences, backfill historical data, and run batch predictions without breaking the user experience.
Sentinel, our live product, uses this shape. The Shopify webhook lands. We queue the enrichment. The model runs whenever the queue has space. The insight sits in the database. The user reads it the next morning. No latency pressure. Better outputs. And if the model gets an inference wrong, we can backfill the entire dataset and users never notice.
Shape 3. Real-time streaming integration.
The model processes a continuous stream of events. Fraud detection, recommendations, dynamic pricing. The model has to make a decision in milliseconds. State has to be shared across inferences. Old inferences have to roll off when new data arrives. This is the shape where infrastructure becomes the work. The model is 5% of the problem. The streaming framework is the rest.
Most teams do not need this shape. We mention it because when you do, every architectural decision upstream has to change. You are no longer thinking about "calls" or "jobs." You're thinking about windowed aggregation and state management and what happens to a decision made five minutes ago when new data arrives.
The five failure modes hiding inside every integration.
Understanding the shapes is the first cut. Understanding the failure modes is the difference between shipping and shipping twice. We have categorized five seams where integrations break. Each has a fix. Each requires a decision upfront or an expensive retrofit later.
Failure mode 1. Model inputs are wrong.
The model's context window is shorter than the data it needs to see. The schema of the data is not what the prompt assumes. The data is dirty, so the model hallucinates facts that are not there. The cardinality is wrong - you asked the model to classify one record but you passed it five hundred.
This failure mode is invisible until it happens in production. In staging, you tested with clean data, full schema, reasonable context windows. In production, you get nulls and edge cases and schema drift from the data team that last changed the pipeline six months ago. The fix is a data validation layer that sits between the source system and the prompt. Before the prompt sees the data, the data has been checked against a known schema, redacted for PII, windowed to fit the context window, and logged so you can reconstruct what the model saw when output quality regresses.
Failure mode 2. Model outputs are not trusted.
The model produces a recommendation and no one acts on it. Not because the recommendation is wrong, but because the team has no idea how the model arrived at it. There is no explainability. There is no audit trail. There is no way to trace back from "why did the system recommend this?" to "what data did the model see?" So the team ignores the recommendation. The integration sits in production, producing insights no one uses, because no one understands the model's reasoning.
This failure mode kills integrations more often than actual inaccuracy does. A team can tolerate a model that is 80% accurate if they understand how it arrived at the answer. They cannot tolerate a model that is 90% accurate but opaque - because on the 10% it gets wrong, they have no way to know which recommendation to ignore.
The fix is structured outputs. Instead of asking the model to "summarize" - which produces text that humans have to read - ask it to return a JSON with specific fields: recommendation, confidence, evidence, alternatives_considered. The JSON is machine-actionable. The confidence score is machine-readable. The evidence field is what gets shown to the human who has to trust the decision. If you want the team to act on the model, you have to make the model's thinking visible.
Failure mode 3. No feedback loop.
The model makes a prediction in week one. By week four, you know the prediction was wrong. The model never learns it. It keeps making the same mistake because there is no mechanism to feed the ground truth back into the eval. The eval harness was trained on old data. It is not training on production data. The model is drifting and you have no signal for it.
The fix is a feedback loop that captures ground truth and re-trains the eval monthly. You do not necessarily re-train the prompt - you re-run the eval against the production data to see if output quality has regressed. If it has, you rewrite the prompt or you change the temperature or you upgrade the model. The eval is the canary. The audit log is the data source.
Failure mode 4. Brittle API contracts.
The third-party API you depend on changed its response schema in week three. The integration silently breaks because the parsing code assumes the old schema. The error handling does not catch it. The model is still running but it is producing output the downstream system does not understand. By the time someone notices, you have a backlog of corrupt records.
This happens because APIs change. Vendors deprecate fields. They add optional fields. They change error codes. In staging, with clean data, the parsing code works fine. In production, with years of data history and edge cases, the parsing code breaks. We have seen this happen with CRM APIs, payment processors, and warehouse systems. The system continues to run, but it silently produces garbage.
The fix is defensive parsing with explicit error handling. Every field is optional until proven required. Every error condition is logged. Every unexpected response shape is routed to a dead letter queue. You do not silently drop bad data and you do not assume the API behaves the same way it did in staging. You also version the contract. When the upstream API changes, you have a way to know about it immediately, not six weeks later when a customer mentions the data looks wrong.
Failure mode 5. Cost runaway.
The integration goes live and works fine for two weeks. Then the upstream process changes and starts sending ten times as many events. The model processes all of them. The inference bill hits in week three and it is five times what the budget was. There is no cost cap. There is no circuit breaker. The system just keeps calling the API until someone notices.
The fix is a hard cost cap per inference and per tenant, enforced in code. You do not negotiate with this cap. You do not lower it as a favor. You hard-stop inferences when the cap would be exceeded and you route the request to a queue for manual review. Better to lose a few inferences than to lose the budget for three months of development.
"Most AI integrations fail in the seams. The model works. The surrounding system doesn't trust it."
The integration architecture that survives production.
Every integration needs seven layers. They sit between the source system and the decision that gets made on the other side. Skip one and you are leaving risk on the table. Run all seven and you have a system that can fail gracefully and be debugged when it does.
Layer 1. Data ingestion and validation.
Data lands from the source system. The first thing that happens is validation against a schema. Required fields are checked. Data types are validated. PII is redacted before the data continues downstream. Nulls are handled explicitly. The validation layer is strict: if the data does not match the schema, it gets logged and routed to a dead letter queue. The model never sees invalid data. This layer is where you catch data quality issues before they become model hallucinations. A null value in a field the model expects? The validation layer stops it. A string in a field that should be a number? Caught. A field that used to be optional but is now required upstream? Caught. The validation layer is your first defense. It's not glamorous but it saves you weeks of debugging.
Layer 2. Prompt construction.
The validated data is templated into the prompt. The prompt is constructed deterministically. The same data always produces the same prompt text. Nothing is random at this layer. The prompt is logged so you can replay it later. If the model's output is wrong, you can reconstruct the exact prompt that produced it.
Layer 3. Inference with rate-limit handling.
The prompt is sent to the model. The call is wrapped in retry logic. If the rate limit is hit, the request is queued for backoff rather than failing immediately. If the API is down, the request is retried with exponential backoff - not forever, but with a fixed budget of three retries. If the retry budget is exhausted, the request goes to a dead letter queue for manual review. The cost cap is checked before the call is made. If the call would exceed the cap - not the monthly cap, but the per-inference cap and the per-tenant cap - the request is rejected immediately and queued for escalation. Without the cost cap, a single runaway loop can drain a month's budget in an hour. We have seen this happen. It is not fun.
Layer 4. Output parsing and validation.
The model's response comes back as text. It is parsed - either as JSON if the model was asked for structured output, or as free text if it was asked for a summary. The parsed output is validated against a schema. If the output does not match the schema, it is routed to a dead letter queue. If it does match, it continues.
Layer 5. Action routing and idempotency.
The validated output is routed to the system that acts on it. The CRM record is updated. The ticket is escalated. The Slack message is sent. This layer is idempotent: the same output always produces the same action, even if the action is retried. If the downstream system is unavailable, the action is queued for retry.
Layer 6. Audit logging.
Everything - the input, the prompt, the model output, the action taken, the downstream feedback - is logged. The audit log is immutable. It is the source of truth for reconstructing what happened when something goes wrong. It is also the source of data for retraining the eval.
Layer 7. Feature flagging and monitoring.
The entire path can be toggled off with a feature flag. If something is wrong, you can flip the flag and the system routes back to the old behavior without a deploy. Metrics are emitted at every layer: latency, error count, refusal rate, hallucination flag. Alerts are configured to fire when the metrics drift from the baseline. You know something is wrong before the customer does.
Structured outputs versus free text.
The question that determines whether your integration can scale is whether the model outputs structured data or free text. This is not a model question. It is an architecture question that becomes a management burden if you get it wrong.
If you ask the model to "summarize the ticket," you get prose. A human can read it. A machine cannot act on it. You have bought yourself a years-long debt of parsing. Every downstream system that needs to act on the summary has to do natural-language understanding to extract the actionable pieces. The support team sees a prose summary. The CRM integration sees a prose summary. The analytics pipeline sees a prose summary. Each one has to parse it. Each parse introduces drift. By month four, you have eight different versions of "what the model said" floating around your systems.
If you ask the model to return a JSON with summary, category, priority, escalation_reason, and confidence fields, you get a machine-readable response. A downstream system can read the JSON and act immediately. The schema is the contract. The downstream system knows exactly what fields to expect and in what format. No parsing. No drift. The integration is deterministic.
The difference between these two approaches is the difference between a demo and a production system. In a demo, free text is fine. In production, with teams depending on the output and the system making decisions based on the output, structured outputs are non-negotiable. Modern models support this through function calling and structured output modes. Use them. Every integration ships with structured outputs or it does not ship.
The evaluation problem.
Most teams test models in isolation. You run a benchmark. The model scores 90% accuracy. You declare success. You ship to production. The model hallucinates on a class of inputs you never tested. The integration breaks. You spend a week debugging. You change the temperature. The accuracy drops to 87%. You change it back. You give up. You decide "AI just doesn't work for this use case."
The problem is not the model. The problem is the eval. You tested the model's accuracy, not the integration's accuracy. The model's accuracy is not the same as the integration's accuracy. A model can be 90% accurate on its own and the integration can be 40% accurate in production because four other layers are broken.
The eval is not the model's accuracy. The eval is the integration's accuracy. It is the full stack: data ingestion, prompt, inference, parsing, action routing, and downstream feedback. If any layer is wrong, the eval fails. You can have the smartest model in the world. If the data layer is feeding it garbage, the eval is garbage.
A production eval is hand-rated. You sample fifty to a hundred real examples from your actual production logs. You rate them. You build a golden eval set. The eval harness scores your integration against these examples. You run it before every deploy. You run it every week in production. When the score drops below the baseline, an alert fires. You know something changed. It could be the model. It could be the data. It could be the prompt. The eval doesn't care. The eval just tells you the truth: "the integration is degraded."
If your team does not have an eval, you do not have an integration. You have a demo. The demo works on clean data and known examples. The integration has to work on messy data and edge cases. If you cannot measure the difference, you cannot know which one you have built.
Feature flags, canaries, and rollback.
Every AI integration ships behind a feature flag. Day one of production, 5% of traffic goes to the AI path. 95% goes to the baseline. You measure latency, error rate, and the downstream action quality. After three days, you increase it to 10%. After a week, 50%. After two weeks, 100%.
The flag lets you roll back in seconds if something is wrong. No deploy. No hotfix. You flip the flag and traffic routes back to the baseline. The canary approach means you find problems in the 5% before they affect everyone.
When to build versus when to buy.
Not every integration requires building. Some do. Some don't. The decision happens early or it costs you later. Get this wrong and you either build something that exists, or you try to buy something that requires understanding your data.
Buy if the task is a known category. You need summarization, classification, entity extraction, sentiment analysis, translation, semantic search. These are solved problems. There is a tool or an API for each one. Some are open source. Some are SaaS. Some are built into the platform you already use. Use them. You are not smarter than the twenty companies that solved this problem before you. You are also not smarter than the open-source community. If you find yourself writing your own summarizer, you have wasted two weeks.
Build if the task requires understanding your proprietary data, your business logic, your workflows. You need a model that understands your customer cohorts, your pricing tiers, your internal terminology, your escalation workflows. No vendor knows your Slack culture. No API knows which customer complaints are actually sales objections versus which ones are legitimate support issues. No off-the-shelf tool understands your data. You have to build it.
The dividing line is usually data. If the task depends on understanding generic text - "is this email spam?" or "what language is this in?" - buy. If it depends on understanding your data - "does this customer fit our ICP?" or "should this order be escalated to the fraud team?" - build.
We built Sentinel rather than buying an analytics product because the intelligence layer requires understanding how Shopify data maps to a specific merchant's cohort definitions. No third-party analytics tool has that model. So we built it. And now we operate it. And now we know what it costs to keep an LLM-backed system inside the SLA.
What a real engagement looks like.
Most teams that try to build an integration on their own run into the same problem: they underestimate the infrastructure. A model that works in a notebook does not work at scale. A prompt that works in a chat interface does not work in production. You end up building all seven layers we described above, and by the time you're done, you've spent six months and three engineers on something that should have been four weeks.
A production AI integration engagement follows a rhythm that separates the architecture work from the implementation work, and separates the implementation from the validation. The rhythm forces you to get the big decisions right before you start writing code.
Week 1-2: Discovery and data audit. We understand which systems need to talk to each other. We look at the data - its shape, its cleanliness, its schema drift history. We ask hard questions about what success looks like. Not "will the model be accurate" but "what number will move if this integration works." We write a one-page spec naming the one outcome and the one metric. This spec becomes the contract. If the metric moves by the amount we specify, we have succeeded. If not, we have failed and we owe you a retrofit.
Week 3-4: Architecture and eval design. We design the seven-layer architecture. We decide which layer will be the bottleneck and we build the observability for that layer. We build the golden eval set from production data, not from hunches. The eval is 20-40 hand-rated examples that represent the distribution of inputs the integration will see. We write the evaluation harness. We get the team's sign-off on the eval before we write a single line of prompt code. This order is non-negotiable. Too many teams write the prompt first and then try to eval it. It does not work. The eval is the spec. The prompt is the implementation.
Week 5-8: Build and staging. We build the integration. Prompt, parsing, action routing, idempotent webhook handlers, cost caps, PII redaction, logging. We run it against the eval every single day. When the eval passes, we move to staging. We run it against production-like data in staging. We deliberately break things to see how the error handling works. We measure latency under load. We check the infrastructure costs.
Week 9-10: Canary and measure. We ship with a 5% traffic flag. 5% of the production load goes to the AI path. 95% goes to the baseline. We measure latency, error rate, action quality. We track the metric from week one. Is it moving? After three days of clean canary, we increase to 10%. After another three days, 25%. After a week, 50%. We watch for drift - latency regressions, error spikes, metric plateau. The flag lets us roll back in seconds if something is wrong.
Week 11-12: Full cutover and monitor. The flag is at 100%. The integration is the primary path. We hand over the runbook to your team - the playbook for incident response, rollback procedures, when to escalate to us. We stay on call for two weeks. Then we step back and you own it.
For a mid-complexity integration into one system - HubSpot or Zendesk or Stripe - this timeline is twelve weeks. For a simple one with low API complexity and clean data, it's eight weeks. For a complex one with multiple interdependent systems and data governance requirements, it's sixteen to twenty weeks. We price the work by engagement shape, not by the hour. A strategy sprint runs $25-45k. A production PoC runs $50-150k. A full implementation runs $150k and up.
The plays we've seen fail and succeed.
We have watched teams try to shortcut this process. They skip the eval to save time. They skip the canary because "we need to move fast." They deploy to 100% on day one. They almost always regret it within three weeks.
The failed case we see most often: team spends three weeks writing prompts. Team tests the prompt against ten examples in a Jupyter notebook. The prompt nails nine out of ten. Team declares victory. Team deploys to production on a Friday. By Monday, the integration is producing nonsense on a class of inputs that were not in the ten-example test set. The team spends week four rewriting the prompt. By week six, the metric has not moved. By week seven, the project is canceled.
What went wrong? The team tested the model. They never tested the integration. The model was fine. The eval was garbage. By the time they realized this, they had already committed to the wrong path and it was too expensive to reboot.
The teams that succeed are the ones that move slowly upfront and quickly later. They spend two weeks building and vetting the eval before writing a single prompt. They spend another two on the architecture and the observability. They get the hard decisions right. Then they move fast because the foundation is solid. They iterate on the prompt daily against the eval. The eval drives the work. By week five, the integration is in staging. By week eight, it is in production and moving the metric.
The teams that fail are the ones that try to move fast upfront. They skip the eval to "get to code faster." They skip the canary to "reduce process overhead." They skip the runbook because "we'll just stay on call." They always end up slower, because they are chasing production bugs instead of designing for them. By week six they are where week four teams are. By week twelve they are where week eight teams finished.
If your team is hitting failure mode 2 or 3 - where the integration works but no one trusts it - read our post on why 95% of GenAI pilots fail. The short version: most pilot failures are architectural, not model-quality failures. You don't need a better model. You need a better integration. If you want to see what this looks like in the wild, read what we learned shipping 12 agents in 90 days - the four things we'd tell our day-one selves apply to every integration project.
Frequently asked.
How long does an integration actually take?
Eight to twelve weeks for a mid-complexity integration. Two weeks if you're integrating an off-the-shelf model into an off-the-shelf system with no custom logic. Twenty weeks if you're integrating multiple systems with custom data governance. The variable is complexity, not the team's skill.
How much does this cost?
A strategy sprint that maps out the architecture and risks runs $25k-$45k. A production proof-of-concept that ships one integration live runs $50k-$150k. A full implementation across multiple systems runs $150k+. The estimates are ranges because the final cost depends on how much existing integration infrastructure you have and how clean your API contracts are.
Can we do this in-house?
Yes, if you have two senior engineers who can spend four months on it and can afford to eat the mistakes. Most teams don't. Most teams build the integration, ship it, have it blow up in week six, then rebuild it. At that point the total cost is higher than hiring us for twelve weeks.
What happens if the integration drifts in production?
The eval catches it. The eval runs daily in production. If the score drops by more than five points, an alert fires. You rerun the eval against production data to narrow down which layer is drifting. You fix the prompt or the config or the schema. You redeploy. The eval passes. You move on. This is why the eval is the load-bearing component of the whole system.
What if we want to switch models?
The eval is model-agnostic. If you want to try GPT instead of Claude, you run the same eval against both. You compare the scores. You pick the winner. The integration code barely changes. The prompt might change. The eval stays the same. This is another reason the eval is first.