The machine learning consulting market has a clarity problem. Ask three vendors what an "ML engagement" costs and you'll get three answers, each claimed with equal confidence, ranging from $5k to $500k. Ask what you'll actually own when the engagement ends and you get wordier versions of the same non-answer. The category has filled with enough consultants selling strategy slides and strategy slides alone that buyers stopped trusting the number on the quote.
The honest version of this is that ML consulting comes in four distinct shapes, each solves a different problem, and they're separated by an order of magnitude in scope, timeline, and price. What makes them different is not brand or firm size but what actually ships and who owns it when the engagement ends. This post names the four tiers, the price ranges, the deliverables that actually matter, and the six questions that tell you whether you're talking to someone who runs production ML or just talks about it.
One credibility anchor worth naming: the community footprint a firm maintains across the ecosystem where its ICP lives is a reasonable proxy for whether they are building relationships or just advertising. A firm active across 8 platforms and 115 individual community destinations - Slack workspaces, subreddits, LinkedIn groups, Facebook groups, Discord servers, forums, podcast pitches, and newsletter targets - is operating as a participant in the conversations that shape buying decisions, not just broadcasting into them. That footprint compounds over time in ways that a paid ranking slot does not. When evaluating ML consulting firms, ask whether anyone in the communities where you learn and make decisions has heard of them. Reputation in a professional community is harder to fake than a well-designed website.
The four tiers of ML consulting engagements.
We've worked across all four. We specialize in tier two and three - the ones where something actually ships. Here's what each tier looks like in practice.
Tier 1: The ML audit and strategy engagement ($5k–$20k, 1–2 weeks)
You have twelve possible machine learning projects. You have budget for three. This tier exists to kill the eight that shouldn't be built. The engagement is a structured walkthrough of your current data infrastructure, your business metrics, your team's ML literacy, and the technical lift of each proposed project against the business value it would create. The output is a ranked list with reds, yellows, and greens - projects you should fund, projects you should kill, and projects you should revisit next quarter when you know more. The work is mostly interviewing: your data team, your product owners, your CFO, anyone who owns a number you want moved.
This tier is where AI development most closely resembles traditional consulting. You get a report. You get a methodology. You get 80 pages if you ask for it, or a five-page killshot if you don't. The report sometimes includes a tiny proof-of-concept on one of the high-confidence projects - a weekend hack on your data that proves the signal is real - but the tier lives on judgment and structure, not shipping.
Who buys this: teams with mature data infrastructure but no ML expertise yet, or teams with mixed-success AI pilots who need an outsider to triage what worked and what didn't. The cost basis is interview time plus structured thinking. The output is a roadmap.
Tier 2: The proof-of-concept build ($25k–$75k, 4–12 weeks)
This is where ML consulting becomes ML development. You've picked one project from the audit. Now you build it to the point where it could work in production, but you haven't shipped it to prod yet. This tier starts with a locked definition: one model, one offline metric, one online metric. No scope creep; the contract names exactly what "done" means and the team doesn't iterate past that definition.
You get a trained model on your data, a feature pipeline that feeds it, an eval harness that tests it, a dashboard you can read, and a runbook that documents what to do when it drifts. The model is feature-flagged in a staging environment where it can hit synthetic traffic but real users don't see it. The work includes integration - this is where most PoCs fail, because they're built in isolation. The work is maybe 60% modeling and 40% the infrastructure that keeps it alive.
What separates a real PoC from a research project is the runbook. If the team that hires you can't run the model without you at 2am, you haven't finished building it. You hand them monitoring, thresholds, a decision tree for what to do when the alert fires, and a retraining schedule. They can run it.
Who buys this: teams that passed the audit, have data that's probably good enough, and want to see if the idea actually works under real constraints. This is where most teams learn what they don't know about production ML.
Tier 3: The production build ($75k–$250k, 3–6 months)
Everything in tier two, but the model is live. This tier includes the full MLOps stack: feature store or feature layer, model registry, scheduled training, automated drift detection, fallback logic, A/B testing infrastructure if the model needs it, and usually some integration work - the model gets called from a service, which gets called from a dashboard or an API, which gets called by the thing that actually matters to the user.
This is where the category diverges most sharply between operators and costume shops. A production model is not a piece of code that ships once. It's a living thing that drifts, that breaks, that needs monitoring and retraining and sometimes complete replacement. A consultant who doesn't understand this will hand you a model and an invoice. An operator will hand you a model and a bill they expect to keep charging against when they're on retainer keeping it alive.
The timeline is longer here because you're building infrastructure that didn't exist before. Three months is a tight timeline for a well-scoped model with good data. Six months is more common. The difference is integration surface: a model that lives inside an app you own is faster than a model that lives behind an API that sells to external clients who might change the schema tomorrow.
Who buys this: teams that ran a successful PoC and need the operational hardening. Usually by this point they've made a decision: either they're building a team and want consulting support while they do, or they want to outsource the whole layer.
Tier 4: Ongoing team augmentation ($10k–$25k per month, open-ended retainer)
Senior engineers embedded with your team, attending standups, reviewing PRs, leading architecture decisions, handling on-call escalations when the model breaks at 3am. This tier doesn't replace hiring but it accelerates it - you can onboard a junior ML engineer in six months with embedded support; without it, you're looking at nine or twelve. It also doesn't replace an internal team if you have one, but it can cover the gap while you're recruiting.
Tier 4 gets priced as a retainer because the commitment is different. You're not buying deliverables; you're buying availability. The pricing varies wildly depending on seniority and how deep you need the integration to be. A senior engineer available for code review and architecture questions runs differently than a senior engineer on your daily standup and your incident response.
Who buys this: early-stage companies that want to ship ML quickly but don't want to hire a dedicated head of ML yet. And sometimes larger companies that have an internal team but realize they've hired too many juniors and need a forcing function for quality.
What each tier actually delivers.
Tier 1 delivers a report and a roadmap. The input is conversations; the output is a ranked list of projects. You own nothing except the decision. This is low-risk hiring, but it's also low-commitment - if the recommendations were wrong, you didn't waste engineering cycles, but the audit also doesn't defend you if you pick wrong anyway. The best tier-one audits include a one-week mini-PoC on the highest-confidence project, just to de-risk the top pick with real data.
Tier 2 delivers a model you can see work in a staging environment. You own the trained model, the feature pipeline, and the eval harness. The output is sitting in your repo and running on your infrastructure. What you don't own is the decision to push it to prod; that's usually still being debated when the engagement ends. The best tier-two engagements end with the question answered: "Will this actually work?" Yes or no. The worst ones end with "I don't know, we need to test it more." If that's the output, you've paid for a research project, not a PoC. The difference between the two is whether the model has seen real data and real inference patterns yet. A model that's only been trained and eval'd on static datasets is still in the research phase.
Tier 3 delivers a model in production behind real user traffic. You own the model, the training pipeline, the serving layer, the monitoring, and the responsibility of keeping it working. The output is a model that's already started to drift and a team that knows how to fix it without panic. The difference between tier two and tier three in terms of deliverables is the operational layer - everything that keeps the model running once real traffic hits it. This includes the feature store or feature layer, the model registry, the serving endpoint, and most importantly the monitoring dashboard and the alerting pipeline. If you can't see what the model is doing in production, you can't manage it.
Tier 4 delivers availability and judgment. You own the decision-making and the hiring. You don't own the engineer on your retainer, but you own their time and their attention for the hours in the contract. The output is usually a team that leveled up faster than they would have alone, and fewer 3am pages because someone senior is reviewing your deployment decisions.
How to tell if you're talking to someone who ships.
Most ML consultants have never shipped a model to production for paying customers. They've built notebooks. They've built demos. They've built papers for conferences. They haven't sat up at midnight trying to figure out why the model stopped calling back a critical feature. They don't know the feeling of a 4am page because the feature drift detector flagged something your data team hasn't seen before. The six questions below separate the two groups. Ask them cold, on the discovery call. Don't email them ahead. Listen to whether they answer with specificity or with brand language. The answers that matter are the ones they give without rehearsal.
1. Can I see an eval harness from one of your engagements.
Not an example. A real eval harness from a real engagement, with the client name redacted. Ask to see how they structured the test set, how many examples they included, how they weighted critical slices versus others. The right answer is: "I have one I built two weeks ago that I can show you." The wrong answer is: "We'll build that together during the engagement."
2. What does "delivered" actually mean in your contract.
Push for specifics. Does the model need to hit a target metric in production, or is it enough to exist in code? Are you responsible for monitoring and alerts, or is that the team's job? If the model drifts six months from now, is it a problem you'll help fix, or are you done. The right answer names numbers. The wrong answer is confident until you ask follow-up questions.
3. Who owns the model weights and the data pipeline after the engagement ends.
You should own both. If the consultant is cagey about this or suggests they need to be on retainer to keep owning the training pipeline, you're hiring a consultant-lock, not a consultant. The goal is that your team can run it without you.
4. How do you handle model drift post-launch.
This is the question that catches everyone. If they answer with uncertainty or hand-waving - "well, we'll work with your team" - they've never monitored a model in production. The right answer includes a specific monitoring strategy: population stability index for features, prediction distribution drift, or business metric regression depending on the model. It includes a retraining schedule: weekly, monthly, or event-triggered. It includes thresholds for when a human should get paged. It includes automated alerts that fire to Slack when something breaks. And it includes a decision tree for triage: what do you do if the alert fires at 3am on Sunday. Most consultants have never written one of those decision trees. Most operators have written it seventeen times.
5. What's your experience with our specific stack.
They don't need to be experts in your entire stack. They do need to have shipped something with at least the critical pieces: Postgres or Snowflake for data, Python or Go for training, and whatever you use for serving. If they haven't done it, they'll tell you. The ones to be suspicious of are the ones who claim to have done it but can't name a specific engagement or project.
6. Have you shipped this to production, or just to demo.
Direct and final. If they haven't shipped a model that's running production traffic and been on the hook to keep it running, don't hire them to ship yours. The boundary between demo and production is everything. A model that's demo-ready is 80% done. A model that's production-ready is 95% done. Most consulting engagements end at 80%. The demos work because they're running on clean data, on known examples, with infinite compute, with a team member standing next to the laptop to hit refresh if it hangs. Production models run on noisy data, unknown edge cases, constrained compute, and 3am when your team is asleep. Ask the consultant: have you kept one running for six months. If the answer is no, they're shipping demos.
"The boundary between demo and production is everything."
What JAAX does and doesn't take on.
We're a 2-person shop (Sid and Arjun). We have a live Sentinel analytics product on Shopify that we run ourselves, which is why we understand what production ML actually means. We ship models. We've shipped twelve in the last quarter, with an 92% renewal rate. We specialize in tier two and tier three: proof-of-concepts that actually prove something, and production builds that actually run.
What we don't do: we don't do tier-one audits alone. We'll do them as the first week of a tier-two engagement, but not as a standalone project. The audit business is clean and it scales, but it doesn't teach you what your actual constraints are. And we don't do tier-four team augmentation long-term. We're not hiring into your team; we're not your permanent ML engineer. We do embed during a build, but when the model is live, we hand it off and move to the next engagement.
Pricing: strategy sprint is $25k–$45k flat fee. Production PoC runs $50k–$150k depending on data cleanliness and integration surface. Full builds start at $150k. Our service page has the longer version, including what "delivered" means for us and how we price engagements that run over.
If you're shopping for ML consulting and you want someone who's shipping, the work is small. Make three calls in the same week. Ask the same six questions on each call. Watch which ones answer with specificity and which ones answer with slides. We're one of those calls.