Skip to main content

Claude consulting for teams who picked the wrong Claude.

Most teams running Claude are using Opus where Sonnet would win and Sonnet where Haiku would win. The fix is not a different model provider. It is the right tier per call, an XML prompt rig, and prompt caching wired in by default.

14d
sprint, no extensions kickoff to read-out
$25–45k
strategy sprint, flat-fee engagement
11/12
agents shipped on Claude in production
92%
renewal rate across the cohort
2
operators - no junior layer
40–70%
typical cost savings on prompt caching

Most teams are overspending on Claude by 40–70%. We fix the bill and the latency.

Most Claude deployments
Running Opus on routing prompts that Haiku would solve for one-twentieth the cost
Markdown-headered prompts that would gain four eval (automated test suite that scores model output) points in XML (the structured tagging format Claude was trained on)
No prompt caching enabled - overspending 40–70% on Sonnet and Opus calls
No eval harness - can't safely upgrade when Anthropic ships a new snapshot
Multi-turn surfaces that drift on turn seven because state management is ad-hoc
JAAX Claude consulting
Model-tier matrix per use case - let the eval pick, not vibes
XML prompt rigs under version control with thinking blocks where they earn it
Prompt caching wired at the call-site - usually 2–3x cost reduction on volume
Eval harness in CI - run the eval on new snapshots before deploying
Hallucination CI and refusal-rate alerts wired to PagerDuty (on-call alert routing)

Each is a contract for a specific problem.

Strategy sprint

Two weeks, no extensions

For teams with twelve possible Claude bets and a budget for three. Model-tier matrix, prompt-rig audit, caching breakeven, and a six-month roadmap defensible in a board meeting.

Production PoC

Fourteen days to live

For teams who have decided the bet and need a Claude feature serving real traffic. The production proof of concept (PoC) runs in your environment behind a feature flag by day fourteen, refundable if it doesn't ship.

Full implementation

Six to twelve weeks

For teams scaling from one prompt to a portfolio. Tool use, thinking blocks, 200K context strategy, hallucination CI, monitoring with refusal alerts, cost caps per tenant.

Team augmentation

Monthly retainer

For teams with ML engineers who want senior backup. Prompt review, eval design, model-tier calls, on-call escalation. We're in your Slack, not on the org chart.

Strategy sprint. Seven deliverables. One document.

The strategy sprint is the entry point. You get a model-tier matrix, a prompt-rig audit, a caching breakeven analysis, eval harness baseline, a six-month roadmap, a kill list, and a recommended next sprint costed and scoped.

Book a fit call  →
Claude Strategy Sprint · Deliverable · JAAX Labs
Claude Strategy Sprint - Deliverable
YOUR COMPANY  ·  CONFIDENTIAL
01
Model-tier matrix
Opus vs Sonnet vs Haiku per use case. When to upgrade, when to downgrade, and why.
02
Prompt-rig audit
What to rewrite from Markdown into XML, where to add cache breakpoints, thinking block recommendations.
03
Prompt caching breakeven
Per-prompt analysis with projected monthly savings once caching is wired at the call-site.
04
Eval harness baseline
Hand-rated golden set, baseline scores on Opus/Sonnet/Haiku, and the upgrade decision matrix.
05
Six-month roadmap
Prioritized Claude bets with model tier, latency budget, and integration cost per project.
06
Kill list
Projects where Claude doesn't pay off, with one-paragraph justification each.
07
Recommended next sprint
Costed and scoped proof-of-concept for the highest-leverage green item in the roadmap.

Four shapes. Ranges from $25k to $150k+.

We publish them because we have been on the other side of the table - the call where you ask the price and three weeks of email-tag begin.

Production PoC $50–150k

Fourteen days to one Claude feature live in your stack. XML rig, eval harness, prompt caching wired, dashboard, runbook. Refundable if it doesn't ship.

Full implementation $150k+

Six to twelve weeks. Tool use, thinking blocks, 200K context strategy, hallucination CI, monitoring, cost caps per tenant, full integration into your stack.

Team augmentation $/month retainer

Senior engineers embedded with your team. Prompt review, eval design, model-tier calls, on-call escalation. Quoted by scope.

/ Our own Claude product - Sentinel /

We run Sentinel on Claude. Same rig you'll inherit.

Sentinel is JAAX's live Shopify analytics product. The Claude layer that answers operator questions over merchant data runs on the same XML-structured, prompt-cached, eval-gated rig you'll inherit. Every habit on this page was earned shipping it, not theorized for a deck.

See Sentinel
11/12 agents on Claude in production
92% renewal rate
14d typical sprint
2 operators, no juniors

For teams who have picked Claude and want the engineering right.

The buyers we do our best work for share three traits:

  • A number they want moved - deflection rate, recovery rate, time-to-quote, cost-per-ticket
  • At least one AI initiative already attempted - they know the difference between a working agent and a working demo
  • A window, usually a quarter, to show something running

We work with Series A startups whose CTO is the buyer. We work with mid-market product teams handed Claude with no roadmap. We work with Fortune 1000 divisions that picked Claude through procurement and now need someone who can actually ship on it.

If you need vendor-evaluation help across Anthropic, OpenAI, and Google, see our anthropic-consulting page. This page is for product-level Claude buyers.

"The eval picks the model. The XML picks the prompt. Prompt caching is not optional once you have run the math."
JAAX methodology

Questions we get on every fit call.

Claude consulting is the engineering practice of getting Claude-backed features into production with the right model tier, the right prompt rig, and the eval harness to know whether either is working. The honest version of the category builds and runs the systems it sells. The dressed-up version sells prompt-tweaking by the hour. We are the first kind. The proof is Sentinel.

  • Opus 4.7 for high-stakes reasoning, multi-step agent loops, and the calls where being right matters more than being fast.
  • Sonnet 4.6 balances cost and quality - it wins most evals. Most production calls live here as the daily driver.
  • Haiku 4.5 for high-volume classification, routing, and the calls where p95 latency (the slowest 5% of requests) under 500ms is the constraint.

The rule is: write the eval first, run all three at the cheapest setting, and let the eval pick. Most teams arriving at us are using Opus where Sonnet would win and Sonnet where Haiku would win.

XML tags, not Markdown headers. Claude was trained to follow XML structure (, , , ) more reliably than free-form prose, and the lift from the same prompt restructured into XML is usually two to five eval points without any other change. We also use thinking blocks for hard reasoning calls, system prompts for invariants, and put the most important instruction at the end of the user message - Claude weights recency more than the middle of long contexts. The system prompt is for the role; the user message is for the task; the assistant prefix is the underrated lever.

Prompt caching breaks even at roughly two cache hits within the five-minute TTL, since the write is 1.25x and the hit is 0.1x of base cost. That makes caching a near-universal win for any prompt with a stable system message, a long retrieved context, or a multi-turn conversation. We wire caching at the call-site by default and set the cache_control (the API flag that activates prompt caching) breakpoint at the largest stable prefix - usually the system prompt plus retrieved documents. Teams who haven't enabled it are typically overspending by 40-70% on Sonnet and Opus calls.

Treat the 200K window as a budget, not a license. Claude's recall degrades on long contexts the same way every model's does - middle-of-context information gets lost more than start or end. We retrieve narrowly, place the most important information at the top and the task at the bottom, and use prompt caching to make long stable contexts cheap. Published benchmarks for long-context recall look better than real-world performance. The production reality is to design for the first 32K tokens and let caching pay for the rest.

Yes. Tool use is the right primitive for any feature that needs to fetch, calculate, or write. We define tools with strict JSON schemas, validate every tool result, and run a retry-with-correction loop when a tool returns an error. Thinking blocks (extended thinking on Opus and Sonnet) are reserved for genuinely hard reasoning calls where the answer benefits from chain-of-thought - most production prompts do not need them and pay for them anyway when teams enable thinking globally. Eval first, then decide.

Anthropic consulting is the enterprise procurement lens - vendor selection, contract terms, security review, deployment options across Anthropic API, AWS Bedrock, and Google Vertex. Claude consulting is the product-engineering lens for teams who already picked Claude and need the model tier, prompt rig, and eval harness right. OpenAI consulting is the same engineering work but on the GPT side. We do all three. This page is for teams shopping the product side.

Strategy sprints run $25–45k. Production proofs-of-concept run $50–150k. Full implementations start at $150k and scale with integration depth. Embedded team augmentation is a monthly retainer. Pricing is the same regardless of industry - we charge by engagement shape, not by model provider.

Yes, mutual NDA before any technical conversation. We do not work for clients with conflicting active engagements in the same competitive set during a quarter - a rule we enforce on ourselves more strictly than most clients ask us to.

Start something

Send a paragraph. We'll come back the same day.

Tell us which Claude tier you're on, what feature you want shipped, and the metric you want moved. We'll come back with a yes, a no, or a sharper question. No discovery deck, no pitch meeting marathon.

Book a 30-min fit call