Capability · Improve
See what’s working. Coach what isn’t. Ship the gain.
Replays, evals, experiments. The agent that runs today is measurably better than the one that ran last month, and we can prove it on a chart.
Replays. Evals. Experiments.
Three disciplines, one outcome: every agent version measurably better than the last, on conversations that map to your business.
Every conversation, fully reconstructable. Voice, chat, kiosk, email. Filter by outcome, by language, by escalation. Find what closed and what didn't.
Regression-tested, not vibes-tested. Canonical test suites with baselines and Bayesian comparison. New versions ship only when they beat the bar that closed last quarter's deals.
Thompson-sampled arms, Bayesian posteriors, automatic concluding. Run on prompt, tone, voice, escalation logic. Anywhere it matters, scientifically.
Real cases. Sealed rubrics. Hard gates.
Each canonical case carries an input, an expected behavior (not necessarily a verbatim string), a scoring rubric, and a tag. The harness runs every candidate version, scores it on rubric matches, and gates promotion if it regresses on any sealed case.
1id: booking_weekend2tags: [booking, hours, hospitality]3language: en4input: |5 do you accept walk-ins on saturday afternoon?6expected:7 intent: booking_enquiry8 must_mention:9 - "saturday"10 - "weekend hours"11 must_not:12 - "i don't know"13 - external_url_pattern14 tone: friendly_professional15 knowledge_used:16 - policy:hours.weekend17rubric:18 intent_match: 0.419 content_match: 0.420 tone_match: 0.1521 no_hallucination: 0.0522gate:23 baseline: 0.8624 required: regression_blockBayesian convergence in days, not 10,000 calls.
Thompson sampling with Bayesian posteriors. Traffic flows to whichever variant looks most promising while preserving exploration. The experiment auto-concludes once one variant is meaningfully better. Typically after 600 to 2,000 conversations.
When it goes wrong, the answer is one click away.
Every conversation turn has a complete trace tree. Every model call, every tool invocation, every retrieval, latency at each step. Open the conversation, click the turn, the answer is in the tree. No log scraping.
1└── turn_14 · 740ms · ok2 ├── intent_classify · 110ms · booking_enquiry · weekend3 │ └── route:core_local · 92ms4 ├── knowledge_retrieve · 240ms · 2 chunks5 │ ├── policy.hours.weekend · 0.946 │ └── policy.walkins · 0.917 ├── plan · 180ms · answer + offer_booking8 │ ├── route:frontier_a · 142ms · 1.2k tokens9 │ └── tool_choice · skip10 ├── compose · 160ms · 71 tokens out11 │ └── route:core_local · 134ms12 └── emit_audit · 50ms · okWhat Improve does. And what it does not.
Improve is observability and retraining for live agents. It is not a generic LLM evaluation harness, and it is not a customer-facing analytics product. The line is deliberate.
Improve handles
- Replays of every conversation across every channel
- Eval suites with regression gates per agent version
- Bayesian A/B experiments on prompt, tone, voice, escalation
- Per-call traces. Every model call, tool, retrieval, latency
- Insights. Pattern detection across recent conversations
- Multi-channel attribution reporting
Improve does not
- Generic LLM benchmarks. Wrong tool
- Customer-facing dashboards your end users see
- Predictive lead scoring (use Convert)
- Marketing analytics (use your existing tools)
- Performance monitoring of your app (use APM)
- Replace your data warehouse
Better agents close more. See it on the pipeline.
The agent gets better. You watch the results.
Thirty minutes. We walk through real replays, your eval suite, and a live experiment. You see the loop running before the call ends.