Capability · Improve

See what’s working. Coach what isn’t. Ship the gain.

Replays, evals, experiments. The agent that runs today is measurably better than the one that ran last month, and we can prove it on a chart.

Eval suite · pass rate
91%
↑ 29 pts
Last 60 days
Agent v1 → v8
Three things to know

Replays. Evals. Experiments.

Three disciplines, one outcome: every agent version measurably better than the last, on conversations that map to your business.

01
Replays

Every conversation, fully reconstructable. Voice, chat, kiosk, email. Filter by outcome, by language, by escalation. Find what closed and what didn't.

02
Evals

Regression-tested, not vibes-tested. Canonical test suites with baselines and Bayesian comparison. New versions ship only when they beat the bar that closed last quarter's deals.

03
Experiments

Thompson-sampled arms, Bayesian posteriors, automatic concluding. Run on prompt, tone, voice, escalation logic. Anywhere it matters, scientifically.

Test case

Real cases. Sealed rubrics. Hard gates.

Each canonical case carries an input, an expected behavior (not necessarily a verbatim string), a scoring rubric, and a tag. The harness runs every candidate version, scores it on rubric matches, and gates promotion if it regresses on any sealed case.

eval/cases/booking_weekend.yaml
yaml
1id: booking_weekend
2tags: [booking, hours, hospitality]
3language: en
4input: |
5 do you accept walk-ins on saturday afternoon?
6expected:
7 intent: booking_enquiry
8 must_mention:
9 - "saturday"
10 - "weekend hours"
11 must_not:
12 - "i don't know"
13 - external_url_pattern
14 tone: friendly_professional
15 knowledge_used:
16 - policy:hours.weekend
17rubric:
18 intent_match: 0.4
19 content_match: 0.4
20 tone_match: 0.15
21 no_hallucination: 0.05
22gate:
23 baseline: 0.86
24 required: regression_block
Experiment, worked

Bayesian convergence in days, not 10,000 calls.

Thompson sampling with Bayesian posteriors. Traffic flows to whichever variant looks most promising while preserving exploration. The experiment auto-concludes once one variant is meaningfully better. Typically after 600 to 2,000 conversations.

Experiment
exp-23 · Outbound voice opener: direct vs warm
Variant A
Direct opener. 'Hi, this is the agent calling about your enquiry. Got two minutes?'
Variant B
Warm opener. 'Hi, hope you're having a good day. Quick follow-up on the question you sent us yesterday.'
Primary metric
Continuation past 30 seconds (proxy for engagement)
Day 1
A: 38% · B: 41% · posterior overlap 0.82. Keep exploring.
Day 4
A: 40% · B: 47% · posterior overlap 0.31. B winning.
Day 6
A: 39% · B: 49% · posterior overlap 0.04. Auto-conclude.
Decision
B promoted at 1,247 calls. Production traffic moved fully to B; A archived for reference.
Trace tree

When it goes wrong, the answer is one click away.

Every conversation turn has a complete trace tree. Every model call, every tool invocation, every retrieval, latency at each step. Open the conversation, click the turn, the answer is in the tree. No log scraping.

trace · conv_C5xN1q · turn 14
trace
1└── turn_14 · 740ms · ok
2 ├── intent_classify · 110ms · booking_enquiry · weekend
3 │ └── route:core_local · 92ms
4 ├── knowledge_retrieve · 240ms · 2 chunks
5 │ ├── policy.hours.weekend · 0.94
6 │ └── policy.walkins · 0.91
7 ├── plan · 180ms · answer + offer_booking
8 │ ├── route:frontier_a · 142ms · 1.2k tokens
9 │ └── tool_choice · skip
10 ├── compose · 160ms · 71 tokens out
11 │ └── route:core_local · 134ms
12 └── emit_audit · 50ms · ok
Boundaries

What Improve does. And what it does not.

Improve is observability and retraining for live agents. It is not a generic LLM evaluation harness, and it is not a customer-facing analytics product. The line is deliberate.

Improve handles

  • Replays of every conversation across every channel
  • Eval suites with regression gates per agent version
  • Bayesian A/B experiments on prompt, tone, voice, escalation
  • Per-call traces. Every model call, tool, retrieval, latency
  • Insights. Pattern detection across recent conversations
  • Multi-channel attribution reporting

Improve does not

  • Generic LLM benchmarks. Wrong tool
  • Customer-facing dashboards your end users see
  • Predictive lead scoring (use Convert)
  • Marketing analytics (use your existing tools)
  • Performance monitoring of your app (use APM)
  • Replace your data warehouse
The other half

Better agents close more. See it on the pipeline.

The agent gets better. You watch the results.

Thirty minutes. We walk through real replays, your eval suite, and a live experiment. You see the loop running before the call ends.