Capability · Improve

See what’s working. Coach what isn’t. Ship the gain.

Replay real talks, test changes, and measure the result. See on a chart whether the new agent works better.

Book a Demo See how it talks →

Illustrative eval suite · pass rate

91%

↑ 29 pts

Example history

Agent v1 → v8

Three things to know

Replays. Evals. Experiments.

These three tools help each new agent work better on talks that match your business.

Replays

Replay each talk from phone, chat, kiosk, or email. Sort by result, language, or handoff. See what led to a sale.

Evals

Use saved test cases to check each new version. A version only goes live when it passes the same key cases as the current one.

Experiments

Test two prompts, tones, voices, or handoff rules. Fairshift sends more talks to the stronger choice and ends the test when there is enough proof.

Test case

Real cases. Sealed rubrics. Hard gates.

Each saved case has a question, a good result, clear scoring rules, and a tag. Fairshift tests every new version. A version cannot go live if it fails a key saved case.

eval/cases/booking_weekend.yaml

yaml

1id: booking_weekend
2tags: [booking, hours, hospitality]
3language: en
4input: |
5  do you accept walk-ins on saturday afternoon?
6expected:
7  intent: booking_enquiry
8  must_mention:
9    - "saturday"
10    - "weekend hours"
11  must_not:
12    - "i don't know"
13    - external_url_pattern
14  tone: friendly_professional
15  knowledge_used:
16    - policy:hours.weekend
17rubric:
18  intent_match: 0.4
19  content_match: 0.4
20  tone_match: 0.15
21  no_hallucination: 0.05
22gate:
23  baseline: 0.86
24  required: regression_block

Experiment, worked

Find a better version with fewer talks.

Fairshift sends more talks to the choice that looks better. It still tests the other choice. The test ends when one choice has enough proof, often after 600 to 2,000 talks.

Experiment

exp-23 · Outbound voice opener: direct vs warm

Variant A

Direct opener. 'Hi, this is the agent calling about your enquiry. Got two minutes?'

Variant B

Warm opener. 'Hi, hope you're having a good day. Quick follow-up on the question you sent us yesterday.'

Primary metric

Continuation past 30 seconds (proxy for engagement)

Day 1

A: 38% · B: 41% · posterior overlap 0.82. Keep exploring.

Day 4

A: 40% · B: 47% · posterior overlap 0.31. B winning.

Day 6

A: 39% · B: 49% · posterior overlap 0.04. Auto-conclude.

Decision

B promoted at 1,247 calls. Production traffic moved fully to B; A archived for reference.

Trace tree

When it goes wrong, the answer is one click away.

Each reply has a step-by-step record. See every model call, tool, source, and wait time. Open the talk and click a reply to see what happened.

illustrative trace · conv_C5xN1q · turn 14

trace

1└── turn_14  · 740ms  · ok
2    ├── intent_classify       · 110ms  · booking_enquiry · weekend
3    │   └── route:core_local  · 92ms
4    ├── knowledge_retrieve    · 240ms  · 2 chunks
5    │   ├── policy.hours.weekend · 0.94
6    │   └── policy.walkins        · 0.91
7    ├── plan                   · 180ms  · answer + offer_booking
8    │   ├── route:frontier_a   · 142ms  · 1.2k tokens
9    │   └── tool_choice        · skip
10    ├── compose                · 160ms  · 71 tokens out
11    │   └── route:core_local   · 134ms
12    └── emit_audit             · 50ms   · ok

Boundaries

What Improve does. And what it does not.

Improve helps your team watch and test live agents. It is not a general AI test tool or a report for your customers.

Improve handles

Replays of every conversation across every channel
Eval suites with regression gates per agent version
Bayesian A/B experiments on prompt, tone, voice, escalation
Per-call traces. Every model call, tool, retrieval, latency
Insights. Pattern detection across recent conversations
Multi-channel attribution reporting

Improve does not

Generic LLM benchmarks. Wrong tool
Customer-facing dashboards your end users see
Predictive lead scoring (use Convert)
Marketing analytics (use your existing tools)
Performance monitoring of your app (use APM)
Replace your data warehouse

The other half

Better agents close more. See it on the pipeline.

See the conversion layer →

The agent gets better. You watch the results.

Thirty minutes. We walk through real replays, your eval suite, and a live experiment. You see the loop running before the call ends.

Book a Demo Back to Platform