Behind the Build

A Bayesian framework for A/B testing AI agents

Rouven Glauert
Senior Applied Scientist

Matthäus Deutsch
Senior Machine Learning Engineer

Stefan Ostwald
Co-Founder & CAIO

Home > blog > Article

12 August 2025 • 11 mins

Evaluating artificial intelligence (AI) agents is hard to get right. Agent builders, architects, engineers, and researchers see this in practice daily. Change a prompt or a dialog step, and the metrics shift. Sometimes, dramatically. Rerun the same simulation, and the results don’t always hold; it’s just the inherent nature of stochastic layered systems.

Without principled (statistical) ways to interpret these fluctuations, it’s hard to tell if a change truly improved your agent or if it just happened to look better this time.

That’s why we built a framework that brings statistical rigor back to evaluation.

TL;DR

We’re introducing a hierarchical Bayesian model for A/B testing AI agents. It combines deterministic binary metrics and LLM-judge scores into a single framework that accounts for variation across different groups.

We recently used this model to compare GPT-4.1 and GPT-4o in customer service simulations. And it didn’t just show which agent performed better overall; it revealed where their strengths differed across scenarios. Since it explicitly accounts for uncertainty and variation—like customer intents and use cases—across task types, this model gives teams a more grounded basis for deciding which agent to deploy.

It’s still early days, but the indications are strong. In the coming months, we’ll keep pressure-testing the framework to ensure it holds up, and look at how we can bring it into our platform—so our partners can move faster and build with confidence.

The challenge of evaluating AI agents

A lot of AI agent evaluations—even in published research, as critiqued numerous times—still rely on single-point scores, without accounting for uncertainty or variability. That means no error bars, no confidence intervals, no sense of how results might change across different runs or scenarios. It flattens the data and hides important details, like where the agent performs well or struggles.

That kind of approach might work if you’re just checking whether an agent responded correctly on a basic test set, i.e., a predefined, limited set of example inputs and expected outputs. For example, a customer might ask, “What are your business hours?” and the expected output is a standard protocol response. Either the agent gives the right answer, or it doesn’t.

But in real-world deployments, that’s not enough. Because users phrase things unexpectedly, switch topics, express emotion or urgency, and often require multi-turn logic.

Suppose you're evaluating a customer service agent with only a pass or fail checklist, or by manually reviewing a few sample conversations. In that case, you're just getting a glimpse of how customer conversations work. You'll catch the obvious failures, but miss nuances:

How do you combine simple yes or no outcomes with more subjective, graded LLM scores, such as how natural or helpful the agent sounded?
What if the same agent performs differently depending on the type of conversation or user intent?
And how do you know if those performance differences are reliable—or just random noise?

These are critical questions for production systems. They impact how fast teams can ship new versions, how agent interactions feel to customers, and whether performance improvements are real. But today’s evaluation pipelines usually aren’t built to answer them.

We’re proposing a hierarchical Bayesian model

Traditional A/B tests treat every result equally, regardless of the context. That’s a problem when you’re evaluating AI agents, because performance can shift based on the type of conversation, customer, or scenario. And lumping all results together overlooks those differences.

Our method uses a hierarchical Bayesian model with partial pooling to capture both kinds of variation (and gives you a more realistic view of agent performance). We considered:

Within-group differences (e.g., how an agent performs across similar scenarios)
Between-group differences (e.g., how performance shifts across different types of conversations)

Together, these feed into a unified framework that reflects real-world complexity.

“While this kind of modeling is common in academic statistics, it’s rarely used in LLM agent evaluation—especially in production settings.

Moreover, what’s new here is how we combine binary outcomes—like 'Did the agent confirm the customer’s name?'—with subjective scores from LLM judges e.g., quality ratings from 1 to 7”

The key shift is treating evaluation as a probabilistic problem, not just a win/loss count. In other words, instead of asking, “Which agent had the higher win rate,” we ask, “How confident are we that one agent performs better—and in what types of scenarios, across which metrics?”

Here’s what our model captures:

1. Binary checks

Some evaluation criteria are straightforward, like whether the agent asked for the customer’s name. That’s a simple yes or no. In basic situations, you might just count how often it happens and call it a day. But when conversations vary (say, across different use cases), that approach starts to fall apart.

Instead, we use a statistical model (specifically, Beta-Binomial) that estimates both how often an action happens and how confident we are in that rate, while factoring in different conversation types. That gives you a clearer picture of how consistent the agent is, not just the raw percentage.

2. LLM-judge scores

LLM judges are models that rate how well another model did. For example:

“Did the agent greet the user?” and GPT might respond “Yes.”

But that “yes” isn’t always as confident as it sounds. These models work by computing token probabilities. So even if the answer is “yes,” the model might assign near-equal likelihood to both “yes” and “no,” depending on phrasing or ambiguity.

Behind the scenes, the model might have been 51% “Yes” and 49% “No.” It gave a firm answer, but it wasn’t sure.

It’s a bit like asking someone for directions when they’re unsure. They might say, “I think it’s left,” but you can tell they’re only slightly more confident in that answer than the alternative. The words sound decisive, but the conviction behind them is weak.

Our framework doesn’t just take that answer at face value; it models that underlying uncertainty. That helps you spot when the agent might be skating by on shaky judgments, even if the score looks good on the surface.

3. Scenario groups

Some agents perform better in certain situations. Maybe one does great with billing but struggles with cancellations. But most evaluation systems either ignore that variation or silo it completely.

We do neither.

Our model groups conversations by scenario and uses “partial pooling” to share statistical strength across groups without conflating them. This way, we reduce the risk of overinterpreting results from small samples or idiosyncratic cases, so a model that performs well in one area but poorly in another isn’t unfairly boosted or penalized.

Why partial pooling is key for fair evaluation

Without grouping scenarios, we risk drawing the wrong conclusions.

Take a model that fails to ask about the user’s role. In some cases, like setting up user permissions, that’s a critical omission. In others—say, a general product inquiry—it doesn’t matter. But if we treat all scenarios the same, we might penalize the model for a “failure” that’s only a failure in some contexts.

Think of it like comparing test scores from two very different schools:

One is well-funded, with small class sizes, laptops for every student, and AP tutors on speed dial.
The other is under-resourced, overcrowded, and doesn’t have access to the same infrastructure.

If a student from each school scores 85% on the same exam, a flat comparison says: “They performed equally well.” But context tells a different story. That 85% might be below expectations for the student from the well-funded school, and well above expectations for the student who’s had far less support.

A pooled model would treat both scores the same. Whereas, a hierarchical model with partial pooling would recognize the context and say:

“Given the resources available, the second student overperformed, and the first may have underdelivered.”

By modeling scenario groups explicitly, we:

Reduce false positives in model comparisons
Quantify variation across types of customer intent
Reflect the uncertainty baked into real-world conversations

This is what partial pooling does in LLM evaluation: It adjusts for scenario-level expectations instead of flattening everything into one global benchmark.

What does the hierarchical Bayesian model look like in practice?

To test the framework, we ran a head-to-head comparison between GPT-4.1 and GPT-4o. Our goal wasn’t to crown a winner; we wanted to stress-test the framework and see how well the model handles complexity.

We used two real-world simulations:

One based on customer conversations from a European insurance company
Another based on prompt variations for a global travel platform’s call center agent

In both cases, we evaluated the agents using the same approach from earlier: combining pass or fail checks with LLM-judge scores. We also grouped conversations by scenario type (like billing, onboarding, or cancellations) to reflect how performance varies in practice.

GPT-4.1 vs. GPT-4o in insurance simulations

For the insurance simulation, we took 100+ actual customer conversations and created 10 variations of each, capturing natural variability in how users express the same intent.

We then used the Bayesian model to estimate the probability that GPT-4.1 outperformed GPT-4o across a range of specific tasks.

A value close to 1 means GPT-4.1 consistently outperformed GPT-4o on that task
A value near 0 means GPT-4o came out ahead
Scores around 0.5 suggest no meaningful difference

From there, the model did its job: It estimated the probability that GPT-4.1 outperformed GPT-4o, not just overall, but within each scenario. The table below summarizes how the two agents compared across key metrics.

Table comparing metrics with probabilities and interpretations, highlighting GPT-4o and GPT-41 advantages in tool call messages and agent inquiries. Posterior probability that GPT-4.1 outperforms GPT-4o across five metrics in a simulated insurance scenario. Higher values indicate greater confidence in GPT-4.1’s advantag

This table shows both clear performance differences and areas where the data is genuinely uncertain. For example, GPT-4.1 consistently outperforms GPT-4o on tasks like asking for a policy number.

On the other hand, a metric like “messages to user contains no code,” focuses on cleanliness of output. It flags when an agent accidentally includes unintended code or formatting artifacts in its response, like raw HTML tags, unrendered template variables, or system-level instructions.

These issues might seem small on the surface, but they break the customer experience. And in this case, the Bayesian model assigned a posterior probability of around 0.495, which means the model doesn’t lean one way or the other.

That ambiguity is useful, because instead of pretending there’s a clear winner, the framework surfaces when the evidence—for whether GPT-4.1 or GPT-4o is better at avoiding this issue—just isn’t strong enough yet.

Why does scenario grouping impact model certainty?

Because when we don’t group them—when we treat all conversations the same, as with a simple, non-hierarchical evaluation—the model gets overconfident. It might show a big difference between agents, even if that difference only shows up in one specific type of task.

Take the chart below: The model claims a very strong difference between agents when evaluating whether they asked for the caller’s role. The distributions for GPT-4.1 and GPT-4o barely overlap, implying near-certainty.

But that’s misleading, because this "clean split" is an artifact of ignoring scenario structure. Graph comparing GPT-4.1 and GPT-4.0 probability densities for asking caller's role, with means and 95% confidence intervals marked. Simplified model shows overstated confidence due to ignoring group differences. Hierarchical model yields a more cautious, realistic estimate

On the flip side, when we group by scenario (like “billing” vs. “onboarding”), the model becomes more cautious. You’ll see that in the next chart. The results for each agent are closer together, and there’s more overlap between them.

Overlapping probability density plots for GPT-4.1 and GPT-4.0, showing means, 5% CI, and evaluated role probability. Same metric using the hierarchical model. Posterior overlap is larger, reflecting real-world group variation

That overlap comes from the posterior distributions the model calculates for each agent’s performance. When those distributions overlap, it’s the model signaling uncertainty, and that’s good. It means we’re not jumping to conclusions based on one narrow slice of data.

Group and agent-level effects: Where does variation come from?

These plots help us separate where performance differences come from: Is it the scenario (group effect), the agent (model effect), or both?

The chart below shows how different conversation scenarios—like those that involve tool call flows—affect intermediate message performance, i.e., the agent’s responses in the middle of a task where it needs to guide the user or take action. Each dot represents a specific group, and the vertical bars show uncertainty around that group’s effect. The right-side histogram gives the overall distribution.

In short, this confirms that scenario context has a clear and measurable impact on intermediate message quality, reinforcing the need to model group-level effects.

Group effects for intermediate message generation from base conversations of 100+ (left: uncertainty by group, right: distribution across groups). Highlights how simulation scenarios drive real performance variation

Bar graphs showing probability differences and effect sizes for agent tasks. Top: probability differences with intervals. Bottom: effect sizes by metric. Agent-level comparison across five metrics, showing both mean probability differences (top) and effect sizes (bottom) between GPT-4.1 and GPT-4o. Includes 95% credible intervals. Green bars reflect high-confidence wins; red bars reflect underperformance

Now, we quantify the difference between agents on each metric. The chart above shows the mean probability difference (e.g., “GPT-4.1 was 13.7% more likely to ask for a policy number”), with credibility intervals. The bottom chart translates that into effect size (Cohen’s d), to help visualize how significant the difference is and how reliable.

Together, these views show that modeling agent and group effects jointly gives a more trustworthy read on real-world performance, especially when decisions hinge on nuanced differences, not just raw win rates.

We validated the model with the same agent

To validate the framework, we also ran A/B tests on the same agent configuration, using data sampled from both within the same scenario groups and across different groups. We wanted to check for false positives.

In one test, we pulled samples from the same scenario groups. In another, we mixed in different groups. If the model’s working correctly, both tests should show no meaningful difference, since the agent didn’t change.

But that only held true with the hierarchical model, which accounts for group-level differences. The simple model, which assumes all data points are independent and identically distributed (i.i.d.), misread natural differences between scenario groups as evidence that the agent's behavior had changed—when it hadn't. It failed to account for group-level variation, which led to overly confident results that didn’t reflect reality.

Edge cases in travel agent evaluation

We tested an updated prompt designed to improve how a travel platform’s agent responds when a caller hangs up. The change came out of an internal iteration process, and we ran simulations to evaluate whether it made a measurable difference.

We used two test sets:

One with 30 grouped scenarios based on real “general” use cases, simulated 10 times each (30×10)
Another with 100 ungrouped simulations focused on “unclear intent” cases, where the caller’s needs are harder to interpret

Performance comparison between GPT-4.1 and GPT-4o on detecting hang-up behavior.

The updated prompt led to noticeable improvements in hang-up handling for the general use cases, but didn’t move the needle in the unclear-intent scenario. That tracks with what we’d expect: When intent is ambiguous, even a stronger prompt can only go so far.

That said, the confidence interval for the observed improvement in the general case wasn’t fully above zero. So while the model saw a likely benefit, it also signaled some uncertainty, which is a useful reminder not to overstate impact based on early results.

A cautious model by design

We designed the model to be cautious by default. That means it assumes most differences between agents are small—unless the data clearly says otherwise.

We apply strong shrinkage on agent effects, so the model pulls performance estimates toward the average unless there’s strong evidence to suggest otherwise.
We use wide priors on group-level variation, letting the data determine how much performance truly varies across different scenarios.
We treat LLM-evaluated scores cautiously, since they’re subjective and more prone to noise.

Together, these choices prevent the model from overreacting to random variation. If it says one agent is better, it’s because it’s very likely to be better, i.e., the evidence is strong—not just lucky on a small sample.

Ongoing model evaluation as data accumulates

Our Bayesian model also supports incremental updates. As more simulation data comes in, the posterior distributions adjust; there’s no need to restart the test. This makes it ideal for production settings where model comparisons are ongoing and deployment risk has to be managed.

Think of it like a live scoreboard that gets more accurate as the game goes on. You don’t have to wait until the end to make a call, because you can see where things are trending and how confident you should be.

We can express these results in operational terms. For example, a result might look like:

"Agent B has a 93% probability of outperforming Agent A on user guidance prompts. The difference in error rates is estimated to fall between 2.1% and 4.3% (credible interval).”

These are the kinds of answers engineering teams and product leaders need to make informed go-live calls.

Statistical humility is critical for responsible AI deployment

At Parloa, our ongoing research and development (R&D) and product enhancements are guided by a deep commitment to responsibility.

This latest evaluation framework is built on the premise that uncertainty, context, and restraint are features—not bugs—of a responsible evaluation process. Rather than drawing conclusions from preliminary signals or chance fluctuations, we infuse every layer of our analysis with caution.

With this Bayesian model, we want to provide engineering teams and product leaders confidence, delivered in probabilities, not proclamations. Our vision is for every AI agent builder shipping models to see, at a glance:

How much stronger is one approach over another, in operational terms
Not just whether a change “works on average”, but where, why, and how reliably it does so
Where more data or testing is needed before rolling out change at scale

As we bake this methodology into our internal tooling, we’re setting a new standard for how agentic systems are evaluated and trusted in production environments.

Ultimately, we believe that by lifting the bar, we enable better, safer, and more human AI, ready to earn the trust of millions of users worldwide. And we look forward to collaborating with the broader AI community to shape, refine, and build upon this approach.

Get in touch

Our team and contributors

References

knowledge-hub6 August 2025

The complete guide to prompt engineering frameworks

Prompt engineering frameworks are structured methodologies for building prompts that ensure consistency, safety, and high performance of LLM systems.

blog4 August 2025

What it takes to build and scale AI voice agents effectively

Building AI voice agents is about trust. Spot silent failures, prevent hallucinations, and build voice AI that performs reliably at enterprise scale.

blog6 August 2025

The quiet spread of AI agent washing in customer service

Building AI voice agents is about trust. Spot silent failures, prevent hallucinations, and build voice AI that performs reliably at enterprise scale.

The 2025 AI agent buyer’s guide

The 2025 AI agent buyer’s guide

The 2025 AI agent buyer’s guide

The 2025 AI agent buyer’s guide

A Bayesian framework for A/B testing AI agents

TL;DR

The challenge of evaluating AI agents

We’re proposing a hierarchical Bayesian model

1. Binary checks

2. LLM-judge scores

3. Scenario groups

Why partial pooling is key for fair evaluation

What does the hierarchical Bayesian model look like in practice?

GPT-4.1 vs. GPT-4o in insurance simulations

Why does scenario grouping impact model certainty?

Group and agent-level effects: Where does variation come from?

We validated the model with the same agent

Edge cases in travel agent evaluation

A cautious model by design

Ongoing model evaluation as data accumulates

Statistical humility is critical for responsible AI deployment

Related articles

The complete guide to prompt engineering frameworks

What it takes to build and scale AI voice agents effectively

The quiet spread of AI agent washing in customer service