Behind the Build

What it takes to build and scale AI voice agents effectively

Anjana Vasan
Senior Content Marketing Manager
Parloa
Home > blog > Article
4 August 20256 mins

In 2002, psychologists Leonid Rozenblit and Frank Keil ran a study that asked people to rate how well they understood ordinary objects, things like toilets, zippers, and helicopters. Predictably, most individuals gave themselves high scores. These were familiar tools, after all.

Then came part two: They had to explain, step by step, how these objects actually work. Suddenly, that self-assurance dissipated. People fumbled through half-remembered ideas and realized they didn’t know what they thought they knew.

The researchers weren’t out to embarrass anyone. They were studying a cognitive bias called the illusion of explanatory depth. It’s the tendency to overestimate how well we understand complex systems. The reality is that most of us know just enough to talk about something, not enough to explain or rebuild it. 

And this illusion shows up everywhere. But in tech—especially in artificial intelligence (AI)—it’s loudest. Ask around about AI agents and you’ll hear the word "agentic" tossed around like it means something definitive. But when you move past demos and the assumptions, the illusion falls apart. 

Voice AI agents, in particular, are still misunderstood. At CCW Las Vegas, Maik Hummel (Head of AI Strategy, Parloa) and Tomas Gear (Team Lead, Agent Integration Engineering) peeled back the curtain on what to know—and avoid—to make voice AI agents production-ready.

Voice is harder than it seems

The industry has already gone through two interface shifts: from clicking on websites to tapping on mobile apps. Now we’re entering the third: talking to systems.

Voice feels natural, but it's anything but simple. It’s ambiguous and unpredictable. Something as basic as checking an order status changes drastically depending on the medium:

  • On a website: You fill out a form

  • On a mobile app: You tap through menus

  • With a voice agent: You say, “Hey, when are my shoes getting here?”

In the first two, the logic is structured and the input is controlled. But in the third, the system has to infer your intent, gather relevant context, and respond correctly in real time—sometimes, in a single turn. There’s no clear path to follow, and this shift breaks traditional systems.

Yet many teams try to handle voice interfaces with the same tools they used for chat: prompts.

Design like an engineer, not a prompter

IVRs were easy to control. You knew exactly what someone could do, e.g., press 1 for billing or 2 for support,  and you could map out every possible path ahead of time. It wasn’t flexible, but it was predictable and safe. 

That design logic stuck around for a long time, partly because it worked. If your goal was to keep calls short and avoid overloading agents, it did the job. And honestly, once people got used to the clunky menus, no one questioned it much.

But then came LLMs. And suddenly, the input isn’t a button press, it’s language. People don’t talk in neat sequences or follow your expected structure, and now the system has to figure out what to do with that. So teams default to giant prompts that try to control behavior, wording, and tone—and just hope it works. 

Prompt engineering is more like a pseudoscience right now... even the base model providers like OpenAI and Anthropic, they don't know what works and what doesn't.

That's the reality we are in. We have estimates, we have guesses, we are trying things out, but the whole industry is currently figuring out how this works.

Maik Hummel, Head of AI Strategy, Parloa

That uncertainty is exactly why text-only logic isn’t reliable in production. Engineering offers a different path. Instead of one prompt trying to do everything, you treat the agent like a set of components:

  • One part handles intent recognition

  • Another pulls in relevant data

  • A third applies rules or escalates to a human

You can inspect each one separately, and you can even update them without rewriting the whole thing. If one part fails, the rest of the system still works.

You don’t really know your agent until it breaks

You’ll watch a bunch of customer interactions, everything looks smooth, and you convince yourself it’s production-ready. But that’s because you haven’t stressed it yet.

Agents don’t fail in obvious ways. They don’t throw an error message and say, “Hey, I don’t know what to do here.”  They go off-script in ways that are easy to miss: inventing policies that don’t exist, misclassifying tone, improvising answers that sound plausible but aren’t grounded in truth.

In other words, sometimes they hallucinate. And if you're not watching for it, you won’t catch it until customers do. 

That’s where simulation comes in as a guardrail. At Parloa, we use large-scale, synthetic simulation, a kind of exposure therapy for AI agents. Each simulated conversation is infused with noise: varied phrasing, accents, pacing, interruptions, emotional tones. Then, they’re scored not just for task success but for protocol adherence, fallback behavior, and escalation timing.

To figure out how well an AI agent’s doing, we use an AI evaluator. It’s basically one AI checking the work of another, what we call “LLM as a judge.” A language model reads through each message in a simulated conversation and decides whether the agent hit the goals it was supposed to.

For instance, did it offer a voucher when someone tried to cancel an order? Did it try to reschedule a hotel booking instead of just accepting the cancellation?  All that feedback gets handed off to an agent architect, a human who reviews the transcripts, spots edge cases, and makes sure everything’s on track before going live. 

It’s that extra layer of human oversight that helps catch hallucinations or odd behavior before customers do.

The Swiss Army knife fallacy

At some point, every team building voice agents thinks: “Wouldn’t it be easier if we just had one agent handle everything?” One agent to change addresses, one to reset passwords, one to check refunds… rolled into a single mega-brain that does it all. It sounds tidy and centralized. Maybe even elegant.

But here’s the thing: Swiss Army knives only seem like a good idea until you try to make dinner with one. Yes, it technically has all the tools, but none of them are necessarily good.

That’s the same problem with all-in-one agents. When a single model tries to manage every task, it starts to lose shape. Worse, there’s no way to isolate the failure. Was it the logic? The data call? The misunderstanding of intent? It’s hard to say when it’s all packed into the same black box.

The better solution is specialization. And what that looks like, in practice, is a setup where you’ve got multiple agents—each one responsible for a single task. One figures out what the user wants. Another grabs the right data. Another decides if it needs to escalate. It's not one giant model guessing its way through the whole conversation. It's a group of smaller AI agents, each doing one job well, like a team.

This setup’s called a multi-agent architecture. And here’s why it matters: when something breaks, you can actually find the problem, and even better, you can debug the issue, too.

What gets measured (properly) gets shipped

An AI agent sounding good isn’t the same as doing the job.

So here’s what we did to validate ours: We ran our multi-agent setup through the Tau benchmark (τ-bench); it’s an industry-specific evaluation. It’s designed to stress-test agents on real-world, multi-goal conversations, not just one-turn tasks, but messy flows like modifying an order, canceling a shipment, or processing a return. And like with any non-deterministic system, the more goals you stack in a single conversation, the more likely things fall apart. That’s just how non-deterministic systems work. So Tau gives you a decent stress test—not just “did it respond,” but “did it keep its head straight across the whole exchange?”

We ran a Tau test using a realistic retail scenario, starting with a human-prompted baseline. That’s what a well-trained team could build in two to three weeks of manual config. Then, we ran the same scenario using our multi-agent setup,  combining meta-prompting with retrieval-augmented generation (RAG) and a compliance layer to keep responses aligned with policy.

The result? A 180% improvement over the baseline.

Even better, that setup scaled fast.  We got one of these agents live in production traffic in five days. We wouldn’t recommend that timeline (for sanity reasons), but the system held up. But the point is that speed didn’t come at the cost of stability.

We’re headed to an agentic web

Voice agents are just the beginning. What comes next is the agentic web, a new layer of the internet navigated not by typing URLs, but by describing goals. And that shift is already underway as protocols like A2A (agent-to-agent) and MCP (model context protocol) are laying the groundwork by allowing agents to talk to each other and to external systems in standard, interpretable ways. 

And with initiatives like Microsoft’s NLWeb, we’re moving toward a world where websites become machine-readable not just by crawlers, but by autonomous agents acting on our behalf. AI agents are already handling voice interactions. Soon, they’ll move across systems, i.e., traverse the web, pull context, and act on your behalf. 

For more insights from Tomas and Maik, check out their full CCW session.