The never-ending conversation: Measuring long-conversation performance in LLMs

At Parloa, our AI agents handle these long conversations every day, i.e., tool-heavy dialogues where context grows fast and ambiguity grows faster. And we wanted to know: does performance actually degrade when conversations get that long or do today’s models hold steady? 

10 December 2025
Author(s)

Stefan Ostwald

Co-Founder & CAIO

Mariano Kamp

Senior Principal AI Researcher
Table of contents

Long conversations are like editing a group project that never stops. Everyone keeps adding comments, making changes, and leaving half-finished thoughts. By the 40th comment, you’re not sure what still matters and what’s just leftover noise. But when you delete it you risk losing relevant context, because it might be important later.

Now imagine that same problem, but in a customer conversation. Someone calls about a delayed order, forgets a detail halfway through, sends a receipt, gets transferred, or changes their mind mid-sentence. The AI agent has to keep track of all of it (every detail, every correction, every tool call) without losing context or history.

That’s the reality for AI agents in enterprise environments.  Customer conversations don’t unfold neatly; they sprawl. In a single support call, an AI agent might fetch order data from a CRM, pull product details from a knowledge base, and follow up with a shipping API. And that’s all while keeping track of what the human actually meant three clarifications ago.

At Parloa, our AI agents handle these interactions every day, i.e., long, tool-heavy dialogues where context grows fast and ambiguity grows faster. And we wanted to know: does performance actually degrade when conversations get that long or do today’s models hold steady? 

To answer that, we ran an experiment. 

As customers build longer and more complex interactions on our platform, we wanted to ensure the underlying models can sustain those experiences reliably. So we ran controlled simulations to see how current models hold up when conversations stretch far beyond typical durations. The results were surprising and encouraging.

The state of the long-conversation problem

Long-context performance has become one of the most discussed—and misunderstood—topics in AI research. Over the past year, models have grown context windows from a few thousand tokens to hundreds of thousands, and vendors have raced to prove they can “remember everything.” But the field still hasn’t settled on what that actually means in practice.

Most long-context benchmarks focus on static memory tests (often called needle-in-the-haystack evaluations): copying long documents into a prompt and checking whether the model can retrieve or reason over information buried inside. Those tests matter, but they don’t reflect how models behave in real use: conversations that unfold over time, full of clarifications, and corrections that generate their own token noise.

In other words, we know how models handle long text. We know far less how successfully they use long context in real-life customer support situations.

That difference matters. When an AI agent runs for 40 minutes, juggling structured tool data and shifting user intent, the challenge isn’t just context length. It’s context stability: can the model stay coherent, relevant, and grounded as that context grows, without losing the thread or the task?

For contact centers, that stability translates directly to business impact. A single dropped context can mean repeating verification steps or losing a sale. It’s also where cost compounds as every redundant turn adds tokens, latency, and frustration. Put simply, long-conversation reliability is the difference between an efficient automated workflow and a human escalation that wipes out the savings.

That’s where the current research gap sits. Most evaluations stretch models horizontally: testing ever-longer inputs. However, the real challenge is vertical: maintaining successful goal-oriented conversations over time. Our goal was to close that gap by measuring the performance drop when conversations get longer in production and also measuring whether today’s top models can handle the long context without decline. 

Why we studied this

Long conversations happen whether we want them or not. In customer service, insurance, and other enterprise use cases, 30–40-minute sessions aren’t unusual for certain cases. Agents collect facts, call APIs, loop back for clarifications, and often revisit earlier context. 

For instance, think of a customer filing a complex insurance claim. Let’s say they start the conversation by describing what happened, then upload a document, then remember a missing detail halfway through.

 The AI agent pulls records from two systems, gets an error from one, retries, and confirms policy limits before proceeding. Each step builds on the last, layering more information, more context, and more opportunities for confusion. 

By the time the claim is ready, the conversation has stretched across dozens of tool calls and corrections. That’s not a single prompt; it’s an evolving, messy dialogue.

Historically, research suggested that models struggled as context grew.

And while newer systems now perform well on synthetic tests like needle-in-the-haystack, those tests have a built-in flaw: they stitch together texts from different distributions.

The “needle” looks nothing like the “haystack,” so models can separate the two and retrieve the answer cleanly. Real-life conversations aren’t like that. All the context is in-distribution (user messages, tool outputs, corrections, and side-notes) and the model has to keep track of everything as it unfolds. That’s the part today’s benchmarks still miss.

We wanted to test something closer to reality:

  • Natural back-and-forth dialogue, where the user sometimes gives short answers, sometimes rambles, or changes their mind mid-way.

  • Tool outputs that add token pressure like long JSON responses or verbose partner API calls that flood the model’s memory with text.

  • Multi-step tasks, where the model had to complete an actual workflow (like resolving a support ticket or booking a flight), not just repeat or summarize information.

The guiding questions were simple, but each one served a purpose:

At what point, if any, do models start to degrade? 

This helps enterprise teams understand whether there’s a predictable limit to how long an AI agent can stay effective in a live conversation, like when handling a 45-minute insurance claim or a complex telecom troubleshooting call. Knowing where that limit sits (or if it exists at all) helps teams design safe conversation lengths, escalation rules, or memory reset policies.

Is degradation caused by length, noise, or linguistic indirection?

This is key because the fix depends on the cause. For instance, if breakdowns happen because of length, it’s a technical constraint, i.e., something that model architecture or infrastructure must solve.But if the issue comes from verbosity or repetition (like a customer asking, “Wait, did you mean this?”), that’s an orchestration problem—one that conversation design, prompt tuning, or intelligent routing can address.

Can summarization or message trimming prevent it? 

This gets to operational efficiency. Long conversations can become expensive fast, since every turn re-sends thousands of tokens. If strategies like compaction (summarizing or trimming old context) can keep sessions cheaper and faster without hurting accuracy, that’s a direct business win: lower compute cost, shorter response times, and smoother customer experiences.

Together, these questions shaped the experiment to see where long conversations start to bend. And if they did, we wanted to know whether the fix would live in the model itself or in how we manage the conversation around it.

How we built a realistic long-conversation benchmark

Instead of starting from a new dataset, we expanded our internal multi-turn evaluation framework—the same one we use to assess AI agents on real customer-service workflows.

Those workflows already covered task domains like flight bookings, return handling; each structured around realistic back-and-forth exchanges between a user and an AI agent. 

Every task had a clear definition of success: the model had to finish the job the same way a real agent would. That could mean issuing the right tool call to retrieve an order, updating a record correctly, or completing a booking with valid customer data.

To turn this into a long-conversation benchmark, we stretched those same workflows over more turns, tool interactions, and dialogue styles. Essentially, we scaled the conversations the way they naturally expand in production. The idea was to build pressure by recreating how context actually grows in real life: gradually, as small clarifications, retries, and tool results accumulate. And unlike needle-in-the-haystack tests, our context stays fully in-distribution because every turn comes from real customer workflows.

This gave us a controlled way to measure where and how performance might slip, for instance, whether accuracy drops after a certain number of turns, or whether reasoning starts to drift once the context becomes cluttered. In other words, we weren’t just testing if a model could remember more; we were testing if it could stay coherent and on-task as the conversation evolved.

1. Simulating longer, messier users

Once the benchmark was ready, we needed to test it against reality. 

Most users don’t respond in clean, predictable sentences. They hedge, repeat themselves, ask clarifying questions, or reveal information one fragment at a time. That’s what turns a tidy five-turn interaction into a 20-turn dialogue.

To capture that, we built a set of user simulators that varied how the “customer” communicated. Each simulator followed the same core task but changed the conversational style—from concise and factual to indirect and verbose.

SimulatorBehaviorRelative token growth
StandardFactual, concise replies1.0×
VerboseAdds filler, hedging, and restatement1.8×
Information dripReveals one piece of information per turn2.0×
Single clarificationAsks one clarifying question before answering2.3×
Double clarificationTwo clarifying questions before each answer2.6×
Verbose double clarificationCombines verbosity + double clarification3.0×

For example:

Agent: “What’s your departure city?”
User: “Do you need the airport code or the city name?”
Agent: “City name is fine.
”User: “And just to confirm, you mean where I’m flying from, right?”
Agent: “Yes.
”User: “Berlin.”

Same meaning, three times the turns. This kind of variation creates semantic clutter, meaning chains that get longer and more dependent on earlier context, forcing the model to decide what matters and what doesn’t.

2. Polluting tool outputs to mimic real-world token load

Have you ever tried to find one important email buried in a long reply thread? Where half of it is filled with signatures, “thanks,” and quoted text from five messages ago? Technically, the information you need is there. But it takes effort to filter out the noise.

That’s essentially what language models face in long enterprise conversations. Most of the tokens they process don’t come from the user; they come from the tools. Every time an AI agent calls an API like “get order details,” it gets back a structured response full of metadata: version numbers, timestamps, nested metrics, and system diagnostics. All technically relevant, but not always useful for the task at hand.

The model still has to read every word. And as that clutter builds up, it becomes harder for the model to stay focused on the parts that matter. To test how this affects long conversations, we added what we called a pollution layer: extra, realistic but irrelevant metadata inside the tool outputs. 

We varied this in-distribution noise from clean to heavy—0, 3, or 20 extra layers of metadata—so we could see how well models handled context when most of it was unhelpful clutter. Our question was straightforward: could the model still focus on the right signal or would it start getting lost in the noise?

3. Tracking the right signals

Once the noise was in place, we wanted to see how well models could cut through it. We tracked every run using our internal experiment logging system (built on MLflow)—automatically recording inputs, outputs, and performance metrics across hundreds of conversations.

Instead of looking at abstract benchmarks, we focused on signals that reflect production reality:

  • Task success rate: Did the model actually complete what it was supposed to do, e.g., book the flight correctly or return the right order details?

  • Token composition: How much of the model’s attention went to useful human dialogue versus technical noise from tool outputs? This tells us where the context load really comes from.

  • Turn count: How many back-and-forth exchanges it took to reach the goal. This works as a proxy for efficiency because shorter, coherent dialogues are easier to manage and cheaper to run.

  • Latency and cache effects: How the system slowed down or sped up as the conversation lengthened. These show us where the performance and cost curves start to climb.

Each variation ran on both GPT-4.1 and GPT-4.1 mini to test how model size shaped long-conversation stability. Together, these layers (realistic dialogue, noisy tool responses, and detailed run-level tracking) turned a simple benchmark into something much closer to a live enterprise simulation.

What the data showed

The findings were clearer than expected and surprisingly steady. The first two charts show how different user behaviors inflated conversation length and token load.

Figure 1. Token growth by conversation turn; colored lines show the baseline; grey lines show simulated user behaviors

Each colored line shows the baseline: short, direct answers with no extensions. The grey lines represent the different simulation types, such as verbose users, information drip, and clarification-heavy turns.

Those simulated behaviors increase both the number of turns (x-axis) and the token load (y-axis). In plain terms: the more a user hedges, rambles, or restates, the more the conversation balloons, often two or three times larger, even though the task itself hasn’t changed.

Figure 2. Extended conversation runs across simulators; colored lines show the verbose/simulator runs; grey lines show the baseline conversations.

Together, these visuals capture a familiar enterprise reality: most of the “bloat” in customer interactions doesn’t come from new information. It comes from human behavior such as hesitation, repetition, or re-asking the same question in slightly different words. For AI agents, that means there’s a lot more to read without any new signal to act on.

Figure 3. Model performance across simulation types (Airline domain)

Despite that ballooning input, GPT-4.1 didn’t flinch. Each bar here represents its task-success rate across simulation types, and the black lines mark the 95% confidence interval. The overlaps show that the differences aren’t statistically significant.

When conversations got 2–3× longer and messier, GPT-4.1 didn’t break—even when we tried.

That’s the key finding: verbosity inflated the workload but didn’t meaningfully hurt performance. The model stayed focused on what mattered.

GPT-4.1 mini did degrade,  but for a different reason

The smaller GPT-4.1 mini model told a different story. It handled short, straightforward exchanges just fine but began to struggle when conversations became repetitive or indirect.

In the chart below, each bar shows GPT-4.1 mini’s success rate under different levels of simulated “noise”—from clean tool outputs (0) to heavily cluttered ones (20). Performance stayed mostly steady but became less consistent as the noise increased. In other words, the more clutter in the conversation, the more likely the smaller model was to lose focus. But that wasn’t because it forgot earlier details; it just paid attention to the wrong ones.

Figure 4. Model success rate under increasing conversational noiseFor instance, in “double clarification” scenarios, the simulated user asked two or more clarifying questions before providing any real information, like the same example we stated previously:

Agent: “What’s your departure city?” 
User: “Do you want the city name or airport code?”
Agent: “City name is fine.” 
User: “And just to confirm, you mean where I’m flying from, right?”
Agent: "Yes, your departure city."User (Turn 3 - Answer): "Berlin."

The actual answer (“Berlin”) doesn’t appear until several turns later. Those back-and-forth clarifications don’t add new information—they just create more for the model to process. GPT-4.1 mini often mistook those polite detours for new instructions and tangled itself up trying to reconcile them.

That distinction matters in the real world. In customer service, people naturally hedge or restate themselves—“Just to be clear…”, “Wait, do you mean this or that?”—and smaller models can misread those moments as a shift in intent. When that happens, they don’t lose context; they misunderstand it.

GPT-4.1, by contrast, filtered through the chatter and stayed focused on what actually changed the task. It knew when to listen and when to move on—a quiet but crucial skill for AI agents working in long, messy, real-world conversations.

In business terms, GPT-4.1 behaves like a seasoned service agent who knows when a customer is simply thinking aloud versus making a real change request.

GPT-4.1 mini, on the other hand, takes every aside as an action item — efficient on short tasks, but overwhelmed by the messier, human reality of customer conversations.

These simulations might sound extreme, but they’re surprisingly common in enterprise settings. Customers trying to clarify policy language, verify a refund, or double-check an address all generate this kind of back-and-forth. That’s why we used these patterns: they represent the subtle, human friction that makes real conversations long and unpredictable.

What’s actually eating your tokens

After seeing how smaller models struggled, we broke down the token load to see how the context was actually distributed. It’s not surprising: most of it doesn’t come from customers at all; it comes from the tools. Every API call adds timestamps, IDs, logs, version numbers, and other metadata the model has to read, even if it doesn’t need it for the task. All of that still counts toward the model’s “memory,” eating up tokens and driving up cost.

Figure 5. What’s actually eating your tokens

In our analysis, these tool responses made up 60–70% of the entire conversation, while actual human messages were closer to 20–30%. The model’s own replies added only about 10–15%.

ComponentShare of total tokens (high-pollution scenario)
Tool responses60–70 %
User messages20–30 %
Agent messages10–15 %

For a business running thousands of concurrent conversations, every unnecessary field in an API response translates directly into higher latency and higher costs. So this finding matters for enterprise systems because it shows us where optimization really counts.

So, while training customers to “be concise” might seem helpful, the real savings might come from the backend: reducing how much unnecessary data the model has to process from tools and APIs. In practice, this could mean cleaning up or compressing tool responses before they reach the model, or introducing compaction strategies that drop irrelevant context over time.

Where do we go from here?

We tried to find the breaking point of long conversations. We didn’t find one.  GPT-4.1 handled everything we threw at it—verbosity, noise, tool pollution—without measurable drop in quality. Smaller models buckled under conversational indirection, but that’s a capacity issue, not a flaw in long-context reasoning.

In other words, study showed that long conversations don’t cause large models to degrade. It means that the most current generation of large models can handle realistic, multi-turn, tool-heavy dialogues better than earlier research predicted.  

But that also raises a more important question: when conversations get longer, are we spending those extra tokens productively?

The study showed that accuracy isn’t the whole story. Long sessions stayed stable in quality but grew costly because every new turn re-sent the entire conversation history, compounding latency and compute. In other words, the system was working harder, not necessarily smarter.

What we still don’t know is whether those extra tokens add value. 

A session can end successfully on paper but still waste effort repeating information or looping politely. To see that, we’ll need better diagnostics: full transcripts, tool outputs, and qualitative signals like frustration, repetition, or stalled progress. That’s how we’ll get closer to understanding conversational efficiency.