What happens when calls never end?

15 December 2025
Author(s)

Stefan Ostwald

Co-Founder & CAIO

Mariano Kamp

Senior Principal AI Researcher
Table of contents

Every customer conversation has a rhythm: a start, a middle, and (generally) an end. But what happens when it doesn’t?

In contact centers, the lengthy ones are often the norm. Customers explain, clarify, change their minds, resend details, and ask, “Wait, did I already tell you that?” Meanwhile, forty minutes later, the AI agent is juggling order data, policy limits, and three tool calls while still trying to sound helpful.

That’s the unglamorous reality of long conversations. At Parloa, our AI agents handle these kinds of interactions every day. And behind the scenes, they rely on large language models (LLMs) to make sense of messy inputs and keep the dialogue on track.

So we asked ourselves a simple question: When conversations get long, does large language model (LLM) performance drop, or can today’s systems keep up?

The invisible weight of context

When people talk about “long-context performance,” they usually mean how well an LLM model remembers things over a long stretch of text. For example, reading a hundred-page document and still being able to pull out a detail from page 47. That’s what most benchmarks test. 

But that’s not how customer conversations work. They don’t come as one long document; they unfold turn by turn, over time. A user asks something, changes their mind, rephrases, clarifies, or adds a new detail 10 minutes later. Each of those turns adds new information and a bit of noise and the model has to keep track of it all without getting confused or losing focus. 

That means two things start happening:

  • The AI agent works harder every turn. Each new message adds more data to sift through. Think of it like a librarian trying to find one page in a stack of books that keeps doubling in size. Even if the librarian’s good, the search takes longer, costs more, and gets fuzzier over time.

  • The AI agent starts to lose its sense of focus. LLMs don’t have true memory; they have attention. And attention gets diluted. If the prompt has thousands of tokens, most of them irrelevant logs or old replies, the model starts losing the thread. Yet, it has to decide what’s important. That’s where reasoning can drift. The AI agent doesn’t “forget” the facts, it just stops prioritizing the right ones, i.e., token quality.Why we tested it

Most benchmarks test whether a model can remember a detail buried in a long document. However, customer conversations aren’t documents. These long, messy conversations happen all the time in customer service, yet most benchmarks don’t test them. Instead, they focus on memory, not messy back-and-forth dialogue.

Moreover, when context starts to pile up, there are business consequences: the customer waits longer for responses and the business pays more for every turn of the conversation. If an AI agent loses context even once, it might repeat a verification step, give the wrong answer, or hand the call off to a human.

All of this can be very risky for enterprises, drive up token costs, and even result in customer churn.

So we ran the experiment

We built a stress test that mirrors how conversations really unfold in enterprise settings—messy, indirect, and packed with tool output. Picture a customer filing an insurance claim: they start with the incident, remember a detail later, upload a document, get a clarification, and the agent has to call multiple systems along the way.

Each turn adds a little more complexity: more context, more corrections, more data. To simulate that, we ran hundreds of extended conversations using two models—GPT‑4.1 and GPT‑4.1 mini—tasked with real-world workflows like booking a flight or resolving a support ticket.

To make it realistic, we added noise: extra tool output, user hedging, repeated clarifications like “Just to confirm…” or “Wait, you meant…”—all the stuff that makes human dialogue hard to follow. The goal was to see: can the model stay focused, accurate, and coherent as the conversation grows?


What we found

The big surprise? GPT-4.1 didn’t break. Even as conversations tripled in length and tool outputs ballooned, accuracy stayed stable—within 2% of baseline across hundreds of runs. The model kept its reasoning intact, no matter how verbose the user got.

GPT-4.1 mini, however, struggled. It didn’t forget information; it misread it, treating polite filler or repetition as new instructions. That distinction is crucial for real-world AI agents. Because the best systems don’t just remember everything; they know what to ignore.

The takeaway

For contact centers, reliability over time is everything. A model that remembers just enough (and filters out what doesn’t matter) means faster, cheaper, and more coherent conversations. It’s not about building infinite memory. It’s about teaching systems to use memory wisely.

For enterprises, the takeaway is clear: long conversations aren’t a breaking point—for the right models. GPT‑4.1 handled realistic, tool-heavy dialogues without measurable degradation. Smaller models didn’t fail because of length but because of conversational noise.

According to us, here’s what that means: the next improvements will come from better context management, cleaner tool outputs, smarter summarization policies, and new ways to track when a conversation is drifting.

This research helps us refine how Parloa’s platform manages long, complex sessions, from compaction strategies that trim stale context to diagnostics that flag when conversations start looping. The goal: agents that stay clear, focused, and efficient, no matter how long the call lasts.

Because in enterprise AI, success isn’t just about keeping the conversation going; it’s about knowing when it’s worth continuing.