Behind the Build

GPT-5.2 doesn’t just follow instructions, it follows through

11 December 2025

Author(s)

Stefan Ostwald

Co-Founder & CAIO

Matthäus Deutsch

Senior Applied Scientist

Anjana Vasan

Principal Content Marketer

Table of contents

There’s a specific kind of model failure we like to track closely. It doesn’t show up in latency graphs or user feedback. It sounds like the system is doing the right thing. It looks like it’s doing the right thing. And yet: the action never happens.

The model says, “I’ll forward you now,” but never triggers the tool call. The backend sees no event; the customer hears a promise with no follow-through. And unless you know exactly what to look for, it slips through unnoticed.

Here’s the thing: these aren't hallucinations or bad responses. In fact, we’ve seen these artifacts seen in every frontier model so far.

But they are something harder to detect: alignment mismatches between language and execution. And they’re common in even the best models, GPT-4.1. GPT-5.1. They understand the goal. They generate the right phrasing. But when it comes time to act, they hesitate or skip the step entirely.

That’s what made GPT-5.2 stand out.

What we set out to test

First off, we didn’t test it for better answers; we tested it for better decisions. Tool calls, specifically: whether the model recognized when to trigger one, and whether it did so in the right moment, without relying on a padded prompt or backup logic.

We ran 500 simulated calls in a real-world insurance use case in customer service. This was a straight comparison of GPT-5.2 alpha against our 5.1 baseline. And what we saw wasn’t a marginal uptick; it was a behavioral shift.

Routing failures dropped from 18.2% to 4.0%. In the final-turn forwarding step—where failures are most damaging—that number fell from 16.8% to 3.6%. In other words, the model got more accountable.

How the model changed

In earlier models, the pattern was consistent: the agent understood the instruction but often lacked execution reliability. We saw routing decisions stall at the last moment, or succeed only with additional prompting. We touched on this previously: this wasn’t a comprehension problem. It was a timing and trust issue, especially in multi-turn tasks where the model had to choose between responding and acting.

Bayesian posterior distribution showing GPT 5.2 (red) has much lower routing failure probability than GPT 5.1 (blue)

GPT-5.2 appears to resolve that tradeoff. It commits to the action more reliably, without getting caught in surface-level politeness. That sounds like a small fix on the surface. But operationally, it removes one of the most persistent sources of friction in agent workflows.

Bayesian posterior distribution showing GPT 5.2 (red) significantly improved routing call timing compared to GPT 5.1 (blue)

This change also reduces the need for prompt engineering gymnastics. We didn’t have to scaffold the behavior with hand-coded logic with extra conditions. The model just picked the right moment and acted.

That said, it’s also worth noting what didn’t change

Error rates across the board—runtime, tool format, malformed responses—held steady at zero. That includes the typical edge-case pitfalls we’ve learned to expect when upgrading models. And while the tooling logic stayed intact, the model’s output got leaner. Message length dropped nearly 50%, from 183 to 93 characters.

We’ve seen verbose replies from earlier versions pad out conversations without adding clarity. That bloat adds latency, burns tokens, and sometimes confuses users. Essentially, GPT-5.2’s brevity wasn’t just stylistic. It made conversations faster and more focused, without losing intent.

What this unlocks

These gains directly affect how we build.

In production, every failure mode has to be mitigated, either in code or in prompt logic:

Tool call dropouts mean fallback flows
Verbosity means post-processing. More importantly, in customer service, it also frustrates callers because they want fast resolution, not long-winded responses.
Lack of alignment means defensive prompting that bloats every turn.

Each workaround might seem small, but across thousands of interactions, they stack into significant overhead.

With GPT-5.2, that overhead shifts. You can trust the model to carry more of the execution burden directly. You don’t need to write around its uncertainty. That opens up more than performance wins—it opens architectural headroom.

“
We’re seeing the beginning of what you might call procedural confidence: models that don’t just understand what needs to happen, but follow through with the right call, in the right moment, at the right layer of the stack.
It’s a baseline requirement for agents that operate at scale.

It also reframes what we optimize for. In the past, a lot of tuning work focused on refining the language: how the agent sounds, how clearly it explains, and how natural the interaction feels. That work still matters. But the bigger question is now: can you trust it to do the thing?

GPT-5.2 shifts that trust boundary.

What comes next

We’re continuing to validate it in different domains, with more complex flows and edge-case inputs. But the early signal is strong. It's a model that makes fewer critical mistakes, in exactly the places where those mistakes cost the most.

If you’re running agents at production scale, that tradeoff is worth more than any new feature. Because it’s what makes performance sustainable.

Preview support for GPT-5.2 will roll out soon in AMP. If you’re already running simulations and want to validate them against your own prompts and tools, reach out. Meanwhile, we’ll keep sharing what we find.

More in this series

11 December 2025Insights

What happens when calls never end?

Every customer conversation has a rhythm: a start, a middle, and (generally) an end. But what happens when it doesn’t?

Stefan Ostwald, Mariano Kamp and Anjana Vasan

17 November 2025Research

We got early access to GPT-5.1 Thinking. Naturally, we tested it

We’ve been quietly testing OpenAI’s GPT-5.1. We’ve been running it through a series of internal real-world benchmarks, especially for tool calling and instruction following, which are foundational to how customer service agents operate inside Parloa’s AI agent management platform.

Stefan Ostwald, Matthäus Deutsch, Rouven Glauert and Anjana Vasan

11 December 2025Research

The never-ending conversation: Measuring long-conversation performance in LLMs

At Parloa, our AI agents handle these long conversations every day, i.e., tool-heavy dialogues where context grows fast and ambiguity grows faster. And we wanted to know: does performance actually degrade when conversations get that long or do today’s models hold steady?

Stefan Ostwald, Mariano Kamp and Anjana Vasan