Behind the Build

GPT-5.1 is here. Naturally, we tested it

Stefan Ostwald
Co-Founder & CAIO

Matthäus Deutsch
Sr. Applied Scientist

Rouven Glauert
Senior Applied Scientist

Home > blog > Article

17 November 2025 • 4 mins

When a new LLM model arrives, it rarely feels like a single moment. It shows up first as a sandbox, then as a few tests, then as a flood of observations. And when you work with AI agents that make decisions in real-time, even small changes are worth paying attention to.

We’ve been quietly testing OpenAI’s GPT-5.1. We’ve been running it through a series of internal real-world benchmarks, especially for tool calling and instruction following, which are foundational to how customer service agents operate inside Parloa’s AI agent management platform.

Below, you’ll get a quick look at what we’re seeing in our earliest experiments, where the differences show up, and what comes next as we prepare to make GPT-5.1 available for preview inside our own platform.

Why real-world benchmarks matter for us

Most people associate new models with better text: accurate answers, fewer hallucinations, faster responses. And to be fair, GPT-5.1 delivers on that promise in many places. However, when you’re building production AI agents which answer millions of calls and perform actions on behalf of the caller, the agent has no room for errors.

It’s not a flashy use case, but it’s unforgiving. For instance, let’s say the agent gets a question. It has to decide whether to answer, escalate, or forward. And that decision triggers a specific tool call, with a specific structure, under specific rules.

Before a model goes live, it has to earn our trust

For example, imagine a customer calls to ask about canceling their insurance policy. If the model replies, "I'll connect you to an agent who can help with that," but fails to actually initiate the transfer tool, the conversation stalls.

The customer assumes action is being taken when nothing is happening on the backend. That kind of mismatch is invisible to the customer until it causes friction. And in high-volume environments, those small failures add up fast.

So, we ran a simulation. We took one of our enterprise deployments, an insurance use case, and re-ran thousands of simulated calls using our baseline against the new GPT-5.1. The plan was simple: hold everything constant (same prompt, same routing rules, same environment), and see how GPT-5.1 behaved compared to its predecessors.

Then, we adapt the prompts step by step, not just to measure performance, but to understand the migration complexity between models. Throughout this process, we collaborate closely with OpenAI’s engineers.

GPT‑5.1 is sharper at following instructions, with some new tradeoffs

GPT‑5.1 sticks more closely to the instructions it’s given. That’s great news when you’re building something from scratch. But in enterprise setups, where prompts evolve over time and can be a double-edged sword.

Take one of our insurance agents, for example. It had been running smoothly for months, but when we swapped in GPT‑5.1, things changed in an unexpected way. The original instruction said: “Always respond with a question to the caller.” That made sense for most cases, but not when forwarding the caller to a human. The model started asking, “Would you like to be forwarded?”which added friction and didn’t match our intent.

The fix was simple: we updated the prompt to say, “Always respond with a question to the caller, except when you forward the caller to a human.” And this experience highlighted something important: GPT‑5.1’s precision surfaces prompt flaws that older models simply worked around. That’s an opportunity, but also a cost when migrating.

The new guidelines reflect a lot of what we’ve learned the hard way: how small wording tweaks can change model behavior, how temperature and reasoning settings interact, and how system prompts shape long-term performance. As those insights become public, we’ll be folding them into our own best practices and baking them directly into AMP, so teams can build on proven patterns that are tested, shared, and ready for production.

It’s easy to overinterpret early results, and we’re certainly not doing that

This test isn’t a model review, nor is it a performance verdict. GPT-5.1 is still new, and we expect significant improvements from OpenAI over the next few weeks.

But that’s exactly why this work matters to us and our customers. Benchmarking now helps us understand where the model’s behavior is different and adjust our setup accordingly. Whether that’s rewriting prompts, tuning temperature, or rethinking how we structure instructions, testing gives us the time to adapt before deploying at scale.

And we've already seen that GPT-5.1 is capable. If you’re a Parloa customer and want early access to test GPT-5.1 inside your AMP environment, reach out. Preview support will be available soon and we’re excited to see what you find.

Contact our team

blog12 November 2025

One AI agent to infinite experiences: Introducing Agent Composition

Agent Composition helps teams build one intelligent agent that adapts everywhere, across every region, language, and communication channel, without starting from scratch.

blog7 November 2025

From assistants to agents: Lessons on agentic enterprise at Gartner IT Symposium & Xpo

At Gartner’s IT Symposium/Xpo, leaders explored the rise of the agentic enterprise—where AI evolves from reactive assistants to autonomous agents that drive real productivity, reliability, and transformation across industries.

blog30 October 2025

From WAVE 2025: The next chapter of meaningful customer relationships

WAVE 2025 delivered a manifesto for the future of enterprise-customer relationships.

The 2025 AI agent buyer’s guide

The 2025 AI agent buyer’s guide

The 2025 AI agent buyer’s guide

The 2025 AI agent buyer’s guide

GPT-5.1 is here. Naturally, we tested it

Why real-world benchmarks matter for us

Before a model goes live, it has to earn our trust

GPT‑5.1 is sharper at following instructions, with some new tradeoffs

It’s easy to overinterpret early results, and we’re certainly not doing that

Related articles

One AI agent to infinite experiences: Introducing Agent Composition

From assistants to agents: Lessons on agentic enterprise at Gartner IT Symposium & Xpo

From WAVE 2025: The next chapter of meaningful customer relationships