GPT-5.1 is here. Naturally, we tested it

When a new LLM model arrives, it rarely feels like a single moment. It shows up first as a sandbox, then as a few tests, then as a flood of observations. And when you work with AI agents that make decisions in real-time, even small changes are worth paying attention to.
We’ve been quietly testing OpenAI’s GPT-5.1. We’ve been running it through a series of internal real-world benchmarks, especially for tool calling and instruction following, which are foundational to how customer service agents operate inside Parloa’s AI agent management platform.
Below, you’ll get a quick look at what we’re seeing in our earliest experiments, where the differences show up, and what comes next as we prepare to make GPT-5.1 available for preview inside our own platform.
Why real-world benchmarks matter for us
Most people associate new models with better text: accurate answers, fewer hallucinations, faster responses. And to be fair, GPT-5.1 delivers on that promise in many places. However, when you’re building production AI agents which answer millions of calls and perform actions on behalf of the caller, the agent has no room for errors.
It’s not a flashy use case, but it’s unforgiving. For instance, let’s say the agent gets a question. It has to decide whether to answer, escalate, or forward. And that decision triggers a specific tool call, with a specific structure, under specific rules.
Before a model goes live, it has to earn our trust
For example, imagine a customer calls to ask about canceling their insurance policy. If the model replies, "I'll connect you to an agent who can help with that," but fails to actually initiate the transfer tool, the conversation stalls.
The customer assumes action is being taken when nothing is happening on the backend. That kind of mismatch is invisible to the customer until it causes friction. And in high-volume environments, those small failures add up fast.
So, we ran a simulation. We took one of our enterprise deployments, an insurance use case, and re-ran thousands of simulated calls using our baseline against the new GPT-5.1. The plan was simple: hold everything constant (same prompt, same routing rules, same environment), and see how GPT-5.1 behaved compared to its predecessors.
Then, we adapt the prompts step by step, not just to measure performance, but to understand the migration complexity between models. Throughout this process, we collaborate closely with OpenAI’s engineers.
GPT‑5.1 is sharper at following instructions, with some new tradeoffs
GPT‑5.1 sticks more closely to the instructions it’s given. That’s great news when you’re building something from scratch. But in enterprise setups, where prompts evolve over time and can be a double-edged sword.
Take one of our insurance agents, for example. It had been running smoothly for months, but when we swapped in GPT‑5.1, things changed in an unexpected way. The original instruction said: “Always respond with a question to the caller.” That made sense for most cases, but not when forwarding the caller to a human. The model started asking, “Would you like to be forwarded?”which added friction and didn’t match our intent.
The fix was simple: we updated the prompt to say, “Always respond with a question to the caller, except when you forward the caller to a human.” And this experience highlighted something important: GPT‑5.1’s precision surfaces prompt flaws that older models simply worked around. That’s an opportunity, but also a cost when migrating.
The new guidelines reflect a lot of what we’ve learned the hard way: how small wording tweaks can change model behavior, how temperature and reasoning settings interact, and how system prompts shape long-term performance. As those insights become public, we’ll be folding them into our own best practices and baking them directly into AMP, so teams can build on proven patterns that are tested, shared, and ready for production.
It’s easy to overinterpret early results, and we’re certainly not doing that
This test isn’t a model review, nor is it a performance verdict. GPT-5.1 is still new, and we expect significant improvements from OpenAI over the next few weeks.
But that’s exactly why this work matters to us and our customers. Benchmarking now helps us understand where the model’s behavior is different and adjust our setup accordingly. Whether that’s rewriting prompts, tuning temperature, or rethinking how we structure instructions, testing gives us the time to adapt before deploying at scale.
And we've already seen that GPT-5.1 is capable. If you’re a Parloa customer and want early access to test GPT-5.1 inside your AMP environment, reach out. Preview support will be available soon and we’re excited to see what you find.
Contact our team:format(webp))
:format(webp))
:format(webp))
:format(webp))
:format(webp))
:format(webp))