Research
Insights into the core challenges and opportunities of agentic AI.
Research archive
We got early access to GPT-5.1 Thinking. Naturally, we tested it
We’ve been quietly testing OpenAI’s GPT-5.1. We’ve been running it through a series of internal real-world benchmarks, especially for tool calling and instruction following, which are foundational to how customer service agents operate inside Parloa’s AI agent management platform.
The never-ending conversation: Measuring long-conversation performance in LLMs
At Parloa, our AI agents handle these long conversations every day, i.e., tool-heavy dialogues where context grows fast and ambiguity grows faster. And we wanted to know: does performance actually degrade when conversations get that long or do today’s models hold steady?
A Bayesian framework for A/B testing AI agents
We’re introducing a hierarchical Bayesian model for A/B testing AI agents. It combines deterministic binary metrics and LLM-judge scores into a single framework that accounts for variation across different groups.