In the fast-paced world of customer support, technology advancements are revolutionizing how organizations respond to and engage with customers. OpenAI’s Realtime API stands among the latest developments pushing these boundaries, offering real-time, multimodal capabilities that hold transformative potential for customer experience and call center operations. This API’s ability to integrate seamlessly into customer service interactions introduces a new level of immediacy and human-like responsiveness, posing dramatic potential to further elevate customer experience across industries.
While still in its preview stage, OpenAI’s Realtime API has generated considerable excitement—especially for applications in phone-first customer support environments, where quick, accurate responses are critical. As an OpenAI partner, we work closely with the OpenAI team as an early adopter of new rollouts like this.
Playing our part in our OpenAI relationship, we decided to conduct some early evaluations of the Realtime API. We ran the API through the first stage of our rigorous testing suite, typically used for internal development and quality control prior to Parloa’s AI agents going live. The results were quite interesting, and preview an imminent new environment that the Realtime API will create for AI-based communication.
What is OpenAI’s Realtime API?
As its name describes, OpenAI’s Realtime API enables real-time audio-to-audio communication between two parties—typically a person speaking with and AI agent.
Current audio-to-audio communication typically occurs over several stages:
1. A person speaks to the AI agent.
2. The AI agent converts the person’s audio into text, a process called speech to text (STT).
3. The AI agent generates its response in text form, and then converts that text into audio in a process called text to speech (TTS).
4. The AI agent’s audio is sent back to the person.
Steps 1 – 4 repeat for every exchange between the person and the AI agent, and happen virtually instantaneously. Nevertheless, having multiple steps in the process still leaves room for improvement. OpenAI’s Realtime API removes the need for STT and TTS, enabling both parties to communicate directly through audio.
The introduction of OpenAI’s Realtime API marks a significant leap forward in enabling natural, dynamic customer interactions over voice channels. As call centers and support teams increasingly embrace digital transformation, real-time voice-based AI capabilities like this will be crucial for customer satisfaction and broader adoption of AI communication:
- Reduced Latency: The Realtime API delivers responses at nearly imperceptible speeds, creating a more fluid and human-like interaction experience. By eliminating the need for interim text transcriptions, audio-to-audio processing cuts down on response time and creates a more seamless conversation.
- Enhanced Functionality: With capabilities like voice activity detection (VAD) and end-of-speech recognition, interactions are more attuned to natural speech, further narrowing the gap between AI and human agents.
- Increased context: Audio-to-audio exchanges capture additional dimensions of a conversation that aren’t otherwise encompassed in text. Tone, emotion, emphasis, speed and other ‘human’ elements of spoken conversation now have the opportunity to influence and better personify exchanges with AI agents.
Parloa’s Evaluation of the Realtime API
The Realtime API is promising, but that doesn’t mean it’s ready to venture into call centers just yet. Implementing this technology in a customer-ready format requires extensive testing and refinement. That said, we certainly congratulate OpenAI on launching its Realtime API and are quite confident that they will rapidly improve on this model.
As part of its commitment to offering high-quality, reliable customer service tools, Parloa has developed its own testing suite to ensure that our AI agents are ready to interact with live customers. This process entails multiple stages of increased rigor, and we ran OpenAI’s Realtime API through our first stage to see how it would fare.
Putting Realtime API to the Test: The Kronosjet Case
In testing the Realtime API, Parloa built a fictional airline scenario, ‘KronosJet,’ to evaluate how well the API could handle typical customer interactions in a simulated environment. The objective was to assess how accurately and efficiently the API could interpret requests and execute tool calls to find flight details, book flights, and answer frequently asked questions.
During the KronosJet tests, Parloa used assertion-based testing, where agents were given specific tasks with clearly defined expected outcomes. This method ensures that the AI agent’s responses are systematically verified against a “correct” answer. While the Realtime API showed promise, it often struggled with complex tool calls, such as correctly interpreting family member details in a booking request. This resulted in minor inaccuracies, such as booking flights for one person instead of two.
Here are some of our key findings:
Tool Calling Capabilities: A Make-or-Break Feature
In customer service, tool-calling abilities allow AI agents to interact with databases, retrieve relevant information, and respond accurately to complex customer queries. This feature is essential for efficient problem-solving, especially in enterprise settings where agents may need to access various information systems.
Parloa’s tests of the Realtime API indicate that its tool-calling functionality shows potential but remains inconsistent. In the KronosJet case, we simulated hundreds of calls to gauge the API’s accuracy in using the right tools at the right time. In approximately 50% of cases, the Realtime API successfully initiated a tool call, but it frequently missed cues to retrieve specific information, such as passenger details.
The Realtime API also showed limited support for temperature settings, a parameter that controls response creativity, which proved challenging. With a default temperature of 0.6, the API showed a tendency to generate creative or tangential responses, which can increase the likelihood of ‘hallucinations,’ or responses where the AI agent fabricates information.
Parloa’s tests prioritized settings to keep responses focused on solving the immediate customer query without unnecessary deviations, but a 50% success rate and a tendency to make up information are clear red flags.
Reducing Response Latency with Audio-to-Audio Processing
For many customer support operations, reducing response time is paramount. Audio-to-audio processing allows AI agents to engage directly in voice conversations, circumventing the need for text transcriptions. This streamlined approach can significantly decrease latency, making responses almost instantaneous and adding a human-like quality to interactions.
In Parloa’s evaluation, direct audio-to-audio processing was one of the standout benefits of the Realtime API. By reducing the need for intermediate processing steps, latency is minimized, which is especially valuable in high-volume call centers where seconds can make a difference in customer satisfaction.
Issues with the API’s voice activity detection (VAD) did arise, however. While VAD is intended to recognize when a customer has finished speaking, it occasionally misfired, resulting in interruptions or delayed responses.
We temporarily disabled VAD to maintain consistent results in their tests, signaling that further refinement in this area could improve future implementations.
Fine-Tuning for Practical Customer Applications
Another critical part of Parloa’s testing involved setting up simulations with agents handling multiple interactions to explore the Realtime API’s performance under varied conditions. These simulations included random customer inquiries, from simple FAQs to more intricate service requests. However, the AI agents frequently misinterpreted inquiries, suggesting that additional training or updates in temperature control settings could enhance the API’s performance.
Our tests also revealed that, although the Realtime API performed adequately in simple, straightforward requests, it occasionally hallucinated answers in more complex interactions. For instance, in response to general inquiries about airline reward points or online check-in, the API sometimes advised users to contact customer service rather than providing the correct tool-based response. While manageable in a test environment, these inaccuracies would need correction in a live setting to meet Parloa’s quality standards.
Implications for the Future of Realtime API in Customer Support
Unsurprisingly, our tests concluded that OpenAI’s Realtime API isn’t ready for large-scale deployment…yet. As we’ve seen over the past two years, generative AI continues to evolve at rapid speed, and The Realtime API’s debut in audio communication is imminent.
With improvements in tool calling reliability, response accuracy, and temperature settings, we anticipate that the Realtime API could become a valuable addition to Parloa’s autonomous AI agents.
We’re continuing to monitor and test developments like OpenAI’s Realtime API, keeping to our focus on optimizing agent performance and ensuring that customer experience is as quick and intuitive as possible. We pride ourselves in this combination of experimentation, rigorous testing protocols, and customer-centered design—a formula that’s established Parloa as a leader in the next generation of customer support solutions.