Insights

The latency paradox: Why voice AI speed is a budget, not a target

28 May 2026

Author(s)

Kevin Boyer

Vice President, Product Marketing

Table of contents

Voice AI engineers have spent years optimizing for speed. Faster transcription, lower time-to-first-token, snappier synthesis. It’s easy to understand why. On average, only 200ms lag between humans in a natural conversation. The assumption is, then, that every millisecond shaved off is a win.

At Parloa, we believe that there are more nuances to this assumption. The right amount of pause depends on what was just said in the conversation, the type of question asked, the rhythm of the language spoken, and even background noise. Collapsing all of these into a single benchmark number ("aim for 300ms") discards the most actionable information available. For this reason, we believe that latency should be treated as a budget, not a target.

The human benchmark

Research from the Proceedings of the National Academy of Sciences (PNAS) found that in natural conversation, the average response offset is 0–200ms. That speed comes from humans listening and formulating a response concurrently, before the other person has finished their turn. Parloa’s on-connect streaming and call data service capabilities mimic this reality. Relevant information is sent to the LLM before turn-taking completes, allowing the model to begin reasoning in parallel with the caller’s utterance.

That 0–200ms benchmark is a population average across idealized conditions, however, and it does not reveal the intricacies that define the latency of each individual conversation.

In automated production, the tolerable budget shifts substantially depending on context:

Speech act type: Yes/no answers arrive 100–500ms faster than a decline to engage (e.g. “I can’t answer that”). Answers to factual questions come faster than “I don’t know” responses.
Question type: For requests, a pause exceeding 600–800ms reads as unwillingness. For where/which questions, a longer pause signals credibility.
Language and culture: Japanese conversational norms expect transitions under 10ms. Danish and other Nordic languages tolerate pauses around 400ms.
Prosody: Falling intonation at the end of a request tightens the tolerable pause window, while rising intonation loosens it. Background noise inflates perceived pause duration.[1]
Modality: In contexts with visual responses (head nods, on-screen indicators), users tolerate longer audio gaps. In voice-only interactions, where the audio response carries the entire cognitive load to indicate that something is happening, shorter gaps are necessary.

With these contextual points in mind, an agent that works only towards a set 200ms target could end up sounding more artificial in the process.

The architecture of "the wait"

Understanding where latency lives in the agent tech stack is a prerequisite for managing it intelligently. The process has four major stages, each with a characteristic latency profile and a different lever for improvement:

Audio ingestion

Before the AI “hears” anything, the raw audio must traverse analog-to-digital conversion, packetization, and jitter buffer management. This adds 10–50ms of irreducible floor. No model optimization touches it.

Streaming STT and voice activity detection

Modern technology streams audio chunks in real-time, using aggressive voice activity detection (VAD) to determine end-of-turn without waiting for silence. With streaming, the LLM can start processing the information before the user finishes speaking, ensuring that the context of the conversation is conveyed in the answer to the question.

LLM reasoning and tool calling

When an agent must call an API or query a database to produce a response, the latency is a byproduct of accuracy. This is the stage where perceived latency management matters most. Techniques like “partial results,” where the LLM begins generating a response while a data fetch is still completing, can significantly compress the experience gap without sacrificing accuracy. Additionally, intermediate messages such as the agent saying, “I am checking that,” before calling an external service but after the LLM fetches some information, provides the impression of a lower wait time. On-connect streaming (pre-loading caller context before turn completion) reduces the effective cost of this stage by starting the clock earlier.

Streaming TTS

Low-latency voice requires a shift from cascading TTS (synthesize the full response, then play it) to streaming TTS (synthesize and play from the first token). The remaining optimization occurs when synthesis starts at a natural phrase boundary rather than waiting for a full sentence.

Parallel safety checks.

Strong providers run guardrails in parallel with response generation, accepting that occasionally a brief fragment of ungated audio may reach the customer in exchange for a system that doesn't make customers wait for safety checks on every turn. This works only if the system can also interrupt and recover gracefully when a guardrail does flag something.

Designing for perceived latency

The conventional framing of conversational fillers such as “uh-huh” and “let me look into that” treats them as latency-masking hacks. In actuality, these fillers are the start of the next response.

Back-channeling signals like “uh-huh” and “okay” function as continuers and turn-claims in conversation theory. They occupy the social slot of a response without yet delivering content, making the listener assume the agent’s turn has begun and resetting the latency clock as a result.[2][3][4]

Back-channeling signals should be emitted immediately on end-of-turn detection, with reasoning soon catching up. Similarly, audible inhalation also extends the listener’s tolerance budget, as it signals the agent’s commitment to respond.

Two other strategies can reduce the experienced delay without reducing actual processing time:

Auditory anchoring: Contextually appropriate ambient sounds (subtle keyboard activity, a soft processing tone) signal active resolution.
Consistency over raw speed: A stable 800ms response time is more trustworthy than a jittery average that fluctuates between 200ms and 2 seconds.

Benchmarking naturalness

Time to First Token (TTFT) has become the dominant latency KPI because it’s easy to measure and roughly correlated with user experience under controlled conditions. Yet, as explained above, a system with great TTFT can still feel broken. That’s why, within our engineering work, we also measure latency via a composite metric that weights response speed against the appropriateness of the pause given the speech act type and prosodic context.

Additionally, companies should consider a “frustration index” for their consumers. User barge-ins (“Hello?”, “Are you still there?”) and unprompted repetitions are the clearest signal that perceived latency has crossed the abandonment threshold.

Data benchmarks for engineering teams

Component	Target Latency (ms)	User perception
Audio capture / Encode	10–50	Inevitable floor
Streaming STT	100–300	Essential for "active listening"
LLM reasoning (TTFT)	200–800	The "thinking" phase
Streaming TTS	100–400	The "speaking" phase
Total mouth-to-ear	600–1,200	The "Goldilocks" zone

The latency reality

The goal of voice AI latency work is not to minimize a number. It’s to build an agent that knows when to be fast, when to take a breath, and how to signal “I’m here and working” in the moments between. An agent that modulates its response timing based on speech act, prosody, and conversational context will outperform a faster agent with a fixed latency profile. So, while speed is table stakes in voice AI, context-awareness remains the real differentiator.

References:

Kohtz, L. S., & Niebuhr, O. (2022). "How long is too long? How pause features after requests affect the perceived willingness of affirmative answers." Proceedings of Speech Prosody 2022, ISCA.

https://arxiv.org/html/2507.22352v1

OG back-channels: Yngve (1970)

https://arxiv.org/abs/2507.22352