Skip to main content

Command Palette

Search for a command to run...

Cutting AI voice latency from 1.5s to 200ms: measure time-to-first-byte, not total time

Three levers — streaming, the flash model, and sentence chunking — with the real TTFB numbers behind each.

Updated
5 min read
Cutting AI voice latency from 1.5s to 200ms: measure time-to-first-byte, not total time
A
AI Alleyway is an independent online publication reviewing AI tools for creators, founders, marketers, and developers. We publish long-form, no-hype reviews and comparisons with a named editor and a clear editorial firewall between our verdicts and the affiliate links that fund the site — no sponsored placements. Coverage spans AI voice, writing, image, video, automation, and developer tools. Readers get a fast, trustworthy answer to "is this tool worth paying for?" before they commit.

There is one number that decides whether a voice agent feels alive or broken, and most people benchmark the wrong one. They measure total generation time. The number users actually feel is time-to-first-byte: how long the agent sits silent before the first audio comes out.

I run ElevenLabs behind a pipeline that generates speech, and when I started caring about realtime, my first benchmark said "about 1.5 seconds per reply." That sounded fine on a spreadsheet and felt awful in a conversation. Here are the three levers that took the felt latency from ~1.5 seconds to under 250 milliseconds, with the real measurements behind each.

Measure time-to-first-byte, not total time

Both convert and stream hand you an iterator of bytes. The difference is when the first byte shows up. So time that, not the loop's end:

import time
from elevenlabs.client import ElevenLabs

client = ElevenLabs(api_key="YOUR_API_KEY")

def ttfb(make_iterator):
    t0 = time.perf_counter()
    first = None
    for chunk in make_iterator():
        if first is None:
            first = time.perf_counter() - t0   # the moment audio could start playing
    return first

Everything below is measured with that harness, median of 3 runs, same voice, mp3_44100_128.

Lever 1: stream, don't convert

convert hits the non-streaming endpoint, which generates the whole clip before it returns. Its time-to-first-byte is basically its total time. stream starts handing you bytes while the rest is still being generated:

# waits for the entire clip before the first byte
audio = client.text_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_multilingual_v2",
    text="Hold on, let me check that for you.",
    output_format="mp3_44100_128",
)

# first bytes arrive early; the rest streams in behind them
stream = client.text_to_speech.stream(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_multilingual_v2",
    text="Hold on, let me check that for you.",
    output_format="mp3_44100_128",
)

On a one-sentence reply, convert first byte landed at ~830 ms, stream at ~670 ms. On a full paragraph the gap widens — ~1,470 ms versus ~1,270 ms — because convert is still waiting on the whole thing. Streaming is the free win, but on its own it is a modest one.

Lever 2: the model is the real lever

Switching the model did far more than switching the call. The same stream call on the low-latency eleven_flash_v2_5:

call one sentence one paragraph
convert, multilingual_v2 ~830 ms ~1,470 ms
stream, multilingual_v2 ~670 ms ~1,270 ms
stream, flash_v2_5 ~180 ms ~235 ms

That is the headline: roughly 6× faster to first audio on a paragraph, just by picking the realtime model. Notice the flash row barely moves between one sentence and one paragraph — first audio is essentially flat regardless of length, which is exactly what you want for an agent.

The catch is the usual quality-versus-speed trade. The flash model is tuned for latency; the multilingual model is a notch richer for narration and long-form, where a couple hundred milliseconds does not matter. Which model belongs on which job is a real decision, and I went deep on how the ElevenLabs models and tiers actually differ in a full hands-on review. Short version for builders: use flash for anything interactive, keep the heavier model for pre-rendered narration.

Lever 3: chunk the first sentence (only when you need the slower voice)

Once you are on flash at ~235 ms, you are done — chunking adds overhead for no gain, and I measured exactly that. But sometimes the agent's first impression has to use the higher-quality voice. In that case, do not stream the whole paragraph. Stream the first short sentence, start playing it, and generate the rest behind it:

import re

def sentences(text):
    return [s.strip() for s in re.split(r"(?<=[.!?])\s+", text) if s.strip()]

def stream_chunked(client, voice_id, text, model_id="eleven_multilingual_v2"):
    sents = sentences(text)
    for i, s in enumerate(sents):
        chunk_stream = client.text_to_speech.stream(
            voice_id=voice_id,
            model_id=model_id,
            text=s,
            output_format="mp3_44100_128",
            previous_text=sents[i - 1] if i > 0 else None,   # keep prosody continuous
            next_text=sents[i + 1] if i + 1 < len(sents) else None,
        )
        for chunk in chunk_stream:
            yield chunk

The previous_text / next_text arguments matter. They are ElevenLabs' request-stitching parameters: synthesize each sentence in isolation and the model has no idea what came before or after, so intonation can reset at every boundary. Feeding it the neighbouring sentences is what those arguments are for — the model conditions on them and carries prosody across the seam.

On the multilingual voice, streaming the whole paragraph put first audio at ~1,470 ms. Chunking the first sentence dropped it to roughly 850 ms–1 s across runs — about a third faster — because the user only waits for one short sentence, not the whole reply.

The rule of thumb I settled on

  • Interactive / agent: stream + eleven_flash_v2_5. First audio lands around 200 ms and stays flat with length.
  • Need the richer voice live: stream the higher-quality model, but chunk by sentence with previous_text/next_text so first audio is one short sentence away, not a whole paragraph.
  • Pre-rendered narration (not realtime): convert is fine — nobody is waiting, and you skip the streaming plumbing.

Measure time-to-first-byte from day one. Total-time benchmarks will tell you everything is fine right up until the conversation feels broken.

What is the lowest realtime TTS latency you have gotten in production, and on which model? I am curious where the floor actually is.