AI Alleyway

A bot in the room vs a tap on your speakers: the two architectures behind AI meeting notes

AI Alleyway — Mon, 15 Jun 2026 07:52:26 GMT

Every AI notetaker has to solve the same problem before it can do anything clever: it has to get the audio. There are really only two ways to do that, and which one a tool picks turns out to decide far more than you'd expect — who sees a recorder on the call, whether you get speaker labels, which platforms it runs on, and, in a lot of companies, whether you're even allowed to use it.

I ran the same meeting through Otter and Granola, the two tools that sit at opposite ends of this choice, to watch the consequences play out. The transcription quality barely separated them. The architecture separated them completely.

Architecture A: put a bot in the meeting

Otter's approach is to send a participant. OtterPilot connects to your calendar and joins your Zoom, Meet, or Teams call as a visible attendee — there's a literal "Otter.ai" notetaker sitting in the participant list, and everyone on the call can see it.

That design decision is the root of Otter's whole feature set:

You get clean speaker labels. A bot that joins through the meeting platform can get a per-participant audio feed, so separating "who said what" is tractable. In my test Otter produced a verbatim, cleanly speaker-labeled transcript — a word-for-word record of each speaker's turns.
It auto-joins. Because it's calendar-connected, it shows up to your meetings without you doing anything.
It runs everywhere. A bot living in the cloud isn't tied to your laptop's OS, so Otter ships on web, iOS, Android, and as browser extensions.

The speaker-labeled, searchable, auto-captured archive that Otter is known for is not a pile of independent features — it's what naturally falls out of having a bot in the room with access to the meeting's streams.

Architecture B: tap the system audio

Granola does the opposite. It's bot-free: instead of joining the call, it captures the audio your device is already playing — the kind of thing you'd build on top of a system-audio loopback (capturing the OS's output stream locally rather than dialing into the meeting). I'm describing the category of approach here, not Granola's internals, but the consequences are visible and consistent with it:

Nothing joins the call. There's no entry in the participant list. The other people on the call never see a recorder, because from the meeting platform's point of view there isn't one.
You lose reliable speaker separation. A system-audio tap gives you one mixed stream — everyone's voice already blended into the output. So on an ad-hoc capture Granola's transcript carries no speaker labels; it can't cleanly tell two voices apart the way a per-participant bot feed can.
It's tied to the OS. Capturing system audio means platform-specific code, which is why Granola is Mac, Windows, and iPhone only — there's no web version, because a browser tab can't tap your system output.

Same trade made in reverse: Granola gives up the labeled-transcript machinery to buy total invisibility on the call.

The accuracy test was almost a tie

I fed both tools the same 80-second synthetic meeting — two distinct voices, with names, numbers, and jargon planted so I knew the exact right answer. Otter produced the more literal, verbatim transcript and labeled both speakers cleanly, though it dropped the quarter off one figure, turning "Q3" into "Q." Granola garbled a line in its raw transcript, but raw capture isn't its game: it merges what you typed during the call with what it heard, and that enhanced summary was the most complete write-up of the meeting either tool produced.

Notice what didn't decide it: word-for-word accuracy. Both were close. The thing that actually distinguishes these tools is upstream of accuracy — it's the capture architecture and everything it forces downstream.

The architecture can decide for you — and not as a preference

Here's the part that turns this from a design-trivia post into a buying decision. In a two-party-consent jurisdiction, a visible bot in the participant list is arguably a feature — it makes the recording obvious to everyone. But plenty of security teams flatly ban third-party bots from joining internal calls, and some clients will not tolerate an unknown participant on the line. If you're in one of those environments, the bot-based tool isn't a worse option — it's a non-option, because there is no version of it that captures a meeting without the bot in the room. The bot-free architecture is the only one you can run at all. I went through how that plays out for each tool, plan by plan, in this head-to-head of Otter and Granola, and the compliance constraint flips the "which is better" question for a surprising number of teams.

The diarization gap works the same way. You don't get speaker labels from Granola not because its engineers forgot to build them, but because a single mixed system-audio stream doesn't contain the per-speaker separation a bot's multi-stream feed does. It's a property of the architecture, not a missing checkbox — which means it's not something a future update is likely to "fix" without changing how the tool captures audio in the first place.

The takeaway for anyone choosing (or building) one

Pick the capture architecture first, then the features — because the features are mostly consequences of the architecture, not independent choices:

Need labeled transcripts, a searchable archive, calendar auto-join, and web/Android coverage? That's the bot-in-the-room model. The visible participant is the price of admission.
Need silence in the participant list, work under a no-bots security policy, or take sensitive client calls? That's the system-audio model. The lost speaker labels and the OS-specific platform list are the price.

Trying to choose on transcription accuracy alone is choosing on the one axis that barely moved. The real fork is how the audio gets in — and if you've shipped a meeting-capture tool that threads this needle a third way, I'd genuinely like to hear how in the comments.

Cutting AI voice latency from 1.5s to 200ms: measure time-to-first-byte, not total time

AI Alleyway — Tue, 09 Jun 2026 07:47:51 GMT

There is one number that decides whether a voice agent feels alive or broken, and most people benchmark the wrong one. They measure total generation time. The number users actually feel is time-to-first-byte: how long the agent sits silent before the first audio comes out.

I run ElevenLabs behind a pipeline that generates speech, and when I started caring about realtime, my first benchmark said "about 1.5 seconds per reply." That sounded fine on a spreadsheet and felt awful in a conversation. Here are the three levers that took the felt latency from ~1.5 seconds to under 250 milliseconds, with the real measurements behind each.

Measure time-to-first-byte, not total time

Both convert and stream hand you an iterator of bytes. The difference is when the first byte shows up. So time that, not the loop's end:

import time
from elevenlabs.client import ElevenLabs

client = ElevenLabs(api_key="YOUR_API_KEY")

def ttfb(make_iterator):
    t0 = time.perf_counter()
    first = None
    for chunk in make_iterator():
        if first is None:
            first = time.perf_counter() - t0   # the moment audio could start playing
    return first

Everything below is measured with that harness, median of 3 runs, same voice, mp3_44100_128.

Lever 1: stream, don't convert

convert hits the non-streaming endpoint, which generates the whole clip before it returns. Its time-to-first-byte is basically its total time. stream starts handing you bytes while the rest is still being generated:

# waits for the entire clip before the first byte
audio = client.text_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_multilingual_v2",
    text="Hold on, let me check that for you.",
    output_format="mp3_44100_128",
)

# first bytes arrive early; the rest streams in behind them
stream = client.text_to_speech.stream(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_multilingual_v2",
    text="Hold on, let me check that for you.",
    output_format="mp3_44100_128",
)

On a one-sentence reply, convert first byte landed at ~830 ms, stream at ~670 ms. On a full paragraph the gap widens — ~1,470 ms versus ~1,270 ms — because convert is still waiting on the whole thing. Streaming is the free win, but on its own it is a modest one.

Lever 2: the model is the real lever

Switching the model did far more than switching the call. The same stream call on the low-latency eleven_flash_v2_5:

call	one sentence	one paragraph
`convert`, multilingual_v2	~830 ms	~1,470 ms
`stream`, multilingual_v2	~670 ms	~1,270 ms
`stream`, flash_v2_5	~180 ms	~235 ms

That is the headline: roughly 6× faster to first audio on a paragraph, just by picking the realtime model. Notice the flash row barely moves between one sentence and one paragraph — first audio is essentially flat regardless of length, which is exactly what you want for an agent.

The catch is the usual quality-versus-speed trade. The flash model is tuned for latency; the multilingual model is a notch richer for narration and long-form, where a couple hundred milliseconds does not matter. Which model belongs on which job is a real decision, and I went deep on how the ElevenLabs models and tiers actually differ in a full hands-on review. Short version for builders: use flash for anything interactive, keep the heavier model for pre-rendered narration.

Lever 3: chunk the first sentence (only when you need the slower voice)

Once you are on flash at ~235 ms, you are done — chunking adds overhead for no gain, and I measured exactly that. But sometimes the agent's first impression has to use the higher-quality voice. In that case, do not stream the whole paragraph. Stream the first short sentence, start playing it, and generate the rest behind it:

import re

def sentences(text):
    return [s.strip() for s in re.split(r"(?<=[.!?])\s+", text) if s.strip()]

def stream_chunked(client, voice_id, text, model_id="eleven_multilingual_v2"):
    sents = sentences(text)
    for i, s in enumerate(sents):
        chunk_stream = client.text_to_speech.stream(
            voice_id=voice_id,
            model_id=model_id,
            text=s,
            output_format="mp3_44100_128",
            previous_text=sents[i - 1] if i > 0 else None,   # keep prosody continuous
            next_text=sents[i + 1] if i + 1 < len(sents) else None,
        )
        for chunk in chunk_stream:
            yield chunk

The previous_text / next_text arguments matter. They are ElevenLabs' request-stitching parameters: synthesize each sentence in isolation and the model has no idea what came before or after, so intonation can reset at every boundary. Feeding it the neighbouring sentences is what those arguments are for — the model conditions on them and carries prosody across the seam.

On the multilingual voice, streaming the whole paragraph put first audio at ~1,470 ms. Chunking the first sentence dropped it to roughly 850 ms–1 s across runs — about a third faster — because the user only waits for one short sentence, not the whole reply.

The rule of thumb I settled on

Interactive / agent: stream + eleven_flash_v2_5. First audio lands around 200 ms and stays flat with length.
Need the richer voice live: stream the higher-quality model, but chunk by sentence with previous_text/next_text so first audio is one short sentence away, not a whole paragraph.
Pre-rendered narration (not realtime): convert is fine — nobody is waiting, and you skip the streaming plumbing.

Measure time-to-first-byte from day one. Total-time benchmarks will tell you everything is fine right up until the conversation feels broken.

What is the lowest realtime TTS latency you have gotten in production, and on which model? I am curious where the floor actually is.