When we say "Sageio translates speech in under two seconds, end-to-end" — that number is the result of three loud mistakes I'd rather forget but probably shouldn't.
For founders building anything that goes through a multi-stage pipeline (speech-to-text → translation → display), these are the traps we walked into so you don't have to.
Mistake 1: Treating STT as "set it and forget it"
The first prototype piped audio into the first STT vendor we could integrate, took the text out the other end, and called it done. End-to-end latency: 8 seconds. Quality: acceptable on English, terrible on Asian languages.
What we missed: most STT services optimize for batch transcription, not streaming. They wait for a 5-10 second audio buffer before returning anything. That's fine for a recording. It's hostile for a live meeting.
We rebuilt around a streaming-first STT provider that returns partial results within ~300ms of receiving audio. The "interim transcript" gets refined every 200ms as more context arrives, then locked when the speaker pauses.
Lesson: in any real-time pipeline, the latency profile of each stage matters more than its accuracy on benchmarks.
Mistake 2: Translating final transcripts only
Once STT was streaming, the next bottleneck was translation. We were waiting for the STT engine to mark a transcript "final" before sending it to the translator. That added another 1-2 seconds.
The fix sounds obvious in retrospect: translate the interim transcript with a cheap-and-fast translator, then re-translate the final transcript with a slower-but-better translator when it locks. The user sees something within ~600ms of the speaker pausing, and the higher-quality translation overwrites it ~1 second later. The overwrite is barely perceptible because the wording is usually 90% the same.
Lesson: latency-critical UIs benefit from two-pass strategies. Cheap-and-fast first, slow-and-good second. The user doesn't notice the overwrite if the first pass is "good enough".
Mistake 3: Trusting the connection never breaks
The first prototype crashed silently on its first real-world meeting. The STT WebSocket dropped after about 45 minutes. No error, no reconnect, no callback — just zero new transcripts arriving.
We had assumed the SDK handled reconnection. It didn't. We had assumed the meeting platform would surface the bot's silence as a problem. It didn't. We had assumed the host would notice. They didn't — until they tried to search the transcript later and saw the gap.
Now we have:
- A heartbeat on the STT connection that triggers a reconnect on any 30-second silence
- A "translation pulse" indicator on the UI that goes from green to amber if no captions have arrived in 10 seconds
- Internal monitoring that surfaces transcript gaps to us before the host notices
Lesson: in any long-running connection, assume it will silently fail. Build the failure signal yourself.
What the lessons add up to
Sub-2-second latency isn't a magic number we hit by picking the best vendors. It's the result of forcing each stage of the pipeline to be honest about what it could promise — and building around those honest answers.
If you're building anything similar — voice agents, live transcription, multilingual customer support — happy to compare notes. Email me.
— Ming