If a meeting tool gives each participant captions in their own language — at the same time, from the same conversation — it has solved the multilingual-meeting problem. If it picks one target language for the whole room, it hasn't. That single distinction, per-person captions, is the feature that separates a tool built for cross-language meetings from one that merely "supports" multiple languages. This is what it means and why it's the first thing to check.
What "per-person" actually means
In a mixed-language meeting, the wrong question is "which language do we translate into?" Any answer leaves most of the room out. Per-person captions reframe it: each participant picks the language they read in, and everyone reads the same discussion simultaneously, each in their own choice.
Concretely: five people on a call, four languages. One reads Korean, one reads German, one reads Japanese, two read English — all generated live from whatever each speaker actually says. Nobody is asked to switch to a shared language, and nobody is stuck reading a language they're weak in because the tool only translates "into English."
That's the difference between translation that serves a meeting and translation that serves one direction of it.
Why most tools don't do it
A lot of tools that list "20+ languages" still translate the meeting into one language at a time — fine for a one-way webinar with a known audience, useless for a working meeting where the languages are mixed and change speaker to speaker. Same-language captions have the same gap: text appears, but in whatever language is being spoken, so every non-native reader is still translating in their own head — the exact effort you were trying to remove.
Per-person is harder to build because it means running translation into several target languages at once, keeping each one in pace with live speech, and letting each participant control their own view. It's also the only model that matches how real cross-language teams actually meet.
How it works under the hood
The pipeline is the same as any real-time meeting translation, with one property that makes it per-person:
- A bot joins the call and captures the audio.
- Speech-to-text transcribes each speaker in the language they're speaking.
- That text is translated into each participant's chosen language in parallel — not one shared target.
- Every participant sees captions in their own language, about two seconds behind speech.
- Afterward, the transcript and summary can be read in each person's language too.
The thing to notice is step 3: the work scales with how many languages the room reads in, not with one fixed output. That's what lets one meeting serve everyone at once.
Why it's the feature to check first
When you evaluate a tool, "number of languages supported" is the wrong headline metric — a tool can support fifty languages and still only translate into one of them per meeting. The questions that actually matter:
- Can each participant choose their own caption language, independently?
- Do those languages render simultaneously, from the same speech?
- Is the transcript afterward available per-person too, not just in one language?
If the answer to those is yes, the meeting works for everyone in the room. If it's no, you've got captions, not cross-language translation. (For the full evaluation, here's what to look for in a meeting translation tool.)
How Sageio does it
Add bot@sageio.net to your Google Meet calendar invite and it joins on its own — nothing for anyone to install. Each participant picks their own caption language, and translations appear in about two seconds across 20+ languages, with Asian languages treated as first-class. A single meeting can run multiple caption languages at once — up to 3 on Professional and up to 7 on Teams — so a genuinely mixed room reads along together. Within about five minutes of the call ending, a searchable, translated transcript and an AI summary arrive, shared at the host's discretion. (Today this runs on Google Meet; Zoom and Microsoft Teams support is coming soon.)
Frequently asked questions
What are per-person captions? Live meeting captions where each participant chooses the language they read in, and everyone reads the same conversation simultaneously — each in their own language — rather than the whole meeting being translated into one shared target language.
How is that different from regular live captions? Regular captions show speech as text in the language it's spoken. Per-person captions translate that speech into each reader's chosen language, live, so someone who doesn't read the spoken language can still follow.
How many languages can one meeting use at once? With Sageio, a single meeting can run multiple caption languages simultaneously — up to 3 on Professional and up to 7 on Teams — covering 20+ languages overall, with Asian languages treated as first-class.
Does each person also get the transcript in their language? Yes — the post-meeting transcript and summary are translated, so people who couldn't attend or couldn't follow live can read it in their own language.
Is it private? Sageio doesn't use your meeting content to train AI models, and its AI vendors are contractually restricted from doing the same. Audio is processed in memory and discarded; only the encrypted text transcript and summary are kept, in the region you choose (US, EU, or APAC).
The fastest way to see whether a tool is really per-person is to put two or three languages in front of it on one real call and check that each person can read along in their own. Add the bot to your next meeting and watch the room follow the same discussion in four languages at once.