Yoruba meeting translation: the pitch is the word, not the decoration

In Yoruba, pitch is lexical: the same string of letters is a different word depending on the tone you say it with, so a tool has to hear the melody, not just the consonants and vowels. On top of that, the tone marks that would write the difference down are routinely dropped in chat and casual typing, so models trained on text never really learned to hear tone and end up guessing from context. A line like "supports Yoruba" on a feature list tells you almost nothing. Here's what actually decides whether a Yoruba meeting comes back usable.

The pitch is the word, not the emphasis

Yoruba has three tones — high (´), mid (unmarked), and low (`) — and they're not stress or intonation the way English uses pitch. They're part of the word. Change the tone and you change the meaning, fluently and grammatically. The textbook example is igba: said with three flat mid tones it's "two hundred"; igbá (mid-high) is "calabash"; ìgbà (low-low) is "time" or "season." Same five letters, three unrelated words, told apart only by the melody. So when someone says a word in a Yoruba meeting, the tool isn't choosing between a right and a misspelled version — it's choosing between several real words, and the only thing that distinguishes them is pitch it has to actually have heard. Get the tone wrong and you don't get gibberish; you get a different, perfectly valid word in the wrong place.

The tone marks fall off in writing

Here's the part that quietly breaks tools. Yoruba is written with tone marks and with dots below certain letters — ẹ, ọ, ṣ are distinct from e, o, s. But in everyday typing — chat, email, social, quick notes — Nigerians overwhelmingly drop the diacritics. People write igba and let you work out from context whether they mean two hundred, a calabash, or time; they write e jowo for ẹ jọ̀wọ́ ("please"). That's normal, and a human reader handles it. But it means most of the Yoruba text in the world is un-toned, and a model trained mostly on text has never been forced to learn what the tones sound like — it learned to recover the word from surrounding context instead. That's a fine strategy for reading; it's the wrong strategy for live audio, where the speaker actually produced the tone and the tool should be hearing it rather than inferring it. A tool built for Yoruba has to listen to pitch as a first-class signal, not reconstruct it after the fact from un-toned text patterns.

Fast speech runs together, and the office speaks two languages

Two more things stack on top of the tones. First, in fast Yoruba, vowels elide and contract — adjacent vowels merge and words run into each other, so the clean boundaries you'd find in a dictionary aren't there in real speech, and a recognizer that expects tidy gaps mis-segments the stream. Second, Nigerian professional speech is normally Yoruba and English woven together in one sentence: "A máa deploy feature yìí ní sprint tó ń bọ̀" — "we'll deploy this feature next sprint" — is one ordinary line, English content words carried inside Yoruba grammar. A tool that decides the sentence is "Yoruba" may leave the English mangled; one that decides it's "English" drops the Yoruba. Each reader needs a complete sentence rebuilt in their own language — not a half-translated lump with the code-switched half left as it was spoken. The elision and the code-switching are how the room actually talks, and handling them cleanly is part of the job. This is the same family of under-served-language problem behind Swahili meeting translation.

Why this specifically stresses real-time captioning

Live translation lives on a tension between latency and committing too early, and tone makes that tension sharper. The faster a tool shows you a caption, the less of the word and its surrounding context it has heard — and if it's guessing the tone from context rather than hearing it, the early guess is exactly where it goes wrong. Worse, once it prints a caption it has effectively committed to a tonal reading; print igba as "two hundred" and the next clause reveals it meant "time," and now the line has to be retracted and re-rendered on screen, or left wrong. A tool built for Yoruba has to take the tone from the audio, hold the caption until the pitch is resolved, and land it once — not flash a context-guess and revise it. A fluent caption that quietly substitutes one real word for another is more dangerous than an obvious error, because nobody stops to question a sentence that reads perfectly well. For why these distinctions are easy to lose at speed, see how accurate is AI meeting translation.

How to do it with Sageio

Add bot@sageio.net to your Google Meet calendar invite. It joins on its own — no extension, nothing to install.
Each participant picks their caption language. The Lagos team reads clean Yoruba, a colleague elsewhere reads clean English — both from the same spoken Yoruba, at the same time. (Sageio translates into 20+ languages.)
Everyone speaks naturally — full tone, fast elision, Yoruba-English code-switching and all. Translated captions appear in about two seconds.
Afterward, a searchable transcript and an AI summary arrive within about five minutes, shared at the host's discretion.

(Today this runs on Google Meet; Zoom and Microsoft Teams support is coming soon.)

How to test any tool in five minutes

Say the tone triplet in context — igba ("two hundred"), igbá ("calabash"), ìgbà ("time") — in three short sentences and check the captions pick the right word each time instead of repeating one reading. Then say a normal code-switched line ("A máa deploy feature yìí ní sprint tó ń bọ̀" — "we'll deploy this feature next sprint") and see whether it keeps the English words whole while rendering the Yoruba correctly. Finally, watch whether the captions land once or flash a guess and revise it after the next clause. If it collapses the tones to one word, garbles the English, or keeps retracting lines, the tool wasn't built for spoken Yoruba.

Is it private?

For anything that joins your meetings: Sageio doesn't use your meeting content to train AI models, and its AI vendors are contractually restricted from doing the same. Audio is processed in memory and discarded — only the text transcript and summary are kept, encrypted, in the region you choose (US, EU, or APAC). Enterprise customers can self-host the entire stack.

Frequently asked questions

Why would a Yoruba caption show the wrong word that still reads fine? Yoruba is a tone language: pitch is part of the word, not emphasis. Igba ("two hundred"), igbá ("calabash"), and ìgbà ("time") are the same letters distinguished only by tone. A tool that doesn't hear the pitch picks one real word over the others and produces a fluent, grammatical sentence that simply means something different.

Why do dropped tone marks matter? In everyday typing, Yoruba tone marks and the dots below ẹ, ọ, ṣ are usually left off, so most written Yoruba is un-toned. A model trained mainly on that text learned to guess the word from context rather than hear the tone — which is the wrong instinct for live audio, where the speaker actually produced the pitch and the tool should be listening for it.

Does it handle Yoruba-English code-switching? Yes — that's the point of testing on a real call. Nigerian professional speech mixes English content words into Yoruba grammar, like "A máa deploy feature yìí." A tool that assumes one language per sentence translates only half; correct handling keeps the English whole and rebuilds a full sentence in each target language.

How fast are the translated captions? About two seconds, fast enough to keep a live conversation moving, with a searchable transcript and summary within about five minutes after the call.

What does it cost to try? Every plan starts with a free 60-minute trial, no credit card required. After that, Professional is $49/month and Teams is $99 per seat/month (annual billing includes 2 months free); Enterprise is custom-priced.

If your team works in Yoruba, the honest test is whether a native speaker reads the live captions and hears the actual meeting — the tones landing on the right words, the code-switching kept whole. Add the bot to your next call and let them judge.