← Return to Blog

Text-to-speech with word-synchronization

The voice narration with word synchronization in campaigns is pretty neat, and I've been asked a number of times how I was able to accomplish this. So, this is a technical article explaining our approach with the Eleven Labs.

Text to Speech with Word Synchronization

At a high level, following each word as the narrator speaks is pretty straightforward. It involves knowing the timestamps for when each word is spoken and splitting the text based on the duration to apply a visual effect.

We use ElevenLabs to produce the voice for Tavern of Azoth's AI Game Master, and at this time, the ElevenLabs REST API does not return character timestamps. To obtain this granular data, you must use the WebSocket endpoint: wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id=eleven_monolingual_v1

The WebSocket payload contains audio data chunks and character alignment data, which represent the timestamp in milliseconds of each letter from your prompt text:

"normalizedAlignment": {
  "char_start_times_ms": [0, 3, 7, 9, 11, 12, 13, 15, 17, 19, 21],
  "chars_durations_ms": [3, 4, 2, 2, 1, 1, 2, 2, 2, 2, 3],
  "chars": ["H", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d"]
},

In our case, we want to highlight words, not characters, so we store the character data and parse each character for the occurrences of spaces " ". We use that marker to indicate the start of a new word.

The WebSocket streams in this data, so we must also calculate the offset based on the previous chunks' duration.

Once we've calculated our word-level timestamps, we can store the audio file and return our own Audio Data model for our frontend client to act upon.

The data model I use looks like this:

{
  "id": "5b492e71-3c40-4480-a006-bdba0572ba0d",
  "url": "http://localhost/api/audio/5b492e71-3c40-4480-a006-bdba0572ba0d.mp3",
  "text": "Welcome brave travellers.",
  "timestamps": {
    "words": ["Welcome", "brave", "travellers."],
    "start_times": [46, 441, 685]
  }
}

From there, applying the effect visually on your frontend client is just a matter of knowing the current duration of the audio clip being played and splitting the text based on that duration.

I chose to use a CSS fade animation on the text color, but a person could just as easily use bold or underline.

For those interested in a working example, I've created a public repo with a simple API that allows you to create audio from the text-to-speech endpoint and observe the effect being applied as we use it. Check out my GitHub repo here

https://github.com/TasikBeyond/Echo

Happy coding,

Cheers 🍻