A compelling voice doesn’t just convey information; it adds personality, emotion, and a sense of connection. In our work with digital humans in the form of game characters, we have seen how their voice will bring them to life and completely change their personality, and how you connect with them.
However, producing high-quality voice work traditionally involves significant time, resources, and cost. Professional voice actors are invaluable, but their services aren't always accessible for smaller teams or frequent content production. In comes computer generated voices, better known as Text To Speech.
Up until very recently, text-to-speech systems were known for their robotic, monotone delivery. Functional? Sure. But not exactly engaging. They felt unnatural and disconnected (think Stephen Hawking). These early systems were good for basic functionality but lacked the emotional depth to engage listeners.
This was the best you could hope for in 2022. Still used in Google Translate in 2024.
Not only do these voices feel artificial on their own - they also suck the life out of any otherwise great video game character or chatbot you'll interact with!
Luckily, a whole lot has changed in the last years and months. AI-powered voices can now express emotions, adjust tone, and provide a more human-like cadence, making it hard to distinguish them from real voices. We're getting to the point where you can create voice lines virtually indistinguishable from recorded voice actors, in a matter of milliseconds.
Multilingual, naturally sounding voices from OpenAI.
All of this combined gives you the feeling of an actual human being who believes in what they are saying, instead of a text being read out loud by a machine.
It’s not difficult to see how this can be used in a wide range of applications. Combined with language models and on-the-fly creation and streaming it gets all the more powerful. ChatGPT’s advanced voice mode blew people’s minds when it came out, because it felt like it really felt like you were conversing with a real person - and that’s without even having a visual representation at all.
Imagine interacting with a virtual assistant that doesn’t just respond, but feels like it understands you. AI generated video where you feel a connection to the made up characters. Imagine game characters that adapt their speech dynamically, enhancing immersion like never before.