November 19, 2024

Bringing Content to Life: The Power of Lifelike AI Voices

3 minutes

A compelling voice doesn’t just convey information; it adds personality, emotion, and a sense of connection. In our work with digital humans in the form of game characters, we have seen how their voice will bring them to life and completely change their personality, and how you connect with them.

However, producing high-quality voice work traditionally involves significant time, resources, and cost. Professional voice actors are invaluable, but their services aren't always accessible for smaller teams or frequent content production. In comes computer generated voices, better known as Text To Speech.

The good old times (aka 2 years ago)

Up until very recently, text-to-speech systems were known for their robotic, monotone delivery. Functional? Sure. But not exactly engaging. They felt unnatural and disconnected (think Stephen Hawking). These early systems were good for basic functionality but lacked the emotional depth to engage listeners.

This was the best you could hope for in 2022. Still used in Google Translate in 2024.

Not only do these voices feel artificial on their own - they also suck the life out of any otherwise great video game character or chatbot you'll interact with!

The better times (aka now)

Luckily, a whole lot has changed in the last years and months. AI-powered voices can now express emotions, adjust tone, and provide a more human-like cadence, making it hard to distinguish them from real voices. We're getting to the point where you can create voice lines virtually indistinguishable from recorded voice actors, in a matter of milliseconds.

Multilingual, naturally sounding voices from OpenAI.

What’s changed?

  • Cadence and tone is now automatically applied based on the content. The system “understands” what it’s reading
  • Realistic pauses, “hum” sounds, breathing and other non-verbal parts of speech are kept intact. Avoiding the “uncanny valley” where a voice sounds almost human, but not entirely there.
  • You can clone voices easily, with as little as 30 seconds of audio, or “design” a voice for a person that doesn’t yet exist
  • The voices can work for a range of different languages, without requiring training data for each language they will speak

All of this combined gives you the feeling of an actual human being who believes in what they are saying, instead of a text being read out loud by a machine.

That’s awesome - what’s next?

It’s not difficult to see how this can be used in a wide range of applications. Combined with language models and on-the-fly creation and streaming it gets all the more powerful. ChatGPT’s advanced voice mode blew people’s minds when it came out, because it felt like it really felt like you were conversing with a real person - and that’s without even having a visual representation at all.

Imagine interacting with a virtual assistant that doesn’t just respond, but feels like it understands you. AI generated video where you feel a connection to the made up characters. Imagine game characters that adapt their speech dynamically, enhancing immersion like never before.

Get in touch
Anders Heivoll
CTO & co-founder
Anders Heivoll is the CTO and co-founder of Apprendly. Anders is a skilled software developer, who studied computing at Kent University. He created his first website at the age of 12 and started his first company when he was 18. In his free time you might find him learning new languages, playing the guitar and …Age of Empires 2. He recharges his batteries through hiking, running, and exploring nature.