Voice Generation Use Cases
Voice generators from companies like ElevenLabs, Wellsaid Labs, and Play.ht, are quickly being adopted by creators as a way to to grow their business. Creators use generated speech to make their voice sound more professional, or to translate their voice into other languages. Movie studios, game studios, and virtual agents are replacing human voices with generated ones to make it easier to add and modify prompts, as no studio time needs to be scheduled to record prompt changes.
Other vendors, like Tomato.ai (our company), are focused on more real-time use cases, for example, providing call center agents with a way to sound clearer and more trustworthy by changing their accents as they speak. This helps agents improve their productivity, and overtime their pay. Another use case is to help anyone with a foreign accent sound clearer when speaking in online meetings.
Companies and individuals considering adopting voice generator solutions need a way to judge the quality of the solution. Below is a list of key dimensions that can be used to compare such offerings.
Prioritizing Call Center AI Solutions
Framework call center leaders can use to prioritize which AI solutions to implement first
6 Ways to Judge the Quality of Voice Generators
-
Naturalness
The most common way to determine the quality of a voice generator is to evaluate how human-like, or natural, the AI generated voice sounds. If the voice sounds robotic the listener is unlikely to engage as much as otherwise. For real-time use cases, such as live agents or meetings, one aspect to look out for is whether the prosody, or emotional features of the speaker, in the original voice comes through, or is the generated voice flat.
-
Acoustic Quality
Acoustic quality is a measure of noises, also known as artifacts, introduced in error when generating the voice. It can be distracting to have various sounds in and around the generated voice. The better the acoustic quality the more likely listeners are to continue listening.
-
Level of Accent
The real-time use cases, like ones helping call center agents, can generate a voice with a softened accent or an alternate accent. In both scenarios the aim is to localize the accent based on where the listener is based. When judging a generated accented voice, if the preference is to soften the accent, then elements of the original accent (e.g. Indian or Filipino) should still exist, but also elements of the desired accent. If the goal is to have an alternate accent, then the aim is to have the accent match the typical accent at the desired location.
-
Latency
The creator use cases are typically done offline, where latency for generating the voice is not a key factor. On the other hand the agent or meetings use cases are done in real-time. As a person is speaking the generated voice is heard by the listener. So for those use cases the time it takes to generate the voice is critical and needs to be as low as possible. If the delay is too long the listener might speak over the other person’s voice.
-
Voice Similarity
With some use cases keeping the original voice of the speaker is important. For example, a creator translating their content might care about keeping the translated versions on brand, with their voice used across languages. Also, call center agents might prefer that customers hear a voice that sounds like theirs.
-
Quality of Voice Actors
Some use cases work better when a person sounds like someone else, for example a celebrity, or professional voice actor. In those cases a comprehensive list of high caliber voice actor options is helpful to select the best voice.
Take Away
When selecting a voice generator there are a number of factors to consider. Depending on the use case there are different dimensions that matter more. For offline creator use cases naturalness, acoustic quality matter a lot. For real-time use cases, like the live agent and meetings ones, latency and accent are also important. With some use cases voice similarity matters more, with others having many voice actor options to choose from is more important.