Understanding the Science Behind Accent Softening Technology

AI-enabled accent softening makes communication easier while keeping the speaker's identity. But how does it work? Learn the science here.

By   James Fan   in   Product   01/09/25

Share on Twitter Share on LinkedIn Share via Email Copy Link Link copied!

What is Accent Softening?

In our society, people from every corner of the globe, speaking a kaleidoscope of languages, molded by distinct cultures and regions, find themselves in a melting pot of accents. These accents, rich and full of heritage, sometimes become barriers to clear communication, leading to misunderstandings and frustration in both personal and professional contexts.

AI-enabled accent softening is a real time speech-to-speech solution that uses deep learning technology to modify one’s speech audio dynamically so as to soften the accent to make it easier to communicate while keeping the speaker’s identity.

A multi-colored sound wave coming from virtual call center agent's mouth, depicting Tomato.ai's accent softening technology in action.

The Challenges of Accent Softening

This is an emerging technology that only became possible with the recent advancement in generative AI for speech. To fully implement an AI-enabled accent softening solution, one needs to overcome a number of challenges:

Latency

In order to have the technology improve the communication without impeding the natural flow of conversation, the solution must be able to soften the accent in real time with minimum latency. Any delays are easily noticeable in a conversation, and will negate any improvements in user experience gained from accent softening.

Naturalness

One common complaint about generated speech is that it often sounds robotic. Think of C-3PO in Star Wars who is fluent in millions of languages, yet still carries a certain mechanical cadence and precision in articulation, making it unmistakably robotic. This becomes a much worse problem if the AI modification to one’s accented speech makes it sound less natural and lose all the emotions conveyed in the original speech. 

Complexity of Accent

An accent is not simply pronouncing words differently, it also involves different prosody (ie. the melody of the speech). If the accent softening solution replaces both the pronunciation and the prosody, then the generated speech may sound very different from the original speaker.

Lack of Training Data

Typically, supervised machine learning produces the best results, but when it comes to the accent softening problem, it’s difficult to obtain labeled data or parallel data for supervised machine learning approaches. It’s nearly impossible to find both accented speech and non accented speech from the same speaker. 

The Naive Way to Implement an Accent Softening Solution

One naive way to implement an accent softening solution is to combine a speech-to-text and a text-to-speech system. The input speech is transcribed to text, and then synthesized into speech with no accent.

While this approach seems to be simple, it does not meet most of the challenges outlined above:

-Latency. Because the text-to-speech system needs the transcript by the speech-to-text system, the overall latency is too large to be acceptable in a real time use case.

-Naturalness and complexity of accent. Because the text-to-speech system doesn’t include the original speaker’s prosody information, the generated audio doesn’t have the same emotional tone of the speaker, and is not as natural. The text-to-speech system may also sound very different from the speaker, hence not preserving the original speaker’s identity.

-Cascading errors. If there’s a mistake in the speech-to-text transcript, then it cascades into the text-to-speech’s output, and it could be quite misleading.

The Components of an Accent Softening Solution

A speech-to-speech accent softening solution is made of the following components:

  1. Audio processing
  2. Accent softening
  3. Vocoder

We’ll look at each of these components in more detail.

Audio Processing

The audio processing component converts an audio waveform into a mel spectrogram, and it serves as the first step in accent softening. This process involves applying a short-time Fourier transform (STFT) to the waveform to capture frequency and time information, followed by mapping the resulting spectrogram to the mel scale, which aligns more closely with human auditory perception. The mel spectrogram emphasizes perceptually relevant frequency bands, creating a compact, informative representation of sound. This transformation enables efficient input for deep learning models by highlighting essential audio features in a manageable format.

Accent Softening

An accent softening model that processes a mel spectrogram of accented speech to produce a spectrogram of the same speech without the accent—while retaining the original speaker’s identity—is an innovative application of deep learning in speech synthesis and accent conversion. The challenge lies in preserving the unique voice characteristics of the speaker (known as speaker identity) while altering the accent characteristics.

The input to this model is a mel spectrogram which represents the accented speech, and the goal of the model is to produce a modified mel spectrogram that corresponds to accent-neutralized speech. This accent softening process involves modifying specific phonetic and prosodic features associated with accents while preserving the qualities that define the speaker’s unique voice, such as pitch, tone, and timbre.

To achieve this transformation, deep learning models, particularly sequence-to-sequence models or generative adversarial networks (GANs), are employed. Sequence-to-sequence models like Transformer-based architectures have demonstrated impressive performance in capturing temporal dependencies in speech and are well-suited for transforming one sequence (accented speech) into another (accent-softened speech). By training the model on paired examples of accented and neutralized speech, the model learns which aspects of the mel spectrogram correspond to accent features and which are intrinsic to the speaker’s voice. During inference, the model can then selectively adjust the accent-related features without affecting the identity-specific features of the speech.

Training an accent softening model requires a large, diverse dataset of accented and neutralized speech samples across various speakers. These datasets should cover a range of accents, speech patterns, and individual voice characteristics to generalize well to different speakers and accents. 

Vocoder

A vocoder, or “voice coder,” is a crucial element in speech synthesis systems, responsible for converting melspectrogram into audible speech signals. Modern vocoders are essential for accent softening, as they make synthetic speech sound more natural and human-like. 

Vocoders are implemented using a variety of techniques, from traditional signal-processing methods like Linear Predictive Coding (LPC) and STRAIGHT to advanced deep learning-based approaches.

Deep learning-based vocoders, such as WaveNet, WaveGlow, and HiFi-GAN, utilize neural networks to model the waveform directly, which allows them to generate high-fidelity speech with intricate details. WaveNet, for example, uses autoregressive models to generate each audio sample conditioned on previous samples, achieving exceptional quality but at the cost of computational intensity. WaveGlow and HiFi-GAN, which leverage non-autoregressive techniques, offer faster synthesis speeds while still delivering high-quality output.

Different vocoder approaches present trade-offs. Traditional vocoders are more computationally efficient and often suitable for real-time applications, but they may lack naturalness in synthesized speech. In contrast, deep learning-based vocoders produce more lifelike audio but require significant computing power and optimization for real-time processing, posing a challenge in resource-constrained environments.

Vocoder technology has progressed significantly, allowing neural vocoders to achieve highly realistic, expressive speech synthesis. However, designing a deep learning-based vocoder remains challenging due to the need to capture complex speech dynamics like pitch, tone, and articulatory nuances.

Conclusion and Outlook

An accent softening system aims to achieve smooth communication among people with different accents and cultural backgrounds, but in order to achieve this goal, many challenges need to be overcome. Recent advances in neural networks have made accent softening increasingly effective, with models like Transformer-based architectures and CycleGANs showing promise in real-time applications. The result is a powerful tool that enables smoother cross-linguistic communication, preserves speaker individuality, and enhances the clarity and accessibility of speech-based applications.

By James Fan in Product 01/09/25

Share on Twitter Share on LinkedIn Share via Email Copy Link Link copied!

Popular Blog PostsSee all posts

A recently released survey found the UK’s most attractive accents. Here are the results.

Accent translation improves call center customer service by removing language barriers that often result in misunderstandings & less efficient conversations.

Serving as a call center agent requires a specific set of skills & certain personality traits. Here’s what you need to know to land a call center job in India.

Learn about the top accents which Americans find favorable, and why they like those accents

In India, there are nearly two dozen different official languages, more than 100 languages, and hundreds of mother tongues. Start with these 6 Indian accents.