Try Inworld
Request a Demo
Try Inworld

Inworld Voice 2.0: Improving TTS voice capabilities for increased realism in AI agents

Alex Huang
March 20, 2024
Related posts
Want to try Inworld?
Get started

Inworld Voice 2.0 capabilities

Over the last few months, we’ve been working on making significant improvements to Inworld’s voice offering for both Inworld Studio and custom voices. These voices enhance character interactions with emotional depth and realism through improved latency, rhythm, intonation, pitch variation, and natural pausing for more authentic experiences. Our goal is to streamline character creation for Inworld customers by offering a wide selection of premium, ready-to-use voices.

The Mercenary

Commanding
Gaming

British Narration

Soothing
Warm

American Narration

Energetic
Positive

Educational

Informative
Precise

Reduced latency for real-time conversations

We've made significant advancements in reducing latency for both our Inworld Studio and cloned voices, giving you more options to generate natural-sounding conversations. Our new voices have a 250ms end-to-end 50pct latency for approximately 6 seconds of audio generation. To achieve the perfect balance between latency, quality, and throughput, we applied the following approaches:

  • Hyperparameter tuning enabled us to optimize quality and latency by implementing automatic quality evaluations and searching for inference hyperparameters. This iterative evaluation process is a fundamental part of machine learning to find the best values for variables in a model. This was a key part of achieving optimal model performance in our machine learning approach.
  • Streaming generation allowed us to improve the real-time factor by producing audio in smaller segments. One segment is spoken while the next is generated, resulting in uninterrupted and continuous speech.

Diverse voices as a part of our platform pricing

Our regular platform pricing now includes 48 Studio voices for a variety of TTS options for different character types. We don't charge separately for our voice pipeline — it is included with the Inworld Platform at no extra cost. For pricing information on custom cloned voices, please reach out to us.

Training our Inworld Voice 2.0 model

The improvements that we've made to our voices are the result of a number of factors:

  • New architecture: By implementing a lightweight diffusion-based TTS architecture, we were able to achieve more refined voice generation. Carefully curating the data used to train this model allowed us to strike a balance between quality and latency.
  • Data cleaning: The data we used to train these voices came from Creative Commons licensed datasets, specifically LibriTTS-R, SLR83, and Hifi-TTS. We used approximately 20 hours of additional licensed audio, recorded specifically for our platform by professional voice actors, to enhance the pitch, pace, and emotional range of our model. To ensure reliable audio quality and transcription accuracy, we implemented a thorough cleaning pipeline that identifies and removes flawed samples, addressing issues like missing words and unintended punctuation. We also trained a set of ML classifiers to help us to automatically identify low-quality, noisy, and game-inappropriate recordings. We further enhanced our data cleaning process by manually listening to the audio to identify and remove imperfections like background noise and samples that sounded slow, monotone, or robotic.
  • Model distillation: We used a much larger TTS model to augment the expressiveness of our voices. We generated audio with patterns not found in the training data, emphasizing text that results in expressive, dialogue-like speech. Using some of the same methods for data cleaning, we carefully filtered generations to ensure that only high-quality augmented samples were added.

Our methodology for benchmarking high quality voices

Selecting the right prompt is a crucial part of high-quality voice generation. The quality of a generated voice is greatly influenced by the nature of the prompt -- whether it’s dull or exciting. To ensure optimal voice generation, we employed a custom set of evaluation criteria to select the most suitable prompts and measure the quality of our model.

The standard method for evaluating TTS model output is through Mean-Opinion Score (MOS), where voice generations are rated on a scale of 1 to 5 by crowdsourced individuals. However, this approach is slow, expensive, and impractical for frequent model and data modifications during development.

Instead, we developed an automated evaluation framework to assess prompt and generated voice quality across five categories: text generation accuracy, speaker similarity, prosody evaluation, talking speed, and expressiveness.

  • Text generation accuracy: We verified the accuracy of our model's generated text by comparing it to the requested text using an Automatic Speech Recognition (ASR) model.
  • Speaker similarity evaluation: To measure voice similarity between the generation and prompt, we employed a speaker verification model. This pretrained model converts the audio from the generation and prompt into dense embedding representations, allowing us to calculate the cosine similarity between the embeddings. The closer the embeddings are, the greater the similarity between the voices.
  • Prosody evaluation: The Mel Cepstral Distortion (MCD) metric helped us quantify the spectral similarity between synthesized speech and a reference sample. It calculates the average Euclidean distance between their Mel-frequency cepstral coefficients (MFCCs). A lower MCD value indicates greater spectral similarity, offering insights into prosodic similarity. The MFCC representation partially captures prosodic features such as rhythm and intonation.
  • Talking speed: Measuring phonemes per second is a reliable indicator of speech naturalness and helps avoid excessively slow or robotic voices. Striking a balance in the range of phonemes per second ensured our voices maintained an optimal tempo and pace while sounding natural.
  • Expressiveness: We used Pitch Estimating Neural Networks (PENN) to measure pitch in the generated audio and assess expressiveness by analyzing the standard deviation of pitch. Our goal was to optimize for natural levels of pitch variation that align with voice expressiveness, avoiding monotone voices.

What's next

As part of our ongoing development, we are working on bringing cloned voices into the Inworld Studio and adding multilingual support to our Voice offering. We are also focused on adding contextual awareness to our voices, striving for more emotionally connected interactions. Stay tuned for upcoming announcements as we continue to develop our voice technology.

Test our new voices

Check out the Inworld Studio to see the new voices for yourself. 

Stay connected

Get the latest updates, events, and offers from Inworld.