Highlights:

  • Wav2vec 2.0 is a self-supervised learning algorithm that lets computers pick up new skills without depending on labeled training data.
  • When models trained on MMS data were directly compared to OpenAI LP’s Whisper speech recognition model, the researchers at Meta discovered that the word error rate was roughly half as low.

The artificial intelligence research group at Meta Platforms Inc. announced that a new project called Massively Multilingual Speech, which seeks to address the difficulties in developing precise and trustworthy speech recognition models, has been open-sourced.

AI models able to recognize human speech and respond to that have great potential, specifically for people using voice access to get information. However, developing high-quality models typically necessitates a vast amount of data, including transcriptions of spoken words and thousands of hours of audio. That information is simply absent in many languages, particularly the less well-known ones.

Meta’s MMS project eliminates the requirement by combining a self-supervised learning algorithm named wav2vec 2.0 with a new dataset that offers labeled data for over 1,100 languages and unlabeled data for almost 4,000 languages.

Meta’s researchers turned to the Bible, which, unlike most other books, has already been translated into thousands of languages, to overcome the lack of data for some languages. Its translations are frequently examined for text-based language translation research, and for many of them, audio recordings of people reading these texts are also freely accessible.

Meta’s researchers added, “As part of this project, we created a dataset of readings of the New Testament in over 1,100 languages, which provided on average 32 hours of data per language.”

Thirty-two hours of data are obviously insufficient to train a traditional supervised speech recognition model, which is why wav2vec 2.0 was employed. A self-supervised learning algorithm called Wav2vec 2.0 enables computers to pick up new skills without relying on labeled training data.

It enables training speech recognition models on a much smaller amount of data. The MMS project used approximately 500,000 hours of speech data in over 1,400 languages to train multiple self-supervised models, which were then fine-tuned for a particular speech task, such as multilingual speech recognition or language identification.

According to Meta, the final models outperformed both other speech recognition models and standard benchmarks like FLEURS.

Meta’s researchers explained, “We trained multilingual speech recognition models on over 1,100 languages using a 1B parameter wav2vec 2.0 model. Meta’s researchers explained. As the number of languages increases, performance does decrease, but only very slightly: Moving from 61 to 1,107 languages increases the character error rate by only about 0.4% but increases the language coverage by over 17 times.”

According to research by Meta’s researchers, the word error rate for models trained on MMS data was roughly half that of OpenAI LP’s Whisper speech recognition model. The researchers said, “This demonstrates that our model can perform very well compared with the best current speech models.”

For other members of the AI research community to build on this work, Meta said it is now sharing its MMS dataset and the tools used to develop and train its models. Meta’s objectives for MMS include broadening its support for more languages and enhancing its management of dialects, which is a significant challenge for current speech technologies.

The researchers added, “Our goal is to make it easier for people to access information and to use devices in their preferred language. We also envision a future where a single model can solve several speech tasks for all languages. While we trained separate models for speech recognition, speech synthesis, and language identification, we believe that in the future, a single model will be able to accomplish all these tasks and more, leading to better overall performance.”