Skip to main content

Alexa AI scientists reduce speech recognition errors up to 22% with semi-supervised learning

Amazon Alexa
Amazon Alexa
Image Credit: Shutterstock

Join us in Atlanta on April 10th and explore the landscape of security workforce. We will explore the vision, benefits, and use cases of AI for security teams. Request an invite here.


Amazon’s Alexa Speech group scientists today announced they have used what they believe to be one of the largest unlabeled data sets ever assembled to train an acoustic model and improve the intelligent assistant’s ability to understand the human voice.

Using semi-supervised learning, a method that combines human and machine labeling of the data used to train AI models, Amazon scientists were able to train a model and reduce speech recognition error rates by 10-22% compared to methods that rely solely on supervised learning. Greater gains in speech recognition error reduction were seen with noisy audio.

The acoustic model was trained with 7,000 hours of labeled data, then 1 million hours of unannotated or unlabeled data. Acoustic models are one of a series of AI systems that power automatic speech recognition, converting a voice command into action by a computer.

“We are currently working to integrate the new model into Alexa, with a projected release date of later this year,” said Alexa senior applied scientist Hari Parthasarathi in a blog post.

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invite today.
Request an invite

The work will be presented at the IEEE International Conference on Acoustics, Speech, and Signal Processing in Brighton in the United Kingdom next month.

These advances in Alexa’s ability to understand the human voice were achieved through a method using long-short term memory (LSTM) networks called teacher-student training. The “teacher” is trained to understand 30-millisecond chunks of audio and then transfers some of that understanding to a “student” network that uses the unlabeled data.

A number of additional techniques were applied to optimize or speed model training, such as analysis of student model audio once instead of twice, interleaving or mixing the two models, and storing only the 20 highest-probability teacher model outputs during training, instead of results grouped into 3,000 different clusters. The student model must then attempt to accurately match as many of the 20 probabilities as possible.

“The 7,000 hours of annotated data are more accurate than the machine-labeled data, so while training the student, we interleave the two. Our intuition was that if the machine-labeled data began to steer the model in the wrong direction, the annotated data could provide a course correction,” the post reads.

Today’s news follows a February announcement of a 20% reduction in speech recognition error rates with other semi-supervised learning methods, as well as advances that make a two-mic array more effective than a seven-mic array, which was announced earlier this week.

VB Daily - get the latest in your inbox

Thanks for subscribing. Check out more VB newsletters here.

An error occured.