DeepSpeech 0.6: Mozilla’s Speech-to-Text Engine Gets Fast, Lean, and Ubiquitous

The Machine Learning team at Mozilla continues work on DeepSpeech, an automatic speech recognition (ASR) engine which aims to make speech recognition technology and trained models openly available to developers. DeepSpeech is a deep learning-based ASR engine with a simple API. We also provide pre-trained English models.

Our latest release, version v0.6, offers the highest quality, most feature-packed model so far. In this overview, we’ll show how DeepSpeech can transform your applications by enabling client-side, low-latency, and privacy-preserving speech recognition capabilities.

Consistent low latency

DeepSpeech v0.6 includes a host of performance optimizations, designed to make it easier for application developers to use the engine without having to fine tune their systems. Our new streaming decoder offers the largest improvement, which means DeepSpeech now offers consistent low latency and memory utilization, regardless of the length of the audio being transcribed. Application developers can obtain partial transcripts without worrying about big latency spikes.

DeepSpeech is composed of two main subsystems: an acoustic model and a decoder. The acoustic model is a deep neural network that receives audio features as inputs, and outputs character probabilities. The decoder uses a beam search algorithm to transform the character probabilities into textual transcripts that are then returned by the system.

In a previous blog post, I discussed how we made the acoustic model streamable. With both systems now capable of streaming, there’s no longer any need for carefully tuned silence detection algorithms in applications. dabinat, a long-term volunteer contributor to the DeepSpeech code base, contributed this feature. Thanks!

In the following diagram, you can see the same audio file being processed in real time by DeepSpeech, before and after the decoder optimizations. The program requests an intermediate transcription roughly every second while the audio is being transcribed. The dotted black line marks when the program has received the final transcription. Then, the distance from the end of the audio signal to the dotted line represents how long a user must wait after they’ve stopped speaking until the final transcript is computed and the application is able to respond.

This diagram compares the latency of DeepSpeech before and after the decoder optimizations.

In this case, the latest version of DeepSpeech provides the transcription 260ms after the end of the audio, which is 73% faster than before the streaming decoder was implemented. This difference would be even larger for a longer recording. The intermediate transcript requests at seconds 2 and 3 of the audio file are also returned in a fraction of the time.

Maintaining low latency is crucial for keeping users engaged and satisfied with your application. DeepSpeech enables low-latency speech recognition services regardless of network conditions, as it can run offline, on users’ devices.

TensorFlow Lite, smaller models, faster start-up times

We have added support for TensorFlow Lite, a version of TensorFlow that’s optimized for mobile and embedded devices. This has reduced the DeepSpeech package size from 98 MB to 3.7 MB. It has reduced our English model size from 188 MB to 47 MB. We did this via post-training quantization, a technique to compress model weights after training is done. TensorFlow Lite is designed for mobile and embedded devices, but we found that for DeepSpeech it is even faster on desktop platforms. And so, we’ve made it available on Windows, macOS, and Linux as well as Raspberry Pi and Android. DeepSpeech v0.6 with TensorFlow Lite runs faster than real time on a single core of a Raspberry Pi 4.

The following diagram compares the start-up time and peak memory utilization for DeepSpeech versions v0.4.1, v0.5.1, and our latest release, v0.6.0.

This bar graph compares start-up time and peak memory utilization for the last three DeepSpeech versions: v0.4.1, v0.5.1, and v0.6.0

We now use 22 times less memory and start up over 500 times faster. Together with the optimizations we’ve applied to our language model, a complete DeepSpeech package including the inference code and a trained English model is now more than 50% smaller.

Confidence value and timing metadata in the API

In addition, the new decoder exposes timing and confidence metadata, providing new possibilities for applications. We now offer an extended set of functions in the API, not just the textual transcript. You also get metadata timing information for each character in the transcript, and a per-sentence confidence value.

The example below shows the timing metadata extracted from DeepSpeech from a sample audio file. The per-character timing returned by the API is grouped into word timings. You can see the waveform below. Click the “Play” button to listen to the sample.

Te Hiku Media are using DeepSpeech to develop and deploy the first Te reo Māori automatic speech recognizer. They have been exploring the use of the confidence metadata in our new decoder to build a digital pronunciation helper for Te reo Māori. Recently, they received a $13 million NZD investment from New Zealand’s Strategic Science Investment Fund to build Papa Reo, a multilingual language platform. They are starting with New Zealand English and Te reo Māori.

Windows/.NET support

DeepSpeech v0.6 now offers packages for Windows, with .NET, Python, JavaScript, and C bindings. Windows support was a much-requested feature that was contributed by Carlos Fonseca, who also wrote the .NET bindings and examples. Thanks Carlos!

You can find more details about our Windows support by looking at the WPF example (pictured below). It uses the .NET bindings to create a small UI around DeepSpeech. Our .NET package is available in the NuGet Gallery. You can install it directly from Visual Studio.

This image shows a screenshot of the WPF example.

You can see the WPF example that’s available in our repository. It contains code demonstrating transcription from an audio file, and also from a microphone or other audio input device.

Centralized documentation

We have centralized the documentation for all our language bindings in a single website, deepspeech.readthedocs.io. You can find the documentation for C, Python, .NET, Java and NodeJS/Electron packages. Given the variety of language bindings available, we wanted to make it easier to locate the correct documentation for your platform.

Improvements for training models

With the upgrade to TensorFlow 1.14, we now leverage the CuDNN RNN APIs for our training code. This change gives us around 2x faster training times, which means faster experimentation and better models.

Along with faster training, we now also support online feature augmentation, as described in Google’s SpecAugment paper. This feature was contributed by Iara Health, a Brazilian startup providing transcription services for health professionals. Iara Health has used online augmentation to improve their production DeepSpeech models.

The video above shows a customer using the Iara Health system. By using voice commands and dictation, the user instructs the program to load a template. Then, while looking at results of an MRI scan, they dictate their findings. The user can complete the report without typing. Iara Health has trained their own Brazilian Portuguese models for this specialized use case.

Finally, we have also removed all remaining points where we assumed a known sample rate of 16kHz. DeepSpeech is now fully capable of training and deploying models at different sample rates. For example, you can now more easily train and use DeepSpeech models with telephony data, which is typically recorded at 8kHz.

Try out DeepSpeech v0.6

The DeepSpeech v0.6 release includes our speech recognition engine as well as a trained English model. We provide binaries for six platforms and, as mentioned above, have bindings to various programming languages, including Python, JavaScript, Go, Java, and .NET.

The included English model was trained on 3816 hours of transcribed audio coming from Common Voice English, LibriSpeech, Fisher, Switchboard. The model also includes around 1700 hours of transcribed WAMU (NPR) radio shows. It achieves a 7.5% word error rate on the LibriSpeech test clean benchmark, and is faster than real time on a single core of a Raspberry Pi 4.

DeepSpeech v0.6 includes our best English model yet. However, most of the data used to train it is American English. For this reason, it doesn’t perform as well as it could on other English dialects and accents. A lack of publicly available voice data in other languages and dialects is part of why Common Voice was created. We want to build a future where a speaker of Welsh or Basque or Scottish English has access to speech technology with the same standard of quality as is currently available for speakers of languages with big markets like American English, German, or Mandarin.

Want to participate in Common Voice? You can donate your voice by reading small text fragments. Or validate existing recordings in 40 different languages, with more to come. Currently, Common Voice represents the world’s largest public domain transcribed voice dataset. The dataset consists of nearly 2,400 hours of voice data with 29 languages represented, including English, French, German, Spanish and Mandarin Chinese, but also for example Welsh and Kabyle.

The v0.6 release is now available on GitHub as well as on your favorite package manager. You can download our pre-trained model and start using DeepSpeech in minutes. If you’d like to know more, you can find detailed release notes in the GitHub release; installation and usage explanations in our README. If that doesn’t cover what you’re looking for, you can also use our discussion forum.

About Reuben Morais

Reuben Morais is a Senior Research Engineer working on the Machine Learning team at Mozilla. He is currently focused on bridging the gap between machine learning research and real world applications, bringing privacy preserving speech technologies to users.

More articles by Reuben Morais…


19 comments

  1. Wojtek

    Hi, this looks really awesome. Is there somewhere an online demo of the new version?

    December 5th, 2019 at 02:19

    1. Reuben Morais

      We don’t have an online demo, as the focus has been on client-side recognition. We experimented with some options to run it on the browser but the technology wasn’t there yet.

      December 5th, 2019 at 14:13

      1. Jonathan Beri

        Have you experimented with tensorflow.js or WebAssembly? Wasm has experimental support for threads and SIMD in some browsers. https://github.com/mozilla/DeepSpeech/issues/2233

        December 5th, 2019 at 16:59

        1. Reuben Morais

          We tried it a long time ago but it was still very rough, we couldn’t get anything working. I should take a look at it some time.

          December 9th, 2019 at 13:14

          1. Mendy

            Would really want to see this! Thanks for all the awesome work you do!

            December 13th, 2019 at 07:53

  2. MUN

    fantastic

    December 5th, 2019 at 14:01

  3. Olaf

    Hey,

    thanks a lot for doing this! Your git repo lists cuda as the only GPU backend. AFAIK there is also an AMD version for tensorflow and it seems to work quite well ( people claim a Radeon VII being about as fast as 2080ti c.f. https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/362). Did you have the chance to test it with DeepSpeech?

    December 8th, 2019 at 08:46

    1. Reuben Morais

      We don’t explicitly target CUDA, it’s just a consequence of using TensorFlow. In addition, our native client is optimized for low latency. The use case optimized for is the software running locally on the user’s machine and transcribing a single stream of audio (likely from a microphone) while it’s being recorded. Our model is already faster than real time on CPUs, so there’s no need to do extensive GPU optimization. We build and publish GPU packages so people can experiment and so we don’t accidentally break GPU support, but there’s no major optimization push happening there.

      December 9th, 2019 at 13:13

  4. TViT

    Hello Reuben Morais. Tell me where you can read in detail about the principles of recognition on which Deep Speech is based. Maybe there is a video where it is told in detail in steps. For example, I am developing my own project for voice recognition on a small microcontroller with 16kB RAM – ERS VCRS. And in my video everything is shown from beginning to end.

    December 9th, 2019 at 13:08

    1. Reuben Morais

      DeepSpeech is not applicable to that hardware, the model is too big for 16kB of RAM. You can read more about it here: https://arxiv.org/abs/1412.5567

      December 9th, 2019 at 13:12

  5. Joe

    When you speak about client-side capabilities it’s not yet runnable client side on javascript in webbrowsers, right?

    December 11th, 2019 at 06:14

    1. Reuben Morais

      We’re working towards Firefox integration, but nothing concrete to share yet. People have deployed it client-side interacting with a web front-end, but currently it requires an additional component running on the machine.

      December 12th, 2019 at 03:12

  6. ida

    hi,
    i’m really glad to see a graphical interface being built so also less technical users can start using deepSpeech (as opposed to google and apple products etc).
    however, even after 3 hours of googling and trying out, i couldn’t understand how to make the DeepSpeechWPF run. i found this code https://deepspeech.readthedocs.io/en/v0.6.0/DotNet-contrib-examples.html and this repo https://github.com/mozilla/DeepSpeech/tree/v0.6.0/examples/net_framework/DeepSpeechWPF but PLEASE, publish some instructions that are understandable to less technical users, as i am assuming that we are who need the graphical interface most.
    best wishes
    ida

    December 12th, 2019 at 03:07

    1. Reuben Morais

      Hello,
      The WPF example is not meant for less technical users, it’s meant for Windows developers to have an example that uses frameworks they’re familiar with. I don’t know of any graphical interfaces for DeepSpeech that target less technical users. It’d be good to have something like that, I agree.

      December 12th, 2019 at 03:09

  7. Paul

    My primary interest in DeepSpeech is to use it in an open source home automation system that doesn’t require my voice data to leave my local network / create potential security issues.

    Have you done anything with DeepSpeech to integrate it into programs like MQTT?

    Since various open source solutions can easily use MQTT as a gateway into multiple other systems, I am wondering if there is any intentions of trying to create a simple interface between DeepSpeech and MQTT.

    December 20th, 2019 at 19:36

    1. Reuben Morais

      I don’t know of any MQTT integration.

      December 21st, 2019 at 02:17

      1. Paul

        The integration may not be all that hard if an intermediate application was written to take the output of DeepSpeech and piped it into MQTT. Might even be able to work that one out myself.

        Is there any way to have DeepSpeech listen to you without the need for converting it to an audio file first? Along the lines of Alexa with a key word to trigger it. Audio files just add another lay of complexity on the input side that make using DeepSpeech less useful than some of the cloud solutions.

        December 21st, 2019 at 03:12

        1. Reuben Morais

          DeepSpeech has no dependency on audio files. The API receives audio samples as input, they can come from a file or a microphone or a network stream, we don’t care.

          December 21st, 2019 at 04:01

  8. Paul

    Cool!

    December 21st, 2019 at 04:05

Comments are closed for this article.