Google today open-sourced the speech engine that powers its Android speech recognition transcription tool Live Transcribe. The company hopes doing so will let any developer deliver captions for long-form conversations. The source code is available now on GitHub.
Google released Live Transcribe in February. The tool uses machine learning algorithms to turn audio into real-time captions. Unlike Android’s upcoming Live Caption feature, Live Transcribe is a full-screen experience, uses your smartphone’s microphone (or an external microphone), and relies on the Google Cloud Speech API. Live Transcribe can caption real-time spoken words in over 70 languages and dialects. You can also type back into it — Live Transcribe is really a communication tool. The other main difference: Live Transcribe is available on 1.8 billion Android devices. (When Live Caption arrives later this year, it will only work on select Android Q devices.)
Working around the cloud
Google’s Cloud Speech API currently does not support sending infinitely long streams of audio. Additionally, relying on the cloud means potential problems in the areas of network connections, data costs, and latency.
As a result, the speech engine closes and restarts streaming requests prior to hitting the timeout, including restarting the session during long periods of silence and closing whenever there is a detected pause in the speech. In between sessions, the speech engine also buffers audio locally and sends it upon reconnection. Google thus avoids truncated sentences or words, and reduces the amount of text lost mid-conversation.
To reduce bandwidth requirements and costs, Google also evaluated different audio codecs: FLAC, AMR-WB, and Opus. FLAC (a lossless codec) preserves accuracy, doesn’t save much data, and has noticeable codec latency. AMR-WB saves a lot of data but delivers worse accuracy in noisy environments. Opus, meanwhile, allows data rates many times lower than most music streaming services while still preserving the important details of the audio signal. Google also uses speech detection to close the network connection during extended periods of silence. Overall, the team was able to achieve “a 10x reduction in data usage without compromising accuracy.”
To reduce latency even further than the Cloud Speech API already does, Live Transcribe uses a custom Opus encoder. The encoder increases bitrate just enough so that “latency is visually indistinguishable to sending uncompressed audio.”
Live Transcribe speech engine features
Google lists the following features for the speech engine (speaker identification is not included):
- Infinite streaming.
- Support for 70+ languages.
- Robust to brief network loss (when traveling and switching between network and Wi-Fi). Text is not lost, only delayed.
- Robust to extended network loss. Will reconnect again even if network has been out for hours. Of course, no speech recognition can be delivered without a connection.
- Robust server errors.
- Opus, AMR-WB, and FLAC encoding can be easily enabled and configured.
- Contains a text formatting library for visualizing ASR confidence, speaker ID, and more.
- Extensible to offline models.
- Built-in support for speech detectors, which can be used to stop ASR during extended silences to save money and data.
- Built-in support for speaker identification, which can be used to label or color text according to speaker number.
The documentation states that the libraries are “nearly identical” to those running in the production application Live Transcribe. Google has “extensively field tested and unit tested” them, but the tests themselves were not open-sourced. But, Google does offer APKs so that you can try out the library without building any code.