Speech to text is something we take for granted a lot of the time. It’s a complicated process, though — so much so that the heavy lifting is actually done remotely, and the end result is sent back to our devices. But Google has worked out a way to shrink the process to the point that it can be performed locally, and the fruits of that labor are coming to Gboard.
Uncompressed, the models Gboard traditionally uses for speech recognition take up about two gigabytes. That’s impractically large to store on a smartphone, so when you tap the microphone icon, your recorded speech is sent to Google’s servers to be converted into text, then that text is sent back. Google was able to train a smaller, similarly effective model using recurrent neural network transducer technology. That model is able to run on-device with the same accuracy as server-based ones, but it still takes up 450 megabytes of storage space — not quite small enough to store locally on most smartphones.
Through a process called model quantization, Google was able to further reduce the size of the model, leading to a package that only takes up about 80 megabytes. This also increases the speed of transcription. The new model works at a character level, too, so transcribed text will appear letter-by-letter, rather than whole words at a time as it does now.
You can see a comparison serverside and on-device transcription below:
The enhanced speech-to-text functionality will initially be exclusive to Pixel devices in American English, although there’s currently no indication of when it’s coming. Google is “hopeful” it’ll be available “in more languages and across broader domains of application” soon after.
You can read a much more detailed explanation of the project on the Google AI Blog.
- Google AI Blog