At the NVIDIA sponsored GTC 2015 (GPU Technology Conference) held last week, voice recognition demonstrations appear to have made the category a whole lot more interesting. It is safe to say that voice recognition and cognition has not changed much in tech user’s daily routine. My theory has always been and continues to be that until the user can truly rely on it, they won’t use it beyond experimentation. 95% reliability doesn’t cut it- it requires near perfection as UI like a mouse or keyboard. This has been especially true when it comes to using mobile voice recognition for searches or command and control.
Certain companies like Nuance have found a strong niche creating their own voice recognition IP to supplement most companies’ solutions, but the truth is that even their solution isn’t perfect, far from it. It’s a very hard problem to solve. Dragon Naturally Speaking’s vertical approach to transcription in the legal and medical community has become an invaluable and reliable tool, but its accuracy is really driven by a limited data dictionary and the fact that it only has to do transcription, not cognition or actions like Siri, Cortana, or Google Voice. Google, Apple, Microsoft and even Baidu have spent vast sums of money to improve their voice recognition and to understand the context of the searches and tasks that people are making on their devices. Recent developments in hardware and software architectures have enabled entirely new levels of voice recognition accuracy that are needed in order to deliver the right experience. At the NVIDIA GTC 2015, we saw many advancements that give me hope, though for a more voice-enabled world.
Using GPUs to accelerate speech
At the NVIDIA GTC 2015 technology conference, Baidu’s Andrew Ng talked about how the company is building their own platform for deep learning in reference to speech. Andrew Ng has been responsible for helping spread the use of deep learning at companies like Google and has brought his expertise to Baidu. There are lots of ways to apply machine learning and neural networks to accomplish deep learning. However, Baidu’s approach utilizes Nvidia GPUs in order to create more neural network connections. According to Baidu, with an HPC solution that combines X86 CPUs and Nvidia GPUs, they can get upwards of 100 billion connections in a single neural network.
Baidu’s approach to improving speech involves increasing the amount of data they use in their deep speech platform, with a data set with 7,000 hours of voice data. However, Baidu did not just stop there, they added an additional 100,000 hours of synthesized data to give themselves an absolutely massive data set to work with for their neural networks to process from. But simply having a massive amount of data was not enough for Baidu, so they also implemented neural networks in their entire speech pipeline rather than using neural networks for one part of the pipeline. They broke up their speech recognition into individual letters and their phonemes, however it required all of the different data to be recurrent so that it could be connected with the previous and following data set as well. This created Baidu’s bi-directional recurrent neural network that powers Baidu’s Deep Speech.
This approach, according to Baidu, gives them the ability to have the fewest errors possible. In fact, Andrew Ng claimed that Baidu’s solution for speech powered by deep learning is better than Apple, Bing, Facebook and Google. This was back in December when they tested against public APIs and in clean environments they are ‘roughly neck and neck’ and in noisy environments they were ahead. Their error rates in a clean environment were as low as 6.56% and 19.06% in noisy environments. These are all pretty bold claims that I have not researched fully with all the data nor have I used what they claim as 99% accurate. Needless to say, they saw huge improvements.
The race is not yet over
While 94 and 95% accuracy are useful for many scenarios, the reality is that speech needs to be even more accurate to become a true method of user interface with devices. Andrew Ng said it quite eloquently, “Most people don’t understand the difference between 95% and 99% accuracy. 99% is game changing.” This is because 99% accuracy means that the system can compensate for non-optimal conditions and the user experience does not suffer, regardless of conditions. Unlike others in this area, Ng “gets it” and this is something that Ng and I agree on wholeheartedly. I can’t tell you how many companies have tried to explain to me that 90-95% is “good enough”. They don’t understand UI and human psychology. Can you imagine every five clicks out of 100 of your mouse being rejected or invoking the wrong response? We wouldn’t use mice. Why did we all hate and reject passive touch screens? We did because we have to click many times to hit what we intend.
While Baidu is not a competitor yet in the global market, or any market outside of Chinese-speaking countries, it appears that they do already have a fairly powerful and accurate speech and image recognition platform powered by deep learning. In fact, what Baidu is showing here today may even be a preview of their ambitions in the US market. Even if they decide not to enter the US market in the near future, I hope their work and research pushes their competitors like Google, Microsoft and Apple to improve and get better than Baidu. (Assuming accurate benchmarks here.)
Right now speech recognition is not where it needs to be in order to fulfill the promises of Microsoft, Google and Apple that speech is our future, but surprisingly Baidu appears on the surface to have brought us the closest we’ve ever been to speech being good enough to be an reliable interface. I’m very impressed how Baidu has leveraged GPUs to do this and we have to give NVIDIA credit as well.