Microsoft's AI as good as humans at voice recognition

Posted Aug 23, 2017 by James Walker
Microsoft has announced a new breakthrough in the development of voice recognition technology. The company's AI is now as accurate at transcribing conversations as teams of professional humans. There's still challenges ahead for real world use though.
Cortana on an Android phone
Cortana on an Android phone
Last year, Microsoft's AI attained parity with human transcribers in the industry standard Switchboard test. Switchboard is a collection of real telephone conversations covering a wide range of disparate subjects. To complete the task, the AI has to accurately transcribe the conversations, without having previously heard them.
In Microsoft's original test, it measured human transcribers to have a word error rate of 5.9 percent. The company's AI then successfully transcribed the conversations with the same error rate, suggesting it was as proficient at the task as the trained humans.
Since then, other researchers have replicated Microsoft's work. They found that humans working as a team have an error rate of only 5.1 percent. In a blog post earlier this week, Microsoft announced its AI has reached parity with this figure too.
The company said it attained the new milestone by making improvements to the AI's acoustic and language models. It tweaked the way in which the AI handles acoustic modelling and word prediction. The language element was also overhauled to offer more context, allowing the AI to use the entire conversation history to predict the words likely to come next.
READ NEXT: Blockchain a "catalytic force" in soaring fintech market
The development is significant for voice recognition tech. It demonstrates machines can recognise voices as accurately as humans, something which will be more important as digital assistants and voice-controlled interfaces develop.
Microsoft acknowledged there are further tasks ahead though. The test was completed in ideal conditions which don't represent the real world. Voice recognition tech in actual operation has to deal with noisy background environments and several styles and accents of speech. Accuracy can suffer dramatically as a result. Microsoft is now turning its attention to improving the AI's word error rate under these conditions.
"While achieving a 5.1 percent word error rate on the Switchboard speech recognition task is a significant achievement, the speech research community still has many challenges to address, such as achieving human levels of recognition in noisy environments with distant microphones, in recognizing accented speech, or speaking styles and languages for which only limited training data is available," said Microsoft.
Real-time voice translation in PowerPoint
Real-time voice translation in PowerPoint
READ NEXT: Majority of CEOs concerned about digital transformation
The research is already having an impact on Microsoft's products. The company's voice recognition technology has been integrated into its cloud-based Cognitive Services toolkit. It also powers its Cortana digital assistant and has been integrated into PowerPoint to translate presentations in real-time for multi-lingual audiences.
According to Microsoft, human levels of speech recognition could unlock new ways of interacting with computers and completing work. The next stage is to train AI to interpret the meaning in different conversations, allowing machines to understand intentions and expressions. Microsoft said its current studies are just the gateway to this kind of system.
"We have much work to do in teaching computers not just to transcribe the words spoken, but also to understand their meaning and intent," said Microsoft. "Moving from recognizing to understanding speech is the next major frontier for speech technology."