Microsoft’s Artificial Intelligence and Research Unit earlier this week reported that its speech recognition technology had surpassed the performance of human transcriptionists.
The team last month published a paper describing its system’s accuracy, said to be superior to that of IBM’s famed Watson artificial intelligence.
The error rate for humans on the widely used NIST 2000 test set is 5.9 percent for the Switchboard portion of the data, and 11.3 percent for the CallHome portion, the team said.
The team improved on the conversational recognition system that outperformed IBM’s by about 0.4 percent, it reported.
That improvement is important, noted Anne Moxie, senior analyst at Nucleus Research.
While speech recognition provides an easier way for humans to interact with technology, “it won’t see adoption until it has extremely low error rates,” she told TechNewsWorld.
Google, IBM and Microsoft are among the companies working on speech recognition systems, but Microsoft is the closest to overcoming the error rate issue, Moxie said. “Therefore, its technology’s the most likely to see adoption.”
Testing the Technology
The team’s progress resulted from the careful engineering and optimization of “convolutional and recurrent neural networks.” The basic structures have long been well known but “it is only recently that they have emerged as the best models for speech recognition,” its report states.
To measure human performance, the team leveraged an existing pipeline in which Microsoft data is transcribed weekly by a large commercial vendor performing two-pass transcription — that is, a human transcribes the data from scratch, and then a second listener monitors the data to perform error correction.
The team added NIST 2000 CTS evaluation data to the worklist, giving the transcribers the same audio segments as provided to the speech recognition system — short sentences or sentence fragments from a signal channel.
For the speech recognition technology, the team used three convolutional neural network (CNN) variants.
One used VGG architecture, which employs smaller filters, is deeper, and applies up to five convolutional layers before pooling.
The second was modeled on the ResNet architecture, which adds a linear transform of each layer’s input to its output. The team applied Batch Normalization activations.
The third CNN variation is the LACE (layer-wise context expansion with attention) model. LACE is a time delay neural network (TDNN) variant.
The team also trained a fused model consisting of a combination of a ResNet and a VGG model at the senone posterior level. Senones, which are states within context-dependent phones, are the units for which observation probabilities are computed during automated speech recognition (ASR).
Both base models were independently trained and the score fusion weight then was optimized on development data.
A six-layer bidirectional LSTM was used for spatial smoothing to improve accuracy.
“Our system’s performance can be attributed to the systematic use of LSTMs for both acoustic and language modeling as well as CNNs in the acoustic model, and extensive combination of complementary models,” the report states.
The Microsoft Cognitive Toolkit
All neural networks in the final system were trained with the Microsoft Cognitive Toolkit (CNTK) on a Linux-based multi-GPU server farm.
CNTK is an open source deep learning toolkit that allows for flexible model definition while scaling very efficiently across multiple GPUs and multiple servers, the team said.
Microsoft earlier this year released CNTK on GitHub, under an open source license.
“Voice dictation is no longer just being used for composing text,” said Alan Lepofsky, a principal analyst at Constellation Research.
“As chat-centric interfaces become more prevalent, core business processes such as ordering items, entering customer records, booking travel, or interacting with customer service records will all be voice-enabled,” he told TechNewsWorld.
To illustrate his point, Lepofsky noted that he had composed his response and emailed it to TechNewsWorld “simply by speaking to my iPad