Chatting with a digital assistant is about as much fun as trying to reason with a stubborn child. If you've ever found yourself yelling at your Xbox or swearing at Siri, you may have already lost hope.
But researchers say recent breakthroughs in speech recognition and artificial intelligence will soon make gadgets dramatically better at understanding people. This new breed of highly competent machines, which are able to not only hear us but to understand context and nuance, is just a year or two away, says Johan Schalkwyk, a distinguished engineer at Google.
Schalkwyk (pronounced "skulk-vick," though Siri calls him "shaw-quick") is working on an ambitious research project at Google to create speech systems that plug into the company's vast amounts of data. A project currently being tested in the lab allows computers to hear and essentially "think" about what people say into Google's digital ear, Schalkwyk says.
Recent inventions in the field of speech and machine learning should lead to major changes in how we murmur, shout, question and interrogate our devices. One of the brains behind Siri says engineers are feverishly working toward speech recognition that's smart enough to engage in authentic conversations with users. "All areas of spoken language understanding have made a lot of progress," says William Mark, a vice president at SRI International, which developed the fundamental technology behind Siri before it was acquired by Apple. "This kind of conversational interaction is where the leading edge is right now."
Tim Tuttle has been waiting a long time for this. He earned a Ph.D from the Massachusetts Institute of Technology in 1997 and worked at its A.I. Lab. He spent the last decade making the rounds at various companies in Silicon Valley before founding his startup Expect Labs in 2010. Tuttle's company began working last year on a system to add complex voice commands to mobile apps, which might allow users to walk into a store and ask their phone which aisle brooms are located in.
"A year ago, we were doing a benchmarking, and our conclusion was it was not yet possible to do that. That's all changed, and our company has doubled down entirely around voice, primarily because of these improvements we've seen," Tuttle says. "You're going to see speech recognition systems that have human or better-than-human accuracy become commercialized."
But first, a quick history lesson: Two and a half years ago, researchers from Google and the University of Toronto published an influential paper about using "deep neural networks" to model speech in computers, and followed this up several months later with another paper resulting from a collaboration with Microsoft and IBM. This led to what Google engineer Jeff Dean describes as the "biggest single improvement in 20 years of speech research."
The findings resurrected a decades-old invention around digital neural networks. The technology tested well in the 1980s at predicting and analyzing large fields of data, but performance was hindered by the wimpy speed of computers at the time. Neural networks only became a viable option recently, following a massive speed-up in computer processing and in the development of new software approaches.
Google's lab project builds off of this research. Six months ago, the team moved on from an older method, called feed-forward neural networks, in favor of recurrent neural networks. The switch allows the system to store more information, and process longer and more complex sequences. Google's breakthrough results from a simplification of the underlying code that will let its software hold more ideas and concepts within the same system, making it easier to ask complicated questions and get sensible answers. "System complexity can hurt your long-term growth," Schalkwyk says.
Google's system currently uses context, physical location and certain other things it knows about the speaker to make assumptions on where a conversation is going and what it all means—just like humans do. Google's new network technology should do this so efficiently that it can process larger amounts of data than ever before, allowing it to answer more complex requests.
To explain how the future of voice recognition should work, Schalkwyk likes to reference a high-end Vietnamese eatery located a few miles from Google's headquarters in Mountain View, California. Xanh Restaurant poses a challenge to typical speech recognition systems because its name—pronounced "zahn"—is "very difficult to recognize," says Schalkwyk. "If I can map it and say, 'It's a restaurant, and this restaurant is in California,' then the list of restaurants suddenly gets a lot smaller," he says. "Using that semantic knowledge, we can significantly improve quality."
It sounds trivial, but to a computer, hearing a word, recognizing the context from the sentence and then layering that information over geography is extremely difficult and takes time. Today, Google voice search can recognize the restaurant correctly—perhaps because its creators are regular patrons, we wonder. In the future, Google will be able to handle many other equally ambiguous questions, Schalkwyk says.
Inside Google, there's been an "unprecedented" number of technological evolutions in speech recognition, says Schalkwyk. While Google's big step forward is still another year or two away from showing up on your phone, the project is already yielding techniques that are making their way into other appendages of Google's mammoth brain. "You build something to go to the moon, and in the meantime, you develop a hundred other technologies that are useful," Schalkwyk says.
Three years ago, Google's voice recognition could recognize just three out of four words coming out of your mouth, Schalkwyk says. Thanks to an accelerated pace of innovation, the Google apps on your phone right now can correctly guess 12 out of every 13 words. Pretty soon, according to Tuttle, "We're going to live in a world where devices don’t have keyboards."