Computers Are Finally Learning To Listen

When Victor W. Zue speaks, his computer listens. And lately, the Massachusetts Institute of Technology researcher has been more and more demanding, asking the machine to buy plane tickets and provide street directions. When Zue commands: "Show me the Chinese restaurant nearest MIT," his computer calls up a city street map and sketches a route in blue between his office and Royal East on Main Street.

That's a leap from what's now available on the market--and a preview of what may lie ahead. Many of today's commercial speech-recognition systems, such as one from Dragon Systems Inc. in Newton, Mass., serve as nothing more than elaborate voice typewriters, changing spoken words into printed ones. Others can respond to preset commands, such as "open file," but even those have no grasp of semantics.

Zue and other researchers are out to prove that speech systems have a higher calling--that they aren't limited to taking dictation but can "understand" and act on spoken words as well. Machines that do that could be a godsend to technophobes, who could simply ask for what they wanted--say, 100 shares of Chrysler Corp.--without fussing with computer keyboards, phone push buttons, and the like. On the other hand, such sophisticated systems could put a lot of clerks and customer-service representatives out of work. In one way or another, says Zue, speech-understanding systems "will touch hundreds of millions of people."

So real is the potential that American Telephone & Telegraph Co. recently augmented its own work by licensing speech technology from Belgium-based Lernout & Hauspie Speech Products that operates in French, German, and other languages. The systems are essential for people who don't have touch-tone phone service, meaning that the request "press 1 for more options" doesn't apply. While 80% of U.S. households have touch-tone service, the proportion is as low as 25% to 35% in countries such as Italy and Belgium. Says Jim Craig, data networking director at AT&T Network Systems: "Speech-recognition technology will be required on a lot of information services happening globally."

If the new systems do span the globe, much of the credit will go to the pairing of speech recognition with a branch of artificial intelligence (AI) that fell from favor nearly a decade ago. The field, natural language processing, was intended to enable computers to be programmed in English instead of arcane computer languages. But research faded in the mid-1980s because of disappointing results. Early attempts to adapt AI stumbled because human speech was simply too ungrammatical for the computer to follow.

The answer was to abandon the Queen's English and create a real-world grammar derived from the hems, haws, and sentence fragments of actual speech. The new systems don't understand everything they're told. But they can take the right action based on a partial understanding--and can even ask questions to clarify something. Since March, MIT researchers have been linked by voice with American Airlines' Eaasy Sabre reservation system to purchase travel tickets (table). One prototype at Carnegie Mellon University even works through mortgage-interest calculations with a live customer.

"WORD SPOTTING." That's quite an advance from the frankly stupid speech systems of only a few years ago. Those required speakers to pronounce words perfectly and to leave spaces between them. These first-generation systems analyzed sounds to identify each word and applied statistics to uncover the most likely two- or three-word combinations. The closest they came to speech understanding was so-called "word spotting," which is the ability to pluck a word or phrase out of a sentence and act on it. That's the concept behind Microsoft's Sound System. Similarly, researchers at Japan's Nippon Telegraph & Telephone Corp. have begun using speech recognition to give bank-account balances.

In the drive to make computers understand, not just recognize, one hot research area is systems that grab and retain bits of speech to solicit even more pieces. Like a resourceful tourist who understands just enough English to deduce a sentence from a few words, MIT's system rephrases questions to verify its word selections and seek more details from the speaker. The idea is to use new conversation to arrive at a better understanding of what was said earlier.

A side benefit of such dialogue is that humans forgive the errors that inevitably occur. Instead of walking away frustrated, people typically respond to such systems as they would to a child's struggle with language. "If you can get the machine to understand even pieces, so long as you're maintaining a dialogue that's moving forward, the human will stick with it," says Lawrence R. Rabiner, a prominent speech researcher at AT&T Bell Laboratories in Murray Hill, N.J.

The next frontier is to get systems to learn on the job. Researchers at Bolt Beranek & Newman, MIT, and IBM are coaxing their systems to automatically build a reservoir of understanding from the way people ordinarily speak. Instead of having scientists write the rules for grasping context and meaning, the computer would derive its own rules from the relationships it finds in recorded speech.

These systems could adapt to the dynamics of language, says researcher Madeleine Bates at BBN in Cambridge, Mass. For instance, they could add a new definition of "bad" once they realize that the word has become slang for cool or hip. "We have to have spontaneous learning systems because language can always surprise us," says Frederick Jelinek, a former manager of IBM's continuous speech recognition research.

One approach being pursued by IBM and others uses statistical probability to determine the correct meaning. It's already used in recognizing words. For instance, these systems give higher probability to combinations such as "icy cold" than unlikely ones such as "I is cold." Now, researchers are betting such systems can detect the most appropriate of several meanings for the same phrase. For instance, a travel reservation model that contains the phrase "morning flight" might give a higher probability to the meaning "leave before noon," than "arrive before noon."

One of the scientists' tricks has been to use sentence-charting techniques as a way to derive and maintain a likely context. Charting the sentence "The boy has left," and storing "the boy" as a noun phrase, allows the computer to identify the subject of a subsequent sentence that begins "He" as the boy. Such techniques enable computers "to hang on to specific phrases, then patch them together to figure out meaningful phrases," says Raj Reddy, dean of Carnegie Mellon's School of Computer Science.

TIMING. Another effort at SRI International attempts to use sound pitch, loudness, and timing, as punctuation is used in text, to help set context. SRI researchers hope the technique can help computers grasp the intent of sentences that now are confusing to computers: For instance, does the sentence "I don't think I know" express befuddlement or impatience?

Of course, getting these new systems to work in the home or office under normal conditions will require vast performance improvements--and cheaper computers. Lab versions now use $100,000 workstations and sophisticated microphones, not home PCs or telephones. Software developer Bolt Beranek's system runs on a workstation with 96 megabytes of memory--24 times that of a good office PC.

Despite their enormous power, today's experimental systems must still be limited to a single subject area, so as not to overwhelm them with choices. That's a problem if, say, a person asking about flights to L.A. suddenly wants to know the weather there. MIT researchers delight in crashing their direction-finding system, which knows streets and buildings, by asking it: "Where's my dog?"

Those problems should gradually diminish as cheaper computing power gives the systems extra brainpower and AI allows them to increase their wordpower as they work. Speech understanding could be just the ticket to ride the information highway without putting your hands on a computer.