Can Speech-Recognition Software Work in Mandarin?

Photograph by Feng Li/Getty Images

In anticipation of Apple’s introduction of Siri in Chinese this year, I decided to try Dragon, a line of smartphone voice apps by Nuance Communications—the company behind the speech-recognition technology that powers Siri. It comes in a number of languages, including Mandarin Chinese.

On a basic level, Dragon in Chinese can be pretty amazing, although like most speech recognition software, it is not 100 percent accurate. First, though, a note about why Chinese is particularly challenging for speech recognition. There are only 400 monosyllabic sounds in Mandarin, which are differentiated by tone. What’s that mean?

The words for mother (妈 mā), scold (骂 mà), and horse (马 mǎ), for example, all sound like “ma” but with different intonation. Developing a software that can understand the sentence “Mother scolds the horse” (妈妈骂马 māmā mà mǎ—click to listen) is no easy task.

While Mandarin is the national language of China, the country has seven language groups, many dialects, and countless accents. Says Jim Wu, Nuance’s vice president for Dragon research: “Within mainland China, everyone has a different accent, and one of the challenges is making sure the system works for people who speak Mandarin with a slight accent.”

Nuance, a Burlington (Mass.) company with $1.4 billion in revenue in fiscal 2011, launched two free Mandarin Chinese apps in March 2011, shortly after the English versions made their debuts. (The company added Cantonese and Taiwanese versions in June.) Dragon Dictation transcribes speech for texts, e-mail, Facebook, and Twitter. Dragon Search is for Internet search. Unlike Siri, Dragon does not talk back to users.

To use the apps, users press a virtual “button” to start and stop recording. All the processing is handled on servers, so the dictation is streamed to a server, which listens, records, and sends text back to the device, says Peter Mahoney, Nuance’s chief marketing officer.

I dictated basic sentences, such as “Where is the nearest Starbucks?” and “I am going to the market,” without a problem. But for many users, the real fun is testing the device’s cognitive limits, seeing just how far the technology can match human speech—and thought.

So I challenged it with Chinese tongue twister: “Mother rides a horse. The horse is slow. Mother scolds the horse” (妈妈骑马,马慢,妈妈 骂马). It sounds like this: Māmā qí mǎ, mǎ màn, māmā mà mǎ (click to listen).

Dragon mostly captured the right sounds but didn’t always pick the right words. The result: “Mother at least (Māmā qǐ mǎ 妈妈起码). Mother mother ? ? (Māmā māmā ma? ma? 妈妈妈妈吗?吗?).” (The sound indicating a question in Chinese is also “ma.”)

Although not yet fully accurate, Dragon is built to learn and improve, according to Mahoney. Since all the processing happens on servers, Dragon collects user speech data to learn how the language works and how words fit together. It also gets used to each user’s regional accent, so the more it’s used, the more accurate it becomes. Mahoney adds: “These recordings are kept, and we can analyze results using automated tools. Sometimes you have people using recordings to see how they can make it better.”

Before it's here, it's on the Bloomberg Terminal.