Google Seeks Help with Recognition

In its quest to create a vast online library, the search titan has released character-recognition scanning software to the open-source crowd

Google does not shy away from gargantuan projects. The search giant is known, after all, for indexing the World Wide Web and mapping the entire planet in three dimensions. But the company's latest endeavor may be too big for even the Internet goliath to complete alone.

Google (GOOG) wants to index all the world's printed material for inclusion in a comprehensive online library. To that end, it launched Google Books, a service featuring free-to-download classics and excerpts from copyrighted works, and Google Scholar, a database of academic and scientific research (see, 8/31/06, "Google Offers Classics for Free").

On Sept. 6, it added another service to further its goal: Google News Archive. The new application allows computer users to search back issues of various publications, such as The Washington Post (WPO), The New York Times (NYT), and The Wall Street Journal (DJ). Articles from some journals date as far back as 200 years and, in some cases, must be purchased from the original publication (see, 9/06/06, "Google Digs Into the Archives").

Despite its relentless release of virtual library material, however, Google is asking the greater engineering community for help developing the technology it needs to index and archive all published works.


  On Aug. 30, Google's "über tech lead" Luc Vincent announced that the company was turning to the tech community for help improving Optical Character Recognition (OCR) technology, which enables computers to decipher words in scanned texts. The first step: Google debugged an old Hewlett-Packard OCR engine, named Tesseract, that HP had released to university researchers in Nevada. Before that, the application had sat idle at Hewlett-Packard (HP) since 1995, when the company decided to leave the OCR business and concentrate on its line of home office products, computers, printers, and cameras.

Google then released the cleaned version to the open-source community. Bdale Garbee, chief technologist at Hewlett-Packard, says the company is pleased others will build off its efforts. "We're happy to see good code being put to good use," he says, "and we look forward to seeing where the community takes this technology in the future."

Google is hoping they take the technology far beyond its current capabilities, says Chris DiBona, Google's open-source program manager. OCR technology is central to Google's cause because it enables search engines to "read" documents. Without OCR, the computer sees a scanned page of print only as an image and cannot find keywords or phrases in the text. In the search world, OCR means the difference between being able to find a book only if you know the complete title and being able to find it if all you know is a few key quotes.

Because it was essentially abandoned, the program's capabilities badly lag the standards of current commercial OCR engines. Tesseract has trouble reading gray scale and text with background color, for example. Google, however, sees promise that the technology community, by tinkering with formerly proprietary coding within Tesseract, will be able to come up with some solutions to problems that plague even the paid technology.


  DiBona says the OCR engines out there are 99.5% accurate at reading Latin characters, but still have some trouble with other languages, handwriting, highly stylized fonts, and unique layouts. In the past, Google has had some problems with blurry or off-center scans that can sometimes confuse the OCR engines. For example, a poorly scanned book with blurry characters could prevent the OCR engine from deciphering the letters and words in a document. Thus, that page would not be properly indexed by searches (see, 12/22/05, "Google's Great Works in Progress").

"If you look at OCR over the past 10 years, not much has happened. There are some programs out there that are pretty good, but we wanted to see if by putting OCR out there we could improve it," DiBona says, adding it would be "really good if OCR gets better for everybody."

As more offices began moving from paper to digital, they needed OCR technology to help computers recognize the text in their scanned documents and allow them to edit the new digital versions. Over the past three years, search engines and other online companies expanded the use of OCR by applying the technology to search, says Robert Weideman, senior marketing vice-president for Nuance Communications (NUAN), maker of one of the market-leading OCR engines, OmniPage. Both Google and Amazon (AMZN), for example, use OCR technology to match search phrases with specific passages in books.


  But who uses OCR outside of online search and commerce? Well, increasingly, everybody who has ever scanned a document or read a scanned document. "When you think about who touches on our OCR technology, it is literally millions of people worldwide…any industry that deals with paper uses OCR," Weideman says. Nuance experienced 8% growth and reaped more than $70 million in revenues from the OCR digital imaging business last year, Weideman says, which includes PDF conversion software.

Those profits may seem surprising for a technology that, at first, didn't seem to have many practical applications. When inventor Raymond Kurzweil created the first OCR system in 1974, he struggled to find a use for it (see, 5/02/01, "How Ray Kurzweil Keeps Changing the World"). The mass-market answer eventually came in the form of a scanner.

Nuance has provided OCR technology to Google, though a confidentiality agreement keeps it from saying whether its OCR systems power Google's current book search. The company has also supplied Microsoft (MSFT) with OCR technology for its upcoming XPS program—a PDF competitor to Adobe's (ADBE) Acrobat—which will be included in the new Vista operating system.


  Now OCR is built into many scanners and comes standard on some computers. It even has become incorporated into cell phones via software that allows people to take pictures of text, such as business cards, for example, and then index the pertinent words in their address books.

Whether Google's open-source release will lead to better OCR technology and more future users is uncertain. Analysts say Nuance has built such a big lead in the industry that competitors remain also-rans, unlikely to contribute major advances any time soon. "For Nuance, they are leaps and bounds ahead of any of their competitors in the space," says Daniel Ives, an equity analyst at Friedman, Billings, Ramsey & Co., a Web-based investment bank based in Arlington, Va. "Nuance dominates the industry," adds Jeff Van Rhee, an equity research analyst at Craig-Hallum Capital in Minneapolis. "I don't see [Tesseract] being a big impact."

Over time, however, Ives sees the demand for OCR and related technology increasing as more people seek to switch effortlessly between the paper and digital worlds. "There's no doubt that we believe it is going to be a very fertile market area," he says.

So fertile, in fact, Google is turning to OCR for help moving the world's print into a digital archive of everything.

Before it's here, it's on the Bloomberg Terminal. LEARN MORE