How Google Is Teaching Computers to See

Project Glass: Google+ Photography Conference, San Francisco, May 23, 2012 Courtesy Project Glass/Google

Google is attempting to teach computers to recognize human faces without telling the computing algorithms which faces are human. It’s a machine-learning problem made for this era of unstructured data and easy access to large computer clusters. Solving it could help the search giant make huge strides in building the next big opportunity in tech—enabling computers to “see.”

A Google research paper prepared for the upcoming International Machine Learning Conference explains how Google has managed to distinguish human faces with 15.8 percent accuracy, using 1,000 machines with 16,000 cores and an image repository.

It also can recognize cats and body parts, elements chosen because they were so common on the YouTube stills used to create the image database that researchers used to train the algorithm. The accuracy rate may not seem impressive to us—it identifies roughly four out of 25 faces as actual faces—but this constitutes a 70 percent improvement over previous efforts.

The net result is that Google can take thousands of images, clean them up, and then learn how to group similar images into categories such as “faces” or “cats.” This has been possible for a while by using state-of-the-art systems, but these required tagged images, as well as a long learning period. Google’s experimented with using unlabeled images and threw a lot of computing at the learning process to reduce the time it took to train the algorithm from weeks to just three days.

Google gets somewhat profound in its paper, noting that if machines can learn like this, it might also be how humans learn: “This work investigates the feasibility of building high-level features from only unlabeled data. A positive answer to this question will give rise to two significant results. Practically, this provides an inexpensive way to develop features from unlabeled data. But perhaps more importantly, it answers an intriguing question as to whether the specificity of the “grandmother neuron” could possibly be learned from unlabeled data. Informally, this would suggest that it is at least in principle possible that a baby learns to group faces into one class because it has seen many of them and not because it is guided by supervision or rewards.”

Scientists are still trying to understand the origins of language and how people learn to classify objects, so Google may be on to something philosophers and anthropologists can debate in alcohol-fueled conversations at university cafés (or machine-learning conferences). But from a practical perspective, throwing a lot of computing at unstructured data to give computers the ability to see could prove to be a gateway for Google to build a platform for the next big thing in tech.

In devices such as Microsoft’s Kinect, Google Glasses, and other gadgets that use gesture recognition, getting computers to see the world is as complex as getting humans to do so. It’s possible to teach computers to recognize gestures that are pre-programmed into its software. Even some touch-based systems are actually taking advantage of cameras to see and interpret different gestures. Getting a computer to “see” is far more complicated.

People use their eyes and brains to see. Our eyes are sensors detecting gradations in light, dark, color, and so forth. That information is conveyed to the brain, where it is interpreted. The brain plays all sorts of tricks with the actual world, though. It fills in blanks, ignores the mundane, and can be tricked via optical illusions.

Computers have cameras and a variety of sensors that can act as eyes, but the brain aspect is a challenge. To train a computer to “see,” programmers have to train machines and offer them ways to behave in any given scenario or gesture combination. Google has shown a way to reduce the training time by throwing a ton of computers at the problem and to reduce the specificity related to image recognition by showing that computers could be trained to recognize images if they see enough of them and have enough processing power.

The Google researchers note that the company’s network of computers is one of the largest in the world. The Google network has 1 billion trainable parameters, which is more than an order of magnitude larger than other large networks reported in literature that have 10 million parameters. Still, even the Google network pales in comparison to the human visual cortex, which is a million times larger, in terms of its total neurons and synapses.

Teaching computers to see would be huge (especially if, along the way, computers can be taught how to learn). Imagine if your smartphone could “see” an object—a building, a piece of art, or a meal—and then classify it. The device could then access the rich trove of data it has about that object and deliver information to you or an app. Right now we have to enter that information, in many cases using a tiny keypad on a mobile device or snapping a picture and relying on a much-less-robust database of learned visual elements, as Google Goggles does. There are also privacy concerns because governments could put the networked computation of 1,000 Googles behind surveillance efforts.

At home, potential entertainment benefits include Kinect-like devices that could see and interpret a user’s actions, not just in the confines of a game but also in the free-flowing world of everyday television. For example, allowing a child to interact with Sesame Street, via a Kinect, is something people at Microsoft are trying to develop. On a smartphone or in the home, this is pretty heady stuff.

Also from GigaOM:

Dissecting the Data: Five Issues for Our Digital Future (subscription required)

Firefox for Android Loses Beta Tag, Speeds Up Mobile Web

Introducing SideCar, the Uber for Personal Cars

Cluttr Dials Down Data Center Energy on Demand

Startup Studio Science Acquires ‘Klout for Pinterest’

    Before it's here, it's on the Bloomberg Terminal.