Cut-rate painkillers! Unclaimed riches in Nigeria!! Most of us quickly identify such e-mail messages as spam. But how would you teach that skill to a machine? David Heckerman needed to know. Early this decade, Heckerman was leading a spam-blocking team at Microsoft Research. To build their tool, team members meticulously mapped out thousands of signals that a message might be junk. An e-mail featuring "Viagra," for example, was a good bet to be spam--but things got complicated in a hurry.
If spammers saw that "Viagra" messages were getting zapped, they switched to V1agra, or Vi agra. It was almost as if spam, like a living thing, were mutating.
This parallel between spam and biology resonated for Heckerman, a physician as well as a PhD in computer science. It didn't take him long to realize that his spam-blocking tool could extend far beyond junk e-mail, into the realm of life science. In 2003, he surprised colleagues in Redmond, Wash., by refocusing the spam-blocking technology on one of the world's deadliest, fastest- mutating conundrums: HIV, the virus that leads to AIDS.
Heckerman was plunging into medicine--and carrying Microsoft (MSFT ) with him. When he brought his plan to Bill Gates, the company chairman "got really excited," Heckerman says. Well versed on HIV from his philanthropy work, Gates lined up Heckerman with AIDS researchers at Massachusetts General Hospital, the University of Washington, and elsewhere.
Since then, the 50-year-old Heckerman and two colleagues have created their own biology niche at Microsoft, where they build HIV-detecting software. These are research tools to spot infected cells and correlate the viral mutations with the individual's genetic profile. Heckerman's team runs mountains of data through enormous clusters of 320 computers, operating in parallel. Thanks to smarter algorithms and more powerful machines, they're sifting through the data 480 times faster than a year ago. In June, the team released its first batch of tools for free on the Internet.
A new industry for the behemoth to conquer? Not exactly. Heckerman's nook in Redmond represents just one small node in a global AIDS research effort marked largely by cooperation. "The Microsoft group has a different perspective and a good statistical background," says Bette Korber, an HIV researcher at Los Alamos National Laboratories. The key quarry they all face is the virus itself, which is proving wilier than any of Microsoft's corporate foes. While Heckerman has high hopes that his tools will lead to vaccines that can be tested on humans within three years, his research sits outside of Microsoft's business plan. "It has nothing to do with Microsoft," he says, "except that we can help." From the company's perspective, the sums invested in HIV research amount to a rounding error--only a couple million dollars per year in a research and development budget of $7 billion. The potential payoff would be to contribute to the holy grail of AIDS research, successful vaccines. In the optimal scenario, drug companies would distill such research into targeted varieties of vaccines, which would help defend millions around the world from the scourge. The business payoff? Well, if helping to conquer a plague doesn't justify the effort--and burnish Microsoft's image--it might just be that a virus-sniffing tool could perhaps drive spam into submission.
If it seems strange that spam-blockers would end up studying nucleic acids, it shouldn't. Research is growing increasingly quantitative. Nearly everything these days, from atoms and cells on up, is described in data. When the work involves finding statistical relationships in mountains of bits, two things happen: First, mathematicians and computer scientists gain sway, which means an expanding role in research for powerhouses such as Microsoft and IBM (IBM ). Second, as researchers find common patterns, they start jumping from one discipline to the next.
The battle against HIV draws loads of such jumpers. Several scientists at Los Alamos, for example, were teaching machines to recognize patterns in satellite imagery. This led them to HIV, where they're building tools along the lines of Microsoft's. And many of the 800 researchers at Microsoft cross disciplines every which way. One of them, Michael Cohen, started out building software to stitch photos into a panorama. Now he's piecing thousands of brain scans into 3D models for scientists.
For Heckerman, the connections between spam and HIV boil down to mathematics. He analyzes both scourges by studying statistical relationships among their ever-changing features. Consider the word "Viagra." Sometimes it shows up in legitimate e-mails. Often it appears in spam. If researchers study thousands of e-mails, they can calculate the percentage of e-mails with that word that are spam. That's one clue. But the spam-filtering machine needs to know more than that. What other features in an e-mail signal that it's spam? Are certain fonts particularly spammy? What about e-mail addresses or types of punctuation? The trick is to figure out which combinations of these features identify an e-mail as spam. Each decision can involve thousands of variables and millions of different calculations.
From Heckerman's perspective, HIV is like a cagey spammer. After attacking a cell, it injects its own genetic material and proceeds (much like a spam jockey who has commandeered an unprotected computer) to manufacture thousands of copies of the virus. It's a notoriously sloppy copier, but that adds to its vigor. Each mistake launches mutant viruses into the system. Many fail. Some, though, survive--and resist the drugs.
One challenge for HIV researchers is to find the variables that point to an infected cell. Ordinarily, the first clues--the cellular equivalent of the variations in fonts and words that Heckerman has discovered in his spam research--are bits of protein that sit atop each cell. These communicate to passersby, including armies of antibodies, what's going on inside the cell. For years, researchers have been striving to single out the combinations of protein that point to an HIV-infected cell. Once they do, the next step is to package those bits of protein into a vaccine. In theory, this would introduce a person's immune system to an entire gang of undesirables, so that it could recognize and attack those cells.
The trouble? Complexity and mutations. HIV-infected cells often wear mutated nameplates that immune systems haven't learned to read. In this sense, vaccines have been like faulty spam filters, the ones that block e-mails promoting "Viagra" while letting ads for "V1agra" scoot through. This leads some researchers to throw up their hands. "We've thrown billions down the black hole of AIDS vaccines," laments Leroy Hood, co-founder of the Institute for Systems Biology in Seattle.
But Heckerman is upbeat. He argues that by revving up the computing power and blending thousands of new variables--including dizzying genetic differences in each patient--researchers are making progress. One key, he says, is to map the patterns of mutation and incorporate them into medicine. These mutations, he says, appear to vary according to a person's immune system. If researchers can find the patterns, they'll be closer to making effective vaccines. Yet if they conclude that the mutations are utterly random, then "we're in big trouble," says Heckerman.
The hunt goes on. No one is betting on miracles from Microsoft. But in a research community desperate for answers, the hum of those computers churning in Redmond is a welcome sound.
By Stephen Baker and Jay Greene