Why are search engines so fast? They farm out the job to multiple processors. Each task is a team effort, some of them involving hundreds, or even thousands, of computers working in concert. As more businesses and researchers shift complex data operations to clusters of computers known as clouds, the software that orchestrates that teamwork becomes increasingly vital. The state of the art is Google's in-house computing platform, known as MapReduce. But Google (GOOG) is keeping that gem in-house. An open-source version of MapReduce known as Hadoop is shaping up to become the industry standard.
This means that the two leading software platforms for cloud computing could end up being two flavors of Google, one proprietary and the other—Hadoop—open source. And their battle for dominance could occur even within Google's own clouds. Here's why: MapReduce is so effective because it works exclusively inside Google, and it handles a limited menu of chores. Its versatility is a question. If Hadoop attracts a large community of developers, it could develop into a more versatile tool, handling a wide variety of work, from scientific data-crunching to consumer marketing analytics. And as it becomes a standard in university labs, young computer scientists will emerge into the job market with Hadoop skills.
The growth of Hadoop creates a tangle of relationships in the world of megacomputing. The core development team works inside Google's rival, Yahoo! (YHOO). This means that as Google and IBM (IBM) put together software for their university cloud initiative, announced in October, they will work with a Google clone developed largely by a team at Yahoo. The tool is already gaining fans. Facebook uses Hadoop to analyze user behavior and the effectiveness of ads on the site, says Hadoop founder Doug Cutting, who now works at Yahoo.
In early November, for example, the tech team at The New York Times (NYT) rented computing power on Amazon's (AMZN) cloud and used Hadoop to convert 11 million archived articles, dating back to 1851, to digital and searchable documents. They turned around in a single day a job that otherwise would have taken months.
Cutting, a 44-year-old search veteran, started developing Hadoop 18 months ago while running the nonprofit Nutch Foundation. After he later joined Yahoo, he says, the Hadoop project (named after his son's stuffed elephant) was just "sitting in the corner." But in short order, Yahoo saw Hadoop as a tool to enhance the operations of its own search engine and to power its own computing clouds.
Cutting says the company's existing software "is stretched thin" and demands lots of engineering attention. What's more, it's "not easy to make big changes," he says. "It's not a flexible system."
Hadoop promises relief. But it is more likely to thrive, Cutting says, if the development community grows outside of Yahoo. He says that while he and about 10 others in Yahoo work on Hadoop, only five or six people outside of the company contribute regularly. "It's dominated by Yahoo," he says. "It would be great for the project to have a more balanced team."
The Hadoop team is not likely to get loads of help from Google. While the search giant provides certain information about MapReduce to open-source developers, it takes great care to prevent the secrets in its proprietary code from leaking into Hadoop. Cutting says Google prohibits any developers who work with MapReduce from participating in Hadoop. "They assign interns to it," he says. Google does not comment on the policy.