Google Just Made Big Data Expertise Much Tougher to Fake

Google I/O 2014 in San Francisco Photograph by Jeff Chiu/AP Photo

For the last five years or so, it’s been pretty easy to pretend you knew something about Big Data. You went to the cocktail party—the one with all the dudes—grabbed a drink and then said “Hadoop” over and over and over again. People nodded. Absurdly lucrative job offers rolled in the next day. Simple.

Well, Google officially put an end to the good times this week. During some talks at the company’s annual developer conference, Google executives declared that they’re over Hadoop. It’s yesterday’s buzzword. Anyone who wants to be a true Big Data jockey will now need to be conversant in Flume, MillWheel, Google Cloud Dataflow, and Spurch. (Okay, I made the last one up.)

Here’s the deal. About a decade ago, Google’s engineers wrote some papers detailing a new way to analyze huge stores of data. They described the method as MapReduce: Data was spread in smallish chunks across thousands of servers; people asked questions of the information; and they received answers a few minutes or hours later. Yahoo! led the charge to turn this underlying technology into an open-source product called Hadoop. Hundreds of companies have since helped establish Hadoop as more or less the standard of modern data analysis work. (Much has been written on this topic.)  Such startups as Cloudera, Hortonworks, and MapR have their own versions of Hadoop that companies can use, and just about every company that needs to analyze lots of information has its own Hadoop team.

Google probably processes more information than any company on the planet and tends to have to invent tools to cope with the data. As a result, its technology runs a good five to 10 years ahead of the competition. This week, it is revealing that it abandoned the MapReduce/Hadoop approach some time ago in favor of some more flexible data analysis systems.

One of the big limitations around Hadoop was that you tended to have to do “batch” operations, which means ordering a computer to perform an operation in bulk and then waiting for the result. You might ask a mainframe to process a company’s payroll as a batch job, or in a more contemporary example, analyze all the search terms that people in Texas typed into Google last Tuesday.

According to Google, its Cloud Dataflow service can do all this while also running data analysis jobs on information right as it pours into a database. One example Google demonstrated at its conference was an instantaneous analysis of tweets about World Cup matches. You know, life-and-death stuff.

Google has taken internal tools—those funky-named ones such as Flume and MillWheel—and bundled them into the Cloud Dataflow service, which it plans to start offering to developers and customers as a cloud service. The promise is that other companies will be able to deal with more information easier and faster than ever before.

While Google has historically been a very secretive company, it is opening up its internal technology as a competitive maneuver. Google is proving more willing than, say, to hand over the clever things built by its engineers to others. It’s an understandable move, given Amazon’s significant lead in the cloud computing arena.

As for the Hadoop clan? You would think that Google flat-out calling it passé would make it hard to keep hawking Hadoop as the hot, hot thing your company can’t live without. And there’s some truth to this being an issue.

That said, even the biggest Hadoop fans such as Cloudera have been moving past the technology for some time. Cloudera leans on a handful of super-fast data analysis engines like Spark and Impala, which can grab data from Hadoop-based storage systems and torture it in ways similar to Google’s.

The painful upshot, however, is that faking your way through the Big Data realm will be much harder from now on. Try keeping your Flume and Impala straight after a couple of gin and tonics.