The Big Data Dump: How Info-Hoarding Can Overwhelm Startups, Spy Agencies
When it comes to big data projects, there are none bigger than the National Security Agency's massive surveillance programs that were exposed by former contractor Edward Snowden a year ago. In internal documents, the agency crowed about the scope of its mission, which was encapsulated in one phrase: "Collect it all."
Yet the same documents caught staffers at the spy agency complaining that the NSA's data-collection practices are "outpacing our ability to ingest, process and store" data, and that the organization "collects far more content than is routinely useful." In other words, it hasn't always been intelligent intelligence gathering.
Even for a mighty agency like the NSA, which has an annual budget of more than $10 billion, managing and making sense out of big data can be a big headache. Businesses are increasingly straining under the too-much-information load. And that's forcing the industry to rethink its approach to the much-hyped technology.
More from the Buried in Big Data special report:
- Retailers Use Big Data to Turn You Into a Big Spender
- Big Data Is Really About Small Things
- 10 Surprising Ways Your Daily Life Is Feeding the Big Data Beast
- Saving the World? How Big Data Is Tackling Everything From Cancer to Slavery
The problem is that organizations tend to collect data on the rainy-day principle, said Jim Adler, an industry veteran who has testified before Congress on big data issues and is vice president of products for data-analytics startup Metanautix. The prevailing mentality is "better to have it and not need it than need it and not have it," he said. Such an approach can cause a data overdose that hinders our ability to glean meaningful insights, whether it's deciphering customer behavior or detecting the next terrorist attack.
A History of Too Much
Concerns about information overload aren't new. In the 1500s, the rise of the printing press in Europe raised fears by the educated class that this new technology would dilute literature, flood the market with low-quality books and distract readers from worthy writings. A few hundred years later, the dawn of the Internet revived worries that quality information would be buried in a sea of meaningless digital content. The jury's still out on that one.
The universe of data being generated and collected today is magnitudes larger than ever before. Companies are combing content that's online (blogs and social media) and offline (DMV and criminal records), as well as the growing amount of bits being spewed by the billions of Internet-connected devices (smartphones and thermostats). Computer-storage maker EMC estimates the amount of digital information in the world in 2020 will swell to 50 times what it is today.
Meanwhile, it's never been easier to accumulate all that digital flotsam. The cost for a gigabyte of storage has fallen from hundreds of thousands of dollars in the 1980s to less than 10 cents today. Of course, hoarding reams of data is one thing -- actually doing something with it is another.
Data Dumping Grounds
Many retailers treat their data warehouses as "dumping grounds" for information, such as sales records, that are never analyzed, said Nathan Smith, chief executive of Sysrepublic Americas, which sells fraud-detection software to retailers. Such information could be used to help companies track inventory of key items - such as milk on grocery-store shelves - to ensure they are restocked before supplies run out. Instead, the data can sit for ages, untouched, past their sell-by date.
Why wouldn't businesses put their big data to work? Some companies, especially older ones, aren't in the habit of making big decisions quickly and frequently, which encourages a slower collect-and-wait approach to data, said Billy Bosworth, chief executive of DataStax, a database-software company that worked with Netflix on its movie-streaming system.
Other companies aren't going on the offensive because they're playing defense with their data. Drug makers are also prone to produce and collect data they often don't use. To appease regulators and prevent lawsuits, the companies create billions of records running clinical trials, said Glen de Vries, co-founder of Medidata Solutions.
Somewhere in that heap of data are some real finds. In the case of drug companies, patterns could be discovered, such as how much money they were wasting on the trials and how many patients they were over-testing.
De Vries' software company sponsored a Tufts University study that examined data from 15 companies' clinical trials and found that a quarter of all procedures may be unnecessary for purposes of the research, costing the industry as much as $5 billion per year and subjecting patients to needless mammograms and lab tests.
"Every time you collect another piece of data in a clinical trial, it probably means you're poking a patient another time, drawing a tube of blood, giving them an MRI," de Vries said.
Big Data's Sick Day
Even when companies do put their data to work, the answers don't always come easy.
Google, which processes terabytes of information every day and sits on one of the biggest sets of data around, was criticized by academics earlier this year for its Google Flu Trends. The system, which relies on search terms related to the illness, periodically overestimated and underestimated the number of U.S. influenza cases in recent years.
ImportIO, which makes a tool for scraping data from Web sites, culled information religiously from its servers and had detailed reports prepared that were filled with minutiae about its technology's performance. But the startup, based in London and San Francisco, missed key signals about glitches and other problems because it was generating more data than it could analyze.
"While we were storing data and graphing it, we had no real insight into what was important and issues could have been buried in a lot of visually difficult-to-understand graphs," Chris Alexander, an operations engineer, said via e-mail.
Less Is More?
Cases like these are forcing some companies to question the collect-it-all mentality of the $44.5 billion big data industry.
"How much data is enough data?" de Vries said. "It's a lot less than what a lot of people think it is."
In fact, some companies and analysts are pushing an idea that could be the next phase in the evolution of big data: small data. By that, they mean information that's not only more manageable, but accessible, immediate and can be acted upon today, not months down the road after a lengthy analysis. It's what Internet companies are accustomed to doing -- making many small, quick decisions based on real-time data.
The Snowden Effect
Last year, ImportIO began using software made by Numenta, a Redwood City, California-based company founded by Palm's founder Jeff Hawkins and former CEO Donna Dubinsky. Their approach: Instead of focusing on data collection and retrospective looks at information piling up in a warehouse, focus on analyzing data streams in real time to spot anomalies.
One use for Numenta's technology is identifying suspicious behaviors by employees who have access to sensitive networks - "the Snowden effect," Hawkins said. Spotting small changes in their behavior, such as downloading files they normally wouldn't or making subtle changes to their computers' settings, might be time-sensitive signs of trouble that retrospective analysis would be too late to identify, he said.
Still, there are those who haven't given up on the collect-it-all approach, despite its burdens.
The NSA is building a one million-square-foot data center facility in Utah, in part to stockpile encrypted information that cannot be broken with current technology. So even though the data is unusable, it's sitting there ready for the day when someone can crack the code.
Too Many 'Things'
It's a similar situation with the flood of data coming from the so-called Internet of Things. Washing machines, power meters and industrial equipment are increasingly being connected to the Internet, which is creating massive amounts of digital sludge that at the moment has no clear use.
For example, car makers are awash in information from braking systems to navigation devices, but the companies haven't yet figured out a way to use the data, said Niall Murphy, founder and chief executive of EVRYTHNG. The London-based company makes software for managing connected devices, and is working with an automaker to make sense of it all.
Of course, the big data problem in many ways is self-serving to the industry. The more devices are connected - the more data collected - the more problems managing all of it arise. The more problems, the more money to be made. EVRYTHNG was one of three companies in this space that Cisco, the leading network-equipment seller, funded in April.
Searching for Sepsis
Amara Health Analytics, which makes predictive software for hospitals, is also big on gathering all the data it can.
Its technology is used for detecting early warning signs of sepsis, a potentially fatal complication resulting from infections. Multiple data points are useful in making that determination - including when tests are ordered and medications administered - so everything from a patient's medical record is kept, said Steve Nathan, chief executive of the San Diego, California-based company.
"We have a big vision for a lot that can be done with that data," he said. "The data we're collecting now, we're not throwing any of it away."
The Art of Subtraction
A24, an independent film company in New York, had also gathered a "tremendous" amount of data and spent hundreds of thousands of dollars on analytics, said co-founder Daniel Katz. But in the end, its ads on social networks for new movies weren't boosting ticket sales as much as the company wanted.
So the studio tried a different strategy by using Quantifind, a startup that has worked with the Central Intelligence Agency on hostage-taking and money laundering cases. Its approach? Separate the good stuff and jettison the junk, especially from social media.
Surprisingly, Quantifind often disregards posts from people who gush about movie trailers because it finds that to be a poor indicator of moviegoers. Instead, the focus turns to keywords like "babysitter" and "girls' night out," which are stronger signals of paying customers.
It seems that regardless of whether you're hoarding in the digital or physical world, the trick is knowing what to throw away.