Darpa Wants to Save Us From Our Own Dangerous Data

Illustration by 731

The author William Gibson once confessed to writing up a terrorist plot so plausible that he didn’t want to publish it. He cut it out of his book Idoru. “It seemed so workable and media-efficient an idea,” he told the Village Voice in 2003, “that I didn’t feel like I could let it out.” Gibson refused further comment.

If he were of a mind to help, Gibson is the sort of person who could extend a hand to the U.S. Defense Advanced Research Projects Agency, or Darpa. The agency, with its characteristic spirit of paranoid generosity that gave us killer robots and the Internet, recently asked researchers to submit proposals for research projects that investigate whether freely available, open data could be weaponized.

“Could a modestly funded group,” asks the RFP, “deliver nation-state type effects using only public data?” Meaning data from geographic information systems, marketing databases, Facebook, Twitter, the open Web—or any one of millions of new data systems that have come online in the last couple of decades. Could a few malicious actors with regular computers and access to the Web cause a lot of death for Americans without hacking—just by making use of what’s out there and available?

It’s a funny time for this RFP to be out there. Foreign Policy issued an “Irony Alert,” contrasting the Darpa request with the recently released leaks about the National Security Agency and spying. “The military,” wrote Shane Harris on FP’s Killer Apps blog, “is worried that Russia or al-Qaeda is going to wreak nationwide havoc after combing through people’s personal records.”

Darpa isn’t necessarily wrong to worry. Public data can absolutely be used to nefarious ends—we’ve already seen what happens after people get “doxxed,” when Internet vigilantes find their addresses and publicize them. In China there have been hundreds of cases where a “human flesh search engine”—basically mass doxxing—came together to create huge, shared dossiers about various moral outrages, corruption issues, or adultery cases. Things go up a notch with “swatting,” wherein pranksters pretend to be their victim, call 911, and phrase their calls so that the protocol results in a visit by a police SWAT team. Celebrities are common targets: Diddy has been swatted.

So you see the formula emerging: Take public data and extract something meaningful from it, like an address. Take knowledge of law enforcement protocols that lead to SWAT team deployments. Combine the two and you’ve created a dangerous, high-risk situation for all. There’s no “hacking” involved, in the context of breaking into computers. It’s just social engineering. Then again, the same instinct in human flesh search engines is at work in collating and editing Wikipedia. These impulses can be turned to lasting benefit.

This is also the lifeblood of modern marketing. Take one pile of personal data, combine it with some other database of purchasing patterns, and voilàyou extrapolate who’s pregnant. As a culture we have been basically comfortable with this information being used for boilerplate commercial (targeted marketing) and political ends (redlining and gerrymandering). But the NSA leaks have made clear it goes much deeper. We are constantly generating a signal, and it is constantly being ingested by various digital leviathans.

And now Darpa is talking about “nation-state effects.” Traditionally that involved explosions and guns, but increasingly it involves data. It’s fairly easy to engage in toxic speculation: Let’s say a country or terrorist organization wanted to seed massive discord and, pretending to be a large public company, sent out “sample pack” fruit snacks with ricin in them to thousands of kids around the U.S. Given the reach of our national media, the fact that the U.S. Postal Service takes pictures of all of our mail, and so forth, it’s possible that such an act could cause massive discord and illness. But it’s also quite likely the tragedy would be minimized by the systems of surveillance and control we’ve put in place.

Or take a long view: Imagine the U.S. is constantly spying on its own citizens via the NSA, which it is doing. Someone in China purchases a list of a few million Americans, and all of these people begin to receive e-mails in Chinese. Most of these messages are just classified as spam. Over time, however, the NSA’s plucky investigators and their statistical-analysis tools observe that these messages are not just about amazing deals on printer toner but also include surprising information about various urban centers, plus dates and times, information about air travel, and the like. If you were going to send coded messages, dropping them into spam is a great way to do it. How could you find the intended recipient among those millions of people? Of course, this is also a great way to fake out your enemies. All of the data could be fake; there might be no spies at all. If you created such a honey pot, if you make people think that your spam campaign is actually a spy network, you could generate a freakout on a national scale.

Nation-state effects! And nary a shot is fired.

In the past, databases of public information have been released and used to nefarious or journalistic ends. AOL released “anonymized” search logs in 2006 that made it trivial to find out who was doing the searching. Netflix released a large amount of usage data, and researchers figured out who had watched which movie. Statistics are a powerful tool, and this is what the Darpa proposal specifically addresses: measuring the “doxxability” of databases. The group is seeking the “creation of tools, techniques, and methodologies to measure the vulnerabilities in a given set of public data.” Darpa wants the researchers to come up with a system for analyzing a given blob of data and finding out if it’s vulnerable or not:

“To what extent could a non-state actor collect, process, and analyze a portfolio of purchased and open source data to reconstruct an organizational profile, fiscal vulnerabilities, location of physical assets, work force pattern-of-life, and other information, in order to construct a deliberate attack on a specific capability.”

The big problem is the framing. Because it’s basically too late. Your average citizen has unparalleled war-making powers thanks to Google Maps, weather prediction, spreadsheets, and text messages. D-day would have been a breeze to prep with an iPad—you’ve got GPS, maps, satellite views, and weather prediction built in. Add to this that we live in a world of big data, meaning you can target individuals in a way that used to take all the efforts of spy networks. Your average human comes pre-spied nowadays, no espionage needed.

So is Darpa wrong to look for monsters under big data’s bed? Researchers should apply because the government is giving out money. If there’s a meaningful statistical method that uncovers how to discover vulnerabilities in a large data set, then discovering that will be a net gain for society. Even if it’s classified. Science first.

But, yes, it remains that turning the open Web into a monster factory is a bad idea. Because where does it end? How can you be sure? You can’t combine every data source with every other data source preventively. (Really, even the NSA doesn’t have that kind of computing power.) Not only that, but this combination of data sources is what gives the Web its economy-moving power. An attacker might look at a list of people who live around Cleveland and think about how to attack their water supply. Groupon takes the same data and offers people coupons. We’re building the new economy on cheap access to once-private data. Facebook and Twitter are both tools for manufacturing such data ever more cheaply, so that they can offer ever more efficient advertising products. They’re bigger than a lot of nation-states themselves, of course.

What Darpa is really asking data scientists to do is come up with a metric that indicates how far a given genie is out of a given bottle—to put a number on a huge national screw-up, a sort of Pandora Probability. Then you can scan tons of databases to find out exactly how much evil they could spew into the world. You could measure the overall potential for evil of the entire Internet!

Which sounds kind of awesome. But frame the problem a little differently: The U.S. has clearly lost all sense of civic responsibility about private data. There’s no national sense of privacy. The only solution is to turn off the pipeline, but as with our oil supplies, there’s really no chance of that happening. The things we’d need to do—stop using unique identifiers like Social Security numbers willy-nilly, stop tracking people with “third-party” cookies, stop tracking location—are hardly popular options. And the ultimate solution—to make everything opt-in and anonymous-by-default, and to securely encrypt all traffic between computers—while feasible, would shock the increasingly important chunk of our economy that relies on constant customer surveillance to keep up its margins.

It’s too bad. People should have ways to share all sorts of information without worrying that it will be sold cheaply to just about any bidder, without the risk of their data being turned into knowledge for some bad actor. So we end up in this situation where nebulous nation-states could steal our open secrets, combine them with other open secrets, and use that to wage war.

The real problem is that with this Darpa RFP the government is funding analytical Band-Aids instead of thinking about how to protect its citizens. Once the Pandora Probability can be established, what then? Internet threat levels? Databases locked down? I could keep a list of every address in the U.S. on a cheap hard drive and still have enough room for a couple hundred pirated movies.

And this is the real worry. Once the government has a Pandora Probability, does it try—once again—to fence in scary parts of the Internet, or at least the parts that represent risk? Does it work backward to the source, to try to protect its citizens from giving away large, valuable parts of their own lives to their ever-pinging mobile phones? Given our choice of two impossible tasks, wouldn’t it be better to keep the risky data from being created in the first place? Because knowledge has always been dangerous. And it always will be.

    Before it's here, it's on the Bloomberg Terminal.