Skip to main content

Searching Private Data, and Ensuring It Stays Private

The National Security Agency has your data. Is there a way to use it that won't further violate your privacy?
This undated photo provided by the National Security Agency shows its headquarters in Fort Meade, Maryland. (Photo: NSA/Getty Images)

This undated photo provided by the National Security Agency shows its headquarters in Fort Meade, Maryland. (Photo: NSA/Getty Images)

The National Security Agency landed in hot water in recent years for collecting basically everyone's phone records. They justified their actions by saying they needed such information to find terrorists. There are plenty of arguments about why the NSA shouldn't have all that data and power, but new research asks a somewhat more practical question: Given that they already have the data, is there a way for them to effectively balance privacy and national security? The answer is surprisingly simple—all it takes is injecting a little randomness into the data.

First things first: The researchers don't take a position on whether the NSA (or the Centers for Disease Control and Prevention, for that matter) ought to have your data. Rather, they're following up on a recent National Academies report that concluded there was no existing substitute for bulk data collection when it comes to finding terrorists (or unknowing carriers of a dangerous disease). The new study's aim was to move toward something better and "trying to inject some science into the debate," says lead author and University of Pennsylvania Professor of Computer and Information Science Michael Kearns.

The research team was also interested in a particularly narrow form of privacy, namely, keeping private the identities of your social contacts, assuming you are not a "target"—a terrorist or disease carrier. In fact, their analysis is even narrower than that. What they're really interested in is differential privacy—roughly speaking, the concept that someone shouldn't be able to infer something about you personally based on more general conclusions made from large data sets. In this case, the idea is that people shouldn't be able to infer your secrets based on your social contacts.

Prioritize investigations into people with the strongest social ties to known targets, but randomize the list of priorities a bit.

Here's an example: Say you have a rare, stigmatized disease, and you've talked to your uncle, who's a specialist on the disease, as well as your own small-town doctor. If both get arrested for Medicare fraud related to the stigmatized disease, your neighbors might wonder what exactly connected your uncle to the small-town doctor. In fact, they might infer you're the link—and that you have the disease.

With those issues in mind, Kearns and his colleagues Aaron Roth, Steven Wu, and Grigory Yaroslavtsev wanted to see if they could find a way to search for unknown but "targeted" individuals in social network data while keeping others' social ties and sensitive information private.

To see how their solution works, first consider two extremes: At one extreme, you can start the search from known targets and build outward, but as in the "stigmatized disease" example, that could inadvertently reveal information that ought to stay private; at the other extreme, you could investigate people at random—that actually preserves privacy in the team's sense, but isn't exactly practical.

The solution lies in between: Prioritize investigations into people with the strongest social ties to known targets, but randomize the list of priorities a bit. Simulations and mathematical arguments show doing so would allow agencies to find and investigate targets efficiently without violating other people's privacy.

There are a number of caveats, Kearns says. Notably, the method doesn't guarantee that the innocent (or disease-free) won't be investigated, and it doesn't prevent other kinds of information from being revealed—the only thing that's sure is that most people's lists of social contacts will remain private.

"It's a very interesting model," says Yves-Alexandre de Montjoye, a data privacy researcher at Harvard University (and author of several recent papers suggesting privacy is more illusory than we'd like), noting that it's a smart extension of some other work in the field of differential privacy to the world of social networks. Still, the work remains theoretical and doesn't address the multitude of issues involved with data privacy. "It's not a complete solution," de Montjoye says. "It's a theoretical work, and it's very careful" to acknowledge as much.

Meanwhile, privacy activists may take issue with the paper's underlying perspective.

"Collection of data, as the authors admit, is itself intrusive. They acknowledge that both for the initial network data and for ... investigations," writes Lee Tien, a senior staff attorney for the Electronic Frontier Foundation. "That’s a lot of intrusion." The analysis also avoids the possibility that agencies like the NSA might misuse the data. After all, just because the NSA has a privacy-preserving terrorist search algorithm doesn't mean they'll use it. "At the end of the day, one of the most important issues in privacy is requiring the government to justify its privacy-implicating actions in a publicly accountable way," Tien writes.

Kearns says that he and his team largely tried to avoid such issues. Publicly stating his own opinions on bulk data collection, for example, "would detract" from the paper's goal of advancing scientists' understanding of data privacy, he says.