Spy Agency Seeks Digital Mosaic to Divine Future

The U.S. intelligence community wants to mine lots and lots of the tidbits bopping around on the Internet to suss out trends before they make the news.

Governments have been caught off guard a lot lately: by revolutions, by riots, even by unemployment rates (or, to go back even further, by events like 9/11).

In the information age — where there’s no limit to publicly available data on everything from political chatter to gas prices — it seems policymakers should be better at predicting major societal shifts and events than at any point in history. Shouldn’t all these little pieces of information be telling us something big? Shouldn’t they be telling us about where the next mass migration will come from or where the next riot will be?

The government’s Office of the Director of National Intelligence is betting this is the case. Its Intelligence Advanced Research Projects Activity ( or IARPA — which sounds like a less menacing cousin to DARPA) is rolling out a new R&D project to test tools that would mine publicly available data to predict political and humanitarian crises, disease outbreaks, mass violence and instability. The project, the Open Source Indicators Program, is premised on the idea that big events are preceded by population-level changes, and that those population-level changes should be identifiable if we just look in the right places.

THE IDEA LOBBY
Miller-McCune’s Washington correspondent Emily Badger follows the ideas informing, explaining and influencing government, from the local think tank circuit to academic research that shapes D.C. policy from afar.

The concept isn’t new. Allied intelligence officers did something similar during World War II, for example, mining letters to the editors of local newspapers and radio transmissions for clues as to what was going on inside Nazi Germany.

“The difference, the big difference — and the thing this initiative is picking up on — is that whereas we used to have to do this with a relatively small sample of radio shows and newspapers and stuff like that, we’re now drinking from the fire hose of the Internet,” said Philip Schrodt, a political scientist at Penn State.

An IARPA public affairs officer said officials could not discuss the program while the government is still soliciting proposals. But Schrodt and several other researchers affiliated with academic teams that may eventually wind up working on the project, alongside private contractors, offered a look into an expanding, multidisciplinary field of real-time data analysis that the government hopes may allow it to “beat the news.”

This experiment — which will test methods in Latin America, not the U.S. — dovetails with broader trends in computer and social science toward mining large-scale sets of social data. Twitter, for instance, is a jackpot: The network can be scoured both for the content of messages and the connections that are revealed between people writing and reading them.

About half of the populations in most major Western countries are now on Facebook (and a surprising amount of data on Facebook remains accessible to the public). Economic data can be collected from e-commerce sites like Amazon. There are also open-source indicators embedded in Internet news sites, message boards, unemployment data, Web search queries, traffic Web cams and financial markets.

Think of what researchers could learn about the economic mood of consumers if they suddenly discovered hundreds of people (each anonymously identified) all trying to sell their best jewelry on Craigslist tomorrow. Other open-source indicators could provide a trove of information on a topic social scientists have been weighing a lot lately: the effect of price changes — whether for housing, gas or food — on social stability.

And because all this data can be collected in real time and by automated systems, it’s more up-to-date, it’s larger in sheer quantity, and it’s becoming more cost-effective and realistic than ever to analyze.

• • • • • • • • • • • • • • •

The easiest place to understand the potential of all this data is in public health, where live analysis is already helping to identify and track disease epidemics at a speed that was never possible before the Internet. Google Flu Trends pioneered the technique. The tool measures the frequency of certain search terms commonly associated with the flu to identify outbreaks down to the city level. Now it’s doing the same with dengue fever.

Researchers at Harvard developed a similar tool five years ago called HealthMap, which offers a prototype of the model IARPA envisions testing beyond public health. HealthMap continuously scrapes the Web for health-related keywords and deletes noise, then classifies the data, geocodes it, filters it into different categories and maps the results. The project is particularly useful in identifying early disease indicators in countries that have no public health capacity and weak or opaque data reporting (because these are often the same countries where sick people are unlikely to Google their symptoms on a home computer, HealthMap also taps into SMS messages and smartphone data).

“These discussions are taking place by the minute, and we’re caching those in real time,” said John Brownstein, the director and co-founder of HealthMap. “Our goal is as soon as anybody’s talking about an outbreak on the Web, within the hour it ends up on HealthMap.”

In theory, similar methods might track contagious ideas, economic unrest or resource shortages — and in a way that would actually move faster than news reports.

Automated systems could certainly be more comprehensive than traditional media. The BBC and Associated Press don’t have correspondents in every small town in Mexico, but an automated computer program could analyze in real time reports from every local news site in the country. Or, better yet, it could analyze social network chatter before it even gets to the local newsroom.

“If everybody is suddenly upset about the drought in Texas or something like that,” said Schrodt, “we should be able to pick that up immediately without having some reporter go out and talk to some farmer whose cows have died.”

This type of analysis could also help policymakers be less surprised by news when it happens, and to monitor it literally in real time.

The bigger question, though, is how researchers take the step from watching trends develop in a live time series to anticipating them before they happen.

Some trends lend themselves more easily to this challenge — disease epidemics are one. HealthMap can start to identify outbreaks before officials even know what disease they’re looking at because the tool is designed to scan for basic symptoms like runny noses or a run on aspirin. HealthMap isn’t merely crawling the Web for Google searches of “Do I have SARS?”

But how do we anticipate “riots in London” when we don’t even know “riots in London” is what we’re looking for? Could open-source indicators have identified the rise of the Tea Party before it even started calling itself that?

• • • • • • • • • • • • • • •

“That’s the really, really big question, that’s the single biggest challenge in this IARPA initiative,” Schrodt said. He has worked on other government research projects, including one called the Political Instability Task Force. It’s clear, though, in that project, what researchers are looking for: political instability, which is suggested by fewer than five indicators. Here, though, the goal is to identify everything that’s publicly available, for anything that might be interesting.

“I can’t take a book off my shelf and open it up and say, ‘Oh, here’s what you do if you’re monitoring 100 different indicators using 200 different sources,'” Schrodt said. “That’s the new science on this. I don’t know if it will work or not.” (The political instability project, as an example, has gotten to about 80 percent accuracy forecasting probabilities that countries will collapse.)

Peter Gloor’s research at MIT’s Center for Collective Intelligence has been trying to solve this unknown prediction problem by identifying the most creative people – both those who are destructively creative, like terrorist plotters, and those who are constructively creative, like Justin Bieber. He’s trying to find the people behind trends — information producers, not the information itself — before trends take off.

“We’re looking for the trendsetters while they are being born and made,” Gloor said. “In Twitter, once you are Ashton Kutcher, everyone knows you; it’s clear you are a trendsetter. But Justin Bieber, whenever he was posting his first video on YouTube, it was not as clear.”

Gloor has done a lot of this work in Wikipedia, where pages are written and edited according to a revealing pattern: Generally, about 1 percent of contributors write 90 percent of the content; 9 percent of people write another 9 percent of the content; and 90 percent of contributors write just 1 percent of the crowdsourced encyclopedia. If you find the 1 percent of people who are doing all the heavy lifting, those are your future trendsetters (or “coolhunters”).

Charles Elkan, a computer scientist at the University of California at San Diego, suspects the biggest value of all these tools will come from quantifying what social scientists already know.

“Social scientists have said for 100 years that revolutions happen at times of rising expectations,” he said. “If, for example, society is becoming more prosperous, people start have rising economic expectations, that spills over into rising political expectations, and that can make a revolution more likely. Social scientists have said this for a long time, and now it’s beginning to be possible to quantify it.”

Finland, for example, is obviously a more stable country than Sudan. But attaching a number to that statement — say, Sudan is eight times more likely to experience a coup or revolution in the next two years — is much trickier.

• • • • • • • • • • • • • • •

One thing open-source indicators probably can’t do is write the news before it happens.

“Predicting events is even harder than predicting trends,” Elkan said. “If you think of a trend as creating the probability for something, that probability combined with some spark makes the actual event.”

And then you have to predict the spark, too.

“We can say unemployment is trending upwards, and that layoffs are increasing,” Elkan went on, “but it’s still very difficult to predict which manager of which company will wake up in the morning and decide, ‘I can’t wait any longer, I need to lay off 10 people today.'”

It’s possible to predict, for instance, that riots are more likely in London than they are in Orlando. But no one would have predicted a week ahead of time that a specific police shooting would catalyze riots (as plenty of other police shootings — even in other seething areas — don’t set off riots). Similarly, no one would have predicted that the suicide of a Tunisian fruit vendor would spark a revolution that would sweep out of North Africa and into the Arabian Peninsula and Mideast.

In this sense, there may be a sizable gap between our imagination of what’s possible with trend prediction and the reality of what it can do for policymakers.

The primary limitation isn’t the data collection; it’s the data analysis. And there’s a point in the process where human judgment must take over from automated computer systems. That is inevitably the moment when controversial or expensive action looms: Should we double the FEMA budget, send support to Syrian revolutionaries or shift course on jobs policy to quell a domestic uprising?

The attention span of decision-makers, Elkan cautions, is a limited resource, too.

There’s another factor: Governments aren’t the only ones who might like to leverage this data. Its business application is obvious. Movie theaters could use trend prediction to book films. Department stores could use it to stock products. Corporations could mine Twitter chatter about them to craft corporate social responsibility policies.

“That’s one reason to be skeptical about the extent of the possibility,” Elkan said. “If it was really possible, for example, to predict unemployment much better than existing methods do, then people on Wall Street would be doing it and taking advantage of it.”

Like the other researchers, he’s cautious about where all of this is headed, even as he tries to design computer models to more accurately predict the future.

“We can easily specify, ‘Let’s monitor every newspaper and every radio station everywhere in the world, then as soon as there’s one school that has a local outbreak of some disease, then we can put that into our computer models and make predictions of when it will come to the United States,'” Elkan said. “That’s a realistic dream, but it’s still a dream at this point.”

Sign up for the free Miller-McCune.com e-newsletter.

“Like” Miller-McCune on Facebook.

Follow Miller-McCune on Twitter.

Add Miller-McCune.com news to your site.

Related Posts

Iran 2009, Meet Ohio 2004

Statistical progression suggests the Ahmadinejad landslide was unlikely, although his win was predictable. In other words, while the election may have been rigged, it wasn't stolen.
See More