In one of the most famous episodes in the history of epidemiology, a London physician named John Snow traced the source of an 1854 London cholera epidemic to a single contaminated water pump. At the time, the prevailing belief was that cholera was transmitted by hard-to-prevent miasmatic vapors in the atmosphere, but Snow's data convinced him that the disease was water-borne. He mapped the detailed geographical locations of cholera deaths and saw that they correlated with the use of water from one particular pump on Broad Street. Acting quickly, Snow convinced the neighborhood council to remove the pump handle. With that simple fix, the cholera epidemic subsided, and Snow was eventually recognized as a pioneer in the field of "geographical epidemiology."
A century and a half later, a team of scientists at Google set out to follow in John Snow's footsteps by using the geographical locations of certain search engine queries to track outbreaks of the flu in the United States. In a 2009 paper, Google’s researchers showed that their big data approach was able to accurately track flu outbreaks in real-time, a big improvement over the one to two weeks it takes the Center for Disease Control and Prevention's surveillance network to collect and process its data. It was easy to mistake this work as just another of Google's clever but temporary software tricks, but Google Flu Trends is in fact part of a growing new field called digital disease detection. It's a field that, like so much else associated with big data, offers great promise, but comes with some big and unresolved ethical questions, the sort that John Snow didn't need to worry about.
An over-prediction could cause panic, misallocation of limited supplies of vaccines or medical resources, and, as some reactions to the recent Ebola outbreak demonstrated, damaging stigmatization of people or communities who don't pose a risk.
Digital disease detection means using Web data for public health surveillance. It's essentially a combination of big data and crowd sourcing used to track disease outbreaks and other public health issues more quickly and with a higher geographical resolution than traditional disease surveillance systems, which depend on official reports made by physicians and health departments. Google Flu Trends is one of the more public examples, but digital disease detection is almost as old as the World Wide Web itself.
ProMed mail, established in 1994, is an email service that collects and disseminates reports of disease from local news outlets and other sources. The Global Public Health Intelligence Network, sponsored by the World Health Organization, has been tracking disease outbreaks by crawling the Web for disease-related news stories since 1997.
These digital disease detection tools are now well-established, while others are newcomers that exploit the more recent growth in social media platforms and mobile phone adoption. For example, one 2011 paper described how a team of researchers used cell phone position data to track the mass movement of more than a half of a million people out of Port-au-Prince after Haiti's catastrophic 2010 earthquake. Real-time information like this could be used to make relief programs much more targeted and effective. And Google Flu Trends now has competition: a Twitter-based program to predict influenza trends.
Using big data from social media to carry out public health surveillance raises some clear ethical issues, very few of which are anywhere close to resolution. As an international quartet of digital disease detection researchers wrote recently in PLoS Computational Biology: "Current regulatory and ethical oversight mechanisms are ill-equipped to address the entire spectrum of [digital disease detection]-type activities." The authors argue that this developing field faces not only the traditional ethical issues of both big data and public health, considered separately, but also new ethical issues created by their merger.
"A context-sensitive understanding of ethical obligations," the researchers argue, "may reveal that some data uses that may not be acceptable within corporate activity (e.g., user profiling and data sharing with third parties) may be permissible for public health purposes." Do corporations with such data then have an ethical obligation to make the data available to public health officials? If companies do share, are they obligated to disclose that to their users or allow them to opt out? And should private companies in the health care sector, like insurance companies, have access to the data?
One of the most urgent ethical issues that the researchers identify lies at what they call "the nexus of ethics and methodology." The ethical issue can be reduced to one question: Do these methods actually work?
Ensuring that the methods work "is an ethical, not just a scientific, requirement," the researchers note. Unlike some other social media experiments, a flawed public health monitoring program can cause serious physical and economic harm to large numbers of people.
Digital disease detection programs are relatively easy to set up compared to traditional disease monitoring systems, which means there is a risk that the bar for entering this field might be dangerously low. An under-prediction of a disease outbreak can result in complacency and lack of preparedness by health officials or the public. An over-prediction could cause panic, misallocation of limited supplies of vaccines or medical resources, and, as some reactions to the recent Ebola outbreak demonstrated, damaging stigmatization of people or communities who don't pose a risk.
As the physicist Niels Bohr once noted, prediction is hard—especially about the future. Big data programs and algorithms often perform well when they’re used to “predict” the existing data that was used to help build them, but then do poorly when confronted with new data. That's where digital disease detection tools that use social media data often run into trouble.
Google Flu Trends looked impressive in its initial report in 2009, where it was used to retroactively predict flu activity of previous years. But it largely missed the two waves of H1N1 swine flu that hit later in 2009. As the Google Flu researchers wrote, "Internet search behavior changed during pH1N1, particularly in the categories 'influenza complications' and 'term for influenza'"—two search terms that are particularly important in the algorithm. The program also over-predicted the severity of the 2011-12 flu season by 50 percent.
The point is not that Google Flu Trends is a terrible idea—it's not. But we need to be cautious, for scientific as well as ethical reasons, and not succumb to "big data hubris," defined by a group of researchers at Harvard and Northeastern University as "the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis."
Digital disease detection with social media has the potential to be a powerful tool that John Snow would have embraced. However, flawed digital public health tools have more serious consequences than the premature, buggy releases of apps, Web services, and social media features that we all put up with. Flawed programs that are used to make real public health decisions are not just technical failures—they’re ethical ones too.