One day in high school science class, our teacher handed out black plastic boxes that were completely sealed. As we lifted, rotated, and shook the boxes, we could hear a heavy object rattling around inside, bumping into the barriers of an internal maze we couldn’t see. Our assignment was to come up with a mental model of what was inside the black box without smashing it open and looking inside. This was a simplified lesson, illustrating how science has worked for nearly 500 years: Scientists make observations and build conceptual models of how the world works. But a revolutionary change is underway. Big data and cheap, powerful computers are transforming what it means to do science, and the result could reshape science’s role in society.
One of science’s primary goals is to understand the relationship between cause and effect. Scientists devote substantial effort to explain why stuff happens as a consequence of a few fundamental principles: The trajectory of a rocket is explained by Newton’s laws of mechanics, the behavior of a chemical reaction is explained by the principles of thermodynamics, or the effect of a genetic mutation is explained by DNA’s role as a blueprint for the protein components of a cell. These models relating cause and effect are useful in two ways. First, we can predict an effect, just by knowing about the cause. And second, we use models to understand why something is a consequence of its causes. Prediction and understanding have been intimately tied together in science, but the influence of big data in science is now breaking them apart.
HOW CAN YOU PREDICT something without understanding it? Simple: Find some other phenomenon that tends to occur with the event you’re trying to predict. You may never know why your weekly poker buddy chooses to bluff on a particular hand, but his tell is a reliable indicator that he is, in fact, bluffing. With big data, it turns out that almost everything in nature and society has a tell, one that can be discovered with sophisticated computer models that run on inexpensive hardware and crunch through terabytes of data. If you measure enough variables, it doesn’t matter whether you understand the relationship between cause and effect; all you need is a relationship between one variable and another.
If testing ideas about cause and effect takes a secondary role, are we less likely to see genuinely new ideas that will lead scientists in radically new directions?
These computer models—Hidden Markov Models, Boltzmann Chains, Support Vector Machines, Petri Nets, and more—are based on mathematical and statistical concepts that have been around for decades, but thanks to ever cheaper and more powerful computers, they are only now beginning to realize their full potential. With enough data, these models can find trends among enormous numbers of variables in complex data, trends that the unaided human mind would never spot. Remarkably, these models don’t care what kind of data you give them. You can put the same type of model to work discovering terrorism with cell phone and email metadata, predicting flu outbreaks, finding genes connected to cancer, or identifying which of your customers are pregnant. Given enough data, these models can predict what is likely to happen, but without telling you why.
We’re beginning to realize how powerful and disruptive big data can be to us as consumers and citizens. Big data is shaking up some sciences as well, creating a fundamental philosophical controversy over what it means to do science. Nowhere is this more evident than in molecular biology, which in the aftermath of the Human Genome Project has turned to big data in a big way.
The rise of big data has generated a controversy in biology over how much emphasis and funding should be put on traditional, “hypothesis-driven” research, versus “unbiased,” big data projects. Hypothesis-driven research is designed to answer specific questions about cause and effect, such as the idea that a mutation in a particular gene causes cancer by hindering the ability of cells to repair damaged DNA. By testing ideas like this, we come to understand how DNA mutations cause cancer, and, in many cases, we can make a prediction about what effect a mutation will have.
A big data scientist, on the other hand, considers hypothesis-driven science too limiting. Cancer is a complex disease, involving many genes, and we’ll never understand it if we get bogged down in the time-consuming process of testing cause-and-effect relationships one at a time. Instead, we can tackle cancer much more effectively by measuring as many variables as possible in as many cancers as we can collect, without being biased by preconceived ideas. We organize big, collaborative (and expensive) projects, like The Cancer Genome Atlas to collect and process the data. We can then use that data to build computer models of cancer that predict which mutations will be important in particular types of cancer, without any hypothesis about why those particular mutations are important.
In practice, the distinction between hypothesis-driven and big data research isn’t quite as sharp as I’ve portrayed it. In recent years, more traditional labs are incorporating big data into their work, and big data projects have sometimes spawned genuinely new hypotheses. Many biologists envision the two modes of research working in tandem; each approach enhancing the other. The National Institutes of Health has recently initiated a “Big Data to Knowledge” project, in the hopes that this will happen. But a fundamental tension among biologists is still unresolved, over the relative importance of prediction and understanding in an era when dizzying technological developments makes it easier than ever to amass piles of data, as part of big projects that capture headlines and become the public face of scientific progress.
WHAT DOES IT MEAN to understand a complex phenomenon like the biology of a cancer cell? Our minds will obviously never encompass all that is going on in a cancer cell by logical reasoning from a few fundamental principles. Traditionally, we’ve tackled something like cancer with a patchwork of cause-and-effect conceptual models, each one graspable by a normal human being. The emerging alternative is to understand something like cancer by “integrating” our big data, using powerful models that capture subtle relationships in the data without making any claims about what causes those relationships. Although their predictive power is impressive, the models themselves are black boxes, not meant to offer the sort of traditional conceptual insights scientists typically work with. Is this now what it means to understand a complex system?
Big data in science is powerful, and it is not going away, but how it will change science is uncertain. If testing ideas about cause and effect takes a secondary role, are we less likely to see genuinely new ideas that will lead scientists in radically new directions? Big data science exploits the variables that we already know how to measure. How will we figure out when to measure something new? And what does it mean for a scientific field to increasingly invest in universally applicable methods that don’t depend on a deep knowledge of that particular field? If you don’t need to know much biology to practice big data biology, what does it mean now to be an expert in biology? The answers to these questions may redefine what it means to do science—a change that will have important consequences in a society that relies on scientific innovation to solve problems.