Skip to main content

Who Says Most Psychology Studies Can't Be Replicated?

A high-profile paper left that impression last year. Now, Harvard University researchers are offering a detailed rebuttal.
  • Author:
  • Updated:
(Photo: John Moore/Getty Images)

(Photo: John Moore/Getty Images)

Here's the thing about research: It can always be challenged, re-examined, re-interpreted. In a high-profile example of that process, a much-publicized 2015 paper looked at 100 psychology studies published in major journals, and found only 36 percent of them could be successfully replicated.

This led to many commentaries about a crisis in psychological research. But now the re-examination process has, in a sense, come full circle.

A group of researchers led by prominent Harvard University psychologist Daniel Gilbert has published a detailed rebuttal of the 2015 paper. It points to three statistical errors that, in their analysis, led the original authors to an unwarranted conclusion.

After controlling for those mistakes, "the data (of the original paper) are consistent with the opposite conclusion—namely, that the reproducibility of psychological science is quite high," the researchers write in this week's issue of the journal Science.

Nosek and his colleagues raised doubts about a series of attention-grabbing but questionable studies.

The issue also contains a less-than-robust defense from the original research team, led by the University of Virginia's Brian Nosek. It concedes that "both optimistic and pessimistic conclusions about reproducibility are possible, and neither are yet warranted."

Gilbert and his colleagues point to three major problems with the study: error (conducting "replication" studies that didn't truly re-create the study being tested); power (using a single attempt at replication as evidence, rather than making multiple attempts); and bias (using protocols that appear to be weighted toward giving the original study a failing grade).

On that first point, they note that many of the "replication" studies significantly differed from the originals.

"An original study that measured Americans' attitudes towards African-Americans was replicated with Italians, who do not share the same stereotypes," they write. "An original study that asked college students to imagine being called on by a professor was replicated with participants who had never been to college."

Also: "An original study that asked Israelis to imagine the consequences of military service was replicated by asking Americans to imagine the consequences of a honeymoon. An original study that gave younger children the difficult task of locating targets on a large screen was replicated by giving older children the easier task of locating targets on a small screen."

On the issue of statistical power, Gilbert and his colleagues point to an earlier Nosek project called "Many Labs," in which 36 laboratories attempted to replicate each of 16 psychology studies. Those researchers attempted to replicate each study "35 or 36 times," and when that data was pooled, "a full 85 percent of the original studies were successfully replicated."

The simpler technique of the 2015 study, in which there was only one attempt to replicate each study, "severely underestimates the actual rate of replication," they conclude.

Regarding bias, they note that Nosek and his colleagues asked the authors of each original study whether they endorsed "the methodological protocol for the to-be-attempted replication."

"We found that the endorsed protocols were nearly four times as likely to produce a successful replication (59.7 percent) as were the unendorsed protocols (15.4 percent)," they write. "This strongly suggests that (the methods used) biased the replication studies toward failure."

In their rebuttal to the rebuttal, Nosek and his colleagues agree that "both methodological differences between original and replication studies and statistical power affect reproducibility," but add that the Gilbert team's "very optimistic assessment is based on statistical misconceptions and selective interpretation of correlational data."

"The combined results of (our) five indicators of reproducibility suggest that, even if (a suspect study's original conclusion was) true, most effects are likely to be smaller than the original results suggest," they write.

They also noted that Gilbert and his peers did not contest findings that went against their interpretation, "such as evidence that surprising or more underpowered research designs were less likely to be duplicated."

"More generally," Nosek and his colleagues add, "there is no such thing as exact replication." As they see it, their paper "provides initial, not definitive, evidence—just like the original studies it replicated."

In a statement to the media, Gilbert insisted that his team's paper is "not a personal attack," but rather a "scientific critique."

"No one involved in (the 2015) study was trying to deceive anyone," he said. "They just made mistakes, as scientists sometimes do."

He added that Nosek and his team "quibbled abut a number of minor issues" in the rebuttal, "but conceded the major one, which is that their paper does not provide evidence for the pessimistic conclusions that most people have drawn from it."

"We hope the (authors of the 2015 study) will now work as hard to correct the public misperceptions of their findings as they did to produce the findings themselves."

So, Nosek and his colleagues raised doubts about a series of attention-grabbing but questionable studies—and, in the process, created an attention-grabbing but questionable study. That surely calls for a commentary in the Journal of Irony.


Findings is a daily column by Pacific Standard staff writer Tom Jacobs, who scours the psychological-research journals to discover new insights into human behavior, ranging from the origins of our political beliefs to the cultivation of creativity.