For two days last January the Harvard professor Raj Chetty, a recent winner of the John Bates Clark Medal for the best American economist under the age of 40, took the stand in a Los Angeles courtroom. He was appearing as a star witness for the plaintiffs in Vergara v. State of California, a case brought by a group of philanthropists hoping to overturn tenure and other union protections for public school teachers so that ineffective educators can be more easily fired.
On the stand, Chetty testified that he had data demonstrating the extent to which great teachers can transform children’s lives. He described a groundbreaking study that he had conducted with two other economists, John Friedman and Jonah Rockoff, and that the three have since published in the American Economic Review, under the title “Measuring the Impact of Teachers.” Chetty and his colleagues examined 20 years of data from more than a million children and their teachers. What they discovered was striking. The students assigned to just one top teacher experienced small yet observable differences in life outcomes: on average, they went on to earn 1.3 percent more per year; they were 2.2 percent more likely to be enrolled in college at age 20; and they were 4.6 percent less likely to become teen parents. If there were a way to move the best teachers to the lowest-performing schools, the authors theorized, then most of the gap in test scores between poor and middle-class children could be closed.
In the courtroom, Chetty implied that value-added measurement—the complex statistical tool that he and his colleagues had used to conduct their research—could be used to determine which teachers to fire and lay off, regardless of tenure protections.
As novel and ingenious as value-added measurement is, it is also a manifestation of something very old: the abiding obsession in America with sorting and ranking people.
Value-added measurement is the most sophisticated and fairest research method ever developed to draw conclusions about teachers from test scores. It uses students’ past scores on standardized tests to estimate how well they will perform the following year. Teachers who preside over larger-than-expected jumps in test scores earn high value-added ratings, while teachers whose students do worse than expected earn low ratings. And because poor children experience slower academic growth than middle-class and affluent kids, value-added measurement controls for demographic traits that teachers cannot sway—among them family poverty, single parenthood, and, in some cases, how often children have moved homes or been held back a grade.
As novel and ingenious as value-added measurement is, however, it is also a manifestation of something very old: the abiding obsession in American education with sorting and ranking people. For more than 150 years, we’ve been experimenting with wildly different ways of identifying intelligence, aptitude, and performance—and always we’ve assumed that this kind of test-based sorting, carried out as scientifically as possible, will be the key to improving our schools. Over the generations, politicians and reformers have embraced a succession of education “sciences,” each of which, motivated by the urge to classify, has committed us to a regime of testing that has consumed lamentable amounts of funding, political will, and classroom time. Given how much this impulse has helped to establish our educational priorities, it’s worth scrutinizing where it came from—and why it has persisted so long.
BELIEVE IT OR NOT, the first “science” of American public education was phrenology, a Northern European craze that swept the United States in the 1830s and 1840s, just as the common-schools movement began its push for state-funded compulsory elementary education. Phrenology involved analyzing the sizes and shapes of people’s skulls in order to determine their moral and intellectual natures. It was racist (characterizing Mediterraneans as hotheaded and lazy, blacks as brutish, and Northern Europeans as hardworking and intelligent), and plenty of 19th-century Americans saw it for what it was: hogwash. Yet it had influential proponents, among them Horace Mann, the nation’s first state secretary of education.
Phrenology, Mann believed, opened up a path for progressive reform, because it provided a way to identify even the most naturally “criminal” or “dumb” children, and then to improve them through a system of common, public education. “Those orders and conditions of life amongst us, now stamped with inferiority,” he declared, “are capable of rising to the common level, and of ascending if that level ascends.” Systematically applied, Mann and other enthusiasts felt, phrenology would eradicate poverty and crime within just a few generations. It was an optimistic science.
Phrenology didn’t live up to its promise, of course, and by the turn of the 20th century American educators had embraced a new, more pessimistic science: educational psychology. One of the field’s founders, E.L. Thorndike, of Teachers College, Columbia University, believed passionately that intelligence and moral capacity were as unchangeable as height or eye color. Thorndike, who was also affiliated with the Eugenics Record Office (a laboratory in Cold Spring Harbor, New York, where researchers advocated for the forced sterilization of the mentally disabled), felt that the job of educational testing was to identify not those students most in need of educating but rather those who were of high enough intelligence to be worth educating. To that end, he designed multiple-choice tests whose scores would map onto standard scales for all major school subjects.
So began the modern era of standardized testing. Soon another member of the Cold Spring circle, a psychologist named Robert Yerkes, convinced the U.S. Army to allow him and several colleagues—among them Carl Brigham, who later developed the SAT—to deliver intelligence tests to over a million World War I recruits. These tests were not a measure of innate intelligence, to the extent that such a thing exists. They required knowledge. “The main factory of the Ford automobile,” one question read, “is in: Bridgeport, Cleveland, Detroit, Youngstown.”
Researchers claimed that the results of these exams established the existence of five levels of biological intelligence, each corresponding to an occupational class: professional and business, clerical, skilled trades, semi-skilled trades, and unskilled labor. Not surprisingly, native-born white Protestants scored at the top of the scale; Jewish and Catholic immigrants in the middle; African Americans at the bottom.
By the mid-1920s, respected researchers and journalists began to decry the wartime intelligence studies as bunk science, noting that they failed to account for the differences in recruits’ previous schooling. A 1932 study of over 100,000 New York City fifth graders, for example, found that socioeconomic factors such as family income and access to health care outweighed IQ scores—that is, scores derived from intelligence tests—as predictors of academic success. In the years that followed, studies appeared suggesting that IQ was changeable over time, too, and was not a measure of innate talent. The psychologist Otto Klineberg published a study in 1935 showing that Southern-born blacks were able to score higher on IQ tests after living in the North for several years.
In 1928, Brigham, the creator of the SAT, recanted his previous beliefs about the close relationship between race, ethnicity, and intelligence, as supposedly demonstrated by IQ scores, later calling them “without foundation.” But it was too late. All over the country, schools were rushing to buy and administer standardized intelligence exams, which they used to assign students to general, vocational, or academic tracks. IQ testing had replaced phrenology as school reform’s favored science for sorting and classifying children, and would remain so for three decades. One federal study found that as early as 1925, 64 percent of elementary schools were already using intelligence tests to assign students to either academic or vocational tracks.
IN 1968, ROBERT ROSENTHAL, a Harvard psychologist, and Lenore Jacobson, a California educator, published a touchstone study titled “Pygmalion in the Classroom.” The study helped end the country’s love affair with IQ testing.
Rosenthal and Jacobson reported that in 1965, at a San Francisco elementary school, they had told teachers that several of their new students had performed well on the Harvard Test of Inflected Acquisition, and thus were likely to “bloom” academically as the year progressed. No such test existed, however. Instead, Rosenthal and Jacobson had simply selected 20 percent of the school’s students at random and labeled them as high achievers. Remarkably, at the end of the school year these students had indeed bloomed: They demonstrated bigger testing gains, even on IQ exams, than did their peers in the same classrooms. To Rosenthal and Jacobson, this was disturbing evidence of the effect that teachers’ expectations have on student performance—and, indeed, Teach for America still cites the study in its instructional materials, warning its teachers against assuming that poor children or children of color are less academically able.
Education policy shifted decisively away from IQ after the 1960s. But the IQ movement’s adherents, many of whom would significantly influence policy in the decades that followed, developed a passion for its successor science, standardized achievement testing, which was supposed to measure not aptitude but learning. An important proponent of achievement testing was Terrel (Ted) Bell, a Utah school administrator who, after coming of age in the IQ era and gaining a national reputation as a data-driven reformer, became President Reagan’s first secretary of education in 1981. (As a youth, Bell was so taken with intelligence testing that, while serving in the Marines during World War II, he mouthed off to a commanding officer by saying that he would register as a “moron” on an IQ test—for which Bell served time in solitary confinement.)
As education secretary, Bell appointed a commission to produce an inspirational plan for school reform that, he hoped, would unite both political parties, the media, and the public behind efforts to improve education. The commission produced a hugely influential report called “A Nation at Risk,” which, in succinct and readable prose, described the “rising tide of mediocrity” in American education, as made clear by, among other things, a 20-year decline in SAT scores. (What the report didn’t acknowledge was that a more economically and racially diverse group of students than ever before was sitting for the SAT.)
“A Nation at Risk” recommended a variety of policy fixes, among them higher pay and stiffer accountability for teachers; more challenging math, science, and foreign-language classes; a school day that was one hour longer and a school year that was 40 days longer; and a larger role for the federal government in setting and funding the national education agenda. These were noble and ambitious goals, but politically they were unobtainable: The nation was in the grip of a budget-cutting fervor and the budding culture wars. New state-level achievement testing programs turned out to be the only major course of action that policymakers were able to agree on, because they were relatively inexpensive and uncontroversial. If scores were low, politicians could describe school funding as an emergency necessity; if scores improved, they could claim that the investments were working.
Some 30 years later, those making education policy in this country remain in achievement testing’s thrall. President George W. Bush’s No Child Left Behind used achievement scores to declare schools either “adequate” or “failing.” Today, thanks to Obama administration incentives, roughly two-thirds of the states require that student test scores be weighed in teacher evaluations.
IQ testing replaced phrenology as school reform’s favored science for sorting and classifying children.
ONE OF THE KEY findings of the value-added study published by Raj Chetty and his colleagues—a finding rarely mentioned in the media—was that out-of-school factors, such as family income and neighborhood poverty, currently have a far greater effect on the achievement gap than do differences in teacher quality between schools (which, the researchers reported, accounts for only seven percent of the current gap). They also acknowledged that their study, like almost every other major value-added study ever conducted, took place in a low-stakes setting—that is, teachers were not being evaluated or paid according to their students’ test scores. In a higher-stakes setting, they warned, educators might teach to the test, or even cheat, in ways that would cause test scores to lose their predictive power. Nonetheless, they were hopeful: If the top value-added teachers in the country could somehow be moved systematically to the lowest-performing schools, they theorized, perhaps three-quarters of the current test-score achievement gap could be closed. That theory is almost impossible to test, however, given the unattractive working conditions in many low-income schools. When a Department of Education/Mathematica Policy Research trial offered more than 1,000 high-value-added teachers $20,000 to transfer to a poorer school, less than a quarter chose to apply. Inconveniently, too, those who did transfer produced test-score gains among elementary school students but not among middle schoolers—a reminder that teachers who succeed in one environment will not always succeed in another.
Contemporary education researchers, among them Andrew Butler and John Hattie, have written extensively on the most academically powerful uses of testing. And when it comes to gathering information about how teachers should actually teach, Butler and Hattie’s work suggests that value-added measurement, as useful as it is in other ways, is mostly beside the point. That’s because it’s based on standardized state tests given toward the end of the school year. Spending a lot of time preparing for those tests turns out to be counter-productive for learning. Research shows that kids learn best when classroom teaching is geared not toward high-stakes year-end tests, but toward low-stakes, unit-level quizzes, created and graded by classroom teachers who use the results to refine their instruction throughout the year. The soundest use of testing, in other words, is as an instrument to figure out what children do and do not know, so that we can teach them better along the way.
Any achievement testing attached to high stakes for educators invites teaching to the test, which often narrows the curriculum in counter-productive ways. Because of that, Jonah Rockoff, who co-authored the value-added study with Raj Chetty, suggests that we need to come up with new ways to measure teachers’ influence on students, perhaps by studying how teachers affect students’ behavior, attendance, and GPA. “Test scores are limited,” Rockoff says, “not just in their power and accuracy, but in the scope of what we want teachers and schools to be teaching our kids. … There’s not just one thing we care about our kids learning. We’re going to measure how kids do on socio-cognitive outcomes, and reward teachers on that, too.”
But is it really fair to judge teachers on their students’ attendance, given the role that, say, parenting and health play? Should a teacher be punished if a boy in her homeroom gets into a fistfight during recess? These are the kinds of questions we’ll need to grapple with as we experiment with new kinds of education science. And as we do, we’ll need to keep in mind the much bigger question suggested by the history of failed American school reforms: Should we continue to devote our limited political, financial, and human resources to measuring the performance of students and teachers, or should we devote those resources to improving instruction itself?
For more on the science of society, and to support our work, sign up for our free email newsletters and subscribe to our bimonthly magazine. Digital editions are available in the App Store (iPad) and on Google Play (Android) and Zinio (Android, iPad, PC/MAC, iPhone, and Win8).