For more stories about all things Google, see the links at the end of this article.
Earlier this year, a group of scientists — mostly in mathematics and evolutionary psychology — published an article in Science titled “Quantitative Analysis of Culture Using Millions of Digitized Books.” The authors' technique, called “culturomics,” would, they said, “extend the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.” The authors employed a “corpus” of more than 5 million books — 500 billion words — that have been scanned by Google as part of the Google Books project. These books, the authors assert, represent about 4 percent of all the books ever published, and will allow the kind of statistically significant analysis common to many sciences.
This sounds impressive. The authors point out that 500 billion words are more than any human could reasonably read in a lifetime. Their main method of analysis is to count the number of times a particular word or phrase (referred to as an n-gram) occurs over time in this corpus. (Try your own hand at n-grams here.) Their full data set includes over 2 billion such “culturomic trajectories.” One of the examples the authors give is to trace the usage of the year “1951.” They note that “1951” was not discussed much before the actual year 1951, that it appeared a lot in 1951, and that its usage dropped off after 1951. They call this evidence of collective memory.
I initially reacted to this article with skepticism. As I read more — including a recent piece (one might call it a puff piece) in Nature on one of the co-authors, Erez Lieberman Aiden, in which he was dubbed “the prophet of digital humanities” — my skepticism became stronger. I think culturomics is a nifty tool, but we need to be cautious and critical about this kind of digital data and about claims that culturomics could make “much of what [historians] do trivially easy.” Historians do much more than follow trajectories, so I am not so sure that culturomics will lead to a new way of doing historical work. It’s not the game-changer it’s been claimed to be.
I would not call myself a Luddite — I use digital resources all the time, in my research and my teaching. I have hundreds of PDFs of books I have downloaded from a variety of online sources — Early English Books Online, Eighteenth Century Collections Online, Gallica (the digital service of the French National Library), and yes, Google Books — that I use in my research.
But when I read the Science article, I was immediately struck by what seems to me to be a fundamental flaw in its methodology: its reliance on Google Books for its sample. Google Books has focused on digitizing academic libraries. I would argue that books found in academic libraries are not necessarily representative of cultural trends across society. As any historian knows, every scholarly library is different and every library has its biases. And surely I am not the only historian who has noticed that the digitizing policy of Google Books does not, and perhaps cannot, result in anything like a uniform, or a uniformly random, sample of all books in a given period. Google’s ability to digitize books is dependent on a number of factors: the willingness of libraries to open their collections for digitization; the condition of the books being digitized; copyright regulations, which allow only “snippets” of many 20th-century books; and the quality of the digitization process itself.
The authors further narrow their range by admitting only publications for which they have “metadata” — that is, author, title, year, immediately confining the range of publications to books, and not periodicals or other more ephemeral literature — and to the period after 1800. The article itself gives no clue as to how the authors obtained this metadata. But surely it skews their data set even more toward a certain kind of book, while treating books as interchangeable pieces of data. In this universe, one book is much like another.
The authors equate size with representativeness and quantity of data with rigor. I am not sure that is true. I do not deny that some of their results are interesting, particularly the tracing of linguistic and grammatical changes over time, which is like watching a speeded-up newsreel. But some of the results are simply banal. The year “1951” appears most often in 1951. The word “slavery” appears more often during the U.S. Civil War. The word “influenza” appears more often during pandemics. Duh. Are these even historical questions?
Perhaps most disturbing to me is the underlying assumptions of such work about the humanities and about what scholars in the humanities do. One assumption is that the humanities need to be more like science and that we need to be more like scientists — that quantitative knowledge is the only legitimate knowledge and that humanities scholars are therefore not “rigorous.” For well over a century, historians and their critics have debated whether their discipline is a science or an art. When the journal Past and Present was founded by a group of Marxist historians in the early 1950s, it was billed as “a journal of scientific history.” By the mid-1960s this had changed to simply “a journal of historical studies.” On the one hand, there are plenty of examples of humanities scholars who have been using sophisticated digital tools and quantification for years. The Cambridge population survey, with birth and death information gleaned from thousands of parish record books all over England, revolutionized social history when it began in the 1960s. When I was in graduate school in the 1980s, the SPSS statistical package could be mastered as an alternative to a second language. As cultural history became more prominent, quantitative history became less fashionable, but it never disappeared.
On the other hand, as these examples indicate, there is not just one kind of historical or, more broadly, humanities scholarship as the Science authors seem to think. Not all of us trace ideas over time. Some of us look at the people who had those ideas and the places they lived and worked, and the people they knew, and how they lived. Not all of this can be found in books but must be traced across a variety of published, manuscript and material media. Although the culturomics people are confident that they can apply their methods to manuscripts and maps, I’m not going to wait for that possibility.
Much like the digital versus the long-lost card catalog, such a sweeping tool leaves out the chance juxtapositions and serendipities that often tell us much more than the texts themselves. I spent many years off and on at the British Library reading advertisements in the microfilmed Burney collection of 18th-century newspapers. Now these have been digitized, and I can search for “anatomy lectures” and come up with dozens of hits that took me many eye-straining hours to find. But it cannot tell me that on the previous page, or in the previous issue, there was an ad for a patent medicine, or a live animal combat, or another fascinating bit of 18th-century London life that lends meaning and context to the bare entry.
It is revealing of another kind of bias that the long list of authors of the Science article includes no historians, in fact no one from the humanities (Louis Menand also pointed this out in an interview in The New York Times). To be fair, “R. Darnton” and “C. Rosenberg” (presumably the Harvard historians Robert Darnton and Charles Rosenberg) are thanked at the end. The Nature article goes out of its way to point out that Erez Lieberman Aiden studied history and philosophy and even creative writing, which is something like saying I took physics in college, and therefore I can publish on quantum mechanics in Nature. Both articles show a nearly complete lack of understanding of what historians and other humanities scholars actually do.
When Lieberman Aiden and his co-authors presented their findings at the meeting of the American Historical Association in January, AHA President Tony Grafton expressed cautious praise of this new tool. In the Nature article he sounds decidedly more anxious: “You can’t help but worry that this is going to sweep the deck of all money for the humanities everywhere else.”