Computer Determines If Torah Is Mosaic … or a mosaic

In a marriage of traditional biblical scholarship and the latest in computerized textual analysis, a team of Israeli scholars has shed light on a long-simmering dispute over the authorship of the first five books of the Old Testament.

The new technique supports a scholarly consensus that the Torah, traditionally attributed solely to Moses, is based on two primary sources.

“It’s cool to be able to answer some of these millennia-old questions with cutting-edge 21st-century techniques,” says Idan Dershowitz, a graduate student in Biblical studies at the Hebrew University of Jerusalem, whose father, Nachum, was one of the computer scientists on the project.

The Torah (known to Christians as the Pentateuch) consists of Genesis, Exodus, Leviticus, Numbers, and Deuteronomy. Based on differences in writing style, such as thematic repetitions and how often specific terms appear in the text, scholars have long believed that the Torah was compiled from a variety of sources – although theories vary as to how many there might have been, Dershowitz says.

To research the problem, the Dershowitzes joined with Navot Akiva and Moshe Koppel, computer scientists at Israel’s Bar-Ilan University. The statistical analysis of word frequency to determine authorship dates back to the early 1960s, and Koppel previously has demonstrated computational methods that can determine, for example, whether a male or female wrote a given text based on how many pronouns an author used.

With the biblical texts, the researchers theorized that looking at synonym usage–the frequency with which a writer might use “say” rather than “speak,” for example–would provide a content-neutral way of teasing out the differences, the younger Dershowitz says.

“If I can say ‘speak’ or ‘say,’ whether I choose one or another is completely a stylistic thing,” he says.

As a test, they turned to the prophetic books of Ezekiel and Jeremiah, each of which, scholars agree, largely had a single author. They mixed the unlabeled chapters from both books–100 in all–and asked the computer to divide them into two groups. Their two-step process first uses a clustering algorithm to categorize the texts according to sets of synonyms, then re-sorts them according to the usage of common biblical terms.

This method correctly grouped the texts back into the original books with 99-percent accuracy, Dershowitz says. “It makes sense,” he says, “because you would expect any stylistic preference to manifest itself in different preferred words.”

Having proven the synonym-based sorting concept, the researchers applied it to the Torah. Over the past century scholars poring over biblical texts have proposed different theories about the authorship of the first five books, with some suggesting that as many as four distinct sources were intertwined over time, Dershowitz says.

The least-controversial theory identifies a “priestly” writing style that is distinct from the non-priestly portions of the text, he says. “Almost everyone agrees that you have a whole lot of material that was written by priests, and that material is unique in style and content and form,” he says. “We’re probably not talking about a single author, but nevertheless, all of the priestly material is extremely similar.”

With that in mind, “We asked the computer to split up the Pentateuch into two and see what we get,” Dershowitz says. “When we compared our results to the priestly/non-priestly business, we found them to be very similar. It’s in the vicinity of 90-percent agreement with the [scholarly] consensus.”

The team is still trying to figure out what accounts for the discrepancy with the accepted scholarly view in the remaining 10 percent of the samples, Dershowitz says. “When we disagree, what’s going on? Is it because they’re wrong or is it because we’re wrong?”

The use of statistical analysis to determine authorship dates back to 1964, when Frederick Mosteller and David Wallace tackled the problem of who wrote each of a dozen disputed Federalist papers. After analyzing the frequency of 265 “function” words (such as “and” “of” and “the”), they cast their vote for James Madison (as opposed to Alexander Hamilton), a verdict that has since been affirmed by other researcher s. Computer-assisted studiesof Shakespeare’s writings date back to the late 1960s and word-frequency analysis since have been applied to poetry, song lyrics and other literature.

Dershowitz views the synonym-based method as a step forward, but could it possibly be applied to other ancient texts?

Mark Liberman, a linguist and computer scientist at the University of Pennsylvania, says one limitation is that in compiling their synonym lists, the Israeli team relied on Strong’s Exhaustive Concordance of the Bible, an exhaustive 19th-century concordance of the King James Bible that listed every original Hebrew or Greek word in the text, its meaning and how many times it is used. (Try out an online version of Strong’s Concordance here.)

“As far as I know, there is no equivalent for any other texts, religious or otherwise,” Liberman says. But if someone could devise an automated method of distinguishing between multiple word meanings in a text, “It’s possible in principle that synonym-choice would be useful as a feature for literary or forensic authorship attribution,” he says.

The Israeli researchers presented a paper on this work in June at the 49th annual conference of the Association for Computational Linguistics in Portland, Oregon.

“Just from chatting with people, some people like the idea of it a lot,” Dershowitz says. “Other people are a bit mistrusting of this technology, especially applying it to a field like this, which has never really been analyzed in that way.”

Sign up for the free Miller-McCune.com e-newsletter.

“Like” Miller-McCune on Facebook.

Follow Miller-McCune on Twitter.

Add Miller-McCune.com news to your site.