The word “God” peaked in usage in the world’s books about 1830. “Women” overtook “men” in print after 1985. Sigmund Freud has gotten more ink in the past 60 years than Charles Darwin or Albert Einstein.
Researchers at Harvard University in Cambridge, Massachusetts, teamed up with Google Inc. to survey 5.2 million digitized books -- about 4 percent of all the volumes published in any language -- to analyze language patterns and quantify cultural trends from 1800 to 2000. The four-year project is described today in the journal Science.
The Harvard researchers dub their discipline “culturomics” -- evoking genomics, in which scientists use billions of bits of quantitative data to study genes. Google, which has digitized 12 percent of the 130 million books published worldwide, unveiled today an online tool that enables users to track the frequency of words and phrases. Goggle, the Mountain View, California-based creator of the most-used Internet-search engine, gave funding and staff to the project.
“This data is a new tool for the humanities, just one part of the puzzle that humanists can use to address questions about human society,” Jean-Baptiste Michel, the study’s coauthor and a postdoctoral researcher at Harvard’s Program for Evolutionary Dynamics, said in a telephone interview.
The researchers used the data to explore the growth of the English lexicon, society’s collective memory, the evolution of grammar, fame, and the effects of censorship.
“This tool gives us really easy access to the way that words or concepts have been used over time,” said Erez Lieberman Aiden, a coauthor, who is a member of Harvard’s Society of Fellows, scholars who are free to work without formal requirements. “This has always been notoriously difficult to achieve because you have to quantify how words were and are used now. Even if you’re extremely learned it’s just not possible to read everything that’s been written.”
About 72 percent of the database’s text is in English, followed by French, Spanish, German, Chinese, Russian, and Hebrew. It’s the largest data release in the history of humanities and is available for download, Michel said.
While the database’s books date from the 1500s, users attempting scientific analyses should restrict their use to volumes published from 1800 to 2000, Michel said. There was a scarcity of material before 1800, and Google altered the criteria for the post-2000 books that were digitized, skewing the sample, he said.
Men vs. Women
Studying word frequency, the researchers found that “men” was present in books almost nine times as often as “women” during the first half of the 19th century. The gap narrowed until 1985, when both words were used evenly, and by 1994 “women” appeared about 4.3 times for every 10,000 words, while “men” lagged behind at 3.3, according to the data.
There are other words to express the ideas behind “women” and “men,” Michel said. Studying the frequency of any specific word is just one way to determine cultural trends, he said.
“You’re also hearing from different authors in 2000 than you are in 1800,” Aiden said. “So it’s a combination of not only what’s being written changing, but also what voices are being heard.”
Members of the clergy produced a greater percentage of what was written in the early 1800s than later, Aiden said. That may help explain why “God” peaked around 1830, when it represented 12.5 of every 10,000 words, he said. By 2000, its prevalence had dropped to 2.6 times.
“’God’ is not dead, but needs a new publicist,” the authors wrote.
Year by year before 1950, the fame of Darwin, the 19th-century evolutionary biologist, was greater on average than that of the psychoanalyst Freud, the physicist Einstein or 17th-century astronomer Galileo Galilei. Freud then took the lead.
The research team, which included staff members from the publishers of “Encyclopaedia Britannica” and “The American Heritage Dictionary,” concluded that the English language absorbs about 8,500 new words each year. From 1950 to 2000, the lexicon grew more than 70 percent. Dictionaries don’t account for the extent of this growth and fail to include an estimated 52 percent of the language, the authors wrote.
By restricting searches to certain languages, users can detect the effects of censorship, the researchers wrote. The prevalence of the Jewish artist Marc Chagall differs when comparing literature in English with that in German -- a result of Nazi censorship, according to the researchers.
“As with fossils of ancient creatures, the challenge of culturomics lies in the correct interpretation of this new evidence,” the authors wrote. “Many more fossils, with shapes no less intriguing, beckon.”