200 years of world literature in 0.4 seconds
Slow reading is out. Today, computers comb through millions of books and analyse their content in a flash. By Mirko Bischofberger
(From "Horizons" no. 104, March 2015)
It all began with data. Too much data. Back in the 1940s, the Italian Jesuit priest Roberto Busa was trying to cope with unmanageable amounts of data, his goal being to create a complete index of all 11 million words in the writings of the theologian Thomas Aquinas. It was a mighty task that he felt would surely never be completed in his lifetime. But then Father Busa had an idea: a machine could help him. He got support from Thomas Watson, the founder of IBM in the USA, with whose help the index was finished in just a few decades. The Index Thomisticus became a pioneering, 56-volume work of 70,000 pages. It was the first-ever to enable a user to perform a rapid content search through a complete corpus.
Digital world literature
Today, digital methods have entered all fields of the humanities. “These days, it’s above all literary studies and linguistics that are keen on getting digital access to their data”, says Martin Volk, a professor of Computer Linguistics at the University of Zurich. “This allows them to use numbers and statistics to prove or disprove certain research hypotheses”. In his research project Text+Berg, he scanned all the books published by the Swiss Alpine Club since 1864. That’s 250 volumes. “This material is a treasure trove of information on the Swiss mountains”, explains Volk. “It shows, for example, how our understanding of the mountains has altered over the course of time. Where the mountains used to be described as objects of exploration, today they’re seen more as a kind of sporting equipment. The word ‘contest’, for example, occurs far more often today than it ever did”.
Another Swiss project is being carried out by the University of Geneva, where a portion of the Bibliotheca Bodmeriana is being digitised. This extraordinary collection of books and manuscripts comprises more than 150,000 works in eighty languages from three millennia. These include the oldest-known manuscript of the Gospel according to St John and the original copies of the Grimm Fairy Tales.
Citizen science offers a helping hand
But scanning books is laborious. “The books first have to be cut open by hand and every page entered separately”, says Volk. He has processed over 120,000 pages in the course of the Text+Berg project, so he knows what he’s talking about. “After scanning them, you have thousands of computer images but no text”. To address this, software is used to recognise the letters in the images and to transform them into characters and words. “The error rate for this process is still relatively high, especially for the older 19th-century texts”. Volk’s project was producing roughly 12 mistakes per page that would otherwise have had to be checked by hand.
His team took an inventive approach to get around the problem, developing an online correction system cast as a kind of ‘game’ that enabled volunteers to eradicate the mistakes by hand. This citizen science project proved very popular with the members of the Swiss Alpine Club. “With their help, we were able to carry out more than 250,000 corrections over the space of half a year”, says Volk proudly. Now the digital ‘mountain’ of text is almost 100% correct. Once documents have been scanned, they can be archived and retrieved simply. “Especially in the case of documents that are old, rare or difficult to access, this is something that is otherwise impossible”, says Volk.
Freud, Einstein, Darwin
The world’s best-known archive of this kind, and probably the most comprehensive, is Google Books. Using a full-text search, the holdings of the university libraries of Harvard, Stanford and New York can be searched in a matter of seconds.
Even European libraries such as those of the University of Oxford or the Bavarian State Library have already joined Google Books and undergone digitisation projects.
Taking Google Books as its starting point, in 2010, Google set up Ngram, a web application for calculating the frequency of a word or a series of words in all the books published since 1800 that have been scanned by Google. For example, this allows us to investigate historical events such as the abolition of slavery or to observe the way in which specific words in a language have changed. It also lets us observe the shifting popularity of historical figures over time. Scientists such as Sigmund Freud, Albert Einstein and Charles Darwin all appear often in the books, but since 1950 Freud has been mentioned at least twice as often as the other two.
“Ngram is just one example of what is possible today with digitised cultural data”, says Jean-Baptiste Michel, a data scientist from Harvard and the author of the Google application used by millions of people. The digital humanities today would be unthinkable without Ngram. Volk confirms that “Ngram was pioneering in the digital humanities, especially because it made the methods broadly known”.
More than 100 million texts each day
Uploading existing literature is just one approach in linguistic analysis. “Today, via our phones and computers, we’re sending out more digital texts than ever before”, says Elisabeth Stark from the Institute of Romance Languages at the University of Zurich. In Germany in 2013 alone, more than 100 million text messages were sent every day. “Almost none of these texts are ever printed, but they are nevertheless part and parcel of our linguistic culture”, says Stark. In her SNSF project ‘Sms4science’ she is investigating the linguistic characteristics of text messages in Switzerland, and how the Swiss communicate in this medium.
In order to access this data, Stark and other researchers invited all Swiss users of cell phones to send a copy of their text messages to a free number. “In this way we were able to collect some 26,000 texts from Switzerland”, says Stark. One of the things of interest to her is the empirical analysis of linguistic ellipses, i.e. leaving out words. Typical examples of ellipsis are “leaving now” and “what you doing?”. In order to find out why the subject is absent in these cases, Stark’s team analysed all the text messages in French and German. They discovered that these ellipses are far rarer than we assume, and that they follow the same laws as everyday spoken language. “This contradicts the impression that you get from observing individual text messages”, says Stark. “And that’s why we need a large volume of data – so we can track down what actually happens in text messages”.
Insufficient resources in Switzerland?
The digital humanities allow us to analyse literature and language using numbers. And numbers have always been the trademark of the hard sciences. They allow us to describe quantitative patterns and relationships with a precision largely impossible with words. The next generation of scholars in the humanities will therfore work in a data-based manner, just as bioinformaticians have been doing since the end of the 20th century. “The field is being driven above all by the immense increase in the volume of digital texts”, says Volk. “Just as sequencing the human genome led to bioinformatics, the digitisation of our language and literature will soon, inevitably, become integral to the humanities”.
Researchers such as Volk and Stark are at just the beginning of a new era of research in the humanities. “Regrettably, resources for the digital humanities are still meagre in Switzerland”, says Volk. Stark is of the same opinion: “At the University of Zurich, for example, there isn’t a single professorship in the digital humanities, even though it’s high time there was”. Both researchers think it’s even more important to get access to bigger data consortia. “Although there are important initiatives on a European level, it’s sad to say that Switzerland often doesn’t take part in them at the moment”, says Stark. And Michel, who has been able to use the incredible reservoir of data at Google Books, says that “access to data is the most important driver!”
Mirko Bischofberger is a science contributor at the SNSF.