OCR works great for paperbacks—but what about 15th Century texts set by hand?
Texas A&M’s Early Modern OCR Project (eMOP), which trains software to recognize texts from the 15th through 17th centuries, has received two significant boosts in recent weeks. At the end of September, the project received a two-year, $734,000 grant from the Andrew W. Mellon Foundation. And, earlier this month, ProQuest began providing the project with access to its Early English Books Online and Early European Books databases.
The eMOP project will be using page images from these and other sources, including Eighteenth Century Collections Online, from Gale Cengage, to build a database of early typefaces used in English books and documents, and then train optical character recognition (OCR) software to read these documents. According to the current project timeline, the tools needed to convert most English texts from this historical period into fully searchable digital documents should be available worldwide within the next two years.
The project is a response to what many scholars view as an emerging problem for books published during this period.
“Collectively, the US, the UK, and scholars around the globe face a problem: rare books and pamphlets from the early modern era which have not yet been made available digitally threaten to become invisible to future scholars,” Laura Mandell, director of Texas A&M’s Initiative for Digital Humanities, Media, and Culture and primary investigator for eMOP, wrote in a grant proposal to the Mellon Foundation.
The problem, she told LJ, is that digital preservation efforts have often focused on preserving an image of a document and including basic metadata such as the title, subject, author, date, and publisher to make it discoverable.
With these documents, many of which were digitized from microfilm copies, “there’s a very slim amount of metadata that’s available,” said Mary Sauer-Games, ProQuest’s vice-president of scientific, technical, medical (STM) and humanities publishing. “There’s the original cataloging records, which have been enhanced with the English Short Title Catalogue. And our catalogers have done some additional coding, but you still don’t have the full text that you can search.”
Sea of Searchable Documents
Meanwhile, the volume of all types of fully searchable digital texts continues growing at an exponential pace. Documents from the early era of printing now risk getting lost in the shuffle.
“People could imagine that they’ve preserved something if they’ve just imaged it,” Mandell said. “And if it’s preserved in that way, and if researchers … think that getting online and searching is good enough, there are things that will never be found again.”
Discoverability isn’t the only problem. The less complete the collection from this period, the greater the possibility that it will lead to incorrect assumptions and mistakes by future digital researchers.
Mandell pointed to the Google Books Ngram Viewer for a simple example. Search for the word “presumption” between 1700 to 1900, and a user will find that usage peaked in the years following the publication of Mary Shelley’s Frankenstein. Based on that data, one might assume that this increase was spurred at least partly by the popularity of a novel about a scientist who presumed to create life.
This would be a tenuous hypothesis at best. During the late 1700s and early 1800s, printing presses around the world were eliminating the use of the medial, or long “s.” Search for terms like “prefumption” or “curiofity,” and you’ll find an overlapping decrease in the use of these words.
In a case like this, “what you get is a map of the history of typography, not a history of what’s going on culturally. That’s the problem,” Mandell said.
Printer Problems
The medial s is also one of the peculiarities of early modern printing that causes problems with current OCR technology. Automated digitizing of works from this era has always posed a challenge, since OCR engines have a difficult time recognizing and adjusting for the typefaces, irregular spacing, and other quirks from early printing presses.
“The fonts are so old and the typefaces used are so old, and the methodology with which they were printing these books was so different,” noted Sauer-Games. “We’ve run some OCR engines over them, and you just get garbage. It’s nothing you can really read.”
Printing from this era was very inconsistent, Sauer-Games explains.
“You think about the original printing presses—they weren’t exactly precise,” she said. “Somebody was actually sitting there, placing letters across the printer. There might be gaps that weren’t supposed to be there, or maybe the ink wasn’t evenly applied. And the books, over time, the ink may have faded. You’ve got large fonts mixed with smaller font sizes. And you think about all of these small, individual presses that were creating their own typefaces to print these books. It wasn’t consistent like it is today.”
Nothing troubles a computer program more than pervasive inconsistencies, but automated OCR will be crucial if scholars hope to make most of the documents from this era full-text searchable. Between Early English Books Online and Early European Books from ProQuest, and Eighteenth Century Collections Online from Gale Cengage, there are a combined 307,000 books and other documents from 1473 to 1800, totaling 45 million pages, Mandell said.
Projects like the Text Creation Partnership, led by the University of Michigan Libraries, have already made an admirable dent in this corpus by manually keying in the entirety of more than 40,000 texts. But that process is time consuming and can be expensive. eMOP is hoping that a combination of specialized OCR software and crowdsourced corrections will present a faster way to preserve a more comprehensive collection.
In-House Expertise
The eMOP project brings together experts from Texas A&M’s English literature department, its computer science department, its Initiative for Digital Humanities, Media, and Culture, and from groups including IMPACT at KB National Library of the Netherlands, The Software Environment for the Advancement of Scholarly Research, Performant Software Solutions, and PRImA [Pattern Recognition and Image Analysis].
And for expertise on early printing techniques and typography, they certainly didn’t need to look far. Texas A&M’s Cushing Memorial Library and Archives hosts the annual Book History Workshop, an intensive five-day course where students learn how printing was done in the hand-press era—even spending lab-time learning how to cast type using early methods.
“In addition to setting a lot of type and correcting it, and putting it on the bed of a common press and printing, we do an introduction to how typography worked,” said Todd Samuelson, director of the workshop, printer in residence, and curator of Rare Books and Manuscripts at the Cushing Memorial Library. Attendees “cast type from a matrix in a hand mould using 500 degree lead alloy and see how that process works… When Laura [Mandell] came over to look at our collections and saw how these questions of historical typography were an ongoing project for us, she thought we could have a really productive collaboration.”
Samuelson is now working with post-doctoral research fellow Jacob Heil to put together a database of common typefaces for the project.
“We’re trying to determine some of the most common, most significant type-faces in English typography from 1476 to 1820, and then [Mandell]’s team is trying to teach the OCR engines to read them. We’re going to see what kind of tolerances we can give [the system] in different type families.”
As a rare books librarian, Samuelson acknowledged that research is trending toward the convenience and accessibility of digitized, searchable text. An increase in searchable texts from the period will also have exciting implications for several fields in the humanities, including linguistics. But preservation remains important. For now, there are some things that full-text digital copies simply can’t communicate.
“There are types of evidence that can’t be translated through digital simulacra. The material object does have ways in which it signals to us what a book is about,” Samuelson said. For example, “a 17th century book of poetry, if it were published in a small, duodecimo volume, might signal something to its readers, whether it’s devotional or intimate in some way—the same way that you can look at a paperback and get a sense of the genre …These are the sorts of tangible qualities of the book that, at least in the near future, won’t be available in digital formats.”
Thanks so much to Matt for this wonderful exposition, but I did want to correct a few factual errors (wish I’d had the opportunity to do so before it went live!). The Text Creation Partnership (http://www.textcreationpartnership.org/) is run SOLELY by the University of Michigan Libraries. I’m not sure how the OED got in there, Matt, but I think what I said is, wouldn’t it be wonderful if we could search all of EEBO by word? then we would have an OED that REALLY told us the first time a word was used. Second, the TCP has had to pay a lot of money to have the texts double-keyed at a particular percentage rate of correctness. The partners of the various phases of EEBO and the partners for the TCP-ECCO project have contributed. Hand-typing text is always far more expensive than OCR’ing, but the problem is getting OCR to work, the point of the eMOP project. I think the notion of volunteers working on typing may be connected to our crowd-sourced correction tools which will help correct errors made by the OCR engines. We are building those as well as part of the eMOP project. Thanks again, Matt, for all your hard work.
Hi, Prof. Mandell,
Thanks for pointing that out. I’ve changed that graf and updated the story. Sorry I missed your comment earlier and didn’t see until today. I was out of the office on Monday.
Thanks for your time!
-Matt