Crowdsourcing Archival Transcription

By Roy Tennant on November 12, 2012

In a recent post, Rose Holley describes how the National Archives of Australia is using crowdsourcing to help transcribe scanned archival descriptions. Dubbed “The Hive” (a play on [arc]Hive), the site allows users to pick a scanned image marked easy, moderate, or difficult and edit the OCR’d version to better reflect the actual document. Work can be saved as partially complete or fully complete.

Holley reports that of the 800 lists available for description, over 300 have been completed in the first two weeks.

Holley is no stranger to crowdsourcing, as she pioneered the hugely successful Australian Newspapers project, now housed at Trove, that enabled the transcription of millions of newspaper stories from scanned newspapers.

It has been surprising to me how few libraries and archives have followed Holley’s lead. The clear success of enabling users to work at transcribing or correcting OCR’d documents seems to offer a wonderful opportunity to expand our digitization capabilities at little or no cost.

Comments

Kelly Brannock says:

November 12, 2012 at 1:04 pm

The State Library of NC is using volunteers to transcribe digitized documents. Details here: http://statelibrary.ncdcr.gov/digital/ncfamilyrecords/verticalfiles.html
It’s an easy way to volunteer and volunteer efforts are making a difference in increasing access to local information.
Claire Stewart says:

November 12, 2012 at 3:58 pm

University of Iowa Libraries are also using this for their Civil War Diaries project: http://diyhistory.lib.uiowa.edu/

Iowa and California Digital Newspaper Project both did sessions on this at the most recent DLF (CDNP abstract: http://www.diglib.org/forums/2012forum/no-tempest-in-my-teapot-analysis-of-crowdsourced-data-and-user-experiences-at-the-california-digital-newspaper-collection/)
Dorothea says:

November 12, 2012 at 8:54 pm

Roy, you know I love you, but…

All the above projects (including those in the comments) counted on considerable dedicated technical staff to design and implement the UI and get the data where the data needed to be. There’s your “why not” answer, right there.

Now, tools are starting to exist that make this not quite such a slog. Scripto (http://scripto.org/) and T-PEN (http://t-pen.org/TPEN/) are two that I have my eye on; there’s also Islandora, which has had a TEI markup/correction tool for some time, if TEI is your thing.

This is all to the good. Erasing the technical labor that’s still necessary to bring crowdsourcing projects to fruition? Not so good. Please don’t do it.
Roy Tennant says:

November 12, 2012 at 11:05 pm

Dorothea, No worries, you make an excellent point. It is no small thing to put this together technically, and I was wrong to not note that while wondering why more organizations don’t do it. Mea culpa, and kudos to those organizations who both have the technical chops and use them to these ends.
Ben Brumfield says:

November 13, 2012 at 1:12 pm

I’m not sure whether this is a quibble with Dorothea or a violent agreement. While tools like Scripto and T-PEN make it possible to avoid the ongoing effort of systems like UIowa’s original design (which required a staff member to cut-and-paste each transcript from an email account to their CMS) or the development effort of building a transcription tool from scratch, there are still costs.

What I tell people investigating the hosted version of my own FromThePage is that even though they don’t have to develop or install software, there is still a substantial amount of set-up work required for each set of documents they want to host. Their scanned images need to be scaled down from archival-quality TIFFs to something more reasonable, sometimes separated into recto/verso images, and uploaded or FTPd to the transcription tool. (And if they’re importing from a CMS like the Internet Archive, that work still needs to be done to get the images into that system.) The appropriate metadata needs to be entered, which usually includes document-specific text like desired transcription conventions which staff may not have experience with.

More important than that, though is the ongoing work of the recruitment and motivation of volunteers. If volunteers report problems or ask questions about the images, the handwriting, or the software, someone from the institution needs to reply quickly, or else those volunteers will find a more responsive project to work on. This kind of community management work needs to be part of someone’s job on a near-daily basis, and so long as volunteers are still contributing–so long as the crowdsourcing project is successful–it’s doing to take work.
Dorothea Salo says:

November 13, 2012 at 2:16 pm

I think it’s violent agreement, Ben. My answer tends more to the “why aren’t libraries/archives trying this?” whereas yours is closer to an answer to “why aren’t more crowdsourcing projects successful?”

The answer to both questions is “labor,” but the exact nature of the labor differs in each case.

Crowdsourcing Archival Transcription

Comments

Search the Shift

Recent & Popular

Advertisement

Job Zone

On Twitter

On Facebook

About the Shift

Crowdsourcing Archival Transcription

Comments

Search the Shift

Recent & Popular

Advertisement

Job Zone

Tag the Shift

On Twitter

On Facebook

About the Shift