October 24, 2020

Hadoop for Large Digital Libraries

Digital libraries will increasingly need to deal with massive amounts of information if they are to serve their clienteles well. Some libraries are already delving in data curation, and once you go down that road the sky’s the limit. Thankfully, there are new ways of dealing with massive amounts of data that have been pioneered by Google and others.

One such way is the growing ecology of software tools that all fall under the “Hadoop” umbrella. In an excellent overview article by O’Reilly and Associates, these various software tools are named and briefly explained (see graphic from the article).

At first, the sheer number of named tools seems daunting, but some are simply there to be used, and mostly without even realizing it. Some you may never need.

We are adopting these technologies at my place of employment (OCLC) and I expect to be learning more about them and how to use them in the coming weeks and months. I’ll share what I learn here, or over on the OCLC blog where I write as well, hangingtogether.org.

Meanwhile, think about getting down and dirty with Pig. I was born an Indiana farm boy, and I can’t believe my adult self, as a high-falutin’ digital librarian, found a way to write that sentence with a straight face.


My thanks to my colleague Lorcan Dempsey for bringing this post to my attention.

Roy Tennant About Roy Tennant

Roy Tennant is a Senior Program Officer for OCLC Research. He is the owner of the Web4Lib and XML4Lib electronic discussions, and the creator and editor of Current Cites, a current awareness newsletter published every month since 1990. His books include "Technology in Libraries: Essays in Honor of Anne Grodzins Lipow" (2008), "Managing the Digital Library" (2004), "XML in Libraries" (2002), "Practical HTML: A Self-Paced Tutorial" (1996), and "Crossing the Internet Threshold: An Instructional Handbook" (1993). Roy wrote a monthly column on digital libraries for Library Journal for a decade and has written numerous articles in other professional journals. In 2003, he received the American Library Association's LITA/Library Hi Tech Award for Excellence in Communication for Continuing Education. Follow him on Twitter @rtennant.