June 19, 2018

The Next-Gen Repository: Part II

In a post last week, I introduced the eScholarship redesign and refocus of mission on the publishing needs of faculty, with repository capture being a side benefit. With this second part, I seek to lift the hood of the technical infrastructure, which I’m doing with the help of Lisa Schiff of the technical team that worked for two years on this major refactoring of eScholarship. As Lisa said in part in an email to the Code4Lib and XML4Lib mailing lists:

Some highlights of the technological innovations that were developed to support the new platform include:

Keyword in Context Pictures (KWIC Pics) is a display technology that provides visual clues to aid in information searches. After a user of the eScholarship site performs a search, a results page is displayed. By hovering the cursor over matching keywords on the results page, the user can view an excerpted image of the actual PDF article page containing the keywords — providing an early glimpse of the document.  From a technical perspective, hovering the cursor over a highlighted keyword activates a series of events that are largely invisible to the user:

  1. JavaScript code running in the user’s browser sends an AJAX request to the eScholarship server, calling out the particular PDF file, page number, and matching keywords to show.
  2. The server then renders just that part of the PDF file and returns it as an image. First it must interpret the PDF file, find the right page, and turn it into image data (or “pixels”). PDF files are highly variable and eccentric, so eScholarship relies on a robust and well-tested PDF library called Poppler, which is also used within many PDF image display tools for Linux.
  3. The eScholarship server then modifies blocks of pixels to add yellow background and red foreground highlighting to the matching keywords, using text coordinates stored in the XTF full-text index.
  4. Next the pixel data is compressed for quick transfer. If a small number of colors or only shades of gray are present, PNG compression is chosen; for full color images, JPEG compression is used instead. This choice logic ensures accurate color reproduction while reducing bandwidth (and thus time) as much as possible.
  5. Finally the compressed image is sent back to the user’s browser, where the Javascript code displays it over the search results page.

Rendering PDFs as Images
A very similar process occurs in eScholarship when displaying the full view of an article. The page is initially filled with blank images. JavaScript code monitors the scrollbar to find out which portion of the page is visible in the browser window. It then issues AJAX requests to render just those pages and displays them when they arrive from the server. This allows us to display the PDF document within the larger frame of the site.

Harvesting System
The harvesting system locates and copies each item and its metadata from the publishing backend.  It then extracts text from the original PDF file in order to support keyword searching, then classifies the item according with on or more disciplines from an “Academic Disciplines” taxonomy, and then indexes the text.

Some other things they accomplished were to classify the papers into a subject taxonomy and identify those that were peer reviewed. This meant retrospectively classifying some 30,000 papers. Asked how they accomplished this feat, Lisa wrote in an email, "we hired an intern to create a training and test set by manually tagging a selection of the content with subject terms from the Academic Disciplines Taxonomy from the National Academies.  We used those documents to train and test a KEA classifier, which we ran against the legacy documents." KEA is an algorithm for keyword extraction.

The technology stack includes the eXtensible Text Framework (XTF), which performs the indexing and much of the heavy lifting, Javascript, AJAX, CSS, and some Java. The full rundown is available on their Technical Documentation page.

The technology is as impressive as was the decision to rebrand eScholarship from a repository into a publishing service. I have seen the next-gen repository, and its name is eScholarship.

Roy Tennant About Roy Tennant

Roy Tennant is a Senior Program Officer for OCLC Research. He is the owner of the Web4Lib and XML4Lib electronic discussions, and the creator and editor of Current Cites, a current awareness newsletter published every month since 1990. His books include "Technology in Libraries: Essays in Honor of Anne Grodzins Lipow" (2008), "Managing the Digital Library" (2004), "XML in Libraries" (2002), "Practical HTML: A Self-Paced Tutorial" (1996), and "Crossing the Internet Threshold: An Instructional Handbook" (1993). Roy wrote a monthly column on digital libraries for Library Journal for a decade and has written numerous articles in other professional journals. In 2003, he received the American Library Association's LITA/Library Hi Tech Award for Excellence in Communication for Continuing Education. Follow him on Twitter @rtennant.