In a post last week, I introduced the eScholarship redesign and refocus of mission on the publishing needs of faculty, with repository capture being a side benefit. With this second part, I seek to lift the hood of the technical infrastructure, which I’m doing with the help of Lisa Schiff of the technical team that worked for two years on this major refactoring of eScholarship. As Lisa said in part in an email to the Code4Lib and XML4Lib mailing lists:
Some highlights of the technological innovations that were developed to support the new platform include:
Keyword in Context Pictures (KWIC Pics) is a display technology that provides visual clues to aid in information searches. After a user of the eScholarship site performs a search, a results page is displayed. By hovering the cursor over matching keywords on the results page, the user can view an excerpted image of the actual PDF article page containing the keywords — providing an early glimpse of the document. From a technical perspective, hovering the cursor over a highlighted keyword activates a series of events that are largely invisible to the user:
- The server then renders just that part of the PDF file and returns it as an image. First it must interpret the PDF file, find the right page, and turn it into image data (or “pixels”). PDF files are highly variable and eccentric, so eScholarship relies on a robust and well-tested PDF library called Poppler, which is also used within many PDF image display tools for Linux.
- The eScholarship server then modifies blocks of pixels to add yellow background and red foreground highlighting to the matching keywords, using text coordinates stored in the XTF full-text index.
- Next the pixel data is compressed for quick transfer. If a small number of colors or only shades of gray are present, PNG compression is chosen; for full color images, JPEG compression is used instead. This choice logic ensures accurate color reproduction while reducing bandwidth (and thus time) as much as possible.
Rendering PDFs as Images
The harvesting system locates and copies each item and its metadata from the publishing backend. It then extracts text from the original PDF file in order to support keyword searching, then classifies the item according with on or more disciplines from an “Academic Disciplines” taxonomy, and then indexes the text.
Some other things they accomplished were to classify the papers into a subject taxonomy and identify those that were peer reviewed. This meant retrospectively classifying some 30,000 papers. Asked how they accomplished this feat, Lisa wrote in an email, "we hired an intern to create a training and test set by manually tagging a selection of the content with subject terms from the Academic Disciplines Taxonomy from the National Academies. We used those documents to train and test a KEA classifier, which we ran against the legacy documents." KEA is an algorithm for keyword extraction.
The technology is as impressive as was the decision to rebrand eScholarship from a repository into a publishing service. I have seen the next-gen repository, and its name is eScholarship.