December 6, 2025

The Power and the Glory of Data Munging

My most recent post was on some of my issues in working with Google Fusion Tables. As a part of that work, I found that I could not simply ingest the data as it was and have it depicted the way I wanted it. First I needed to merge some of the data points so that only one “dot” was produced for the same address. That seems trivial, and it sort of is, but if I did not know a programming language it likely would have been a lot more difficult than it was.

As it turns out, I’ve spent a lot of time in my career transforming (the high-falutin’ name for “munging”) data. Activities like:

  • Taking data in one format and exporting it in another.
  • Adding or removing or merging data elements.
  • Combining one pile of data with one or more other piles.
  • Extracting data from a larger pile and doing something interesting with it.
  • Scraping content out of an HTML file.
  • Etc. (the sky’s the limit as they say).

My favorite tool for doing all of this is Perl — partly because it is what I’m familiar with, but also more importantly because it is so darn good at it. Heck, someone even wrote a book about it. Perl uses regular expressions to enable you to match just about any string of characters. Sure, the first “regex” you see will probably explode your head, but if you take it bit by bit the full power and the glory is revealed. Add to that a few other great operations like “split” (splitting apart a string or a file based on a matched string) and there is little you can’t do.

One of the things I have discovered about this is that once you have acquired the ability to transform data you begin seeing all sorts of opportunities for doing it. For example, in confronting the issue I mentioned in the first paragraph I might have simply given up if I didn’t know how easy it would be to write a Perl script to transform it.

There is a blog post that claims that data munging is one of the three sexy skills of data geeks. I’d go even farther — it is the single most essential skill for data geeks. Without it, you are not a data geek at all — you’re just a geek. And that, my friend, is pitiful.

Photo by FotoosVanRobin licensed under Creative Commons Attribution-ShareAlike 2.0 Generic license.

Share
Roy Tennant About Roy Tennant

Roy Tennant is a Senior Program Officer for OCLC Research. He is the owner of the Web4Lib and XML4Lib electronic discussions, and the creator and editor of Current Cites, a current awareness newsletter published every month since 1990. His books include "Technology in Libraries: Essays in Honor of Anne Grodzins Lipow" (2008), "Managing the Digital Library" (2004), "XML in Libraries" (2002), "Practical HTML: A Self-Paced Tutorial" (1996), and "Crossing the Internet Threshold: An Instructional Handbook" (1993). Roy wrote a monthly column on digital libraries for Library Journal for a decade and has written numerous articles in other professional journals. In 2003, he received the American Library Association's LITA/Library Hi Tech Award for Excellence in Communication for Continuing Education. Follow him on Twitter @rtennant.

Comments

  1. Celia White says:

    Still a favorite task of mine, and one I rarely get to do these days. Sigh.

  2. Roy, I’m showing this to my lib-tech students tonight, as they’ll be learning the basics of regular expressions this semester. I will have to tell them I didn’t pay you to write it!

  3. Dorothea, maybe we should just make it so! I accept checks or lots of unmarked bills in small denominations. ;-) Regular expressions are the mother’s milk of data munging. Can’t live without ’em!

  4. RIGHT THEN. Unmarked bills on their way.

    (My students enjoyed the photo you selected.)