The Power and the Glory of Data Munging

By Roy Tennant on August 17, 2011

My most recent post was on some of my issues in working with Google Fusion Tables. As a part of that work, I found that I could not simply ingest the data as it was and have it depicted the way I wanted it. First I needed to merge some of the data points so that only one “dot” was produced for the same address. That seems trivial, and it sort of is, but if I did not know a programming language it likely would have been a lot more difficult than it was.

As it turns out, I’ve spent a lot of time in my career transforming (the high-falutin’ name for “munging”) data. Activities like:

Taking data in one format and exporting it in another.
Adding or removing or merging data elements.
Combining one pile of data with one or more other piles.
Extracting data from a larger pile and doing something interesting with it.
Scraping content out of an HTML file.
Etc. (the sky’s the limit as they say).

My favorite tool for doing all of this is Perl — partly because it is what I’m familiar with, but also more importantly because it is so darn good at it. Heck, someone even wrote a book about it. Perl uses regular expressions to enable you to match just about any string of characters. Sure, the first “regex” you see will probably explode your head, but if you take it bit by bit the full power and the glory is revealed. Add to that a few other great operations like “split” (splitting apart a string or a file based on a matched string) and there is little you can’t do.

One of the things I have discovered about this is that once you have acquired the ability to transform data you begin seeing all sorts of opportunities for doing it. For example, in confronting the issue I mentioned in the first paragraph I might have simply given up if I didn’t know how easy it would be to write a Perl script to transform it.

There is a blog post that claims that data munging is one of the three sexy skills of data geeks. I’d go even farther — it is the single most essential skill for data geeks. Without it, you are not a data geek at all — you’re just a geek. And that, my friend, is pitiful.

Photo by FotoosVanRobin licensed under Creative Commons Attribution-ShareAlike 2.0 Generic license.

Comments

Celia White says:

August 22, 2011 at 3:27 pm

Still a favorite task of mine, and one I rarely get to do these days. Sigh.
Dorothea Salo says:

September 6, 2011 at 12:46 pm

Roy, I’m showing this to my lib-tech students tonight, as they’ll be learning the basics of regular expressions this semester. I will have to tell them I didn’t pay you to write it!
Roy Tennant says:

September 6, 2011 at 12:52 pm

Dorothea, maybe we should just make it so! I accept checks or lots of unmarked bills in small denominations. ;-) Regular expressions are the mother’s milk of data munging. Can’t live without ’em!
Dorothea Salo says:

September 6, 2011 at 10:16 pm

RIGHT THEN. Unmarked bills on their way.

(My students enjoyed the photo you selected.)

The Power and the Glory of Data Munging

Comments

Search the Shift

Recent & Popular

Advertisement

Job Zone

On Twitter

On Facebook

About the Shift

The Power and the Glory of Data Munging

Comments

Search the Shift

Recent & Popular

Advertisement

Job Zone

Tag the Shift

On Twitter

On Facebook

About the Shift