May 22, 2018

Data Recovery From Corrupt MS Office Files

Anyone who follows me on Facebook knows that recently I experienced a corrupted Microsoft PowerPoint file. I still don’t know what caused it, but the upshot was that a file that I worked on for an hour (and saved!) would no longer open after several different attempts.

Finally, in frustration, I set out to recreate it. As anyone who has had to recover something they created from scratch, one is left with the distinct impression that the second attempt will inevitably pale in comparison to the first, most brilliant, effort. So I was at some pains to recover whatever I could from the borked file. Thankfully, I knew exactly what I had to do.

Since I was working with a relatively recent version of Microsoft Office, and the file had the “.pptx” extension, I knew that really what I had wasn’t a single file with corrupted bits (and therefore likely irrecoverable), but instead a package of related files. Most people don’t likely know that MS Office files with the “x” extension are really a set of XML files and associated media (e.g., image) files, all zipped up together into one file.

Therefore, my first move was to drop the corrupted PowerPoint file on top of my “Stuffit Expander” utility (I’m on a Mac, so use whatever “unzip” utility you have), which created a folder with all of the constituent parts. The contents of the folder looked like this:

Then it simply became a matter of rooting around to find the bits I needed. In the /ppt/media folder I found the slide images I could then paste into a new presentation. Since they were saved in numeric order, it even preserved the order in which they appeared in the slide show. To recover the text on the slides, I simply needed to open up the XML files in the /ppt/slides folder and find the text bits among the XML markup. Again, since these files are saved in numeric order, the slide show order was preserved.

So although it was MS Office that somehow ruined my day, it was the fact that the “document” wasn’t really a document at all that ended up saving it. Microsoft: It’s a complicated relationship we have.

Roy Tennant About Roy Tennant

Roy Tennant is a Senior Program Officer for OCLC Research. He is the owner of the Web4Lib and XML4Lib electronic discussions, and the creator and editor of Current Cites, a current awareness newsletter published every month since 1990. His books include "Technology in Libraries: Essays in Honor of Anne Grodzins Lipow" (2008), "Managing the Digital Library" (2004), "XML in Libraries" (2002), "Practical HTML: A Self-Paced Tutorial" (1996), and "Crossing the Internet Threshold: An Instructional Handbook" (1993). Roy wrote a monthly column on digital libraries for Library Journal for a decade and has written numerous articles in other professional journals. In 2003, he received the American Library Association's LITA/Library Hi Tech Award for Excellence in Communication for Continuing Education. Follow him on Twitter @rtennant.


  1. Shane White says:

    I just had a look inside one of my docx files – brilliant. I can see this coming in handy whenever disaster strikes

  2. actually some 3rd party software can be helpful. Minitool power data recovery is one of my fav tools.