Friday, July 12, 2013

Reading the Microsoft Word docx file format

After having done some programming to read Microsoft Word files, I thought I'd write about how the Word 2007 or Office Open XML file format is put together. This isn't complete, but this will get you started.

Cracking the door open

When investigating a mystery file, the first thing a Unix junkie does is run file on it. file is a nifty program that will try to identify what sort of data it's looking at, without paying any attention to the file extension. Let's do that now:
$ ls
Lecture 1.docx
$ file Lecture\ 1.docx 
Lecture 1.docx: Zip archive data, at least v2.0 to extract