I recently learned about the basic structure of the .docx file (it's a specially structured zip archive). However, docx is not formated like a doc.
How does a doc file work? What is the file format, structure, etc?
I recently learned about the basic structure of the .docx file (it's a specially structured zip archive). However, docx is not formated like a doc.
How does a doc file work? What is the file format, structure, etc?
The full format for binary .doc files is documented in this pdf from (the Wikipedia article on .doc)
It's not a direct answer to your question, but I highly recommend reading Joel Spolsky's article, Why are the Microsoft Office file formats so complicated? (And some workarounds). It will give you some insight into how complex the .doc format really is - and why. Joel also gives a very basic overview of what the .doc format consists of:
You see, Excel 97-2003 files are OLE compound documents, which are, essentially, file systems inside a single file. These are sufficiently complicated that you have to read another 9 page spec to figure that out. And these “specs” look more like C data structures than what we traditionally think of as a spec. It's a whole hierarchical file system.
(The quote refers to Excel files but it applies to Word docs as well). Informative article and helpful in understanding why .docx and ODF files are structured and designed so much more logically when being examined from an outside perspective.
The basic idea behind the MS Word DOC format is an OLE Compund Document which, as Kibbee has already written, is basically a memory dump. It's a very complex and convoluted way to store documents, but if you've ever really dug into the application Word you'll know how insanely many features it has, and if you have used it in a business setting you'll have a good feeling for how it integrates with other programs in the Office series.
In general, OLE Compund Documents are very extensible structures that allows you to stuff all kinds of data into one file and even to some degree handle data you don't have an application installed for. For example, if you insert an Equation object (from the MS Equation Editor) into a document it gets stored as a sub-object which is like a file inside the file, but this object doesn't just contain the data required for Equation Editor to edit and render it, it also has a generic bitmap (or metafile, maybe) representation stored so it can be displayed, though not edited, on a machine without Equation Editor installed.
This was the why, for the how you'll have to read the specifications other people have linked to already ;)
If you want the easy way out to work with the files though, make sure your software runs on a Windows machine with Word installed, then use COM/OLE Automation to open and manipulate the documents. You won't have to worry about file format then.
Doc is the binary format of word document - here's the Microsoft Office Word 97-2007 Binary File Format Specification [*.doc] document.
The .doc format is quite complex. Like most Microsoft formats, it reflects a long history of changes between versions and legacy support. They published it not too long ago, so if you want to view it (and other pre-Office 2007 formats), knock yourself out here.
There's Microsoft Word's .doc and then there's plain text .doc. It sounds like you're wondering about the proprietary Microsoft format.
From Wikipedia:
The DOC format varies among Microsoft Office Word Formats. Word versions up to 97 used a different format from Microsoft Word version between 97 and 2003.
It wasn't until Word 2007 where .docx, although a packaged file, is not necessarily a .zip archive. It is a structured XML document.