41
votes

Has anyone come across a Git or Hg plugin for "meaningful" diffs/merging/branching of OpenOffice or Microsoft word files.

I know I can 'checkin' .doc files but both Git and Hg treat them as binary blobs. I'd like to be able to do all (or at least many) of the normal revision based operations on the text of the file.

And yes, I do know that I should be using Latex or converting files back-and-forth between RTF. I'm just looking for a more "native" solution since I'm trying to manage collaboration between techies and "management people".

This is related to my question on Biostar here: http://biostar.stackexchange.com/questions/1749/writing-collaboration-with-source-control-and-microsoft-word

Thanks.

8
Not an answer: Use a wiki. If you need to version it/carry it around, look for a git/hg enabled wiki with a webserver (techies can use the wiki files/a local copy, management guys use the normal web frontend) - ZeissS
management refuses to use anything besides Microsoft Word. Google-Docs was almost a possibility but they were put off by the idea of someone else seeing their manuscripts. I didn't tell them that during the e-mail process upwards of 20 computers "see" their manuscripts too for fear of being reverted back to a paper system ;) - JudoWill
You probably should tell them that. Or how google docs or a properly setup wiki is more secure than bouncing unencrypted email everywhere. At best they'll listen and let you make things better. At worst you'll still be stuck with the same problem. - majinnaibu
In an academic setting, LaTeX is the way to go; because of the ability to separate form from content, it becomes a lot easier to collaborate on the text. Naturally, since it is source code, it is simple to put into version control using git. - TamaMcGlinn

8 Answers

10
votes

How about:

  1. Save your Word docs in XML.
  2. Commit your XML Word files.
  3. Diff using an external XML diff tool. For example:

    $ git difftool -t xmldiff c3d293 498571

Transforming the XML files to have one element per line should make the check-in process run efficiently and also allow the external XML diff tool to process quickly.

References:

10
votes

A nice trick I was able to come up with that also works on Open Office files, PPTs, etc.:

http://xcafebabe.blogspot.hu/2012/09/sexy-comparison-of-word-documents-with.html

Here's a screenshot that demonstrates the result:

enter image description here

9
votes

If you are on MS Windows, use TortoiseGit. I just had to go through this painful experience, and TGit, although inelegant takes some of the pain out it. A couple of other points:

  • Surprisingly git diff and gitk both do a reasonably good job of at least visualizing diffs between .docx (not sure about .doc, but I would assume it's the same). This is good for just a quick scan of diffs when doing commits.
  • You are completely out of luck as far as fast forward and automerging is concerned. Unfortunately I have not found a tool that can handle this (although I like the xml idea above), so you will have to do all merges manually.
  • Microsoft Word (MS Word) has a decent, if flawed, merge tool. AFAIK, it can only do 2-way merges (i.e.: X0 + dX = X1), not 3-way or 2-parent merges, which are more common in version control (i.e.: X0 + dX1 + dX2 = X1). You could solve merge conflicts using this tool, but there would be some legwork right - checking out each branch, exporting HEAD as an untracked version, etc.

    X0 = *.BASE.docx,
    X0 + dX1 = *.LOCAL.docx and
    X0 + dX2 = *.REMOTE.docx
    
  • Luckily this is exactly what TGit (and TSVN too) do. I would unfortunately, avoid rebase since if you have to replay several changes in a row, it can be very tiring, but merge for short documents is fine, just not great.

4
votes

Answering JudoWill's question - Workshare is probably leading tool used by Lawyers.

3
votes

I compiled instructions for multiple places here: http://bit.ly/17LaxVY

# download docx2txt by Sandeep Kumar
wget -O docx2txt.pl http://www.cs.indiana.edu/~kinzler/home/binp/docx2txt

# make a wrapper 
echo '#!/bin/bash
docx2txt.pl $1 -' > docx2txt
chmod +x docx2txt

# make sure docx2txt.pl and docx2txt are your current PATH. Here's a guide
http://shapeshed.com/using_custom_shell_scripts_on_osx_or_linux/
mv docx2txt docx2txt.pl ~/bin/

# set .gitattributes (unfortunately I don't this can't be set by default, you have to create it for every project)
echo "*.docx diff=word" > .git/info/attributes

# add the following to ~/.gitconfig
[diff "word"]
    binary = true
    textconv = docx2txt

# add a new alias
[alias]
    wdiff = diff --color-words

# try it
git init

# create my_file.docx, add some content

git add my_file.docx

git ci -m "Initial commit"

# change something in my_file.docx

git wdiff my_file.docx

# awesome!

It works great on OSX

2
votes

Git 1.6.1 or later now comes with the textconv features, which allows using an arbitrary command to convert a file to text before diffing.

check this also: https://gist.github.com/17twenty/4985374

1
votes

Law firms have extremely robust systems for doing this. One's that don't trust the revision history in the document (because it's externally sourced) and instead do their own comparisons and can provide deltas. If that's what they really need you're better off buying that than putting a wrapper into git or mercurial that will never really be useable for them.

Sorry to sound like pessimist, but it's more likely that the techies will use (while grumbling) the over priced commercial tool than it is that the office folks will use git or mercurial to any level of satisfaction.

1
votes

Using svn (not git or hg, but you could have a gateway), there is an extension for Ooo working on uncompressed XML files, see my answer about a similar question. BTW, if ever you look at the plugin code and make it hg-aware instead of svn, please let me know! ;-)