4
votes

I can't figure out why small differences to big files are causing my subversion repository to grow so much.

I have a zip file of the contents a database used by some tests. I want to store each new version of the test data in our subversion repository.

I've done some experiments, checking in the last few versions of the data.zip and looking at what happens to the size of the repository. The uncompressed data is about 150MB, compressed and zipped it's ~50MB. Each new version of the data.zip file checked into the repository increases the repository's size by about 50MB. I think it should only increase by the amount of a delta which I expect to be much less.

Subversion uses xdelta to store compressed difference data. My attempt to confirm that SVN could do better was to download xdelta and check there isn't much difference between two versions. Indeed

xdelta3.0z.x86-64.exe -e -s v1_path\data.zip v2_path\data.zip v1v2_delta.file

produced a v1v2_delta.file which was about 3MB.

I've looked in the SVN repository at [myrepo]\db\revs and can see large files for each new revision

02/08/2011  11:12        57,853,082 4189
02/08/2011  11:40        51,713,289 4190
02/08/2011  11:46        52,286,060 4191

(The 4189, 4190 and 4191 are the names of files.)

I even tried zipping the data.zip without compression. This didn't make a difference to what SVN stores - from the look of it, my guess is that it is storing a compressed copy of the entire data.zip for every revision, not just the first. I'm running SVN 1.6 with an FSFS backend.

There are various other good stackoverflow answers about committing binaries and how SVN stores deltas, e.g. SVN performance after many revisions. But I cannot see from these why deltas aren't being stored in the above case - ie. if xdelta can get such a small diff running standalone, surely SVN can too - or is it choosing not to?!

Edit: I've also tried tar (uncompressed) files, again SVN isn't storing them efficiently. Also I found that we have a zip file of the same data format (although much smaller) in a different repository where SVN has just stored diffs.

So the summarized version of this question is: SVN can efficiently store binary files, e.g. 10 slightly different CAD files are just 1.2 times the size of 1. SVN even can be space efficient with compressed zip files sometimes. But evidently it isn't always space efficient with binary files - under what conditions is this the case?

4
Regarding "avoid storing binary files". On Windows, this is unavoidable, especially if storing revisions of game-editor artifacts or office-based documents. "Avoid storing easily regenerable binary files" is more apt. The fact that svn can use binary deltas sets it apart from every other freely available source control system out there, as none of the others can do this -- they all recommit the binary fresh, which causes a large leap in the end size of the storage.user1167794

4 Answers

3
votes

Summary

Subversion will sometimes be worse than xdelta standalone because of how much memory is given to the compression. This is subversion behaviour that can't currently be changed, as of version 1.6.

Details

I asked on the subversion mailing list why the subversion repository files seemed to be bigger than they should be.

The conclusion is that xdelta can produce a smaller delta if you give it more memory.

Read back in this thread another example of someone else who had the same problem.

With credit and thanks to various people on subversion mailing lists recently and four years ago for this.

Also having this problem?

If you're analysing disk usage by the subversion repository, understand skip deltas and use this grep DELTA trick to figure out the base being used for the delta.

And assuming, like me, you really do want to store binary files in the repository, here's my guess at some workarounds (none of them very easy!):

  1. Modify the subversion source code and build your own with the xdelta memory window set to be bigger
  2. Do you own xdelta-ing - check the deltas into source control and have some crazy ass process for reconstructing
  3. Migrate to Git - it's bound to have better compression (wild speculation)
1
votes

I would think that the compression will completely change the makeup of the binary file, therefore svn will have to store huge deltas. Even changing a few characters of the contents of a compressed file can drastically change it.

Storing binaries in source control is generally a bad idea and I think you should look for an alternative.

1
votes

Compressed files binary content might change drastically when files are added or modified in a compressed archive. Thought it can happen that changes can take place in particular elements of the archive and no significant changes happen in large areas of the compressed file file. However, it is a matter of "luck" that this will be the case in normal cases (of course there is no real luck in this but it is a bit complex to plan on achieving it)

This is quite normal in entropy encoding algorithms, such as Huffman (to name the simplest one), as the frequencies of the symbols change when files are added or modified. If this takes place at the beginning of the archive's contents, it can severely affect the entire content of the file following the change.

-1
votes

Did you use the fsfs file system backing? As I recall, it stores a new copy each time (although it may be compressed). Why are you expecting SVN to store diffs of binary files? SVN is a source code control system (meaning text) not a general binary control system (although it doesn't do as badly as it could with storing binaries).