4
votes

We have an Hg repo that is over 6GB and 150,000 changesets. It has 8 years of history on a large application. We have used a branching strategy over the last 8 years. In this approach, we create a new branch for a feature and when finished, close the branch and merge it to default/trunk. We don't prune branches after changes are pushed into default.

As our repo grows, it is getting more painful to work with. We love having the full history on each file and don't want to lose that, but we want to make our repo size much smaller.

One approach I've been looking into would be to have two separate repos, a 'Working' repo and an 'Archive' repo. The Working repo would contain the last 1 to 2 years of history and would be the repo developers cloned and pushed/pulled from on a daily basis. The Archive repo would contain the full history, including the new changesets pushed into the working repo.

I cannot find the right Hg commands to enable this. I was able to create a Working repo using hg convert <src> <dest> --config convert.hg.startref=<rev>. However, Mecurial sees this as a completely different repo, breaking any association between our Working and Archive repos. I'm unable to find a way to merge/splice changesets pushed to the Working repo into the Archive repo and maintain a unified file history. I tried hg transplant -s <src>, but that resulted in several 'skipping emptied changeset' messages. It's not clear to my why the hg transplant command felt those changeset were empty. Also, if I were to get this working, does anyone know if it maintains a file's history, or is my repo going to see the transplanted portion as separate, maybe showing up as a delete/create or something?

Anyone have a solution to either enable this Working/Archive approach or have a different approach that may work for us? It is critical that we maintain full file history, to make historical research simple.

Thanks

2
Hello Bryan, could you details why do you want to make the repo much smaller? Is it because cloning is too slow? Is it because some operations are too slow (commit, push, pull)? There is some experimental changes that have landed recently in Mercurial that could helps you, but first I would need more information about your repository. Could you run hg heads -T "\n" | wc -l, it will give the number of open heads on your repository?Boris Feld

2 Answers

5
votes

You might be hitting a known bug with the underlying storage compression. 6GB for 150,000 revision is a lot.

This storage issue is usually encountered on very branchy repositories, on an internal data structure storing the content of each revision. The current fix for this bug can reduce repository size up to ten folds.

Possible Quick Fix

You can blindly try to apply the current fix for the issue and see if it shrinks your repository.

  • upgrade to Mercurial 4.7,
  • add the following to your repository configuration:

    [format] sparse-revlog = yes

  • run hg debugupgraderepo --optimize redeltaall --run (this will take a while)

Some other improvements are also turned on by default in 4.7. So upgrade to 4.7 and running the debugupgraderepo should help in all cases.

Finer Diagnostic

Can you tell us what is the size of the .hg/store/00manifest.d file compared to the full size of .hg/store ?

In addition, can you provide use with the output of hg debugrevlog -m

Other reason ?

Another reason for repository size to grow is for large (usually binary file) to be committed in it. Do you have any them ?

0
votes

The problem is that the hash id for each revision is calculated based on a number of items including the parent id. So when you change the parent you change the id.

As far as I'm aware there is no nice way to do this, but I have done something similar with several of my repos. The bad news is that it required a chain of repos, batch files and splice maps to get it done.

The bulk of the work I'm describing is ideally done one time only and then you just run the same scripts against the same existing repos every time you want to update it to pull in the latest commits.


The way I would do it is to have three repos:

  • Working
  • Merge
  • Archive

The first commit of Working is a squash of all the original commits in Archive, so you'll be throwing that commit away when you pull your Working code into the Archive, and reparenting the second Working commit onto the old tip of Archive.

STOP: If you're going to do this, back up your existing repos, especially the Archive repo before trying it, it might get trashed if you run this over the top of it. It might also be fine, but I'm not having any problems on my conscience!

  1. Pull both Working and Archive into the Merge repo.
    You now have a Merge repo with two completely independent trees in it.

  2. Create a splicemap. This is just a text file giving the hash of a child node and the hash of its proposed parent node, separated by a space.
    So your splicemap would just be something like:
    hash-of-working-commit-2 hash-of-archive-old-tip

  3. Then run hg convert with the splicemap option to do the reparenting of the second commit of Working onto the old tip of the Archive. E.g.
    hg convert --splicemap splicemapPath.txt --config convert.hg.saverev=true Merge Archive
    You might want to try writing it to a different named repo rather than Archive the first time, or you could try writing it over a copy of the existing Archive, I'm not sure if it'll work but if it does it would probably be quicker.

Once you've run this setup once, you can just run the same scripts over the existing repos again and again to update with the latest Working revisions. Just pull from Working to Merge and then run the hg convert to put it into Archive.