Does git subtree have no choice but to have duplicated Commits?

2

votes

I just started using git subtree and I got confused.

I have "main" repo and "subtree" repo.
The "main" repo includes "subtree" repo.

On this situation, my question is 'Does "main" repo have no choice but to have duplicated Commits?'.

For example, let's assume that I pushed some commits to "main" repo, and made 'subtree push' to "subtree" repo.

After that, when I hit 'git subtree pull ~' command on "main" repo, all commits, even what I pushed from "main" to "subtree" are pulled to "main" repo, and "main" repo get duplicated commits.

Is it unavoidable? Or Did I made mistakes?

gitgit-subtree

0

votes

There are two different ways to integrate one repository into another: submodule and subtree.

subtree works by "copying" all commits to the target repository, which is then able to "live on its own".

submodule works by "referencing" a commit from another repository. No "copying" is needed, but this needs access to both repositories afterwards.

So yes. It is perfectly normal for subtree to "duplicate" your commits.

0

votes

First, using the existing git-subtree tool, it is avoidable by using the --squash option to git-subtree. This "avoids" the problem, by simply suppressing all the commits. From the man page:

--squash
This option is only valid for add, merge, and pull commands. Instead of merging the entire history from the subtree project, produce only a single commit that contains all the differences you want to merge, and then merge that new commit into your project.

It should be used always, or you will pull in the duplicates. You will not see any remote commit history this way.

If you want to retain remote commit history, duplicates are not unavoidable in some fundamental sense. They are just unavoidable using previously described, implemented (and maybe known) subtree methods.

git-alltrees

avoids these duplicates by using a more complex translation strategy. To understand duplicates you have to understand what identifies a commit. They are identified by their hash which is a checksum that depends on essentially all of the data related to the commit. This includes a few things including the obvious commit content and also the parent ids stored in the commit. So if one commit changes, all the descendent hashes change.

With subtrees, when you push to a remote you obviously change the content. Files are removed and directories are changed. Hashes change. When you pull the commits back the directories can be changed back, but files are still missing.

git-alltrees re-associates and replaces the partial commits in the pulled branch with their original commits, thus restoring the original hash. Any new commits made from the remote will branch and merge naturally. The work is done using git-filter-repo.

I'm not trying to hide that this is my work, but it's work that was intended to answer exactly this question.

Does git subtree have no choice but to have duplicated Commits?

2 Answers