I was unable to make use of the most popular answer because the --batch-check
command-line switch to Git 1.8.3 (that I have to use) does not accept any arguments. The ensuing steps have been tried on CentOS 6.5 with Bash 4.1.2
Key Concepts
In Git, the term blob implies the contents of a file. Note that a commit might change the contents of a file or pathname. Thus, the same file could refer to a different blob depending on the commit. A certain file could be the biggest in the directory hierarchy in one commit, while not in another. Therefore, the question of finding large commits instead of large files, puts matters in the correct perspective.
For The Impatient
Command to print the list of blobs in descending order of size is:
git cat-file --batch-check < <(git rev-list --all --objects | \
awk '{print $1}') | grep blob | sort -n -r -k 3
Sample output:
3a51a45e12d4aedcad53d3a0d4cf42079c62958e blob 305971200
7c357f2c2a7b33f939f9b7125b155adbd7890be2 blob 289163620
To remove such blobs, use the BFG Repo Cleaner, as mentioned in other answers. Given a file blobs.txt
that just contains the blob hashes, for example:
3a51a45e12d4aedcad53d3a0d4cf42079c62958e
7c357f2c2a7b33f939f9b7125b155adbd7890be2
Do:
java -jar bfg.jar -bi blobs.txt <repo_dir>
The question is about finding the commits, which is more work than finding blobs. To know, please read on.
Further Work
Given a commit hash, a command that prints hashes of all objects associated with it, including blobs, is:
git ls-tree -r
So, if we have such outputs available for all commits in the repo, then given a blob hash, the bunch of commits are the ones that match any of the outputs. This idea is encoded in the following script:
#!/bin/bash
DB_DIR='trees-db'
find_commit() {
cd ${DB_DIR}
for f in *; do
if grep -q $1 ${f}; then
echo ${f}
fi
done
cd - > /dev/null
}
create_db() {
local tfile='/tmp/commits.txt'
mkdir -p ${DB_DIR} && cd ${DB_DIR}
git rev-list --all > ${tfile}
while read commit_hash; do
if [[ ! -e ${commit_hash} ]]; then
git ls-tree -r --full-tree ${commit_hash} > ${commit_hash}
fi
done < ${tfile}
cd - > /dev/null
rm -f ${tfile}
}
create_db
while read id; do
find_commit ${id};
done
If the contents are saved in a file named find-commits.sh
then a typical invocation will be as under:
cat blobs.txt | find-commits.sh
As earlier, the file blobs.txt
lists blob hashes, one per line. The create_db()
function saves a cache of all commit listings in a sub-directory in the current directory.
Some stats from my experiments on a system with two Intel(R) Xeon(R) CPU E5-2620 2.00GHz processors presented by the OS as 24 virtual cores:
- Total number of commits in the repo = almost 11,000
- File creation speed = 126 files/s. The script creates a single file per commit. This occurs only when the cache is being created for the first time.
- Cache creation overhead = 87 s.
- Average search speed = 522 commits/s. The cache optimization resulted in 80% reduction in running time.
Note that the script is single threaded. Therefore, only one core would be used at any one time.