从 Git 提交历史中删除大文件

1. Overview

In this tutorial, we’ll learn how to remove large files from the commit history of a git repository using various tools.

2. Using git filter-branch

This is the most commonly used method, and it helps us rewrite the history of committed branches.

For example, suppose we mistakenly drop a blob file inside a project folder, and after deleting it, we still notice the file in our git history:

$ git log --graph --full-history --all --pretty=format:"%h%x09%d%x20%s"
* 9e87646        (HEAD -> master) blob file removed
* 2583677        blob file
* 34ea256        my first commit

We can remove the blob file from our git history by rewriting the tree and its content with this command:

$ git filter-branch --tree-filter 'rm -f blob.txt' HEAD

Here, the rm option removes the file from the tree. Additionally, the -f option prevents the command from failing if the file is absent from other committed directories in our project. Without the -f option, the command may fail when we have more than one directory in our project.

Here is our git log after we ran the command:

* 8f39d86        (HEAD -> master) blob file removed
* e99a81d        blob file
| * 9e87646      (refs/original/refs/heads/master) blob file removed
| * 2583677      blob file
|/  
* 34ea256        my first commit

We can replace the HEAD with the SHA1 key of the commit history to minimize the rewrite.

Our git log still contains the reference to the deleted file. We can delete the reference by updating our repo:

$ git update-ref -d refs/original/refs/heads/master

The -d option deletes the named ref after verifying it still contains old values.

We need to record that our reference changed in the repository:

$ git reflog expire --expire=now --all

The expire subcommand prunes older reference log entries.

Finally, we need to clean up and optimize our repo:

$ git gc --prune=now

The –prune=now option prunes loose objects regardless of their age.

After running the command, here is our git log:

* 6f49d86        (HEAD -> master) my first commit

We can see that the refs have been removed.

Alternatively, we can run:

$ git filter-branch --index filter 'git rm --cached --ignore-unmatched blob.txt' HEAD

This works exactly like tree-filter, but it is faster because it only rewrites the index, i.e., the working directory. The subcommand –ignore-unmatched prevents the command from failing if the file is missing from other committed directories in our project.

We should note that this approach with two different commands can be slow when deleting a large file.

3. Using git-filter-repo

An alternate approach is to use the git-filter-repo command. It is a third-party add-on, simpler to use, and faster than other approaches. Moreover, it is the solution recommended in the git official documentation.

3.1. Installation

It requires python3 >= 3.5 and git >= 2.22.0 at a minimum; some features require git 2.24.0 or higher.

We are going to install git-filter-repo on our Linux machine. For the Windows installation guide, we can refer to the documentation.

Firstly, we are going to install python-pip and git-filter-repo with the following commands:

$ sudo apt install python3-pip
$ pip install --user git-filter-repo

Alternatively, we can use the below commands to install git-filter-repo:

# Add to bashrc.
export PATH="${HOME}/bin:${PATH}"

mkdir -p ~/bin
wget -O ~/bin/git-filter-repo https://raw.githubusercontent.com/newren/git-filter-repo/7b3e714b94a6e5b9f478cb981c7f560ef3f36506/git-filter-repo
chmod +x ~/bin/git-filter-repo

3.2. Removing the File

Let’s run the command to check our git log:

$ git log --graph --full-history --all --pretty=format:"%h%x09%d%x20%s"
* ee36517        (HEAD -> master) blob.txt removed
* a480073        project folder

The next thing is for us to analyze our repo:

$ git filter-repo --analyze
Processed 5 blob sizes
Processed 2 commits
Writing reports to .git/filter-repo/analysis...done.

This generates a directory of reports of the state of our repo. The report can be found at .git/filter-repo/analysis. This information may help determine what to filter in a subsequent run. It can also help us determine if our previous filtering command actually did what we wanted it to do.

Then, let’s run this command with option –path-match, which help specify the file to include in the filtered history:

$ git filter-repo --force --invert-paths --path-match blob.txt

Here is our new git log:

* 8940776        (HEAD -> master) project folder

After execution, it will change the commit hashes of the modified commit.

4. Using BRG Repo-Cleaner

Another great option is BRG Repo-Cleaner, which is a third-party add-on written in Java.

It is faster than the git filter-branch approach. Additionally, it is good for removing large files, passwords, credentials, and other private data.

Let assume we want to remove blob files greater than 200MB. This add-on makes it easy to do this:

$ java -jar bfg.jar --strip-blob-bigger-than 200M my-repo.git

Then, let’s run this command to clean the dead data:

$ git gc --prune=now --aggressive

5. Using git-rebase

We need the SHA1 key from the git log to use this approach:

$ git log --graph --full-history --all --pretty=format:"%h%x09%d%x20%s"
* 535f7ea        (HEAD -> master) blob file removed
* 8bffdfa        blob file
* 5bac30b        index.html

Our aim is to remove the blob file from our commit history. So we’ll use the SHA1 key from the history of the entry preceding the one we want to remove.

With this command, we enter into an interactive rebase:

$ git rebase -i 5bac30b

This opens our nano editor showing:

pick 535f7ea blob file removed
pick 8bffdfa blob file 

# Rebase 5bac30b..535f7ea onto 535f7ea (2 command)
#
# Commands:
# p, pick <commit> = use commit
# r, reword <commit> = use commit, but edit the commit message
# e, edit <commit> = use commit, but stop for amending
# s, squash <commit> = use commit, but meld into previous commit
# f, fixup <commit> = like "squash", but discard this commit's log message
# x, exec <command> = run command (the rest of the line) using shell
# b, break = stop here (continue rebase later with 'git rebase --continue')
# d, drop <commit> = remove commit
# l, label <label> = label current HEAD with a name
# t, reset <label> = reset HEAD to a label
# m, merge [-C <commit> | -c <commit>] <label> [# <oneline>]
# .  create a merge commit using the original merge commit's
# .  message (or the oneline, if no original merge commit was
# .  specified). Use -c <commit> to reword the commit message.

Now, we’ll modify this by deleting the text “pick 535f7ea blob file removed“. This helps us alter the commit history and remove the history we deleted earlier.

We then save the file and quit the editor, which drops us at the terminal with the following message:

interactive rebase in progress; onto 535f7ea
Last command done (1 command done):
pick 535f7ea blob file removed
No commands remaining.
You are currently rebasing branch 'master' on '535f7ea'.
(all conflicts fixed: run "git rebase --continue")

Finally, let’s continue the rebase operation:

$ git rebase --continue
Successfully rebased and updated refs/heads/master.

We can then verify our commit history:

$ git log --graph --full-history --all --pretty=format:"%h%x09%d%x20%s"
* 5bac30b        (HEAD -> master) index.html

We should note that this approach is not as fast as git-filter-repo.

6. Conclusion

In this article, we learned different approaches to remove large files from the commit history of a git repository. We also saw that according to git documentation, git filter-repo is recommended because it is fast and has fewer cons compared to other approaches.

Persistence

REST

Security