Ubuntu Manpage: git-filter-repo - Rewrite repository history

name
synopsis
description
options
output
fresh clone safety check and --force
versatility
discussion
examples
callbacks
internals
see also
git
notes

Provided by: git-filter-repo_2.45.0-1_all

NAME

       git-filter-repo - Rewrite repository history

SYNOPSIS

       git filter-repo --analyze
       git filter-repo [<path_filtering_options>] [<content_filtering_options>]
               [<ref_renaming_options>] [<commit_message_filtering_options>]
               [<name_or_email_filtering_options>] [<parent_rewriting_options>]
               [<generic_callback_options>] [<miscellaneous_options>]

DESCRIPTION

       Rapidly rewrite entire repository history using user-specified filters. This is a destructive operation
       which should not be used lightly; it writes new commits, trees, tags, and blobs corresponding to (but
       filtered from) the original objects in the repository, then deletes the original history and leaves only
       the new. See the section called “DISCUSSION” for more details on the ramifications of using this tool.
       Several different types of history rewrites are possible; examples include (but are not limited to):

       •   stripping large files (or large directories or large extensions)

       •   stripping unwanted files by path

       •   extracting wanted paths and their history (stripping everything else)

       •   restructuring the file layout (such as moving all files into a subdirectory in preparation for
           merging with another repo, making a subdirectory become the new toplevel directory, or merging two
           directories with independent filenames into one directory)

       •   renaming tags (also often in preparation for merging with another repo)

       •   replacing or removing sensitive text such as passwords

       •   making mailmap rewriting of user names or emails permanent

       •   making grafts or replacement refs permanent

       •   rewriting commit messages

       Additionally, several concerns are handled automatically (many of these can be overridden, but they are
       all on by default):

       •   rewriting (possibly abbreviated) hashes in commit messages to refer to the new post-rewrite commit
           hashes

       •   pruning commits which become empty due to the above filters (also handles edge cases like pruning of
           merge commits which become degenerate and empty)

       •   stripping of original history to avoid mixing old and new history

       •   repacking the repository post-rewrite to shrink the repo for the user

       And additional facilities are available via a config option

       •   creating replace-refs (see git-replace(1)) for old commit hashes, which if manually pushed and
           fetched will allow users to continue to refer to new commits using (unabbreviated) old commit IDs

       Also, it’s worth noting that there is an important safety mechanism:

       •   abort if run from a repo that is not a fresh clone (to prevent accidental data loss from rewriting
           local history that doesn’t exist anywhere else). See the section called “FRESH CLONE SAFETY CHECK AND
           --FORCE”.

       For those who know that there is large unwanted stuff in their history and want help finding it, this
       command also

       •   provides an option to analyze a repository and generate reports that can be useful in determining
           what to filter (or in determining whether a separate filtering command was successful).

       See also the section called “VERSATILITY”, the section called “DISCUSSION”, the section called
       “EXAMPLES”, and the section called “INTERNALS”.

OPTIONS

Analysis Options
--analyze
Analyze repository history and create a report that may be useful in determining what to filter in a
subsequent run (or in determining if a previous filtering command did what you wanted). Will not
modify your repo.

Filtering based on paths (see also --filename-callback)
These options specify the paths to select. Note that much like git itself, renames are NOT followed so
you may need to specify multiple paths, e.g. --path olddir/ --path newdir/

--invert-paths
Invert the selection of files from the specified --path-{match,glob,regex} options below, i.e. only
select files matching none of those options.

--path-match <dir_or_file>, --path <dir_or_file>
Exact paths (files or directories) to include in filtered history. Multiple --path options can be
specified to get a union of paths.

--path-glob <glob>
Glob of paths to include in filtered history. Multiple --path-glob options can be specified to get a
union of paths.

--path-regex <regex>
Regex of paths to include in filtered history. Multiple --path-regex options can be specified to get
a union of paths.

--use-base-name
Match on file base name instead of full path from the top of the repo. Incompatible with
--path-rename, and incompatible with matching against directory names.

Renaming based on paths (see also --filename-callback)
Note: if you combine path filtering with path renaming, be aware that a rename directive does not select
paths, it only says how to rename paths that are selected with the filters.

--path-rename <old_name:new_name>, --path-rename-match <old_name:new_name>
Path to rename; if filename or directory matches <old_name> rename to <new_name>. Multiple
--path-rename options can be specified.

Path shortcuts
--paths-from-file <filename>
Specify several path filtering and renaming directives, one per line. Lines with ==> in them specify
path renames, and lines can begin with literal: (the default), glob:, or regex: to specify different
matching styles. Blank lines and lines starting with a # are ignored (if you have a filename that you
want to filter on that starts with literal:, #, glob:, or regex:, then prefix the line with
literal:).

--subdirectory-filter <directory>
Only look at history that touches the given subdirectory and treat that directory as the project
root. Equivalent to using --path <directory>/ --path-rename <directory>/:

--to-subdirectory-filter <directory>
Treat the project root as instead being under <directory>. Equivalent to using --path-rename
:<directory>/

Content editing filters (see also --blob-callback)
--replace-text <expressions_file>
A file with expressions that, if found, will be replaced. By default, each expression is treated as
literal text, but regex: and glob: prefixes are supported. You can end the line with ==> and some
replacement text to choose a replacement choice other than the default of ***REMOVED***.

--strip-blobs-bigger-than <size>
Strip blobs (files) bigger than specified size (e.g. 5M, 2G, etc)

--strip-blobs-with-ids <blob_id_filename>
Read git object ids from each line of the given file, and strip all of them from history

Renaming of refs (see also --refname-callback)
--tag-rename <old:new>
Rename tags starting with <old> to start with <new>. For example, --tag-rename foo:bar will rename
tag foo-1.2.3 to bar-1.2.3; either <old> or <new> can be empty.

Filtering of commit messages (see also --message-callback)
--replace-message <expressions_file>
A file with expressions that, if found in commit or tag messages, will be replaced. This file uses
the same syntax as --replace-text.

--preserve-commit-hashes
By default, since commits are rewritten and thus gain new hashes, references to old commit hashes in
commit messages are replaced with new commit hashes (abbreviated to the same length as the old
reference). Use this flag to turn off updating commit hashes in commit messages.

--preserve-commit-encoding
Do not reencode commit messages into UTF-8. By default, if the commit object specifies an encoding
for the commit message, the message is re-encoded into UTF-8.

Filtering of names & emails (see also --name-callback and --email-callback)
--mailmap <filename>
Use specified mailmap file (see git-shortlog(1) for details on the format) when rewriting author,
committer, and tagger names and emails. If the specified file is part of git history, historical
versions of the file will be ignored; only the current contents are consulted.

--use-mailmap
Same as: --mailmap .mailmap

Parent rewriting
--replace-refs {delete-no-add, delete-and-add, update-no-add, update-or-add, update-and-add, old-default}
How to handle replace refs (see git-replace(1)). Replace refs can be added during the history rewrite
as a way to allow users to pass old commit IDs (from before git-filter-repo was run) to git commands
and have git know how to translate those old commit IDs to the new (post-rewrite) commit IDs. Also,
replace refs that existed before the rewrite can either be deleted or updated. The choices to pass to
--replace-refs thus need to specify both what to do with existing refs and what to do with commit
rewrites. Thus update-and-add means to update existing replace refs, and for any commit rewrite (even
if already pointed at by a replace ref) add a new refs/replace/ reference to map from the old commit
ID to the new commit ID. The default is update-no-add, meaning update existing replace refs but do
not add any new ones. There is also a special old-default option for picking the default used in
versions prior to git-filter-repo-2.45, namely update-and-add upon the first run of git-filter-repo
in a repository and update-or-add if running git-filter-repo again on a repository.

--prune-empty {always, auto, never}
Whether to prune empty commits. auto (the default) means only prune commits which become empty (not
commits which were empty in the original repo, unless their parent was pruned). When the parent of a
commit is pruned, the first non-pruned ancestor becomes the new parent.

--prune-degenerate {always, auto, never}
Since merge commits are needed for history topology, they are typically exempt from pruning. However,
they can become degenerate with the pruning of other commits (having fewer than two parents, having
one commit serve as both parents, or having one parent as the ancestor of the other.) If such merge
commits have no file changes, they can be pruned. The default (auto) is to only prune empty merge
commits which become degenerate (not which started as such).

--no-ff
Even if the first parent is or becomes an ancestor of another parent, do not prune it. This modifies
how --prune-degenerate behaves, and may be useful in projects who always use merge --no-ff.

Generic callback code snippets
--filename-callback <function_body>
Python code body for processing filenames; see the section called “CALLBACKS”.

--message-callback <function_body>
Python code body for processing messages (both commit messages and tag messages); see the section
called “CALLBACKS”.

--name-callback <function_body>
Python code body for processing names of people; see the section called “CALLBACKS”.

--email-callback <function_body>
Python code body for processing emails addresses; see the section called “CALLBACKS”.

--refname-callback <function_body>
Python code body for processing refnames; see the section called “CALLBACKS”.

--blob-callback <function_body>
Python code body for processing blob objects; see the section called “CALLBACKS”.

--commit-callback <function_body>
Python code body for processing commit objects; see the section called “CALLBACKS”.

--tag-callback <function_body>
Python code body for processing tag objects; see the section called “CALLBACKS”.

--reset-callback <function_body>
Python code body for processing reset objects; see the section called “CALLBACKS”.

Location to filter from/to
Note
Specifying alternate source or target locations implies --partial. However, unlike normal uses of
--partial, this doesn’t risk mixing old and new history since the old and new histories are in
different repositories.

--source <source>
Git repository to read from

--target <target>
Git repository to overwrite with filtered history

Miscellaneous options
--help, -h
Show a help message and exit.

--force, -f
Ignore fresh clone checks and rewrite history (an irreversible operation, especially since it by
default ends with an immediate pruning of reflogs and old objects). See the section called “FRESH
CLONE SAFETY CHECK AND --FORCE”. Note that when cloning repos on a local filesystem, it is better to
pass --no-local to git clone than passing --force to git-filter-repo.

--partial
Do a partial history rewrite, resulting in the mixture of old and new history. This disables
rewriting refs/remotes/origin/* to refs/heads/*, disables removing of the origin remote, disables
removing unexported refs, disables expiring the reflog, and disables the automatic post-filter gc.
Also, this modifies --tag-rename and --refname-callback options such that instead of replacing old
refs with new refnames, it will instead create new refs and keep the old ones around. Use with
caution.

--refs <refs+>
Limit history rewriting to the specified refs. Implies --partial. In addition to the normal caveats
of --partial (mixing old and new history, no automatic remapping of refs/remotes/origin/* to
refs/heads/*, etc.), this also may cause problems for pruning of degenerate empty merge commits when
negative revisions are specified.

--dry-run
Do not change the repository. Run git fast-export and filter its output, and save both the original
and the filtered version for comparison. This also disables rewriting commit messages due to not
knowing new commit IDs and disables filtering of some empty commits due to inability to query the
fast-import backend.

--debug
Print additional information about operations being performed and commands being run. (If used
together with --dry-run, shows extra information about what would be run).

--stdin
Instead of running git fast-export and filtering its output, filter the fast-export stream from
stdin. The stdin must be in the expected input format (e.g. it needs to include original-oid
directives).

--quiet
Pass --quiet to other git commands called.

OUTPUT

       Every time filter-repo is run, files are created in the .git/filter-repo/ directory. These files are
       overwritten unconditionally on every run.

   Commit map
       The .git/filter-repo/commit-map file contains a mapping of how all commits were (or were not) changed.

       •   A header is the first line with the text "old" and "new"

       •   Commit mappings are in no particular order

       •   All commits in range of the rewrite will be listed, even commits that are unchanged (e.g. because the
           commit pre-dated when the large file(s) were introduced to the repo).

       •   An all-zeros hash, or null SHA, represents a non-existent object. When in the "new" column, this
           means the commit was removed entirely.

   Reference map
       The .git/filter-repo/ref-map file contains a mapping of which local references were changed.

       •   A header is the first line with the text "old", "new" and "ref"

       •   Reference mappings are in no particular order

       •   An all-zeros hash, or null SHA, represents a non-existent object. When in the "new" column, this
           means the ref was removed entirely.

FRESH CLONE SAFETY CHECK AND --FORCE

       Since filter-repo does irreversible rewriting of history, it is important to avoid making changes to a
       repo for which the user doesn’t have a good backup. The primary defense mechanism is to simply educate
       users and rely on them to be good stewards of their data; thus there are several warnings in the
       documentation about how filter repo rewrites history.

       However, as a service to users, we would like to provide an additional safety check beyond the
       documentation. There isn’t a good way to check if the user has a good backup, but we can ask a related
       question that is an imperfect but quite reasonable proxy: "Is this repository a fresh clone?"
       Unfortunately, that is also a question we can’t get a perfect answer to; git provides no way to answer
       that question. However, there are approximately a dozen things that I found that seem to always be true
       of brand new clones (assuming they are either clones of remote repositories or are made with the
       --no-local flag), and I check for all of those.

       These checks can have both false positives and false negatives. Someone might have a perfectly good
       backup of their repo without it actually being a fresh clone — but there’s no way for filter-repo to know
       that. Conversely, someone could look at all things that filter-repo checks for in its safety checks and
       then just tweak their non-backed-up repository to satisfy those conditions (though it would take a fair
       amount of effort, and it’s astronomically unlikely that a repo that isn’t a fresh clone randomly happens
       to match all the criteria). In practice, the safety checks filter-repo uses seem to be really good at
       avoiding people accidentally running filter-repo on a repository that they shouldn’t be running it on. It
       even caught me once when I did mean to run filter-repo but was in a different directory than I thought I
       was.

       In short, it’s perfectly fine to use ‘--force` to override the safety checks as long as you’re okay with
       filter-repo irreversibly rewriting the contents of the current repository. It is a really bad idea to get
       in the habit of always specifying --force; if you do, one day you will run one of your commands in the
       wrong directory like I did, and you won’t have the safety check anymore to bail you out. Also, it is
       definitely NOT okay to recommend --force on forums, Q&A sites, or in emails to other users without first
       carefully explaining that --force means putting your repositories’ data at risk. I am especially bothered
       by people who suggest the flag when it clearly is NOT needed; they are needlessly putting other peoples'
       data at risk.

VERSATILITY

       filter-repo has a hierarchy of capabilities on the spectrum from easy to use convenience flags that
       perform pre-defined types of filtering, to choices that provide lots of flexibility in controlling how
       filtering occurs. This spectrum includes the following:

       •   Convenience flags making common types of history rewriting simple (e.g. --path,
           --strip-blobs-bigger-than, --replace-text, --mailmap)

       •   Options which are shorthand for others or which provide greater control than others (e.g.
           --subdirectory-filter could just be written using both a path selection (--path) and a path rename
           (--path-rename) filter; --paths-from-file can handle all other --path* options and more such as regex
           renaming of paths)

       •   Generic python callbacks for handling a certain type of data (the filename, message, name, email, and
           refname callbacks)

       •   Generic python callbacks for handling fundamental git objects, allowing greater control over the
           combination of data types the object holds (the commit, tag, blob, and reset callbacks)

       •   The ability to import filter-repo as a module in a python program and use its classes and functions
           for even greater control and flexibility while still leveraging lots of basic capabilities. One can
           even use this to write new tools with a completely different interface.

       For more information about callbacks, see the section called “CALLBACKS”. For examples on writing python
       programs that import filter-repo as a module to create new history rewriting tools, look at the
       contrib/filter-repo-demos/ directory. That directory includes, among other examples, a reimplementation
       of git-filter-branch which is faster than git-filter-branch, and a reimplementation of BFG Repo Cleaner
       with several bug fixes and new features.

DISCUSSION

Using filter-repo is relatively simple, but rewriting history is part of a larger discussion in terms of
collaboration. When you rewrite history, the old and new histories are no longer compatible; if you push
this history somewhere for others to view, it will look as though you’ve done a rebase of all branches
and tags. Make sure you are familiar with the "RECOVERING FROM UPSTREAM REBASE" section of git-rebase(1)
(and in particular, "The hard case") before proceeding, in addition to this section.

Steps to use git-filter-repo as part of the bigger picture of doing a history rewrite are roughly as
follows:

1. Create a clone of your repository (if you created special refs outside of refs/heads/ or refs/tags/,
make sure to fetch those too). You may pass --bare or --mirror to git clone, if you prefer. You
should pass --no-local if the repository you are cloning from is on the local filesystem. Avoid other
flags; some might confuse the fresh clone check, and others could cause parts of the data to be
missing that are needed for the rewrite.

2. (Optional) Run git filter-repo --analyze. This will create a directory of reports mentioning renames
that have occurred in your repo and also listing sizes of objects aggregated by
path/directory/extension/blob-id; this information may be useful in choosing how to filter your repo.
It can also be useful to re-run --analyze after filtering to verify the changes look correct.

3. Run filter-repo with your desired filtering options. Many examples are given below. For more complex
cases, note that doing the filtering in multiple steps (by running multiple filter-repo invocations
in a sequence) is supported. If anything goes wrong here, simply delete your clone and restart.

4. Push your new repository to its new home (note that refs/remotes/origin/* will have been moved to
refs/heads/* as the first part of filter-repo, so you can just deal with normal branches instead of
remote tracking branches). While you can force push this to the same URL you cloned from, there are
good reasons to consider pushing to a different location instead:

• People who cloned from the original repo will have old history. When they fetch the new history
you force pushed up, unless they do a git reset --hard @{u} on their branches or rebase their
local work, git will think they have hundreds or thousands of commits with very similar commit
messages as what exist upstream (but which include files you wanted excised from history), and
allow the user to merge the two histories, resulting in what looks like two copies of each
commit. If they then push this history back up, then everyone now has history with two copies of
each commit and the bad files have returned. You’re more likely to succeed in forcing people to
get rid of the old history if they have to clone a new URL.

• Rewriting history will rewrite tags; those who have already downloaded tags will not get the
updated tags by default (see the "On Re-tagging" section of git-tag(1)). Every user trying to use
an existing clone will have to forcibly delete all tags and re-fetch them; it may be easier for
them to just re-clone, which they are more likely to do with a new clone URL.

• Rewriting history may delete some refs (e.g. branches that only had files that you wanted excised
from history); unless you run git push with the --mirror or --prune options, those refs will
continue to exist on the server. If folks then merge these branches into others, then people have
started mixing old and new history. If users had already cloned these branches, removing them
from the server isn’t enough; you need all users to delete any local branches based on these refs
and run fetch with the --prune option as well. Simply re-cloning from a new URL is easier.

• The server may not allow you to force push over some refs. For example, code review systems may
have special ref namespaces (e.g. refs/changes/, refs/pull/, refs/merge-requests/) that they have
locked down.

5. If you still want to push your rewritten history back to the original url despite my warnings above,
you’ll have to manage it very carefully:

• git-filter-repo deletes the "origin" remote to help avoid people accidentally repushing to the
same repository, so you’ll need to remind git what origin’s url was. You’ll have to look up the
command for that.

• You’ll need to carefully synchronize with everyone who has cloned the repository, and will also
need to carefully synchronize with everything (e.g. CI systems) that has cloned it. Every single
clone will either need to be thrown away and re-cloned, or need to take all the steps outlined in
item 4 as well as follow the necessary steps from "RECOVERING FROM UPSTREAM REBASE" section of
git-rebase(1). If you miss fixing any clones, you’ll risk mixing old and new history and end up
with an even worse mess to clean up.

• Finally, you’ll need to consult any documentation from your hosting provider about how to remove
any server-side references to the old commits (example: GitLab’s excellent docs on reducing
repository size[1], or the first and second steps under "Fully removing the data from
GitHub"[2]).

6. (Optional) Some additional considerations

• filter-repo has a --replace-refs option to allow creating replace refs (see git-replace(1)) for
each rewritten commit ID, allowing you to use old (unabbreviated) commit hashes in the git
command line to refer to the newly rewritten commits. If you want to use these replace refs,
manually push them to the relevant clone URL and tell users to manually fetch them (e.g. by
adjusting their fetch refspec, git config --add remote.origin.fetch
+refs/replace/*:refs/replace/*). Sadly, replace refs are not yet widely understood; projects like
jgit and libgit2 do not support them and existing repository managers (e.g. Gerrit, GitHub,
GitLab) do not yet understand replace refs. Thus one can’t use old commit hashes within the UI of
these other systems. This may change in the future, but replace refs at least help users locally
within the git command line interface. Also, be aware that commit-graphs are excessively cautious
around replace refs and just turn off entirely if any are present, so after enough time has
passed that old commit IDs become less relevant, users may want to locally delete the replace
refs to regain the speedups from commit-graphs.

• If you have a central repo, you may want to prevent people from pushing old commit IDs, in order
to avoid mixing old and new history. Every repository manager does this differently, some provide
specialized commands (e.g.
https://gerrit-review.googlesource.com/Documentation/cmd-ban-commit.html), others require you to
write hooks.

EXAMPLES

   Path based filtering
       To only keep the README.md file plus the directories guides and tools/releases/:

           git filter-repo --path README.md --path guides/ --path tools/releases

       Directory names can be given with or without a trailing slash, and all filenames are relative to the
       toplevel of the repo. To keep all files except these paths, just add --invert-paths:

           git filter-repo --path README.md --path guides/ --path tools/releases --invert-paths

       If you want to have both an inclusion filter and an exclusion filter, just run filter-repo multiple
       times. For example, to keep the src/main subdirectory but exclude files under src/main named data, run:

           git filter-repo --path src/main/
           git filter-repo --path-glob 'src/*/data' --invert-paths

       Note that the asterisk (*) will match across multiple directories, so the second command would remove
       e.g. src/main/org/whatever/data. Also, the second command by itself would also remove e.g.
       src/not-main/foo/data, but since src/not-main/ was removed by the first command, that’s not an issue.
       Also, the use of quotes around the asterisk is sometimes important to avoid glob expansion by the shell.

       You can also select paths by regular expression (see
       https://docs.python.org/3/library/re.html#regular-expression-syntax). For example, to only include files
       from the repo whose name is in the format YYYY-MM-DD.txt and is found at least two subdirectories deep:

           git filter-repo --path-regex '^.*/.*/[0-9]{4}-[0-9]{2}-[0-9]{2}.txt$'

       If you want two directories to be renamed (and maybe merged if both are renamed to the same location),
       use --path-rename; for example, to rename both cmds/ and src/scripts/ to tools/:

           git filter-repo --path-rename cmds:tools --path-rename src/scripts/:tools/

       As with --path, directories can be specified with or without a trailing slash for --path-rename.

       If you do a --path-rename to something that was already in use, it will be silently overwritten. However,
       if you try to rename multiple files to the same location (e.g. src/scripts/run_release.sh and
       cmds/run_release.sh both existed and had different content with the renames above), then you will be
       given an error. If you have such a case, you may want to add another rename command to move one of the
       paths somewhere else where it won’t collide:

           git filter-repo --path-rename cmds/run_release.sh:tools/do_release.sh \
                           --path-rename cmds/:tools/ \
                           --path-rename src/scripts/:tools/

       Also, --path-rename brings up ordering issues; all path arguments are applied in order. Thus, a command
       like

           git filter-repo --path-rename sources/:src/main/ --path src/main/

       would make sense but reversing the two arguments would not (src/main/ is created by the rename so
       reversing the two would give you an empty repo). Also, note that the rename of cmds/run_release.sh a
       couple examples ago was done before the other renames.

       Note that path renaming does not do path filtering, thus the following command

           git filter-repo --path src/main/ --path-rename tools/:scripts/

       would not result in the tools or scripts directories being present, because the single filter selected
       only src/main/. It’s likely that you would instead want to run:

           git filter-repo --path src/main/ --path tools/ --path-rename tools/:scripts/

       If you prefer to filter based solely on basename, use the --use-base-name flag (though this is
       incompatible with --path-rename). For example, to only include README.md and Makefile files from any
       directory:

           git filter-repo --use-base-name --path README.md --path Makefile

       If you wanted to delete all .DS_Store files in any directory, you could either use:

           git filter-repo --invert-paths --path '.DS_Store' --use-base-name

       or

           git filter-repo --invert-paths --path-glob '*/.DS_Store' --path '.DS_Store'

       (the --path-glob isn’t sufficient by itself as it might miss a toplevel .DS_Store file; further while
       something like --path-glob '*.DS_Store' would workaround that problem it would also grab files named
       foo.DS_Store or bar/baz.DS_Store)

       Finally, see also the --filename-callback from the section called “CALLBACKS”.

   Filtering based on many paths
       If you have a long list of files, directories, globs, or regular expressions to filter on, you can stick
       them in a file and use --paths-from-file; for example, with a file named stuff-i-want.txt with contents
       of

           # Blank lines and comment lines are ignored.
           # Examples similar to --path:
           README.md
           guides/
           tools/releases

           # An example that is like --path-glob:
           glob:*.py

           # An example that is like --path-regex:
           regex:^.*/.*/[0-9]{4}-[0-9]{2}-[0-9]{2}.txt$

           # An example of renaming a path
           tools/==>scripts/

           # An example of using a regex to rename a path
           regex:(.*)/([^/]*)/([^/]*)\.text$==>\2/\1/\3.txt

       then you could run

           git filter-repo --paths-from-file stuff-i-want.txt

       to get a repo containing only the toplevel README.md file, the guides/ and tools/releases/ directories,
       all python files, files whose name was of the form YYYY-MM-DD.txt at least two subdirectories deep, and
       would rename tools/ to scripts/ and rename files like foo/bar/baz.text to bar/foo/baz.txt. Note the
       special line prefixes of glob: and regex: and the special string ==> denoting renames.

       Sometimes you have a way of easily generating all the files you want. For example, if you know that none
       of the currently tracked files have any newlines or special characters in them (see core.quotePath from
       git config --help) so that git ls-files would print all files literally one per line, and you knew that
       you wanted to keep only the files that are currently tracked (thus deleting from all commits in history
       any files that only appear on other branches or that only appear in older commits), then you could use a
       pair of commands such as

           git ls-files >../paths-i-want.txt
           git filter-repo --paths-from-file ../paths-i-want.txt

       Similarly, you could use --paths-from-file to delete many files. For example, you could run git
       filter-repo --analyze to get reports, look in one such as
       .git/filter-repo/analysis/path-deleted-sizes.txt and copy all the filenames into a file such as
       /tmp/files-i-dont-want-anymore.txt and then run

           git filter-repo --invert-paths --paths-from-file /tmp/files-i-dont-want-anymore.txt

       to delete them all.

   Directory based shortcuts
       Let’s say you had a directory structure like the following:

           module/
              foo.c
              bar.c
           otherDir/
              blah.config
              stuff.txt
           zebra.jpg

       If you wanted just the module/ directory and you wanted it to become the new root so that your new
       directory structure looked like

           foo.c
           bar.c

       then you could run:

           git filter-repo --subdirectory-filter module/

       If you wanted all the files from the original repo, but wanted to move everything under a subdirectory
       named my-module/, so that your new directory structure looked like

           my-module/
              module/
                 foo.c
                 bar.c
              otherDir/
                 blah.config
                 stuff.txt
              zebra.jpg

       then you would instead run run

           git filter-repo --to-subdirectory-filter my-module/

   Content based filtering
       If you want to filter out all files bigger than a certain size, you can use --strip-blobs-bigger-than
       with some size (K, M, and G suffixes are recognized), e.g.:

           git filter-repo --strip-blobs-bigger-than 10M

       If you want to strip out all files with specified git object ids (hashes), list the hashes in a file and
       run

           git filter-repo --strip-blobs-with-ids FILE_WITH_GIT_BLOB_IDS

       If you want to modify file contents, you can do so based on a list of expressions in a file, one per
       line. For example, with a file named expressions.txt containing

           p455w0rd
           foo==>bar
           glob:*666*==>
           regex:\bdriver\b==>pilot
           literal:MM/DD/YYYY==>YYYY-MM-DD
           regex:([0-9]{2})/([0-9]{2})/([0-9]{4})==>\3-\1-\2

       then running

           git filter-repo --replace-text expressions.txt

       will go through and replace p455w0rd with ***REMOVED***, foo with bar, any line containing 666 with a
       blank line, the word driver with pilot (but not if it has letters before or after; e.g. drivers will be
       unmodified), replace the exact text MM/DD/YYYY with YYYY-MM-DD and replace date strings of the form
       MM/DD/YYYY with ones of the form YYYY-MM-DD. In the expressions file, there are a few things to note:

       •   Every line has a replacement, given by whatever is on the right of ==>. If ==> does not appear on the
           line, the default replacement is ***REMOVED***.

       •   Lines can start with literal:, glob:, or regex: to specify whether to do literal string matches,
           globs (see https://docs.python.org/3/library/fnmatch.html), or regular expressions (see
           https://docs.python.org/3/library/re.html#regular-expression-syntax). If none of these are specified,
           literal: is assumed.

       •   If multiple matches are found, all are replaced.

       •   globs and regexes are applied to the entire file, but without any special flags turned on. Some folks
           may be interested in adding (?m) to the regex to turn on MULTILINE mode, so that ^ and $ match the
           beginning and ends of lines rather than the beginning and end of file. See
           https://docs.python.org/3/library/re.html for details.

       See also the --blob-callback from the section called “CALLBACKS”.

   Updating commit/tag messages
       If you want to modify commit or tag messages, you can do so with the same syntax as --replace-text,
       explained above. For example, with a file named expressions.txt containing

           foo==>bar

       then running

           git filter-repo --replace-message expressions.txt

       will replace foo in commit or tag messages with bar.

       See also the --message-callback from the section called “CALLBACKS”.

   Refname based filtering
       To rename tags, use --tag-rename, e.g.:

           git filter-repo --tag-rename foo:bar

       This will rename any tags starting with foo to now start with bar. Either side of the colon could be
       blank, e.g.

           git filter-repo --tag-rename '':'my-module-'

       For more general refname modification, see --refname-callback from the section called “CALLBACKS”.

   User and email based filtering
       To modify username and emails of commits, you can create a mailmap file in the format accepted by git-
       shortlog(1). For example, if you have a file named my-mailmap you can run

           git filter-repo --mailmap my-mailmap

       and if the current contents of that file are as follows (if the specified mailmap file is version
       controlled, historical versions of the file are ignored):

           Name For User <email@addre.ss>
           <new@ema.il> <old1@ema.il>
           New Name And <new@ema.il> <old2@ema.il>
           New Name And <new@ema.il> Old Name And <old3@ema.il>

       then we can update username and/or emails based on the specified mapping.

       See also the --name-callback and --email-callback from the section called “CALLBACKS”.

   Parent rewriting
       To replace $commit_A with $commit_B (e.g. make all commits which had $commit_A as a parent instead have
       $commit_B for that parent), and rewrite history to make it permanent:

           git replace $commit_A $commit_B
           git filter-repo --force

       To create a new commit with the same contents as $commit_A except with different parent(s) and then
       replace $commit_A with the new commit, and rewrite history to make it permanent:

           git replace --graft $commit_A $new_parent_or_parents
           git filter-repo --force

       The reason to specify --force is two-fold: filter-repo will error out if no arguments are specified, and
       the new graft commit would otherwise trigger the not-a-fresh-clone check.

   Partial history rewrites
       To rewrite the history on just one branch (which may cause it to no longer share any common history with
       other branches), use --refs. For example, to remove a file named extraneous.txt from the master branch:

           git filter-repo --invert-paths --path extraneous.txt --refs master

       To rewrite just some recent commits:

           git filter-repo --invert-paths --path extraneous.txt --refs master~3..master

CALLBACKS

       For flexibility, filter-repo allows you to specify functions on the command line to further filter all
       changes. Please note that there are some API compatibility caveats associated with these callbacks that
       you should be aware of before using them; see the "API BACKWARD COMPATIBILITY CAVEAT" comment near the
       top of git-filter-repo source code.

       All callback functions are of the same general format. For a command line argument like

           --foo-callback 'BODY'

       the following code will be compiled and called:

           def foo_callback(foo):
             BODY

       Thus, you just need to make sure your BODY modifies and returns foo appropriately. One important thing to
       note for all callbacks is that filter-repo uses bytestrings (see
       https://docs.python.org/3/library/stdtypes.html#bytes) everywhere instead of strings.

       There are four callbacks that allow you to operate directly on raw objects that contain data that’s easy
       to write in git-fast-import(1) format:

           --blob-callback
           --commit-callback
           --tag-callback
           --reset-callback

       We’ll come back to these later because it is often the case that the other callbacks are more convenient.
       The other callbacks operate on a small piece of the raw objects or operate on pieces across multiple
       types of raw object (e.g. author names and committer names and tagger names across commits and tags, or
       refnames across commits, tags, and resets, or messages across commits and tags). The convenience
       callbacks are:

           --filename-callback
           --message-callback
           --name-callback
           --email-callback
           --refname-callback

       in each you are expected to simply return a new value based on the one passed in. For example,

           git-filter-repo --name-callback 'return name.replace(b"Wiliam", b"William")'

       would result in the following function being called:

           def name_callback(name):
             return name.replace(b"Wiliam", b"William")

       The email callback is quite similar:

           git-filter-repo --email-callback 'return email.replace(b".cm", b".com")'

       The refname callback is also similar, but note that the refname passed in and returned are expected to be
       fully qualified (e.g. b"refs/heads/master" instead of just b"master" and b"refs/tags/v1.0.7" instead of
       b"1.0.7"):

           git-filter-repo --refname-callback '
             # Change e.g. refs/heads/master to refs/heads/prefix-master
             rdir,rpath = os.path.split(refname)
             return rdir + b"/prefix-" + rpath'

       The message callback is quite similar to the previous three callbacks, though it operates on a bytestring
       that is likely more than one line:

           git-filter-repo --message-callback '
             if b"Signed-off-by:" not in message:
               message += b"\nSigned-off-by: Me My <self@and.eye>"
             return re.sub(b"[Ee]-?[Mm][Aa][Ii][Ll]", b"email", message)'

       The filename callback is slightly more interesting. Returning None means the file should be removed from
       all commits, returning the filename unmodified marks the file to be kept, and returning a different name
       means the file should be renamed. An example:

           git-filter-repo --filename-callback '
             if b"/src/" in filename:
               # Remove all files with a directory named "src" in their path
               # (except when "src" appears at the toplevel).
               return None
             elif filename.startswith(b"tools/"):
               # Rename tools/ -> scripts/misc/
               return b"scripts/misc/" + filename[6:]
             else:
               # Keep the filename and do not rename it
               return filename
             '

       In contrast, the blob, reset, tag, and commit callbacks are not expected to return a value, but are
       instead expected to modify the object passed in. Major fields for these objects are (subject to API
       backward compatibility caveats mentioned previously):

       •   Blob: original_id (original hash) and data

       •   Reset: ref (name of reference) and from_ref (hash or integer mark)

       •   Tag: ref, from_ref, original_id, tagger_name, tagger_email, tagger_date, message

       •   Commit: branch, original_id, author_name, author_email, author_date, committer_name, committer_email,
           committer_date, message, file_changes (list of FileChange objects, each containing a type, filename,
           mode, and blob_id), parents (list of hashes or integer marks)

       An example of each:

           git filter-repo --blob-callback '
             if len(blob.data) > 25:
               # Mark this blob for removal from all commits
               blob.skip()
             else:
               blob.data = blob.data.replace(b"Hello", b"Goodbye")
             '

           git filter-repo --reset-callback 'reset.ref = reset.ref.replace(b"master", b"dev")'

           git filter-repo --tag-callback '
             if tag.tagger_name == b"Jim Williams":
               # Omit this tag
               tag.skip()
             else:
               tag.message = tag.message + b"\n\nTag of %s by %s on %s" % (tag.ref, tag.tagger_email, tag.tagger_date)'

           git filter-repo --commit-callback '
             # Remove executable files with three 6s in their name (including
             # from leading directories).
             # Also, undo deletion of sources/foo/bar.txt (change types are
             # either b"D" (deletion) or b"M" (add or modify); renames are
             # handled by deleting the old file and adding a new one)
             commit.file_changes = [
                    change for change in commit.file_changes
                    if not (change.mode == b"100755" and
                            change.filename.count(b"6") == 3) and
                       not (change.type == b"D" and
                            change.filename == b"sources/foo/bar.txt")]
             # Mark all .sh files as executable; modes in git are always one of
             # 100644 (normal file), 100755 (executable), 120000 (symlink), or
             # 160000 (submodule)
             for change in commit.file_changes:
               if change.filename.endswith(b".sh"):
                 change.mode = b"100755"
             '

INTERNALS

You probably don’t need to read this section unless you are just very curious or you are trying to do a
very complex history rewrite.

How filter-repo works
Roughly, filter-repo works by running

git fast-export <options> | filter | git fast-import <options>

where filter-repo not only launches the whole pipeline but also serves as the filter in the middle.
However, filter-repo does a few additional things on top in order to make it into a well-rounded
filtering tool. A sequence that more accurately reflects what filter-repo runs is:

1. Verify we’re in a fresh clone

2. git fetch -u . refs/remotes/origin/*:refs/heads/*

3. git remote rm origin

4. git fast-export --show-original-ids --reference-excluded-parents --fake-missing-tagger
--signed-tags=strip --tag-of-filtered-object=rewrite --use-done-feature --no-data --reencode=yes
--mark-tags --all | filter | git -c core.ignorecase=false fast-import --date-format=raw-permissive
--force --quiet

5. git update-ref --no-deref --stdin, fed with a list of refs to nuke, and a list of replace refs to
delete, create, or update.

6. git reset --hard

7. git reflog expire --expire=now --all

8. git gc --prune=now

Some notes or exceptions on each of the above:

1. If we’re not in a fresh clone, users will not be able to recover if they used the wrong command or
ran in the wrong repo. (Though --force overrides this check, and it’s also off if you’ve already ran
filter-repo once in this repo.)

2. Technically, we actually use a git update-ref command fed with a lot of input due to the fact that
users can use --force when local branches might not match remote branches. But this fetch command
catches the intent rather succinctly.

3. We don’t want users accidentally pushing back to the original repo, as discussed in the section
called “DISCUSSION”. It also reminds users that since history has been rewritten, this repo is no
longer compatible with the original. Finally, another minor benefit is this allows users to push with
the --mirror option to their new home without accidentally sending remote tracking branches.

4. Some of these flags are always used but others are actually conditional. For example, filter-repo’s
--replace-text and --blob-callback options need to work on blobs so --no-data cannot be passed to
fast-export. But when we don’t need to work on blobs, passing --no-data speeds things up. Also, other
flags may change the structure of the pipeline as well (e.g. --dry-run and --debug)

5. We use this step to write replace refs for accessing the newly written commit hashes using their
previous names. Also, if refs were renamed by various steps, we need to delete the old refnames in
order to avoid mixing old and new history.

6. Users also have old versions of files in their working tree and index; we want those cleaned up to
match the rewritten history as well. Note that this step is skipped in bare repos.

7. Reflogs will hold on to old history, so we need to expire them.

8. We need to gc to avoid mixing new and old history. Also, it shrinks the repository for users, so they
don’t have to do extra work. (Odds are that they’ve only rewritten trees and commits and maybe a few
blobs, so --aggressive isn’t needed and would be too slow.)

Information about these steps is printed out when --debug is passed to filter-repo. When doing a
--partial history rewrite, steps 2, 3, 7, and 8 are unconditionally skipped, step 5 is skipped if
--replace-refs is update-no-add, and just the nuke-unused-refs portion of step 5 is skipped if
--replace-refs is something else.

Limitations
Inherited limitations
Since git filter-repo calls fast-export and fast-import to do a lot of the heavy lifting, it inherits
limitations from those systems:

• extended commit headers, if any, are stripped

• commits get rewritten meaning they will have new hashes; therefore, signatures on commits and
tags cannot continue to work and instead are just removed (thus signed tags become annotated
tags)

• tags of commits are supported. Prior to git-2.24.0, tags of blobs and tags of tags are not
supported (fast-export would die on such tags). tags of trees are not supported in any git
version (since fast-export ignores tags of trees with a warning and fast-import provides no way
to import them).

• annotated and signed tags outside of the refs/tags/ namespace are not supported (their location
will be mangled in weird ways)

• fast-import will die on various forms of invalid input, such as a timezone with more than four
digits

• fast-export cannot reencode commit messages into UTF-8 if the commit message is not valid in its
specified encoding (in such cases, it’ll leave the commit message and the encoding header alone).

• commits without an author will be given one matching the committer

• tags without a tagger will be given a fake tagger

• references that include commit cycles in their history (which can be created with git-replace(1))
will not be flagged to the user as an error but will be silently deleted by fast-export as though
the branch or tag contained no interesting files

There are also some limitations due to the design of these systems:

• Trying to insert additional files into the stream can be tricky; since fast-export only lists
file changes in a merge relative to its first parent, if you insert additional files into a
commit that is in the second (or third or fourth) parent history of a merge, then you also need
to add it to the merge manually. (Similarly, if you change which parent is the first parent in a
merge commit, you need to manually update the list of file changes to be relative to the new
first parent.)

• fast-export and fast-import work with exact file contents, not patches. (e.g. "Whatever the
current contents of this file, update them to now have these contents") Because of this, removing
the changes made in a single commit or inserting additional changes to a file in some commit and
expecting them to propagate forward is not something that can be done with these tools. Use git-
rebase(1) for that.

Intrinsic limitations
Some types of filtering have limitations that would affect any tool attempting to perform them; the
most any tool can do is attempt to notify the user when it detects an issue:

• When rewriting commit hashes in commit messages, there are a variety of cases when the hash will
not be updated (whenever this happens, a note is written to .git/filter-repo/suboptimal-issues):

• if a commit hash does not correspond to a commit in the old repo

• if a commit hash corresponds to a commit that gets pruned

• if an abbreviated hash is not unique

• Pruning of empty commits can cause a merge commit to lose an entire ancestry line and become a
non-merge. If the merge commit had no changes then it can be pruned too, but if it still has
changes it needs to be kept. This might cause minor confusion since the commit will likely have a
commit message that makes it sound like a merge commit even though it’s not. (Whenever a merge
commit becomes a non-merge commit, a note is written to .git/filter-repo/suboptimal-issues)

Issues specific to filter-repo
• Multiple repositories in the wild have been observed which use a bogus timezone (+051800); google
will find you some reports. The intended timezone wasn’t clear or wasn’t always the same. Replace
with a different bogus timezone that fast-import will accept (+0261).

• --path-rename can result in pathname collisions; to avoid excessive memory requirements of
tracking which files are in all commits or looking up what files exist with either every commit
or every usage of --path-rename, we just tell the user that they might clobber other changes if
they aren’t careful. We can check if the clobbering comes from another --path-rename without much
overhead. (Perhaps in the future it’s worth adding a slow mode to --path-rename that will do the
more exhaustive checks?)

• There is no mechanism for directly controlling which flags are passed to fast-export (or
fast-import); only pre-defined flags can be turned on or off as a side-effect of other options.
Direct control would make little sense because some options like --full-tree would require
additional code in filter-repo (to parse new directives), and others such as -M or -C would break
assumptions used in other places of filter-repo.

• Partial-repo filtering, while supported, runs counter to filter-repo’s "avoid mixing old and new
history" design. This support has required improvements to core git as well (e.g. it depends upon
the --reference-excluded-parents option to fast-export that was added specifically for this usage
within filter-repo). The --partial and --refs options will continue to be supported since there
are people with usecases for them; however, I am concerned that this inconsistency about mixing
old and new history seems likely to lead to user mistakes. For now, I just hope that long
explanations of caveats in the documentation of these options suffice to curtail any such
problems.

Comments on reversibility
Some people are interested in reversibility of a rewrite; e.g. rewrite history, possibly add some
commits, then unrewrite and get the original history back plus a few new "unrewritten" commits.
Obviously this is impossible if your rewrite involves throwing away information (e.g. filtering out
files or replacing several different strings with ***REMOVED***), but may be possible with some
rewrites. filter-repo is likely to be a poor fit for this type of workflow for a few reasons:

• most of the limitations inherited from fast-export and fast-import are of a type that cause
reversibility issues

• grafts and replace refs, if present, are used in the rewrite and made permanent

• rewriting of commit hashes will probably be reversible, but it is possible for rewritten
abbreviated hashes to not be unique even if the original abbreviated hashes were.

• filter-repo defaults to several forms of irreversible rewriting that you may need to turn off
(e.g. the last two bullet points above or reencoding commit messages into UTF-8); it’s possible
that additional forms of irreversible rewrites will be added in the future.

• I assume that people use filter-repo for one-shot conversions, not ongoing data transfers. I
explicitly reserve the right to change any API in filter-repo based on this presumption (and a
comment to this effect is found in multiple places in the code and examples). You have been
warned.

GIT

       Part of the git(1) suite

NOTES

        1. GitLab’s excellent docs on reducing repository size
           https://docs.gitlab.com/ee/user/project/repository/reducing_the_repo_size_using_git.html

        2. the first and second steps under "Fully removing the data from GitHub"
           https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository#fully-removing-the-data-from-github