When it comes to organizing Git repositories, there’s only one way to do it: a repository per project or component (a web server, an iOS app, etc…).
If you’ve done anything other than that, you’ve fucked up. Maybe you have a single repository for everything? Some combination? Sure, it might be easier to just have one repository, but as your system grows, the ramifications of that fuck-up become clear:
-
A 1-line change to the app performs a full-redeploy of your web-server - now you have to have your CI server only build certain jobs based on directory’s touched. Yuck.
-
Every single server has the code for all the other servers. Now you have to perform a build step to remove files from deployments. Yuck.
-
Large commits cutting across multiple components. Commit messages could be considered novels. Now we have to selectively commit certain files that touch a server, and certain files that touch the app, etc. Yuck.
If this sounds like your project, don’t keep working around the problem - let’s fix it. It’s probably going to be a bit of work, but it’s necessary, so take a deep breath, and let’s get started.
The first choice you have to make is whether to preserve history or not. For almost all cases you should preserve history, but if for some reason you decide it’s not important, then “these aren’t the droids you’re looking for”. You just need to create new repositories, copy files to each as appropriate, and you’re done. That might be the right choice for your project, but this isn’t the post for you.
Git gives us many tools to rewrite history, so let’s look at situations you’ll run into, and how to use Git to deal with them.
Extract a sub-directory
This is the best-case scenario: everything is perfect about your repository,
only you have each project/component in a sub-directory of your large repo.
For this, git subtree split
is exactly what you want.
git subtree split -prefix=subdir_to_remove -b new_project_branch
For some older versions of git, the
subtree
subcommand may be unavailable. Instead, you can use thefilter-branch
subcommand:git filter-branch --subdirectory-filter subdir_to_remove new_project_branch
Then you can push each new branch to its own remote.
Arbitrary paths
What if a project has two top-level directories, or shares files between projects?
Maybe a .env
or Procfile
?
In this case you’ll have to extract history by arbitrary paths.
To do this, first we remove all non-wanted files from all git commits with the
git filter-branch
command. Any commits that do not touch those paths will be
removed because of the --ignore-unmatch
flag.
PATHS_TO_KEEP="./some_dir ./some_other_dir .env Procfile"
git filter-branch -f --index-filter "git rm --ignore-unmatch --cached -qr -- . && git reset -q \$GIT_COMMIT -- $PATHS_TO_KEEP" --prune-empty -- HEAD
The
-- HEAD
means to only do this transformation in the current branch. You can replace it with-- --all
to do it across all branches. Or--tag-name-filter cat -- --all
to do it across all branches AND tags. Pick your poison.
While this is great, empty merge commits are not removed by the last command!
A parent-filter
can be used to rewrite commit parents to remove them:
tmpfile=`mktemp -t splitgit`
chmod 755 $tmpfile
echo '#!/usr/bin/env ruby
old_parents = gets.chomp.gsub("-p ", " ")
new_parents = old_parents.empty? ? [] : `git show-branch --independent #{old_parents}`.split
puts new_parents.map{|p| "-p " + p}.join(" ")' > $tmpfile
git filter-branch -f --prune-empty --parent-filter $tmpfile HEAD
rm $tmpfile
You can simplify this by writing $tmpfile to disk in a well-known location, instead of using
mktemp
each time.
Moving files around
Let’s say a new repo should still have a file, but you want to change its
path in the new repo? If so, tree-filter
can help with this. For this example,
we want to move an entire sub-directory up one level.
git filter-branch -f --tree-filter 'shopt -s dotglob nullglob; test -d sub_dir && mv sub_dir/* . || :' HEAD
Deleting files
What if you just don’t want a particualr file around?
git filter-branch -f --index-filter 'git rm --cached --ignore-unmatch Rakefile' --prune-empty -- HEAD
We use
index-filter
here to not require files be checked out for each commit, which makes it faster. This is different fromtree-filter
in that it checks out each commit’s files. This is good in that it lets you userm
andtest
, but is slower. When formulating your own history manipulations, keep this in mind.
Keeping only certain lines in a file
(I know, this is crazy…) What if, in your old repo, you had a configuration
file that was for all projects, but now you want to only keep commits that
affected the variables for the particuar sub-project? tree-filter
+ cat
+ grep
:
git filter-branch -f --tree-filter 'test -f .env && cat .env | grep "\(^VAR_X\|^VAR_Y\|^VAR_Z\)" > .newenv && mv .newenv .env || :' --prune-empty -- HEAD
Fucking symlinks
OK, this is harder, but all your fault. You have symlinks in your repo, presumably to share code between projects. Bad. Share code through a package manager, a submodule, or some other means, but not through a symlink.
This is a problem when your repository split makes a symlink invalid, since Git stores symlinks as a file with its only contents a path.
To fix this, we’ll convert all symlinks into hard links, which Git will then treat as any other file and commit its contents.
Convert all symlinks to hard links (if the symlinks are relative paths):
git filter-branch -f --tree-filter \
'find . -type l -exec bash -c '"'"'ln -f "$(dirname "$0")/""$(readlink -n "$0")" "$0"'"'" {} \;' \
--prune-empty -- HEAD
If the paths are absolute, WTF? How the hell did that ever work to begin with??
Empty commits
If for some reason you have empty commits (maybe you forgot a --ignore-unmatch
?),
you can remove them with a special --commit-filter
:
git filter-branch --commit-filter 'git_commit_non_empty_tree "$@"' HEAD
The initial commit is still empty, what gives?
That’s the only price your going to pay for not doing the right thing to start, so be OK with it. Your initial commit is most likely going to be an empty one saying “Initial commit” or whatever you said.
You could rebase your entire repository (git rebase --interactive --root
)
but if you have any legitimate merge commits you’ll find yourself attempting to merge
years old code in the right order. It’s not worth it.
Fixing author info
If you notice certain committers used a non-company email, or there are commits
from non-human accounts, you can use a --env-filter
to rewrite author and
committer name and emails. GitHub’s Help site has a good example of this.
Holy crap I deleted the wrong thing/everything!
No problem! If anything goes wrong, you can undo the last filter-branch
:
git reset --hard refs/original/refs/heads/master
Next steps
After you’ve got all your new shiny respositories how you want them, you’re may not be entirely done. Some things that you probably need to do:
- Create new CI or deployment jobs to work against new repos
- Reference GitHub issues in commits? Well, now
#1234
points to your new repo, not the old one. To reference an issue in a different repo than the current one, useuser/repo#1234
(and make it a habit to always reference them like that). I’ll leave it to the reader to figure out how to rewrite commit messages from#1234
touser/repo#1234
. - Delete old code from old repo, and put a message up on your old repo telling visitors where to look for the new repo.
Happy history rewriting!