Counting objects Counting objects

The Systems Team at GitHub works to solve complex bugs and performance bottlenecks at the lowest levels of our infrastructure. Over the past two years we’ve undertaken a major project to improve the performance of Git network operations (like clones or fetches) for the repositories we host.

Two years ago, if you cloned a large repository from any Git host, you probably found yourself waiting several minutes before data started being sent to your machine, while the server proudly announced that it was “counting objects”.

$ git clone
Cloning into ‘linux’…
remote: Counting objects: 4350078, done.
remote: Compressing objects: 100% (4677/4677), done.
Receiving objects: 4% (191786/4350078), 78.19 MiB | 8.70 MiB/s
However, if you try to clone the equivalent repository hosted on today, or on a GitHub Enterprise instance, you’ll notice that data starts being sent immediately and the transfer is only limited by your available bandwidth to the server.

Inquiring minds may wonder: What are these objects, why did you have to count them every single time, and why are you not counting them anymore?

The objects

All data in a Git repository is stored in the shape of a Directed Acyclic Graph. The history of the repository is ordered throughout time by the way commits are linked to each other. Each commit has a link to its parent, the commit that came right before it; some commits have two parents, the result of merging two branches together. At the same time, each commit has a link to a tree. The tree is a snapshot of the contents of the working directory in the moment where the commit was created. It contains links to all the files in the root of your repository, and links to other subtrees which are the folders and recursively link to more files and subtrees.

The result is a highly dense forest of interlinked nodes stored in the shape of a graph, but essentially accessed as a key-value store (each object in the database is only indexed by the SHA1 of its contents).

The counting

When you connect to a Git daemon to perform a fetch, the client and the server perform a negotiation. The server shows the tips of all the branches it has, and the client replies by comparing those against the tips of its own branches. “I’ve already got all these objects. I want those! And those!”

There’s a very specific case of a fetch operation: a fresh clone, where little negotiation is necessary. The server offers the tips of its branches, and the client wants all of them, because it doesn’t have any.

This is when this whole ordeal starts becoming expensive, because the server has little information on what to actually send. Git doesn’t keep a definite list of all objects reachable from the graph, and it cannot send every single object in its database as a whole, because it could very well be that some of those objects are not reachable at all in the repository and should be thrown away instead of sent to the client. The only thing Git knows are the tips of all branches, so its only option is to walk down the graph, all the way to the beginning of the history, listing every single object that needs to be sent.

Each commit in the history is loaded and added to the list of objects to send, and from each commit its tree is also expanded, and blobs and subtrees are also added to the list. This process scales with the size of the history of the repository (i.e. the amount of commits) and the actual size of the repository. The more files that exist on the repository, the longer it takes to iterate through the trees of each commit to find the few blobs that haven’t been listed yet.

This is the “Counting Objects” phase, and as you may gather, some repositories have a pretty big graph, with lots of objects to count.

The problem

At GitHub we serve some of the highest-traffic Git repositories on the internet, repositories which generally are also very large. A clone of the Linux kernel can take up to 8 minutes of CPU time in our infrastucture. Besides the poor user experience of having to wait several minutes before you can start receiving data in a clone, a thundering herd of these operations could significantly impact the availability of our platform. This is why the Systems team made it a priority to start researching workarounds for this issue.

Our initial attempts involved caching the result of every clone in order to replay it to other clients asking for the same data. We quickly found out that this approach was not elegant and definitely not scalable.

When it comes to caching clones, the repositories that matter the most are the most active ones. This makes caching the full result of a clone pointless, since the state of the repository changes constantly. On top of that, the initial caching still takes the same unreasonable amount of CPU time, and the size of the cache is about the same as that of the repository on disk. Each cached response roughly doubles the amount of disk space needed by the repository.

Because of these reasons, we decided against pursuing the caching approach. Generally speaking, caching specific results to queries is a weak approach to performance in complex systems. What you want to do instead is caching intermediate steps of the computation, to be able to efficiently answer any kind of query. In this case, we were looking for a system that would not only allow us to serve clones efficiently, but also complex fetches. Caching responses cannot accomplish this.

In 2013 we organized and obviously attended Git Merge in Berlin, the yearly conference (previously known as GitTogether) where developers of all the different Git implementations meet, mingle and attempt to cope with the fact that it’s 2015 and we’re writing version control systems for a living.

There, we were delighted to learn about the “bitmap index” patchset that Google’s JGit engineers had written for their open-source Java implementation of Git. Back then the patchset was still experimental, but the concept was sound and the performance implications were very promising, so we decided to tread down that path and attempt to port the patchset to Git, with the hopes of being able to run it in our infrastucture and push it upstream for the whole community to benefit.

The idea

The idea behind bitmap indexes is not particularly novel: it is the same method that databases (especially relational ones) have used since the 1970s to speed up their more complex queries.

In Git’s case, the queries we’re trying to perform are reachability queries: given a set of commits, what are all of the objects in the graph that can be reached from those commits? This is the operation that the “Counting Objects” phase of a fetch performs, but we want to find the set of reachable objects without having to traverse every single object on the way.

To do so, we create indexes (stored as bitmaps) that contain the information required to answer these queries. For any given commit, its bitmap index marks all the objects that can be reached from it. To find the objects that can be reached from a commit, we simply look up its bitmap and check the marked bits on it; the graph doesn’t need to be traversed anymore, and an operation that used to take several minutes of CPU time (loading and traversing every single object in the graph) now takes less than 3ms.

Of course, a practial implementation of these indexes has many details to work around:

First and foremost, we cannot create an index for every single commit in the repository. That would consume too much storage! Instead, we use heuristics to decide the set of commits that will receive bitmaps. We’re interested in indexing the most recent commits: the tips of all branches need an index, because those are the reachability computations that we will perform with a fresh clone. Besides that, we also add indexes to recent commits; it is likely that their reachability will be computed when people fetch into a clone that is slightly out of date. As we progress deeper into the commit graph, we decrease the frequency at which we pick commits to give indexes to. Towards the end of the graph, we pick one of every 3000 commits.

Last, but not least, we minimize the amount of disk used by the indexes by compressing the bitmaps. JGit’s implementation originally used Daniel Lemire’s EWAH Bitmaps, so we ported his C++ implementation to C to be backwards compatible [1]. EWAH Bitmaps offer a very reasonable amount of compression while still being extremely fast to decompress and work with. The end result is an index that takes between 5 and 10% of the original size of the repository — significantly less than caching the result of fetches, whilst allowing us to speed up any fetch negotiation, not only those of fresh clones.

The implementation

Once we started working on the Git implementation, we found that JGit’s design was so dramatically different from Git’s that it was simply unfeasible to port or translate the patchset. JGit is after all a Java library, while Git is a collection of small binaries written in C.

What we did instead was take the idea of bitmap indexes and re-implement it from scratch, using only the on-disk format of JGit’s bitmap indexes as reference (as to ensure interoperability between the two implementations).

The first iteration of the patchset took about 2 months of work, and was drastically different from the original JGit implementation. But it seemed to be compatible. We could open and query bitmap indexes generated by JGit. We did several rounds of synthetic benchmarks, which showed that our implementation was slightly faster than JGit’s (mostly because of performance differences between the languages, not between the different designs).

So we started testing cloning real production repositories in a staging environment. The benchmarks showed an exciting improvement of 40%… In the wrong direction. The bitmap-based code was about 40% slower to generate the packfile for a repository.

These results raised many questions: how can a clone operation be significantly slower than in vanilla Git, when we’ve already tested that the bitmap indexes are being queried instantly? Did we really spend 3 months writing 5000 lines of C that make clones slower instead of faster? Should we have majored in Math instead of Computer Science?

Our existential crisis vanished swiftly with some in-depth profiling. If you break down the graph by the amount of time spent during each phase of the cloning operation you can see that, indeed, we’ve managed to reduce an operation that took minutes to mere miliseconds, but unfortunately we’ve made another operation — seemingly unrelated — three times slower.

Deltas all the way down

The job of a Version Control System is tracking changes on a set of files over time; hence, it needs to keep a snapshot of every single file in the repository at every changeset: thousands of versions of thousands of different files which, if stored individually, would take up an incredible amount of space on disk.

Because of this, all version control systems use smarter techniques to store all these different versions of files. Usually, files are stored as the delta (i.e. difference) between the previous version of a file and its current one. This delta is then compressed to further reduce the amount of size it takes on disk.

In Git’s case, every snapshot of a file (which we call a blob) is stored together with all the other objects (the tags, commits and trees that link them together) in what we call a “packfile”.

When trying to minimize the size of a packfile, however, Git mostly disregards history and considers blobs in isolation. To find a delta base for a given blob, Git blindly scans for blobs that look similar to it. Sometimes it will choose the previous version of the same file, but very often it’s a blob from an unrelated part of the history.

In general, Git’s aggressive approach of delta-ing blobs against anything is very efficient and will often be able to find smaller deltas. Hence the size of a Git repository on-disk will often be smaller than its Subversion or Mercurial equivalent.

This same process is also performed when serving a fetch negotiation. Once Git has the full list of objects it needs to send, it needs to compress them into a single packfile (Git always sends packfiles over the network). If the packfile on disk has a given object X stored as a delta against an object Y, it could very well be the case that object X needs to be sent, but object Y doesn’t. In this case, object X needs to be decompressed and delta’ed against an object that does need to be sent in the resulting packfile.

This is a pretty frequent ocurrence when negotiating a small fetch, because we’re sending few objects, all from the tip of the repository, and these objects will usually be delta’ed against older objects that won’t be sent. Therefore, Git tries to find new delta bases for these objects.

The issue here is that we were actually serving full clones, and our benchmarks were showing that we were recompressing almost every single object in the clone, which really makes no sense. The point of a clone is sending all the objects in a repository! When we send all objects, we shouldn’t need to find new delta bases for them, because their existing delta bases will be sent too!

And yet the profiling did not lie. There was something fishy going on the compression phase.

Your very own fork of Rails

Very early on we figured out that actually forking people’s repositories was not sustainable. For instance, there are almost 11,000 forks of Rails hosted on GitHub: if each one of them were its own copy of the repository, that would imply an incredible amount of redundant disk space, requiring several times more fileservers than the ones we have in our infrastructure.

That’s why we decided to use a feature of Git called alternates. When you fork a repository on GitHub, we create a shallow copy of it. This copy has no objects of its own, but it has access to all the objects of an alternate, a root repository we call network.git and which contains the objects for all the forks in the network. When you push to your repository, eventually we move over the objects you pushed to the network.git root, so they’re already available in case you decide to create a Pull Request against the original repository.

With all the repositories in the network sharing the same pool of objects, we can keep all the objects inside of a single packfile, and hence minimize the on-disk size of the repository network by not storing all the duplicated objects between the forks.

Here’s the issue though: when we delta two objects inside this packfile (a packfile that is shared between all the forks of a repository), Git uses the straightforward, aggressive heuristic of “finding an object that looks alike”. As stated earlier, this is very effective for minimizing the size of a packfile on disk, but like all simple things in life, this comes back to bite us.

What happens when you delta an object of a fork against an object of another fork? Well, when you clone that fork, that object cannot be sent as a delta, because its base is not going to be sent as part of the clone. The object needs to be blown up (recomposed from its original delta), and then delta’ed on the fly against an object that is actually going to be sent on the clone.

This is the reason why the “Compressing Objects” phase of our fetches was recompressing all the objects and taking so long to run: yes, we were sending a whole repository in the clone, but the objects in that repository were delta’ed against objects of a different fork altogether. They did need to be recompressed on the fly. Ouch!

The missing optimization

We had a solid explanation of why we were seeing all these objects being recompressed, but something was still off. The “compressing objects” phase is completely independent of the “counting objects” phase, and our bitmap changes didn’t affect it at all.

The behavior of this compression phase is dictated by the way we store forks jointly on disk, so it made no sense that it would take twice as long after our patch, again considering these code paths were not touched. Why was object compression suddenly so slow?

Here’s what happens: when traversing the graph during the “Counting Objects” phase, traditional Git collected metadata about blobs, like their filenames, sizes and paths. During the compression phase, it used this metadata as part of its heuristics for finding good delta bases.

However, after three months of work in our bitmap index changes, we could now find the list of all the objects to send through the bitmap index, without having to actually load a single one of them. This was great for removing the intense CPU requirements of the “counting objects” phase, but it had unintended side effects when it was time to to compress the objects.

When Git noticed that none of the objects on the list could be sent because they were delta’ed against other objects that were not in the list, it was forced to re-delta these objects. But without any metadata to hint at the size or contents of the objects, it had to randomly compare them against each other (obviously by loading them and forfeiting all the benefits that the bitmap index had in the first place) until it was able to find similar objects to delta against and piece together a packfile that wasn’t outrageously large.

This is how we were losing all the benefits of our optimization, and in fact making the process of generating a packfile 40% slower than it was before, despite the fact that the most expensive phase of the process was essentially optimized away.

The performance fix

Our dooming performance issue was being caused, again, because we were merging several forks of a repository in the same packfile, but unfortunately, splitting these forks into individual packs was not an option because it would multiply up the storage costs of repositories several times.

After some thought, we agreed that the bitmap index approach was too promising to give up on, despite the fact it could not run properly with our current infrastucture. Therefore, we started looking for a workaround in the way we store the different forks of repositories on disk.

The idea of “forks” is foreign to Git’s storage layer: when creating the packfile for a network of forks, Git only sees hundreds of thousands of objects and is not aware of the fact that they actually come from different forks of the same repository.

However, we are aware of the source of the objects as we add them to the packfile for delta’ing, and we can influence the way that Git will delta these packfiles.

What we did was “paint” the graph of all forks, starting from the roots, with a different color for each fork. The “painting” was performed by marking a bit on each object for each fork that contained the object.

With this, it’s trivial to propagate the markings to the descendant objects, and change the delta-base heuristic to only consider a parent object as a base if its marking were a strict superset of the markings of the child object (i.e. if sending the child object would always imply sending also the parent, for all forks).

As the markings trickle down, the final packfile will only have objects that can be sent as-is, because when cloning any of the forks in the network, every object will only be delta’ed against another object that will also be sent in the clone.

The results

With this new heuristic, we repacked the original repository for our benchmarks (which in our case meant repacking the whole network of forks that contained the repository), and ran the packfile generation test again.

The first thing we verified is that our heuristics worked, and that we had to re-compress exactly 0 objects during the “Compressing Objects” phase. That was the case. As for the performance of the pack generation phase… Well, let’s just say we hit our performance target.

It took another couple months to work around the kinks in our implementation, but we eventually deployed the patchset to production, where the average CPU time spent on Git network operations was reduced by more than 90% — an order of magnitude faster than the old implementation.

Shortly after our initial deploy, we also started the process of upstreaming the changes to Git so the whole community could benefit from them. The bitmap-index patchset [2] finally shipped in the much anticipated Git 2.0 release, and now Git network operations from most major Git hosts are enjoying the overhaul.

Since the initial deployment of our patchset more than a year ago, it has already saved our users roughly a century of waiting for their fetches to complete (and the equivalent amount of CPU time in our fileservers).

But this is only the beginning: now that bitmap indexes are widely available, they can be used to accelerate many other Git operations. We’ve implemented a number of such improvements which we already run on production, and we are preparing them for submission to the open source Git project.

The story behind these second generation optimizations, however, will have to wait until another day and another blog post.

1. Daniel Lemire was extremely helpful assisting us with questions regarding his EWAH compression algorithm, and allowed us to relicense our C port under the GPLv2 to make it compatible with Git. His state-of-the-art research on bitmap and array compression is used in a lot of open-source and commercial software.

2. The final series that made it upstream was composed of 14 commits. They include thorough documentation of the whole implementation in their commit messages. The key parts of the design are explained in compressed bitmap implementation, add documentation for the bitmap format, add support for bitmap indexes, use bitmaps when packing objects and implement bitmap writing.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s