What is a Monorepo, Really?

There are often discussions at software companies about whether they should or shouldn’t have a “monorepo,” meaning “a single, version-controlled repository for all code at the company.” Very often, people base this decision on the fact that this is how Google stores its code.

I have now worked in developer productivity organizations at a company with a very advanced monorepo (Google) and a company with a very advanced multi-repo system (LinkedIn), and I have to tell you: most of the valuable properties that people associate with a monorepo have nothing to do with how many source control repositories you have. In fact, what people (and Google) consider a monorepo is actually multiple different concepts:

  1. Atomic commits across different projects. (And thus an atomic “head” commit that moves forward atomically for all code.)
  2. A universal directory hierarchy and a single view of all source code.
  3. The single place where you go to check out or commit code. (Including all tools that read or write stuff.)
  4. (Sometimes) The smallest unit of check out, commit, and dependency is a file.
  5. (Usually) No concept of a project, only concepts of directories and files.
  6. (Sometimes) The One Version Rule: There may only be one version of any dependency in the repository at any one time.
  7. The ability to require library maintainers to solve the problems they cause.

I’ll talk about these in more detail, including some of their upsides and downsides.

Atomic Commits Across Projects

Let’s say we have two separate projects, A and B. We want to make a change that affects both of them. Part of a “monorepo” is the guarantee that you can commit atomically to both of these projects simultaneously. There is no view of the repository where Project A is at Commit #1 but Project B is at Commit #2.

This is especially important where you want to make a change where either Project A or B would be broken if they are not changed at exactly the same time. For example, let’s say we have one project called App, and it depends on a project called Library. We want to change the signature of a function in Library and update App at the same time. If we just update Library or just update App, then App is broken.

This is the feature that most depends on things being in a single source code repository, because practically the definition of “a repository” is “a location to which you can commit multiple files atomically, which tracks those atomic commits, and from which you can check out at any point in that atomic commit history.”

This feature also implies that there is a single definition of “head” (the most recent commit) for the entire repository. This is important to think about because when developers check out from a repository, they usually check out at “head.” This means that when developers check out, they are guaranteed a consistent view of the entire source code tree, no matter how many projects they check out simultaneously. They never have to think about whether they checked out App and Library at two different versions that are incompatible with each other. For the most part (as long as you have a good testing system that validates that all commits actually work, which is a complex problem in and of itself) code checked out at any given commit should all work together.

A Standardized Cross-Project Directory Structure

All code in a monorepo is thought of as being in a single directory structure. This has advantages when you are developing, and advantages when you are browsing through code.

While Developing: Checking Out Is Standardized

During development, if Project A is stored at /path/to/project/A in the repository and Project B is stored at /path/to/project/B in the repository, they will be in directories right next to each other when I check them both out. I can guarantee that that will be the directory structure. I never have to think about where I should place Project A on the disk in relationship to Project B, if I need to have them work together while I am developing.

For those who are used to a monorepo, this may seem like a small detail. However, in most multi-repo systems, this can be very confusing. If I am working on an App that depends on a Library, and I want to modify them both on my disk to test how the two modifications will work together, it can be very confusing to figure out how to get the App to consume my modified Library.

All this said, there’s nothing about this principle that actually requires a single source code repository. There could be a standardized way, provided by tools, that projects are always checked out, even if you have multiple repos.

A Uniform Way of Browsing Code

Since you have a single directory structure, it’s relatively straightforward to browse through directories in your code search tool, and to have a single code search tool that searches that one repository.

However, there’s nothing preventing you from having a single, universal view of a multi-repo system via some UI tool or some virtual filesystem. It’s more complicated because there isn’t an atomic “head” for a multi-repo system—all repositories are at different versions at different times. However, you could either (a) account for that in the UI of your code review tool (such as by making the version number part of the “path” people see when they are browsing, or letting people choose versions somehow) or (b) decide that when you’re browsing or searching, you always see the “head” commit of every repository (which is how most code search tools work today anyway).

A Single Place to Check Out and Commit

This may seem unimportant, but one of the values of a monorepo is not having to think “which repository do I check out from?” Instead developers just have to think about what code they need to check out. Similarly, all commits go to that same repository.

This also means that you have a single view of all the commits throughout history, which can sometimes be helpful (such as when you are trying to figure out everything that could have changed between Time A and Time B, for debugging purposes).

And finally, all the tools only have to worry about accessing a single repository—all they have to care about is directory and file names.

Once again, this doesn’t really require having just one repository. You could have a facade in front of your multi-repo system that provides the important parts of this functionality, such as a unified view of history, a single place to check out from, and a single place to commit to, if that was really important.

Files Are the Smallest Unit of Checkout, Commit, and Dependency

In most monorepos the smallest thing you can commit to, that is tracked by the versioning system, is a file. The system knows that “a file” is what changed. It might seem to be aware of lines in a file, but that’s only because it can reproduce the changes to a file as a “diff” by comparing the previous version to the current version. When you commit, the new commit actually contains an entirely new copy of the file you modified.

In some monorepos, you can also check out individual files without checking out the entire repository. In fact, if the repository gets very large, this becomes a very important productivity feature. Otherwise you could be forced to check out gigabytes of code that have nothing to do with what you’re working on.

Also, in some monorepos (Google’s in particular) the smallest unit of dependency is a file. That means that the build system can be aware that one file depends upon another file. It can’t be aware that one function depends on another function, or that one class depends on another class. This means that when you build, you only have to build the specific files that you need, transitively across all of your dependencies. (It should be noted that in Google’s monorepo, sometimes you can only depend upon a group of files or an entire directory, and sometimes that makes more sense.)

None of this requires having a single repository, at all.

No Concept of a Project

Since everything is in the same repository, there’s no inherent concept that a collection of different directories could all represent a single “project.” The build system probably knows that some directories are compiled together to produce a particular artifact, but there’s no universal way of easily seeing that just by looking at the directory structure or something like that. Any level of the directory hierarchy could have any significance. There could be a top-level directory in the repository that is a whole project. There could be a directory three levels down that’s a project, like /code/team/project. There are no inherent rules (except usually top-level directories are mandated to be very broad categories of potential projects that contain many projects in their tree).

In contrast, a multi-repo system could say that each repository is a project, which would give you a more concrete artifact to represent a project. However, there’s also nothing really enforcing this in a multi-repo system either. There could be four projects in one repo and two projects in another.

In reality, most of this ends up being defined by your build system’s configuration files, not by your source code repository.

The One Version Rule

Often, a monorepo will mandate that only one version of any given piece of software can exist in the repository at the same time. If you check in a library, you may only check in one version of that library in the entire repository. Since you have a monorepo, that ends up meaning that only one version of that library may exist at the company at any given time. This is the way (mostly) that Google’s monorepo works.

This is done for multiple reasons.

First off, it makes it much easier to reason about the behavior of your system. You understand which version of your dependencies you’re going to get, always. You don’t have to inspect your transitive dependency tree every time you check out a piece of code to understand what you’re actually getting, because you’re getting the version of that dependency that exists in the repository when you check out.

But perhaps the most important reason this is done is that most programming languages mandate having only one version of any particular dependency exist in a final program. Otherwise, they end up having weird behavior at runtime when you include multiple versions of the same thing. For example, in Java, it’s essentially random (from the viewpoint of the programmer) which version of a dependency will get used, if you include both in your binary. Including multiple versions in a program can lead to some very complex and difficult-to-debug errors at runtime.

This problem can be solved, and many dependency-resolution systems in modern languages or frameworks do solve this. Some systems allow for multiple versions of a dependency to exist, and for calling code to actually “know” which version they expect to be calling. Other systems will “force upgrade” all versions of a dependency to be the most recent one, or “force downgrade” all versions to be the oldest one.

However, all of that only exists if your system has the concept of projects and versions of those projects, which most monorepos don’t have.

This rule has some pretty significant downsides. If you own a piece of code that a lot of people depend on, it can be very difficult to upgrade that piece of code, because any change you make will break somebody. You can’t fork your codebase, move everybody who depends on you incrementally to the new version, and then delete the old version. Instead, when you make a breaking change you have to either:

(a) commit to every project that depends on you, all at once
(b) do a dance where you create a new function with no callers, commit that, then move your callers to use the new function over lots of commits, then delete the old function.
(c) decide never to make breaking changes even though you’re an internal library

Honestly, option (b) above is not that bad. It’s actually kind of a good software practice, but it can be a lot of work for a library maintainer, sometimes so much work that maintainers opt for (c) by default and let their systems stagnate more and more over time.

Where this really becomes a problem is third-party libraries. If all code must live in your repository, that means you have to check third-party libraries into your repository. And there can be only one version of them, for everybody in the company at once. But you’re not the maintainer of those libraries, and you can’t realistically do the function dance of option (b) above.

Plus, the outside world is not a monorepo. Libraries out there depend on specific versions of other libraries. Let’s say you check in Library A that causes you to have to check in Library B, C, and D as dependencies. But then somebody wants to check in Library X that requires a newer version of C. But that requires them to now have to upgrade Library A. But the upgrade to Library A breaks all of the people who depend on Library A, so now the person who just wants to check in a single library so that they can use it has to upgrade everybody who depends on Library A.

This gets even worse when you have a very-broadly-used third-party library inside of the repository. Often, they get “stuck” at a particular version and never get upgraded, because upgrading them is just so hard. Instead, people start bringing in selective patches to the library that they know won’t break it. Or they start making their own fixes to it and diverging from upstream, making it difficult or impossible to upgrade to the external version later.

One other thing about the one-version rule is that systems in production in a complex multi-service environment were all built at different versions, so the reality is that you’re actually always experiencing multiple versions of things in production. The one-version rule provides a polite fiction that makes life easier at development time for most situations, but it can also make you forget that it’s not actually true when you have multiple programs interacting with each other.

It’s worth noting that this rule doesn’t really require a monorepo. You could allow only one version of a dependency to exist across all of your repositories. Then you just have to mandate that all repositories across your company always build at head and only consume each others’ code at head, and you would have essentially the same effect. I’m not recommending that you do so, just pointing out that you could. Whether you do it is up to you.

Making Library Maintainers Solve the Problems They Cause

In a monorepo world, if you own a library, you can break the builds of every project who depends on you by checking in something incompatible with those projects. This is especially true in a one-version world, where library owners must check in to the single version of the library that everybody depends on. This means that library maintainers can’t just force their consumers to do all the work of upgrading to a new version of the library. The library maintainers have to dig in and do the work themselves. If they think that making a breaking change is worthwhile, they have to bear the cost for the business. Otherwise, library maintainers could create a lot of unplanned work for their consumers without talking to their consumers. (Sometimes those consumers represent projects that don’t even have developers on them anymore, but are still important to the business, so there’s nobody even there to do upgrade work.)

This is mostly a matter of company policy, but it’s much easier to do in a world where you can actually enforce it, and where there is some system that causes pain for the library developers when they cause pain to others. For example, having a lot of teams complain that their builds are broken can be that pain. In some monorepos, you can actually prevent the library maintainers from checking in their change at all, because the test system runs the tests of all their consumers and stops breaking changes from going in.

This enforcement doesn’t exactly require a single source repository. There are various ways to accomplish this, or parts of it, in a multi-repository system.

Summary

So you can see that a “monorepo” is actually a lot more than having just one source code repository where you put all your stuff. Some people have grouped all of these things together, because the above is basically a description of the Google monorepo, and most people seem to be thinking of that system when they talk about “a monorepo.” But it’s important to separate out these concepts, because a lot of them can be implemented in the systems you have today. Plus, maybe not all of these things are actually good, and perhaps you should be intentional about which ones of them you are trying to adopt at your business.

-Max

Leave a Reply