How to Build a Great Platform

In the field of software engineering, we often talk about using or building a “platform.” Let’s talk about what that really means and the principles for how to build a great platform.

What is a Platform?

Sometimes people think of a platform as being “a thing a lot of people use” or “a system built by one team that is depended on by a lot of other teams.” I have a more nuanced definition that I think helps us differentiate between a product and a platform:

A platform is a system that has multiple independent customers, where the customers of that system can use the system and modify its behavior without requiring coordination with other customers.

Often, we think of a platform as a system that “runs” other people’s systems. A good example of that type of platform is Amazon Web Services. Amazon provides a way for you to run your own systems (including various pieces you would need to run those systems, like storage, routing, etc.) “on top” of AWS. When you use AWS, you don’t have to check with every other AWS customer things like, “Is it okay if I deploy right now?” “Can I get five hosts for running my service?” It’s very hard for what you are doing to negatively affect other customers, and it’s very hard for what other customers are doing to negatively affect you.

There are other types of platforms, though. For example, I once helped develop a platform that allowed people to display their own metrics in a common dashboard. They wrote the code that generated the metric, ran that code somewhere else, and our system just displayed the metric in a standardized fashion on a company-wide dashboard. Nothing was really “running” on our platform, but customers still could use the system and modify its behavior for their own needs without requiring coordination with other customers.

Broad and Narrow Platforms

Theoretically, a website like YouTube is a platform in the most limited sense. You upload a video, and that “modifies the behavior of the system” so that it now shows your video. You don’t have to check with every other person uploading a video on YouTube if it’s okay for you to do it right now. And you have some power to configure how that video displays and how your YouTube channel works.

So from this, we can see that platforms have different levels of freedom that they allow. Something like AWS would be a very broad platform—you can run almost any type of software on it. YouTube, on the other hand, is a very narrow platform—you have only limited freedom to modify how the system works.

Getting this trade-off right for your platform is one of the most important things you will have to do as a platform owner. How many “degrees of freedom” you chose to allow to your customer will determine how difficult it is to design, build, and maintain your platform successfully.

Automated vs Manual Platforms

There is another property that the best platforms have:

The best platforms do not require manual intervention from the platform owner in order for the customer to use them successfully.

Imagine that you want to release new versions of your system five times a day to your customers. What if you had to email a person at Amazon and ask them to manually deploy your code every time you wanted to do that? Not only would that be a terrible experience for you, it would dramatically harm the success of AWS as a product. Amazon’s customers would rapidly look for a different platform to use.

There are different levels of manual intervention that a platform can require. Two of the most common are:

Requiring manual intervention from the platform owner in order to onboard to the platform.
Requiring manual intervention from the platform owner in order to execute a common configuration change or fix a common support issue.

These have two problems:

You make your customers wait until a human technician is available to support them or onboard them. Often that’s not acceptable for their timelines and so they will search for other solutions. At the very least, it slows your business down significantly.
It drowns the platform owner in manual work and they become less and less able to actually work on the platform over time. You can even get into a vicious cycle where there’s so much manual work that you no longer have engineers available to develop the automation necessary to eliminate that manual work.

And of course, manual work can also be more prone to error and so you open up your system to more malfunctions. That is usually a minor issue, in practice, but it does happen.

Overall, if you have a platform that requires manual intervention, it’s still a platform, but it’s not ideal, and you’re at risk of it degrading over time as your engineers spend more and more of their time doing manual work and less and less of their time improving the platform itself.

Building a Platform

Okay, now we know what a platform is. How do you make one? What I always tell teams is:

You start with a product and then you turn that into a platform over time.

A “product” here would be a system that helps people accomplish some goal, but where the users don’t control the fundamental behavior of the system. Most software that we interact with is a product, at its core: Gmail, Microsoft Word, Zoom, Photoshop, etc.

With a product, you, the developer of the product, are focused on creating a great experience for your users. You’re focused on helping them accomplish a known, specific goal, and you are curating the entire experience they have of using your product. Microsoft has put incredible resources into optimizing the user interface of Office, as an example, so that people writing documents or creating spreadsheets can focus simply on the task they need to do and have the product “get out of their way” as much as possible.

Above we talked about how YouTube is a very narrow platform, for video creators. But for video viewers, YouTube is entirely a product: you show up, it recommends videos for you to watch, you watch them, done. I worked at YouTube, and I can tell you, the company cares deeply about curating that experience for you.

If you intend to set out to build a platform, you should first figure out: what’s the core value that I intend to deliver to my customers? What are the parts of that that need to be a carefully curated experience in order for my system to be successful?

For example, let’s imagine you wanted to build a deployment platform for your company—a system that takes code and ships it to production. The most core value that that is delivering is: I no longer have to do manual work to successfully and safely push my code into production. So the first thing you should build is not a platform, but a product that provides that carefully curated experience.

What would that look like in actual practice? Well, you would find a team that’s experiencing some pain in terms of manual deployments. You don’t want them to be too much of an edge case in the company (like, if there’s one team of three mobile developers and the rest of the company writes Java backend services, don’t start with building something for the three mobile developers). You also don’t want them to be too important to the business—never start with the largest customer first. You’re going to be building a new product and you need some leeway to experiment, learn, and make mistakes. If you try building this product for some team that the whole company’s revenue depends on, you won’t have that space to experiment—instead, you’ll get pushed on hard for deadlines that cause you to cut corners, and be forced to build features that you shouldn’t support or create this early on in the lifecycle of your product.

So basically, you’re looking for a customer that is representative of how a lot of the company works, who’s willing and able to take some risks, and who’s able to work with you closely to provide feedback on what you’re building.

You then do your best to build a great deployment system for just that customer. At this point, it doesn’t matter if all the work to modify or reconfigure the system is manual. In fact, it probably should be, because what’s going to happen is you’re going to build the wrong thing, give it to your customer, discover it’s the wrong thing, and re-work it until it’s the right thing. You will never get the product right on the first try. So it’s not worth it to try to automate everything up front, because you’ll have to throw out that automation and rewrite it again. The deployments themselves happen automatically for the customer, but if they need to change how those deployments work, it’s fine if that requires you, the product owner, to go make manual changes to code.

The one thing that’s really important to keep in mind while you’re building this product, though, is don’t back yourself into a corner where you can only support this one customer, ever. It should be possible to modify the system safely to take another customer in the future. The primary way you do this is by following the laws of software design. Don’t make things generic before they need to be—just keep in mind that you will have to onboard other customers in the future, and don’t lock yourself out of being able to do that. Really, the best way to do that is keep the system simple so that you can easily modify it in the future.

Getting this product right and polishing it is supremely important, especially during this first customer engagement. You want to work closely with this customer until you are sure that you have created a great experience for this customer. Then you go out and you find some other customers that are representative in different ways—people who have different requirements from your original customer—and go through this same process with them, still building a product.

You will know that you can expand beyond the first customer when you stop learning significant new things on a regular basis. That’s true for each of your other pilot customers, too—you can expand beyond that set of users when you stop learning significant new things from them that cause significant changes in your product. You don’t have to be completely “done” with each customer before expanding, you just have to be confident the product is successful for that customer and that you aren’t discovering new requirements very often anymore.

All of this will form the core foundation of your future platform. Most platforms that are clunky, difficult, or failing did not spend enough time polishing their core product before they “opened the floodgates” and became a platform.

Transitioning to a Platform

If you have done a great job of building a product, people will start knocking on your door and demanding that they be allowed to use your product, too. You might have to do some light marketing of the product, but I usually find that for platforms that are internal in a business, if you’ve built a great product, word of mouth will spread and people will just start demanding that they be allowed to use it too.

The first people you want to onboard are those whose requirements are nearly identical to your initial customers. Essentially, at first you want to stay a product, and you want to see what it’s like when you have to scale out that product to many customers. You will learn a lot about your product and improve it dramatically at this stage. You may even start to automate onboarding and provide some configuration settings for very common configuration requests that were requiring manual work.

However, eventually you will reach a point where you feel like the “barbarians are at the gates.” You will see a few things happen that indicate that you’re at this stage:

You start to have more and more insistent demands to onboard from your customers. They may start complaining that you’re actually harming them by not letting them onboard.
You will start to get more and more feature requests that seem questionable to implement in the core product. They address only small edge cases, but if you implemented them in the core product, they would be disruptive to all your customers.

This is the point at which you build a platform. Note:

It is very important that you work very hard and very fast during the “product” stage to polish the product before you get to “the barbarians are at the gates” stage. Otherwise you will be forced to build a platform before you are ready, and your platform will suffer for the rest of its lifetime, as will the experience of all your users.

What Do We “Platform-ize?”

One of the most key decisions you will have to make over and over, as a platform owner is: what do we implement in the core product, and what do we allow our customers to develop or configure on their own as part of our platform?

This should always be driven by the requirements of your customers, but be wary of customers who simply demand that they be allowed to have total control and configure everything. If you really let them configure everything, you would be totally defeating the value of your platform—you’re just making them write their own software in a new and different way. So how much leeway do you give to your customers?

Well, there are a few principles that drive this.

Non-Interference

Our first principle is:

One customer must not be able to interfere with another customer.

In our deployment platform example, let’s imagine that a customer came to us and said, “I want to be able to pause all other deployments at the company other than mine, whenever I want to.” Under almost all circumstances, you would say no to that. The only people who should be able to affect all customers (or even more than one customer at a time) should be the platform owners. What you would do with a requirement like that, if it was really a legitimate requirement (I personally would dig into that one more and ask a lot of “what problem are you trying to solve?” questions) is you would say, “If you need to do that, please let us know and we will do it on your behalf.” That lets you maintain control and puts responsibility in the right place—you know what’s going on with your system the best, and so if anybody is going to do something that affects the whole system, it should be you.

However, if you think about this more deeply, you’ll realize also that this principle of non-interference also means you need to implement guardrails. For example, if you’re AWS, it should not be possible for one customer to request so many resources in a data center that suddenly no other customer can run their critical workloads. If you’re a deployment system, it should not be possible for one customer to put so much load on your system that it breaks other people’s deployments. Building these guardrails is often most of the work of building a platform. You have to think through “how could one customer break another customer” and design the system so that that isn’t possible.

Reasoning About the System

One of the most common ways to kill a platform is to make it difficult or impossible for the platform owners to evolve the platform over time.

Whenever you provide some freedom to your customers, ask yourself this question: “What happens if we change our mind about how this feature works, or how the core product works?” Would you be able to make that change in a safe, automated fashion just by yourself as the platform team, or would you have to go ask all your customers to do manual work? Even more importantly, how would you even know if a change is safe to make in the future? Would you have to go around and talk to every customer and ask them “how are you using the platform?” and “is this change I want to make safe for you?” Or is there some way to do all analysis of your customers in a programmatic way that gives you total confidence about how your customers are using your product and what changes are safe to make?

In essence:

You must continue to be able to reason about how your system behaves, no matter what customers do.

It needs to be easy to make logical statements like, “If we change the tool that pushes code onto a server, for everybody, I know that’s safe without having to ask our customers, because ______________.”

You will violate this principle if you allow too much freedom to your customers. For example, let’s imagine that you are designing a platform that takes your customer’s code, builds it into a binary, and runs its tests (a continuous integration system, basically). What if you let every customer write completely different build scripts that can do anything they want? You’ve stopped being a continuous integration system and become a totally generic task orchestrator. You can no longer provide any value to your customers that would be specific to continuous integration. You can’t even reason about tradeoffs between security and value (like “should we let these scripts access the Internet?”) anymore. It even “infects” the testing part of your system, because the testing part of the system has no idea what it’s getting—it could be any output of any script. So then the testing system also becomes a generic task orchestrator. If that’s all you wanted, you could have just used one of the many open-source task orchestrators and provided very little value to your customers right from the start.

It’s also extremely difficult to get out of that situation once you’re in it. Imagine that you want to move from that state into having a more restricted, more standardized build system. You have hundreds or thousands of different build scripts all across the company, and now you have to either go manually look at all of them yourself or ask all of your customers to go “fix” their scripts (good luck with that). This brings up another thing you learn after working on platforms like this for a long time:

It is much harder to go from freedom to restriction than from restriction to freedom.

Essentially, you never want to give people something and then have to take it away. You always want to be “giving” them things that you will never have to take back. And when I say the “freedom to restriction” path is much harder, I mean orders of magnitude more effort—sometimes to the point that it’s impossible and you just have to abandon all hope of ever having control of that part of your platform ever again. (This is what causes platform owners to create backwards-incompatible “Version 2.0” editions of their platforms where they abandon all their existing customers, but boy oh boy, does that have its own whole new set of terrible problems!)

Overall, you must never allow so much freedom to your customers that you can no longer reason about the behavior of the system, and thus can no longer evolve it or enhance it usefully.

Escape Hatches

You may have heard of the “80/20 rule” for platforms. This means the core product offering should handle 80% of the use cases successfully in a way that’s really simple and great for your customers. Then the 20% of your customers with really unique requirements get more power in the platform and are able to service their more complex needs, even though that means they have a more complex user experience.

Here’s what really ends up happening a lot of the time: that 20% of users have a lot of power. Often, the largest and most important customer systems are in that 20% of users. They show up and demand total freedom, and you get into a situation where you can no longer reason about or maintain your platform.

All platforms need to keep this principle in mind:

There must be some way for customers to fulfill their valid requirements.

Your customers have real needs and they need to be able to execute them. They don’t live in a magical walled garden that you may have designed in your mind for your platform. There will always be customers whose requirements are way beyond what you ever intended to support in your system.

However, and I cannot stress this enough: the way those customers fulfill their requirements does not have to be your platform. There just needs to be some way that they can fulfill them.

All platforms need “escape hatches” that allow a limited set of customers to do anything they need to do. However, the burden of maintenance and support for that must go on those customers. As much as possible, the platform owners must be isolated from the cost of supporting those customers.

Usually, the way that you accomplish that is that you have essentially two layers:

A set of tools that let people do anything they need to do. For developer platforms, these are often command line tools with complex interfaces and immense power.
A “platform” built on top of those tools that creates a great user experience but severely limits the power of the system compared to the tools. This solves 80% or more of the common use cases, but requires customers to conform to a specific, standardized way of working in order to get the benefits. We often call this the “paved path” or “golden path.” It’s a little like a freeway—it doesn’t go everywhere, but it goes to the places that most people want to go in a fashion that’s much more streamlined than taking the back roads.

The vast majority of your users should be using the platform. For those who aren’t, you want some place where you record which customers have taken the “escape hatch” and been allowed to use the underlying tools due to their complex requirements. There will be a lot of people who will want to know who those teams are in the future, like your security team (“I need a list of everybody who uses the escape hatch, because there’s a vulnerability that we fixed centrally for everybody who uses the platform, but all the escape hatch users need to fix it on their own.”).

If you don’t have these escape hatches, you can’t say “no” to any customer. And you must be able to say “no” to functionality that shouldn’t be in the platform. Otherwise, your platform will degrade over time—sometimes even to the point that it would be better if you had never had the platform in the first place. When you have to be “everything to everyone,” your product can become so complex that it actually slows customers down rather than speeds them up.

One thing to watch out for, though, is when a lot of your users start using the escape hatch instead of the platform. This indicates a few different possible situations that could be going on:

A lot of teams have legitimate requirements that you haven’t implemented successfully in the platform.
There are no incentives in the business that cause teams to get onto the platform and so they “take the shortcut” that is less immediate effort on their part (using the escape hatch) even though it will be more expensive for them in the long run.

Almost always it’s an issue with your platform. Never be too confident that you’ve gotten everything right and there’s nothing left to learn. Even when it seems like the issue is incentives (people feeling they won’t get rewarded individually for doing the work to migrate to your platform, leaders feeling like the work isn’t important, etc.) often the real issue is something like: it’s much too hard to onboard to the platform.

People make decisions based on the data they have and the purpose they are trying to accomplish. If the data they have about your platform is “it will be hard to accomplish my purpose if I use that platform,” you’re going to have a hard time getting people to adopt it.

What is a Valid Requirement?

Often above I talk about “valid requirements” or “legitimate requirements.” What are those?

A valid requirement is something that would be best for the company as a whole, and which comes from a real problem the customer has.

As a platform owner, I would also generally say that there is a timeline aspect here: requirements are more valid when they are the right thing in the long term, not just in the short term. Once in a while there is a really compelling reason to compromise for the short term, but if you do, think about how you could eventually build toward the right long-term solution.

This is true even when you’re building your first product for a single customer. I don’t mean make guesses about the future. Don’t worry about “am I implementing this feature of my product in the best way for all other possible future customers?” That road leads to bloated, terrible products. Instead, just ask, “Is what I’m doing for this customer going to deliver the best outcome for the business?”

Later on, when you’re building a platform, it’s still the same question: “is what this customer wants to do the right thing for the company as a whole?” For example, imagine you are a company that has written everything in Java. A customer comes to you and says, “I intend to build this system in Python and so I need you to support Python systems.” Well, that could be legitimate. Let’s find out more: why do they want to use Python? “Oh, I just like it more.” So this customer wants to break the company’s standards based on a personal preference. That’s not a legitimate requirement.

On the other hand, let’s say that same customer had a different answer for that same question: “We are building a machine learning system and Python is the standard language for machine learning systems across the industry. The tooling we have available to us in Python makes a huge difference for the success of this product.” That sounds like a very legitimate requirement. Also, we just learned a lot more—the customer wants to deploy a machine learning system, which actually is going to have a whole additional set of requirements.

As a platform owner, you must say no to implementing requirements that are not the right thing for the company as a whole. Not only do those requirements cause trouble for the company, they also tend to degrade the quality of your platform over time. They add “cruft” that most customers don’t really want in order to provide functionality that the company doesn’t really need.

Invalid Requirements

It is very surprising how many platform requirements are not actually legitimate when you look into them. A common one is “it would be slightly more convenient for us if you did a huge amount of work to save us a little bit of work.” Sometimes customers don’t even look at how much work it would be to migrate their systems onto your platform. They are busy and they don’t even want to think about doing more work. It could be a few hours’ work for them to fix their system to be more standardized, but they feel overwhelmed and haven’t even developed that estimate.

If you had 1000 customers who would all be saved one hour of work if you did 20 hours of work, that sounds like a great tradeoff that you absolutely should do! But if you would be doing 20 hours of work to save 10 hours of work for one customer, that’s obviously not the right decision for the company as a whole, so don’t do it.

Another common issue is imaginary requirements. Customers sometimes believe they have requirements that they don’t actually have. A customer comes and says, “This must scale to one million queries per second.” You ask them why, and they say, “because our Product Manager said so.” You ask the Product Manager and they say, “I just want to be sure the system can do way more than what we need, in case we expand a little faster than what we expected.” When you make people do the real math, you find out that even if we exceed our wildest dreams in terms of usage, the system will actually be doing one thousand queries per second.

So you need to make sure your customer’s requirements are grounded in the reality of the actual problems they need to solve. Be polite about it—don’t go saying that they are imagining things. Just ask for clarification so that you understand the requirement as fully as you can. Often, even if it’s a legitimate requirement, you’ll discover a lot more about it by doing this. For example, if that was a legitimate requirement to serve one million queries per second, you’d learn a lot about things like: what latency do we have to serve those queries at ? How many different regions do we have to host this system in? Does traffic peak at certain times of day? And so forth.

One of the most dangerous types of requirements to accept is “please build this solution for us.” Like, imagine that you have built a security platform that scans code using different tools to find vulnerabilities, but the outcome you’re going for is successfully finding vulnerabilities, not just “running tools.” A customer comes to you and says, “Please start running this new vendor tool called QuxBugs.” Your response should always be, “Oh, thanks for your request. Can you tell me more about the problem you’re trying to solve?” Never accept solutions from your platform customers, only accept descriptions of the problems they are trying to solve. Once in a rare while, the solution they proposed will be the right solution, but most of the time it’s not. And unfortunately, even if you deliver that solution to them, they will end up hating it eventually because it’s the wrong solution.

Most engineers are not experts at specifying their requirements. You will have to help them. Usually, you know when you have a valid requirement because the solution becomes obvious (at least, obvious to you, the platform owner). Even when the solution doesn’t become fully obvious, having the right requirement opens the door to figuring out the right solution, and helps clarify the answer to any questions that come up when you’re developing the solution. In general, if you aren’t sure how you should implement something, there’s probably something about the problem you don’t fully understand.

The Challenges of Platforms

The above lays out what I know about the basic principles of how to design and develop great platforms that users love. The challenges in doing these things are mostly organizational and social challenges: changing people’s minds, getting leaders on board with “doing it the right way,” getting the funding and time required to develop the product and eventually the platform, etc. All of those are very real problems, but at the very least, if you understand and apply the principles and processes above, you’ll know how you should get to your destination. I hope that you do, and get to experience the joy of implementing, operating, and maintaining a truly great platform.