There’s a point that I’ve been making to engineers recently that I realized would be valuable if shared more widely.
When you do engineering work, there are different types of tasks that get given to you. Some tasks are emergencies or short-term work. We sometimes call this “putting out fires,” especially when the work relates to handling something that is urgently broken or immediately needed without delay.
Other tasks are strategic in nature. You have collected information about what is needed and/or wanted from your users, you’ve designed a solution, and you’re working toward it methodically and intelligently.
It is important to understand when you are doing which type of work, and to think about them differently.
When you’re putting out a fire, the goal is to put out the fire. You basically want to do whatever the minimal work is to put out the fire so that you can get back to your long-term strategic work. You don’t want to get involved in building huge, complex systems that will live forever, just to put out a fire. Emergencies are the time when you want to do “quick and dirty” work. It doesn’t mean you should do bad work. But you shouldn’t be building up some long-term, high-maintenance system around putting out a fire.
There are different types of fires. Sometimes an executive or another team comes to you with an immediate demand–something that must get done in the next few weeks or so. What you want to do is figure out how to get that task done and out of the way so that you can get back to your long-term strategic work.
Other times, you have some sort of actual emergency, like an outage. It’s clearer in that case that you should just fix the outage and not run around doing a bunch of other stuff. An outage is not the time when you want to say, “Well, let’s wait to write a design document and review it next week with our senior engineers.” The same is true of any fire, though–a fire is not the time to apply the methods and systems of long-term software design.
Let’s get a more concrete example to show what I’m talking about. Let’s say that an executive comes to you and says, “We have a customer who wants to give us a million dollars next week, but before they do this, we have to produce a graph that shows how our servers stand up under high load.” But, let’s say, you don’t even have any systems for recording the load on your servers.
If you were thinking about this in a long-term, strategic manner, you might say, “Ah, well, we should have a system that tracks the load on our servers. We have to work out in detail how the storage for this would work, how we can be sure that it’s accurate, how we monitor it, and how we test it. Then we should work with a user experience designer to make sure that the graphs it produces can be understood well by its users, by conducting standard user research and working out a UI design from that.”
That’s not going to get done in a week. Also, it’s a waste of time. You actually have no idea if this fire is going to be happening again or not. Just because somebody has come to you with some urgent demand once does’t mean that this will be a long-term need at all. It might seem like it will be, and you could guess that it will be, but why are you guessing about long-term strategic designs? There’s no need to guess about long-term work–when you’re doing long-term work, you have the luxury of doing research to find out what the actual user needs and requirements are. So do that and build things based on that, not based on your guesses.
Instead, what you should say is something like, “Okay, I will work out a very basic load test that I can run manually from a script on my machine, tomorrow. I will roll out a new version of the server that just writes information about its load to a log file, and then I will manually make a graph based on parsing that log.” All of that was basically the minimum work required to solve the problem.
Even that solution comes with a risk, though–you instrumented the server to log something related to load. There is a chance that later, somebody will come along and think that you intended this to be a long-term, supported mechanism for tracking the load of the system, and rely on it being well-designed and well-thought-out when it isn’t. This highlights a very important point:
Never make long-term decisions or implement long-term solutions during a fire.
In fact, you might even want to intentionally undo all the work you did during the fire, like remove that log line, just so nobody else thinks that you made some long-term decision.
This rule doesn’t just apply to technical implementation details, but also to organizational changes, or really any decision. For example, let’s say that there is an outage ongoing. During the outage is not the time to talk about how you will prevent it from happening in the future, or how you should change your normal, everyday processes.
The one time that it is safe to make long-term decisions based on a fire is when you’re doing a “postmortem”–a rational review of the situation after the fire has been “put out.” Then you can sit down and say, “Okay, what sort of strategic work do we want to do to prevent fires like this from happening again?” or “What did we learn from this that we could use to change how we work?”
This rule is extremely important. Violating it builds up insanities that can destroy groups. If you built up all of your company’s policies and work patterns based only on decisions made during times of extreme emergency, it would eventually look to be a totally crazy company, and would probably fail.
The other end of the spectrum (and it is a spectrum, it’s not black and white) from “putting out fires” is: doing strategic work. Basically, you have a known goal and you’re working toward it, applying all of the basic principles of software design, making sure that you’re thinking about the long-term, and working together with your group intelligently to create something sustainable.
Similarly, if you apply the methods and systems of “putting out fires” to strategic work, you will cause a disaster. If you treat every single project as though it’s an emergency and just dash it out “quick and dirty” because it “has to be done tomorrow” (even though it really doesn’t), you’ll end up with a mess. What will actually happen is you will create fires! Your system will be so poorly designed that it will fall over, cause trouble, be hard to maintain, and eventually consume you entirely in putting out fires around this poorly-designed mess.
When you apply the principles of Fires to Strategy work, you never actually get your strategic work done. If you see an engineering organization that just can’t seem to get things done over the long-term, this is very often the reason why–they have been treating everything like the world is on fire, and so can never actually move forward.
Strategic work requires a lot of saying, “Okay, we understand your requirements. Thanks for telling us what your problems are. We are building a solution for you, we are doing it the right way, and it will take a little bit of time. Not forever, but it will take some time to get it done.”
I think that sometimes, executives get worried that if they tell engineers to “take enough time,” that the engineers will get lazy and just never complete the work. This might be a legitimate concern in some companies, and certainly executives have an interest in keeping things moving along so that the company can deliver its products! But there has to be a balance between encouraging people to deliver on time and making sure that they follow the processes and procedures of long-term software development. In general, it’s best, when doing strategic work, to err on the side of doing a little too much design, a little too much review, etc. I’m not saying go overboard and stop building things, or put everybody through unnecessary reviews just because something “might need it.” I’m just saying that if you’re uncertain, this is the direction you should err in.
As long as you apply the general principles above, it’s possible for one team (or one person) to handle both strategic work and fires simultaneously (at least, within the same week or month). The trick is doing minimal work on the fires, to make sure that emergencies are handled and the business keeps chugging along, and then focusing back on the strategic work once the fire is put out.
After all, if you’re doing it right, the strategic work should be the stuff that’s most important to the business–the things that you’ve researched and know will make the highest impact if you deliver them, in the long run. So put out the fires and get back to doing what’s actually going to be important in the long term.