Make It Never Come Back

When solving a problem in a codebase, you’re not done when the symptoms stop. You’re done when the problem has disappeared and will never come back.

It’s very easy to stop solving a problem when it no longer has any visible symptoms. You’ve fixed the bug, nobody is complaining, and there seem to be other pressing issues. So why continue to do work on it? It’s fine for now, right?

No. Remember that what we care about the most in software is the future. The way that software companies get into unmanageable situations with their codebases is not really handling problems until they are done.

This also explains why some organizations cannot get their tangled codebase back into a good state. They see one problem in the code, they tackle it until nobody’s complaining anymore, and then they move on to tackling the next symptom they see. They don’t put a framework in place to make sure the problem is never coming back. They don’t trace the problem to its source and then make it vanish. Thus their codebase never really becomes “healthy.”

This pattern of failing to fully handle problems is very common. As a result, many developers believe it is impossible for large software projects to stay well-designed–they say, “All software will eventually have to be thrown away and re-written.”

This is not true. I have spent most of my career either designing sustainable codebases from scratch or refactoring bad codebases into good ones. No matter how bad a codebase is, you can resolve its problems. However, you have to understand software design, you need enough manpower, and you have to handle problems until they will never come back.

In general, a good guideline for how resolved a problem has to be is:

A problem is resolved to the degree that no human being will ever have to pay attention to it again.

Accomplishing this in an absolute sense is impossible–you can’t predict the entire future, and so on–but that’s more of a philosophical objection than a practical one. In most practical circumstances you can effectively resolve a problem to the degree that nobody has to pay attention to it now and there’s no immediately-apparent reason they’d have to pay attention to it in the future either.

Example

Let’s say you have a web page and you write a “hit counter” for the site that tracks how many people have visited it. You discover a bug in the hit counter–it’s counting 1.5 times as many visits as it should be counting. You have a few options for how you could solve this:

You could ignore the problem.
The rationale here would be that your site isn’t very popular and so it doesn’t matter if your hit counter is lying. Also, it’s making your site look more successful than it is, which might help you.

The reason this is a bad solution is that there are many future scenarios in which this could again become a problem–particularly if your site becomes very successful. For example, a major news publication publishes your hit numbers–but they are false. This causes a scandal, your users lose trust in you (after all, you knew about the problem and didn’t solve it) and your site becomes unpopular again. One could easily imagine other ways this problem could come back to haunt you.

You could hack a quick solution.
When you display the hits, just divide them by 1.5 and the number is accurate. However, you didn’t investigate the underlying cause, which turns out to be that it counts 3x as many hits from 8:00 to 11:00 in the morning. Later your traffic pattern changes and your counter is completely wrong again. You might not even notice for a while because the hack will make it harder to debug.
Investigate and resolve the underlying cause.
You discover it’s counting 3x hits from 8:00 to 11:00. You discover this happens because your web server deletes many old files from the disk during that time, and that interferes with the hit counter for some reason.

At this point you have another opportunity to hack a solution–you could simply disable the deletion process or make it run less frequently. But that’s not really tracing down the underlying cause. What you want to know is, “Why does it miscount just because something else is happening on the machine?”

Investigating further, you discover that if you interrupt the program and then restart it, it will count the last visit again. The deletion process was using so many resources on the machine that it was interrupting the counter two times for every visit between 8:00 and 11:00. So it counted every visit three times during that period. But actually, the bug could have added infinite (or at least unpredictable) counts depending on the load on the machine.

You redesign the counter so that it counts reliably even when interrupted, and the problem disappears.

Obviously the right choice from that list is to investigate the underlying cause and resolve it. That causes the problem to vanish, and most developers would believe they are done there. However, there’s still more to do if you really want to be sure the problem will never again require human attention.

First off, somebody could come along and change the code of the hit counter, reverting it back to a broken state in the future. Obviously the right solution for that is to add an automated test that assures the correct functioning of the hit counter even when it is interrupted. Then you make sure that test runs continuously and alerts developers when it fails. Now you’re done, right?

Nope. Even at this point, there are some future risks that have to be handled.

The next issue is that the test you’ve written has to be easy to maintain. If the test is hard to maintain–it changes a lot when developers change the code, the test code itself is cryptic, it would be easy for it to return a false positive if the code changes, etc.–then there’s a good chance the test will break or somebody will disable it in the future. Then the problem could again require human attention. So you have to assure that you’ve written a maintainable test, and refactor the test if it’s not maintainable. This may lead you down another path of investigation into the test framework or the system under test, to figure out a refactoring that would make the test code simpler.

After this you have concerns like the continuous integration system (the test runner)–is it reliable? Could it fail in a way that would make your test require human attention? This could be another path of investigation.

All of these paths of investigation may turn up other problems that then have to be traced down to their sources, which may turn up more problems to trace down, and so on. You may find that you can discover (and possibly resolve) all your codebase’s major issues just by starting with a few symptoms and being very determined about tracing down underlying causes.

Does anybody really do this? Yes. It might seem difficult at first, but as you resolve more and more of these underlying issues, things really do start to get easier and you can move faster and faster with fewer and fewer problems.

Down the Rabbit Hole

Beyond all of this, if you really want to get adventurous, there’s one more question you can ask: why did the developer write buggy code in the first place? Why was it possible for a bug to ever exist? Is it a problem with the developer’s education? Was it something about their process? Should they be writing tests as they go? Was there some design problem in the system that made it hard to modify? Is the programming language too complex? Are the libraries they’re using not well-written? Is the operating system not behaving well? Was the documentation unclear?

Once you get your answer, you can ask what the underlying cause of that problem is, and continue asking that question until you’re satisfied. But beware: this can take you down a rabbit hole and into a place that changes your whole view of software development. In fact, theoretically this system is unlimited, and would eventually result in resolving the underlying problems of the entire software industry. How far you want to go is up to you.

9 Comments

  1. After doing TDD for a few years, I’ve found that it’s forced me to think this way all the time (and regressions are a distant memory)

    Great Post! Loved the hit counter example!

    • Thanks, Jeff! 🙂 That’s awesome that TDD has done that for you. I think one aspect that TDD doesn’t always inherently address is creeping complexity, though–it’s still theoretically possible to create a fairly complex beast that happens to be well-tested. 🙂 (Though the tests [if well-written themselves] tend to help with refactoring, which makes the problem less difficult to fix.)

      -Max

  2. Great example; very nicely illustrates the point. It should be noted that the example isn’t traced down all the way, however. Why does the hit counter count the last visit again when interrupted and restarted? 😉

    For every shell script bug ever, the “Down the Rabbit Hole” section describes the problem. Very, very few people writing shell scripts actually understand the shell. They copy shell script snippets from other authors, who based them from other snippets, which were written by people who also didn’t understand shells. https://unix.stackexchange.com/q/131766/135943 is a good starting point to clear up at least the most common misconception, but the point is that the “bug” is really broad lack of correct education and correct examples for shell scripting.

    And it’s very true it’s wonderful to see a bug vanish when you fully trace it down. 🙂

    • It is traced down all the way. Redesigning it eliminates even the question. That is, you don’t have to ask the question because the object being questioned no longer exists. The example may be missing some detail that would fully answer your question though (and would show how it is in fact already answered) which is more covered in http://www.codesimplicity.com/post/the-fundamental-philosophy-of-debugging/.

      From my perspective, the actual down the rabbit hole on shell scripting is that the language is designed in such a way as to be inscrutable, unintuitive, and surprising to the average user. It’s out of agreement with most other programming languages.

      But from a practical perspective, since we aren’t here to fix bash, it’s true that you can trace it down to a failure of understanding in the programmer. In fact, I think all programming failures trace to problems in understanding, responsibility, or awareness. That’s probably a subject for a future blog.

      -Max

  3. I recently came across https://www.fastcompany.com/28121/they-write-right-stuff, which is a great example of your “Down the Rabbit Hole” section. It describes how the Lockheed-Martin team handles code accuracy for rocket flight software. When they find a bug, they don’t just fix it; they examine how the process allowed the bug to get there in the first place, look for all potential bugs that could have been allowed in by the same flaw in the process, fix the process, fix all the bugs, etc.

Leave a Reply