The Fundamental Philosophy of Debugging

Sometimes people have a very hard time debugging. Mostly, these are people who believe that in order to debug a system, you have to think about it instead of looking at it.

Let me give you an example of what I mean. Let’s say you have a web server that is silently failing to serve pages to users 5% of the time. What is your reaction to this question: “Why?”

Do you immediately try to come up with some answer? Do you start guessing? If so, you are doing the wrong thing.

The right answer to that question is: “I don’t know.”

So this gives us the first step to successful debugging:

When you start debugging, realize that you do not already know the answer.

It can be tempting to think that you already know the answer. Sometimes you can guess and you’re right. It doesn’t happen very often, but it happens often enough to trick people into thinking that guessing the answer is a good method of debugging. However, most of the time, you will spend hours, days, or weeks guessing the answer and trying different fixes with no result other than complicating the code. In fact, some codebases are full of “solutions” to “bugs” that are actually just guesses—and these “solutions” are a significant source of complexity in the codebase.

Actually, as a side note, I’ll tell you an interesting principle. Usually, if you’ve done a good job of fixing a bug, you’ve actually caused some part of the system to go away, become simpler, have better design, etc. as part of your fix. I’ll probably go into that more at some point, but for now, there it is. Very often, the best fix for a bug is a fix that actually deletes code or simplifies the system.

But getting back to the process of debugging itself, what should you do? Guessing is a waste of time, imagining reasons for the problem is a waste of time—basically most of the activity that happens in your mind when first presented with the problem is a waste of time. The only things you have to do with your mind are:

  1. Remember what a working system behaves like.
  2. Figure out what you need to look at in order to get more data.

Because you see, this brings us to the most important principle of debugging:

Debugging is accomplished by gathering data until you understand the cause of the problem.

The way that you gather data is, almost always, by looking at something. In the case of the web server that’s not serving pages, perhaps you would look at its logs. Or you could try to reproduce the problem so that you can look at what happens with the server when the problem is happening. This is why people often want a “reproduction case” (a series of steps that allow you to reproduce the exact problem)—so that they can look at what is happening when the bug occurs.

Sometimes the first piece of data you need to gather is what the bug actually is. Often users file bug reports that have insufficient data. For example, let’s say a user files the bug, “When I load the page, the web server doesn’t return anything.” That’s not sufficient information. What page did they try to load? What do they mean by “doesn’t return anything?” Is it just a white page? You might assume that’s what the user meant, but very often your assumptions will be incorrect. The less experienced your user is as a programmer or computer technician, the less well they will be able to express specifically what happened without you questioning them. In these cases, unless it’s an emergency, the first thing that I do is just send the user back specific requests to clarify their bug report, and leave it at that until they respond. I don’t look into it at all until they clarify things. If I did go off and try to solve the problem before I understood it fully, I could be wasting my time looking into random corners of the system that have nothing to do with any problem at all. It’s better to go spend my time on something productive while I wait for the user to respond, and then when I do have a complete bug report, to go research the cause of the now-understood bug.

As a note on this, though, don’t be rude or unfriendly to users just because they have filed an incomplete bug report. The fact that you know more about the system and they know less about the system doesn’t make you a superior being who should look down upon all users with disdain from your high castle on the shimmering peak of Smarter-Than-You Mountain. Instead, ask your questions in a kind or straightforward manner and just get the information. Bug filers are rarely intentionally being stupid—rather, they simply don’t know and it’s part of your job to help them provide the right information. If people frequently don’t provide the right information, you can even include a little questionnaire or form on the bug-filing page that makes them fill in the right information. The point is to be helpful to them so that they can be helpful to you, and so that you can easily resolve the issues that come in.

Once you’ve clarified the bug, you have to go and look at various parts of the system. Which parts of the system to look at is based on your knowledge of the system. Usually it’s logs, monitoring, error messages, core dumps, or some other output of the system. If you don’t have these things, you might have to launch or release a new version of the system that provides the information before you can fully debug the system. Although that might seem like a lot of work just to fix a bug, in reality it often ends up being faster to release a new version that provides sufficient information than to spend your time hunting around the system and guessing what’s going on without information. This is also another good argument for having fast, frequent releases—that way you can get out a new version that provides new debugging information quickly. Sometimes you can get a new build of your system out to just the user who is experiencing the problem, too, as a shortcut to get the information that you need.

Now, remember above that I mentioned that you have to remember what a working system looks like? This is because there is another principle of debugging:

Debugging is accomplished by comparing the data that you have to what you know the data from a working system should look like.

When you see a message in a log, is that a normal message or is it actually an error? Maybe the log says, “Warning: all the user data is missing.” That looks like an error, but really your web server prints that every single time it starts. You have to know that a working web server does that. You’re looking for behavior or output that a working system does not display. Also, you have to understand what these messages mean. Maybe the web server optionally has some user database that you aren’t using, which is why you get that warning—because you intend for all the “user data” to be missing.

Eventually you will find something that a working system does not do. You shouldn’t immediately assume you’ve found the cause of the problem when you see this, though. For example, maybe it logs a message saying, “Error: insects are eating all the cookies.” One way that you could “fix” that behavior would be to delete the log message. Now the behavior is like normal, right? No, wrong—the actual bug is still happening. That’s a pretty stupid example, but people do less-stupid versions of this that don’t fix the bug. They don’t get down to the basic cause of the problem and instead they paper over the bug with some workaround that lives in the codebase forever and causes complexity for everybody who works on that area of the code from then on. It’s not even sufficient to say “You will know that you have found the real cause because fixing that fixes the bug.” That’s pretty close to the truth, but a closer statement is, “You will know that you have found a real cause when you are confident that fixing it will make the problem never come back.” This isn’t an absolute statement—there is a sort of scale of how “fixed” a bug is. A bug can be more fixed or less fixed, usually based on how “deep” you want to go with your solution, and how much time you want to spend on it. Usually you’ll know when you’ve found a decent cause of the problem and can now declare the bug fixed—it’s pretty obvious. But I wanted to warn you against papering over a bug by eliminating the symptoms but not handling the cause.

And of course, once you have the cause, you fix it. That’s actually the simplest step, if you’ve done everything else right.

So basically this gives us four primary steps to debugging:

  1. Familiarity with what a working system does.
  2. Understanding that you don’t already know the cause of the problem.
  3. Looking at data until you know what causes the problem.
  4. Fixing the cause and not the symptoms.

This sounds pretty simple, but I see people violate this formula all the time. In my experience, most programmers, when faced with a bug, want to sit around and think about it or talk about what might be causing it—both forms of guessing. It’s okay to talk to other people who might have information about the system or advice on where to look for data that would help you debug. But sitting around and collectively guessing what could cause the bug isn’t really any better than sitting around and doing it yourself, except perhaps that you get to chat with your co-workers, which could be good if you like them. Mostly though what you’re doing in that case is wasting a bunch of people’s time instead of just wasting your own time.

So don’t waste people’s time, and don’t create more complexity than you need to in your codebase. This debugging method works. It works every time, on every codebase, with every system. Sometimes the “data gathering” step is pretty hard, particularly with bugs that you can’t reproduce. But at the worst, you can gather data by looking at the code and trying to see if you can see a bug in it, or draw a diagram of how the system behaves and see if you can perceive a problem there. I would only recommend that as a last resort, but if you have to, it’s still better than guessing what’s wrong or assuming you already know.

Sometimes, it’s almost magical how a bug resolves just by looking at the right data until you know. Try it for yourself and see. It can actually be fun, even.

-Max

37 Comments

  1. Two things come to mind:

    1. There is no way to estimate how long it will take to fix a bug, because most of the work is diagnosis. You cannot know ahead of time what the diagnosis will entail because you cannot know what you do not yet know. After you arrive at a diagnosis supported by observation, then you should be able to give a reasonable estimate of how long it will take to fix, but by then you have already done most of the work.

    2. Before doing the diagnosis step so well described in this article, I do suggest writing automated tests that cover the cases that fail due to the bug. This not only helps you know when you have fixed the bug, but the accumulative automated tests for previous bugs validate that you have not undone the fix for some other bug. Bugs are well known to exhibit recidivism.

    • Hey Steven! Agreed on both counts. The only caveat I would provide is that when you write the tests, you shouldn’t just write “regression tests” reactively in response to bugs, but rather good-coverage and well-designed unit tests that expose the bug by the nature of being generally well-designed tests. This doesn’t mean that sometimes you don’t need to add another case to your unit tests when a new bug is found, but it means that the base tests should be designed surrounding the basic principles of testing.

      -Max

      • Unit tests seem more appropriate during the implementation of a solution, given that you may be refactoring or reworking the code in order to fix the bug.

        • Well, if they are good unit tests, they should cover the functionality of the system. If that functionality is appropriately expressed, then the bug you are encountering is a violation of that functionality and should be catchable by improving the unit tests.

          The problem with doing pure regression testing is that you develop a large suite of somewhat-random tests that may or may not continue to have as much value into the future as well-designed unit tests. There are many codebases where you can see the consequences of this. Tests are also code that has to be maintained.

          -Max

  2. Heh… I’ve been trying to teach this stuff to some of our newer staff lately, smart guys, but pretty green. And it’s a challenge, because half the time, I *know* what the bug is as soon as they describe it… it’s not really a guess, just years of experience pointing me in the right direction.

    But that intuition doesn’t help the new guys, because they’re fifteen years shorter on experience. So instead, I need to resist the urge to go straight to fixing the bug, and instead run them through the process of systematically gathering the information to find the bug for themselves.

    • Same here
      I read your post, Max, and I think it can be (and it is, to me at least) difficult do split up “think” and “look”

      Mostly because many times, I already have a lot of data about a particular system
      Thus, when a problem arises, there is not always a “lookup phase” : data has already been gathered, leaving the “think” part alone
      From all the data I know about that, I “think” that the issue is “there”

      This does not prevent me from checking my guess before fixing πŸ™‚

      • Yeah, I definitely understand what you guys are talking about. It’s something that I’ve experienced too. But I try to go through the correct process anyway, because on the average over time, it tends to be faster (taking into account the times when you are wrong about the guess and having to try again). Experience with the system does help in knowing where to look first, though.

        -Max

    • That’s very cool. πŸ™‚

      Yeah, it’s surprising how many people have difficulty with troubleshooting. It requires a certain amount of discipline (and sometimes patience) to keep to the level of simplicity required as you go along—simply saying, “Okay, I’m going to get this data, and when I get the data, then I will make a decision.”

      -Max

  3. I disagree with the premise that you shouldn’t ‘guess’ first. Perhaps it is just a terminology thing. I think you mean ‘decide, based on little more that gut instinct, where the problem is a change that, hoping it will work’. To me, ‘guess’ means you use your knowledge of the system to prioritise which would be the most probably useful areas to investigate first, accepting that it might lead to a dead-end. Otherwise, your search for the solution is no better than wandering around aimlessly hoping to reach a destination; where what you really want to do is to get a map, decide a route but know that there may be road-blocks and detours.

    Using you example of a web server not serving a page, after getting further clues from the problem finder, I’d possibly start with looking at the bit that sends the page, backtracking to what creates the page. Start from the known (the page does not appear) and work backwards to the cause. I would not start with looking at (say) the part of the application that calculates income tax.

    • As an instructor I had years ago said, it’s all about applying hatchets to cut your problem size down.

      If you cannot ssh to a server, you start with seeing if you’re connected to the network, not if your hard drive is full, for example (though the latter might be the/a problem).

    • I also questioned the part about not ‘guessing’. To me the ‘guess’ work / thinking at the beginning of debugging is a way of characterising and categorising the reported behaviour. This thinking/conversation can eliminate unrelated parts of a system and allow programmers to better focus on the most likely cause of the bug.

      I suppose the overall sentiment is to not waste time wildly making stuff up, but rather use a scientific approach where we observe behaviours and adjust our theories accordingly.

    • Amen.

      I have found over many years that debugging is mainly hypothesis testing. You gain all the info you reasonably can, then postulate a theory. Then devise a way to test that theory. By way of gaining information, good debug tools are priceless. This process sounds kinda like guessing, but it works.

      As a note, I have found that about 50% of the time the solution is to delete code.

    • I have to disagree. But perhaps it IS terminology. πŸ™‚

      What you describe isn’t “guessing.” It’s looking in the right place.

      You’re not *guessing* at what the underlying cause is. You’re just looking in the area of the problem and using all channels of information you have to investigate (observe) *that* area.

      Of course looking at the income tax calculation to debug pages not loading would be silly, but that mistake isn’t caused by failing to *guess.* It’s caused by not realizing that all the errors you have (everything that *doesn’t look like a working server*) relate to the page loading part of the code, *not* the tax calculation part.

      This is mentioned in the article as one of the only parts you have to do with your mind:

      > 2. Figure out what you need to look at in order to get more data.

      But *guessing* would be: “Maybe if I increase the database cache then the page load problem will disappear.” There is no logic to connect the two; no theory at all. Because it’s a pure *guess* based on no observation at all. People really do this all the time; I’ve seen it. THAT’S guessing.

      Looking intelligently in the sensible place to look, to gather more information, with no assumption that you already know the solution, isn’t guessing. πŸ™‚

  4. It seems to me that there’s another principle that should be applied in any problem-solving process. We express it as “Mah nishtanah halailah hazeh mikol haleilot?” (in English, that’s “how is this night different from all other nights”). In other words, what can we distinguish about the context in which the bug appears that is different from the “normal” context where the bug DOESN’T appear. That information often helps us isolate the part of the system most likely to be causing the problem – and THEN we can proceed to your step three, with some confidence that we’re looking at the most useful data.

    • I heard it described thusly: “every child that’s seen Sesame Street knows how to start troubleshooting, “..one of these things is not like the other..”

    • Sure, that’s a way to look at it!

      Also, having grown up Jewish, I’m fond of your way of phrasing the problem. Brings back many pleasant memories of Passover! πŸ™‚

      -Max

  5. I also disagree about guessing first, though I agree that trying a fix before knowing what the bug is could politely be described as nuts.

    The people I respect most as debuggers are methodical scientists. They make hypotheses about what is going wrong. They write their hypothesis down. They look for evidence to confirm or refute the hypothesis. They perform experiments and record the results of these experiments.

    Making a hypothesis is the critical first step. It tells you which data in your big web server to look at. It tells you which of your million-line code base to inspect. If the first hypothesis is not confirmed, they make a new one and look at more data. If they need to make code changes to conform a hypothesis, they use the revision management system to record these changes, so they can be backed out if the hypothesis is not confirmed.

    The rest of the article is right on.

    • Context is King. I’ve seen a guy who knew a system particularly well debug a significant issue that people had been working on for half a day from the bar, beer in hand with a single line of information and hit it on the head first time.
      But he knew the system back-to-front and could make a hypothesis based off very limited information using his intuition (collective past experiences).

    • Before you can make a hypothesis, you have to observe. Once you make a hypothesis, you have to observe more.

      Anyone can make a hypothesis. But if you haven’t LOOKED first, and looked searchingly and closely and followed up the looking with more looking, your hypothesis is just a dream up and won’t relate to anything. You won’t even be able to use it to fix the problem. “Maybe there are network problems and I should try again later.” “Maybe I’m using the wrong compiler.”

      How do you differentiate a good hypothesis from a bad one? It’s not by experimentation, actually. A good hypothesis is based on observation and it applies to the area of the data where the most discrepancies were observed. A bad hypothesis doesn’t even take into account which areas have the most departures from “what a working server looks like.” So you go chasing butterflies instead of looking in the area where the most errors are coming from.

      But really, a hypothesis just tells you where to look more specifically. When you REALLY have the cause of the bug, you don’t need to make any guesses about it. “This line is missing a semicolon.” Or, “The programmer who wrote this shell snippet used shell variables in his functions as though they were locally scoped rather than global.”

      • I find that a piece of software looks pretty big if you have to look at all of it before making a hypothesis. You have some behavioral information about the bug to guide your initial hypothesis, and it’s stepwise refinement from there.

        We’re arguing over wording, not over the fundamental concept.

        • Well, I do think there is a slight difference between what Mike and I are saying and what you are saying.

          A hypothesis by definition is a proposed explanation, but I don’t even propose an explanation until I’ve looked at something. And looking at things usually generates the explanation by itself rather easily without having to do any guesswork around it. That is, usually by the time you’ve looked at the thing enough, you’re not hypothesizing—you know.

          -Max

  6. I coined a term called ULTRA debugging:

    1. Understand the problem: However works for you. I like whiteboards personally, mapping out the problem and the various sequences or systems in play.
    2: Logs: Get them. Read them.
    3: Trace: Trace the problem back to the code or configuration etc. Could also call this step theorise as at this point you are forming hypotheses.
    4: Reproduce: ideally locally but in a test environment if need be.
    5: Alter: Make a change and push it to an environment (test first please). Only change one thing at a time. Fix a bug once, so put in automation to catch this variety of issue again.

    I find it a helpful framework/acronym for an undervalued and generally underdeveloped skill in our industry.

    • Sure. πŸ™‚

      The “change only one thing at a time” is an important concept that applies outside of debugging too.

      Not everything has logs and not every problem is debugged by logs, though. πŸ™‚ Also, the Trace step is best done by more looking, like running the system in a debugger, adding printf statements, etc.

      -Max

  7. Two more implications:

    1. Debugging is so difficult, time-consuming, and schedule-blowing that avoiding creating the bugs in the first place is perhaps an even more valuable skill than knowing how to diagnose and remove them.

    2. Therefore, once a bug has been diagnosed and removed, the jobs not done. It is irresponsible to not reflect on what flaw in our work process or our thinking caused the bug to be created in the first place, and then adjust how we work accordingly. If we are not creating fewer bugs over time, we are not learning.

    In particular, if our bugs are so frequently being solved by deleting code, then perhaps we should conclude that we are writing more code than we need to. Perhaps, this may be due to writing additional code to cover anticipated needs instead of just the minimum code necessary for the functionality we are delivering at the moment. TDD is a good way to only write the code we need right now, as well as 8providing a foundation of unit tests to help avoid breaking intended functionality later.

  8. You say: “… guessing is the wrong thing … Figure out what you need to look at … gathering data … look at its logs … try to reproduce the problem … send the user back specific requests to clarify their bug report …” I do all of the above.

    Three points:
    1. Strictly speaking, guessing and figuring out are different but they are interchangable terms in ordinary discourse. But one is a pejorative term and the other is laudatory, so if you’re liable to be one-upped in an important setting, replace guess with figure out.

    2. There are many costs to gathering information from the system or the user, so you must narrow your search, by … figuring out … where to look, whom to ask, how many open-ended or narrowly focused questions to ask, or whether to simply take a stab at the whole answer. Judgement is unavoidable.

    3. It might be like music. You learn with drills and maybe stick with a methodology for a long time, but once you develop a knack for it, at the appropriate time you can try to improvise. Whatever works for you.

  9. I agree with this, most times the bug is not on the surface and the code must be debugged thoroughly. My only caveat to the article is that while you should not β€œguess” at what the solution or even problem is, you should jog your memory for a familiar or common fix. I find this also helps in finding the direct entry point for my breakpoint.

    Great article πŸ™‚

  10. I found your article really interesting and want it to have a Korean translation.

    Is it ok to translate your article and post it on my blog?

    Of course, you’ll be referenced as an owner of this article and link to the original article will be provided on top of the page.

Leave a Reply