The Philosophy of Testing

Much like we gain knowledge about the behavior of the physical universe via the scientific method, we gain knowledge about the behavior of our software via a system of assertion, observation, and experimentation called “testing.”

There are many things one could desire to know about a software system. It seems that most often we want to know if it actually behaves like we intended it to behave. That is, we wrote some code with a particular intention in mind, does it actually do that when we run it?

In a sense, testing software is the reverse of the traditional scientific method, where you test the universe and then use the results of that experiment to refine your hypothesis. Instead, with software, if our “experiments” (tests) don’t prove out our hypothesis (the assertions the test is making), we change the system we are testing. That is, if a test fails, it hopefully means that our software needs to be changed, not that our test needs to be changed. Sometimes we do also need to change our tests in order to properly reflect the current state of our software, though. It can seem like a frustrating and useless waste of time to do such test adjustment, but in reality it’s a natural part of this two-way scientific method–sometimes we’re learning that our tests are wrong, and sometimes our tests are telling us that our system is out of whack and needs to be repaired.

This tells us a few things about testing:

  1. The purpose of a test is to deliver us knowledge about the system, and knowledge has different levels of value. For example, testing that 1 + 1 still equals two no matter what time of day it is doesn’t give us valuable knowledge. However, knowing that my code still works despite possible breaking changes in APIs I depend on could be very useful, depending on the context. In general, one must know what knowledge one desires before one can create an effective and useful test, and then must judge the value of that information appropriately to understand where to put time and effort into testing.
  2. Given that we want to know something, in order for a test to be a test, it must be asserting something and then informing us about that assertion. Human testers can make qualitative assertions, such as whether or not a color is attractive. But automated tests must make assertions that computers can reliably make, which usually means asserting that some specific quantitative statement is true or false. We are trying to learn something about the system by running the test–whether the assertion is true or false is the knowledge we are gaining. A test without an assertion is not a test.
  3. Every test has certain boundaries as an inherent part of its definition. Much like you couldn’t design a single experiment to prove all the theories and laws of physics, it would be prohibitively difficult to design a single test that actually validated all the behaviors of any complex software system at once. If it seems that you have made such a test, most likely you’ve combined many tests into one and those tests should be split apart. When designing a test, you should know what it is actually testing and what it is not testing.
  4. Every test has a set of assumptions built into it, which it relies on in order to be effective within its boundaries. For example, if you are testing something that relies on access to a database, your test might make the assumption that the database is up and running (because some other test has already checked that that part of the code works). If the database is not up and running, then the test neither passes nor fails–it instead provides you no knowledge at all. This tells us that all tests have at least three results–pass, fail, and unknown. Tests with an “unknown” result must not say that they failed–otherwise they are claiming to give us knowledge when in fact they are not.
  5. Because of these boundaries and assumptions, we need to design our suite of tests in such a way that the full set, when combined, actually gives us all of the knowledge we want to gain. That is, each individual test only gives us knowledge within its boundaries and assumptions, so how do we overlap those boundaries so that they reliably inform us about the real behavior of the entire system? The answer to this question may also affect the design of the software system being tested, as some designs are harder to completely test than others.

This last point leads us into the many methods of testing being practiced today, in particular end to end testing, integration testing, and unit testing.

End to End Testing

“End to end” testing is where you make an assertion that involves one complete “path” through the logic of the system. That is, you start up the whole system, perform some action at the entry point of user input, and check the result that the system produces. You don’t care how things work internally to accomplish this goal, you just care about the input and result. That is generally true for all tests, but here we’re testing at the outermost point of input into the system and checking the outermost result that it produces, only.

An example end to end test for creating a user account in a typical web application would be to start up a web server, a database, and a web browser, and use the web browser to actually load the account creation web page, fill it in, and submit it. Then you would assert that the resulting page somehow tells us the account was created successfully.

The idea behind end to end testing is that we gain fully accurate knowledge about our assertions because we are testing a system that is as close to “real” and “complete” as possible. All of its interactions and all of its complexity along the path we are testing are covered by the test.

The problem of using only end to end testing is that it makes it very difficult to actually get all of the knowledge about the system that we might desire. In any complex software system, the number of interacting components and the combinatorial explosion of paths through the code make it difficult or impossible to actually cover all the paths and make all the assertions we want to make.

It can also be difficult to maintain end to end tests, as small changes in the system’s internals lead to many changes in the tests.

End to end tests are valuable, particularly as an initial stopgap for a system that entirely lacks tests. They are also good as sanity checks that your whole system behaves properly when put together. They have an important place in a test suite, but they are not, by themselves, a good long-term solution for gaining full knowledge of a complex system.

If a system is designed in such a way that it can only be tested via end-to-end tests, that is a symptom of broad architectural problems in the code. These issues should be addressed through refactoring until one of the other testing methods can be used.

Integration Testing

This is where you take two or more full “components” of a system and specifically test how they behave when “put together.” A component could be a code module, a library that your system depends on, a remote service that provides you data–essentially any part of the system that can be conceptually isolated from the rest of the system.

For example, in a web application where creating an account sends the new user an email, one might have a test that runs the account creation code (without going through a web page, just exercising the code directly) and checks that an email was sent. Or one might have a test that checks that account creation succeeds when one is using a real database–that “integrates” account creation and the database. Basically this is any test that is explicitly checking that two or more components behave properly when used together.

Compared to end to end testing, integration testing involves a bit more isolation of components as opposed to just running a test on the whole system as a “black box.”

Integration testing doesn’t suffer as badly from the combinatorial explosion of test paths that end to end testing faces, particularly when the components being tested are simple and thus their interactions are simple. If two components are hard to integration test due to the complexity of their interactions, this indicates that perhaps one or both of them should be refactored for simplicity.

Integration testing is also usually not a sufficient testing methodology on its own, as doing an analysis of an entire system purely through the interactions of components means that one must test a very large number of interactions in order to have a full picture of the system’s behavior. There is also a maintenance burden with integration testing similar to end to end testing, though not as bad–when one makes a small change in one component’s behavior, one might have to then update the tests for all the other components that interact with it.

Unit Testing

This is where you take one component alone and test that it behaves properly. In our account creation example, we could have a series of unit tests for the account creation code, a separate series of unit tests for the email sending code, a separate series of unit tests for the web page where users fill in their account information, and so on.

Unit testing is most valuable when you have a component that presents strong guarantees to the world outside of itself and you want to validate those guarantees. For example, a function’s documentation says that it will return the number “1” if passed the parameter “0.” A unit test would pass this function the parameter “0” and assert that it returned the number “1.” It would not check how the code inside of the component behaved–it would only check that the function’s guarantees were met.

Usually, a unit test is testing one behavior of one function in one class/module. One creates a set of unit tests for a class/module that, when you run them all, cover all behavior that you want to verify in that module. This almost always means testing only the public API of the system, though–unit tests should be testing the behavior of the component, not its implementation.

Theoretically, if all components of the system fully define their behavior in documentation, then by testing that each component is living up to its documented behavior, you are in fact testing all possible behaviors of the entire system. When you change the behavior of one component, you only have to update a minimal set of tests around that component.

Obviously, unit testing works best when the system’s components are reasonably separate and are simple enough that it’s possible to fully define their behavior.

It is often true that if you cannot fully unit test a system, but instead have to do integration testing or end to end testing to verify behavior, some design change to the system is needed. (For example, components of the system may be too entangled and may need more isolation from each other.) Theoretically, if a system were well-isolated and had guarantees for all of the behavior of every function in the system, then no integration testing or end to end testing would be necessary. Reality is often a little different, though.

Reality

In reality, there is a scale of testing that has infinite stages between Unit Testing and End to End testing. Sometimes you’re a bit between unit testing and integration testing. Sometimes your test falls somewhere between an integration test and an end to end test. Real systems usually require all sorts of tests along this scale in order to understand their behavior reliably.

For example, sometimes you’re testing only one part of the system but its internals depend on other parts of the system, so you’re implicitly testing those too. This doesn’t make your test an Integration Test, it just makes it a unit test that is also testing other internal components implicitly–slightly larger than a unit test, and slightly smaller than an integration test. In fact, this is the sort of testing that is often the most effective.

Fakes

Some people believe that in order to do true “unit testing” you must write code in your tests that isolates the component you are testing from every other component in the system–even that component’s internal dependencies. Some even believe that this “true unit testing” is the holy grail that all testing should aspire to. This approach is often misguided, for the following reasons:

  • One advantage of having tests for individual components is that when the system changes, you have to update fewer unit tests than you have to update with integration tests or end to end tests. If you make your tests more complex in order to isolate the component under test, that complexity could defeat this advantage, because you’re adding more test code that has to be kept up to date anyway.

    For example, imagine you want to test an email sending module that takes an object representing a user of the system, and an sends email to that user. You could invent a “fake” user object–a completely separate class–just for your test, out of the belief that you should be “just testing the email sending code and not the user code.” But then when the real User class changes its behavior, you have to update the behavior of the fake User class–and a developer might even forget to do this, making your email sending test now invalid because its assumptions (the behavior of the User object) are invalid.

  • The relationships between a component and its internal dependencies are often complex, and if you’re not testing its real dependencies, you might not be testing its real behavior. This sometimes happens when developers fail to keep “fake” objects in sync with real objects, but it can also happen via failing to make a “fake” object as genuinely complex and full-featured as the “real” object.

    For example, in our email sending example above, what if real users could have seven different formats of username but the fake object only had one format, and this affected the way email sending worked? (Or worse, what if this didn’t affect email sending behavior when the test was originally written, but it did affect email sending behavior a year later and nobody noticed that they had to update the test?) Sure, you could update the fake object to have equal complexity, but then you’re adding even more of a maintenance burden for the fake object.

  • Having to add too many “fake” objects to a test indicates that there is a design problem with the system that should be addressed in the code of the system instead of being “worked around” in the tests. For example, it could be that components are too entangled–the rules of “what is allowed to depend on what” or “what are the layers of the system” might not be well-defined enough.

In general, it is not bad to have “overlap” between tests. That is, you have a test for the public APIs of the User code, and you have a test for the public APIs of the email sending code. The email sending code uses real User objects and thus also does a small bit of implicit “testing” on the User objects, but that overlap is okay. It’s better to have overlap than to miss areas that you want to test.

Isolation via “fakes” is sometimes useful, though. One has to make a judgment call and be aware of the trade-offs above, attempting to mitigate them as much as possible via the design of your “fake” instances. In particular, fakes are worthwhile to add two properties to a test–determinism and speed.

Determinism

If nothing about the system or its environment changes, then the result of a test should not change. If a test is passing on my system today but failing tomorrow even though I haven’t changed the system, then that test is unreliable. In fact, it is invalid as a test because its “failures” are not really failures–they’re an “unknown” result disguised as knowledge. We say that such tests are “flaky” or “non-deterministic.”

Some aspects of a system are genuinely non-deterministic. For example, you might generate a random string based on the time of day, and then show that string on a web page. In order to test this reliably, you would need two tests:

  1. A test that uses the random-string generation code over and over to make sure that it properly generates random strings.
  2. A test for the web page that uses a fake random-string generator that always returns the same string, so that the web page test is deterministic.

Of course, you would only need the fake in that second test if verifying the exact string in the web page was an important assertion. It’s not that everything about a test needs to be deterministic–it’s that the assertions it is making need to always be true or always be false if the system itself hasn’t changed. If you weren’t asserting anything about the string, the size of the web page, etc. then you would not need to make the string generation deterministic.

Speed

One of the most important uses of tests is that developers run them while they are editing code, to see if the new code they’ve written is actually working. As tests become slower, they become less and less useful for this purpose. Or developers continue to use them but start writing code more and more slowly because they keep having to wait for the tests to finish.

In general, a test suite should not take so long that a developer becomes distracted from their work and loses focus while they wait for it to complete. Existing research indicates this takes somewhere between 2 and 30 seconds for most developers. Thus, a test suite used by developers during code editing should take roughly that length of time to run. It might be okay for it to take a few minutes, but that wouldn’t be ideal. It would definitely not be okay for it to take ten minutes, under most circumstances.

There are other reasons to have fast tests beyond just the developer’s code editing cycle. At the extreme, slow tests can become completely useless if they only deliver their result after it is needed. For example, imagine a test that took so long, you only got the result after you had already released the product to users. Slow tests affect lots of processes in a software engineering organization–it’s simplest for them just to be fast.

Sometimes there is some behavior that is inherently slow in a test. For example, reading a large file off of a disk. It can be okay to make a test “fake” out this slow behavior–for example, by having the large file in memory instead of on the disk. Like with all fakes, it is important to understand how this affects the validity of your test and how you will maintain this fake behavior properly over time.

It is sometimes also useful to have an extra suite of “slow” tests that aren’t run by developers while they edit code, but are run by an automated system after code has been checked in to the version control system, or run by a developer right before they check in their code. That way you get the advantage of a fast test suite that developers can use while editing, but also the more-complete testing of real system behavior even if testing that behavior is slow.

Coverage

There are tools that run a test suite and then tell you which lines of system code actually got run by the tests. They say that this tells you the “test coverage” of the system. These can be useful tools, but it is important to remember that they don’t tell you if those lines were actually tested, they only tell you that those lines of code were run. If there is no assertion about the behavior of that code, then it was never actually tested.

Overall

There are many ways to gain knowledge about a system, and testing is just one of them. We could also read its code, look at its documentation, talk to its developers, etc., and each of these would give us a belief about how the system behaves. However, testing validates our beliefs, and thus is particularly important out of all of these methods.

The overall goal of testing is to gain valid knowledge about the system. This goal overrides all other principles of testing–any testing method is valid as long as it produces that result. However, some testing methods are more efficient–they make it easier to create and maintain tests which produce all the information we desire. These methods should be understood and used appropriately, as your judgment dictates and as they apply to the specific system you’re testing.

-Max

8 Comments

  1. The value of fakes and pure unit tests is also dependent on the speed of the languages and systems you’re working with. When in a fast language, such as C++ or C# (both of which I’ve been using at work lately), you can make your tests broad and deep, and only really worry about faking out the non-deterministic components (i.e. the network, databases, random numbers, etc.).

    However, we also have substantial systems in PHP and JavaScript. These languages are both slow, and PHP particularly so. I’m not going to defend the virtues of PHP, because I don’t really think it has any. Regardless, for historical reasons, we’re stuck with a large codebase in PHP, and because the language is so ridiculously slow, the full test suite would likely take hours to run on a normal dev machine (can’t say more precisely, because I don’t know anybody who has tried). Instead, we run only the tests likely to affect the bits of code we’re editing, and trust in BuildBot to check the rest (which due to massive parallelism, can get through all the tests in “only” about 15 minutes). Even the limited subset of relevant test files can often take minutes to run, however. It’s well over the threshold required for rapid iteration.

    The situation in JavaScript isn’t quite as bad, both because JavaScript isn’t quite as slow as PHP and because our JS codebase isn’t nearly as large. But it’s still not good.

    Which is why we’ve had a strong push toward pure unit tests, at least in those languages. The tests for our newer systems, even in the atrociously slow world of PHP, are much better. They tend to run in seconds, instead of minutes, usually fitting within the window which allows for rapid iteration. This has opened up the risk of missing incompatibility at the seams, but the value we’ve gotten out of the increased iteration speed has greatly outweighed the value lost from the occasional failure to test the seams.

    I suppose one additional rule to consider is this: Automated tests of realistic, large-scale systems are never perfect. As you spend more time writing tests, your coverage increases, but like Zeno’s Arrow you will never achieve perfect coverage. You get diminishing returns on the coverage achieved per the amount of effort put in, but (roughly) linear increase in the test run time. There’s a point where you have to accept that it’s “good enough”.

  2. If you’re developing component A which depends on B, and you find yourself writing a fake implementation of B (as in “Fakes”, above), then your project is already violating a big rule of modularity, is it not?

    You should write what you know. You know about A, not B. Who knows about B? That’s who should be writing the B fake.

    Seems simple, and usually it is. We all need to violate it sometimes, but it always costs us. When you work in someone else’s domain, not only are you exposed to version problems (as you illustrate), you also have to get your own A-oriented head into the B problem domain. It’s probably more expensive for you to do it. (Plus, the developers of other dependents like C and D are probably duplicating your work.)

    So I interpret your notes a little differently. If you’re publishing a component that other components depend on, you write the spec, you write the API classes — and you write a good fake. Make it deterministic and make it fast and make it work for most common use cases. Your users will have a better time writing unit tests, upgrade easily, and be less likely to abandon your product.

    The fake gets written before the test? Sure. The “write tests first” nature of TDD may or may not agree with this, depending on your TDD dialect. But I don’t think it’s incompatible.

    • Your points are all logical, based in many time-honored principles…and unfortunately incorrect.

      The two principles that actually apply are:

      * Systems should always and only be designed for present-time requirements. See these videos: https://www.youtube.com/playlist?list=PLOU2XLYxmsIJ7HGm2bv20QrtwcWemSRCI

      * Systems end up better when developers take broad responsibility for them. See http://radar.oreilly.com/2013/04/code-simplicity-the-science-of-software-design.html. Note that this doesn’t change the principles of modularity, nor does it change the principles of strong ownership. It’s just that people should be out there making the changes that they need or making those changes happen.

      -Max

    • I think more important than who writes the fake is that it gets written only once. Obviously if the owners of an API provide a good fake, others are much less likely to duplicate their work.
      If it’s left to users to write a fake, I think the work is much more likely to be duplicated across many users.
      But there are examples of users contributing fakes that get some official status and become widely used, and that doesn’t seem worse to me.

  3. Could you please introduce some vertical whitespace around your ordered lists? They are so painful to read I had to copy/paste into another doc.

    Wonderful content though!

Leave a Reply