Last week I was discussing how one of my students could use test coverage analysis in his daily work. His observation was that his manager had little idea about coverage. Do managers need coverage information? Can they do harm with coverage numbers? What good can project leads do with coverage data?
For me, coverage analysis is first and foremost a code reviewing tool. Test coverage information helps me to see which code has not been tested. This I can use to rethink
- The test suite: How can I extend the test suite to cover the corner cases that I missed?
- The test strategy: Is there a need for me to extend my suite to cover the untested code? Why did I miss it?
- The system design: How can I refactor my code so that it becomes easier to test the missing cases?
- An incoming change (pull request): Is the contributed code well tested?
Coverage is not a management tool. While coverage information is essential to have good discussions on testing, using test coverage percentages to manage a project calls for caution.
To appreciate why managing by coverage numbers is problematic, let’s look at four common pitfalls of using software metrics in general, and see how they apply to test coverage as well.
Coverage Without Context
A coverage number gives a percentage of statements (or classes, methods, branches, blocks, …) in your program hit by the test suite. But what does such a number mean? Is 80% good or not? When is it good? Why is it good? Should I be alarmed at anything below 50%?
The key observation is that coverage numbers should not be leading: coverage targets (if used at all) should be derived from an overall project goal. For example:
- We want a well-designed system. Therefore we use coverage analysis to discover classes that are too hard to test and hence call for refactoring;
- We want to make sure we have meaningful tests in place for all critical behavior. Therefore, we analyze for which parts of this behavior no such tests are in place.
- We have observed class
Foois bug prone. Hence we monitor its coverage, and no change that decreases its coverage is allowed.
- For the avionics software system we are building, we need to comply with standard DO178b. Thus, we must demonstrate that we achieve modified condition/decision coverage;
With these higher level goals in place, it is also possible to argue when low coverage is acceptable:
- These packages are deprecated, so we do not need to rethink the test suite nor its design anymore.
- These classes take care of the user interface, which we do not consider critical for our application
- These classes are so simple that we deem automated testing unnecessary.
- This code has been stable for years, so we do not really care anymore how well it is tested.
Code coverage is a means to an end; it is not a goal in itself.
Treating The Numbers
The number one reason developers are wary of managers using coverage numbers is that it may give incentives to write useless tests.
Coverage numbers are easily tricked. All one needs to do is invoke a few methods in the test cases. No need for assertions, or any meaningful verification of the outcomes, and the coverage numbers are likely to go up.
Test cases to game coverage numbers are dangerous:
- The resulting test cases are extremely weak. The only information they provide is that a certain sequence of calls does not lead to an unexpected exception.
The faked high coverage may give a false sense of confidence;
The presence of poor tests undermines the testing culture in the team.
Once again: A code coverage percentage is not a goal in itself: It is a means to an end.
There is little information that can be derived from the mere knowledge that a test suite covers 100% of the statements. Which data values went through the statements? Was my if-then conditional exercised with a true as well as a false value? What happens if I invoke my methods in a different order?
A lot more information can be derived from the fact that the test suite achieves, say, just 50% test coverage. This means that half of your code is not even touched by the test suite. The number tells you that you cannot use your test suite to say anything about the quality of half of your code base.
As such, statement coverage can tell us when a test suite is inadequate, but it can never tell us whether a test suite is adequate.
For this reason, a narrow focus on a single type of coverage (such as statement coverage) is undesirable. Instead, a team should be aware of a range of different coverage criteria, and use them where appropriate. Example criteria include decision coverage, state transition coverage, data-flow coverage, or mutation coverage — and the only limitation is our imagination, as can be seen from Cem Kaner’s list of 101 coverage metrics, compiled in 1996.
The latter technique, mutation coverage, may help to fight gaming of coverage numbers. The idea is to generate mutations to the code under test, such as the negation of Boolean conditions or integer outcomes, or modifications of arithmetic expressions. A good test suite should trigger a failure for any such ‘erroneous’ motivation. If you want to experiment with mutation testing in Java (or Scala), you might want to take a look at the pitest tool.
A final pitfall in using coverage metrics is to become over-enthusiastic, measuring coverage in many (possibly overlapping) ways.
Even a (relatively straightforward) coverage tool like EclEmma for Java can measure instruction, branch, line, method, and type coverage, and as a bonus can compute the cyclomatic complexity. Per metric it can show covered, missed, and total, as well as the percentage of covered items. Which of those should be used? Is it a good idea to monitor all forms of coverage at all levels?
Again, what you measure should depend on your goal.
If your project is just starting to test systematically, a useful goal will be to identify classes where a lot can be gained by adding tests. Then it makes sense to start measuring at the coarse class or method level, in order to identify classes most in need of testing. If, on the other hand, most code is reasonably well tested, looking at missed branches may provide the most insight
The general rule is to start as simple as possible (just line coverage will do), and then to add alternative metrics when the simpler metric does not lead to new insights any more.
Additional forms of coverage should be different from (orthogonal to) the existing form of coverage used. For example, line coverage and branch coverage are relatively similar. Thus, measuring branch coverage besides line coverage may not lead to many additional cases to test (depending on your code). Mutation coverage, dataflow-based coverage, or state-machine based coverage, on the other hand, are much more different from line coverage. Thus, using one of those may be more effective in increasing the diversity of your test cases.
Putting Coverage to Good Use
Managers that are aware of the above pitfalls, can certainly put coverage to good use. Here are a few ways, taken from Glover’s classic Don’t be fooled by the coverage report:
- “Code without corresponding tests can be more challenging to understand, and is harder to modify safely. Therefore, knowing whether code has been tested, and seeing the actual test coverage numbers, can allow developers and managers to more accurately predict the time needed to modify existing code.” — I can only agree.
“Monitoring coverage reports helps development teams quickly spot code that is growing without corresponding tests” — This happens: see our paper on the co-evolution of test and production code)
“Given that a code coverage report is most effective at demonstrating sections of code without adequate testing, quality assurance personnel can use this data to assess areas of concern with respect to functional testing.” — A good reason to combine user story acceptance testing with coverage analysis.
Test coverage analysis is an important tool that any development team taking testing seriously should use.
More than anything else, coverage analysis is a reviewing tool that you can use to improve your own (test) code, and to to evaluate someone else’s code and test cases.
Coverage numbers can not be used as a key performance indicator (KPI) for project success. If your manager uses coverage as a KPI, make sure he is aware of the pitfalls.
Coverage analysis is useful to managers: If you are a team lead, make sure you understand test coverage. Then use coverage analysis as a way to engage in discussion with your developers about their (test) code and the team’s test strategy.
- Eric Bouwers, Joost Visser, and Arie van Deursen. Getting what you Measure: Four common pitfalls in using software metrics for project management. Communications of the ACM, 55(7): 54-59, 2012.
- Andrew Glover. In pursuit of code quality: Don’t be fooled by the coverage report. IBM Developer Works blog post, January, 2006.
- Martin Fowler. Test Coverage. http://martinfowler.com/bliki/TestCoverage.html, 17 April 2012.
- Alberto Savoia. How Much Unit Test Coverage Do You Need? – The Testivus Answer. Artima Forum, 2007. (Also listed as answer to the Stackoverlfow question “What is reasonable code coverage and why?“).
- Brian Marick. How to misuse code coverage, 1999
- Cem Kaner. Software Negligence and Testing Coverage (Appendix A lists 101 coverage criteria). Proceedings of STAR 96
(Fifth International Conference on Software Testing, Analysis, and Review), 1996.
- Zhu, Hall, May. Software Unit Test Coverage and Adequacy. ACM Computing Surveys, 1997.
- Andy Zaidman, Bart Van Rompaey, Arie van Deursen, and Serge Demeyer. Studying the co-evolution of production and test code in open source and industrial developer test processes through repository mining (open access). Empirical Software Engineering 16:325–364, 2011.
- Dimitrios Athanasiou. Constructing a Test Code Quality Model and Empirically Assessing its Relation to Issue Handling Performance (pdf). Master’s Thesis, Delft University of Technology, 2011.
© Arie van Deursen, 2013.
Image credits: Flickr.
Source featured image: flickr