Semantic Versioning? In Maven Central? Breaking Changes!

I love the use of semantic versioning. It provides clear guidelines when an API version number should be bumped. In particular, if a change to an API breaks backward compatibility, the major version identifier must be updated. Thus, when as a client of such an API I see a minor or patch update, I can safely upgrade, since these will never break my build.

But to what extent are people actually using the guidelines of semantic versioning? Has the adoption of semantic versioning increased over time? Does the use of semantic versioning speedup library upgrades? And if there are breaking changes, are these properly announced by means of deprecation tags?

To better understand semantic versioning, we studied seven years of versioning history in the Maven Central Repository. Here is what we found.

Versioning Policies

Semantic Versioning is a systematic approach to give version numbers to API releases. The approach has been developed by Tom Preston-Werner, and is advocated by GitHub. It is “based on but not necessarily limited to pre-existing widespread common practices in use in both closed and open-source software.”

In a nutshell, semantic versioning is based on MAJOR.MINOR.PATCH identifiers with the following meaning:

  • MAJOR: increment when you make incompatible API changes

  • MINOR: increment when you add functionality in backward compatible manner

  • PATCH: incremented when you make backward-compatible bug fixes

Versioning in Maven Central

To understand how developers actually use versioning policies, we studied 7 years of history of maven. Our study comprises around 130,000 released jar files, of which 100,000 come with source code. On average, we found around 7 versions per library in our data set.

The data set runs from 2006 to 2011, so predates the semantic versioning manifesto. Since the manifesto claims to be based on existing practices, it is interesting to investigate to what extent API developers releasing via maven have adhered to these practices.

As shown below, in Maven Central, the major.minor.patch scheme is adopted by the majority of projects.

Frequency of semver identifiers in maven

Breaking Changes

To understand to what extent version increments correspond to breaking changes, we identified all “update-pairs”: A version as well as its immediate successor of a given maven package.

To be able to assess semantic versioning compliance, we need to determine backward compatibility for such update pairs. Since in general this is undecidable, we use binary compatibility as a proxy. Two libraries are binary compatible, if interchanging them does not force one to recompile. Examples of incompatible changes are removing a public method or class, or adding parameters to a public method.

We used the Clirr tool to detect such binary incompatibilities automatically for Java. For those interested, there is also a SonarQube Clirr plugin for compatibliity checking in a continuous integration setting; An alternative tool is japi.

Major versus Minor/Patch Releases

For major releases, breaking changes are permitted in semantic versioning. This is also what we see in our dataset: A little over 1/3d of the major releases introduces binary incompatibilities:

Breaking changes in major releases

Interestingly, we see a very similar figure for minor releases: A little over 1/3d of the minor releases introduces binary incompatibilities too. This, however, is in violation with the semantic versioning guidelines.

Breaking changes in minor releases

The situation is somewhat better for patch releases, which introduce breaking changes in around 1/4th of the cases. This nevertheless is in conflict with semantic versioning, which insists that patches are compatible.

Breaking changes in patch releases

Overall, we see that minor and major releases are equally likely to introduce breaking changes.

The graph below shows that these trends are fairly stable over time. Minor (~40%) and patch releases (~45%) are most common, and major releases (~15%) are least common.

Adherence over time.

The orange line indicates releases with breaking changes: Around 30% of the releases introduce breaking changes — a number that is slightly decreasing over time, but still at 29%.

Of these releases with breaking changes, the vast majority is non-major (the crossed green line at ~25%). Thus, 1/4th of the releases does not comply with semantic versioning.

Deprecation to the Rescue?

The above analysis demonstrates a deep need to introduce breaking changes: appearantly, developers wish to introduce breaking changes in 30% of their API releases.

Apart from requiring that such changes take place in major releases,
semantic versioning insists on using deprecation to announce such changes. In particular:

Before you completely remove the functionality in a new major release there should be at least one minor release that contains the deprecation so that users can smoothly transition to the new API.

Given the many breaking changes we saw, how many deprecation tags would there be?

At the library level, 1000 out of 20,000 libraries (5%) we studied include at least one depracation tag.

When we look at the individual method level, we find the following:

  • Over 86,000 public methods are removed in a minor release, without any deprecation tag.
  • Almost 800 public methods receive a deprecation tag in their life span: These methods, however, are never removed.
  • 16 public methods receive a deprecation tag that is subsequently undone in a later release (the method is “undeprecated”)

That is all we found. In other words, we did not find a single method that was deprecated according to the guidelines of semantic versioning (deprecate in minor, then delete in major).

Furthermore, the worst behavior (deletion in minor without deprecation, over 86,000 methods) occurs hundred times more often than the best behavior witnessed (deprecation without deletion, almost 800 methods)

Discussion

Our findings are based on a somewhat older dataset obtained from maven. It might therefore be that adherence to semantic versioning at the moment is higher than what we report.

Our findings also take breaking changes to any public method or class into account. In practice, certain packages will be considered as API only, whereas others are internal only — yet declared public due to the limitations of the Java modularization system. We are working on a further analysis in which we explicitly measure to what extent the changed public methods are used. Initial findings within the maven data set suggest that there are hundreds of thousands of compilation problems in clients caused by these breaking changes — suggesting that it is not just internal modules containing these changes.

Some non-Java package managers explicitly have adopted semantic versioning. A notable example is npm, and also nuget has initiated a discussion. While npm has a dedicated library to determine version orderings according to the Semantic Versioning standard, I am not aware of Javascript compatibility tools that are used to check for breaking changes. For .NET (nuget) tools like Clirr exists, such as LibCheck, ApiChange, and the NDepend build comparison tools. I do not know, however, how widespread their use is.

Last but not least, binary incompatibility, as measured by Clirr, is just one form of backward incompatibility. An API may change the behavior of methods as well, in breaking ways. A way to detect those might be by applying old test suites to new code — which we have not done yet.

Conclusion

Breaking API changes are a fact of life. Therefore, meaningful version identifiers are a cause worth fighting for. Thus:

  1. If you adopt semantic versioning for your API, double check that your minor and patch releases are indeed backward compatible. Consider using tools such as clirr, or a dedicated test suite.

  2. If you rely on an API that claims to be following semantic versioning, don’t put too much faith in backward compatibility promises.

  3. If you want to follow semantic versioning, be prepared to bump the major version identifier often.

  4. Should you need to introduce breaking changes, do not forget to announce these with deprecation tags in minor releases first.

  5. If you are in research, note that the problems involved in preventing, encapsulating, detecting, and handling breaking changes are tough and real. Backward compatiblity deserves your research attention.

The full details of our study are published in our paper presented (slides) at the IEEE International Conference on Source Analysis and Manipulation (SCAM), held in Victoria, BC, September 2014.

This post is based on joint work by Steven Raemaekers and Joost Visser, both from the Software Improvement Group.

© Copyright Arie van Deursen, October 2014.

EDIT, November 2014

If you are a Java developer and serious about semantic versioning, then checkout this open source semantic versioning checking tool. It is available on Maven Central and on GitHub, and it can check differences, validate version numbers, and suggest proper version numbers for your new jar release.

Think Twice Before Using the “Maintainability Index”

Code metrics results in VS2010

This is a quick note about the “Maintainability Index”, a metric aimed at assessing software maintainability, as I recently run into developers and researchers who are (still) using it.

The Maintainability Index was introduced at the International Conference on Software Maintenance in 1992. To date, it is included in Visual Studio (since 2007), in the recent (2012) JSComplexity and Radon metrics reporters for Javascript and Python, and in older metric tool suites such as verifysoft.

At first sight, this sounds like a great success of knowledge transfer from academic research to industry practice. Upon closer inspection, the Maintainability Index turns out to be problematic.

The Original Index

The Maintainabilty Index was introduced in 1992 by Paul Oman and Jack Hagemeister, originally presented at the International Conference on Software Maintenance ICSM 1992 and later refined in a paper that appeared in IEEE Computer. It is a blend of several metrics, including Halstead’s Volume (HV), McCabe’s cylcomatic complexity (CC), lines of code (LOC), and percentage of comments (COM). For these metrics, the average per module is taken, and combined into a single formula:

formula

To arrive at this formula, Oman and Hagemeister started with a number of systems from Hewlett-Packard (written in C and Pacscal in the late 80s, “ranging in size from 1000 to 10,000 lines of code”). For each system, engineers provided a rating (between 1-100) of its maintainability. Subsequently, 40 different metrics were calculated for these systems. Finally, statistical regression analysis was applied to find the best way to combine (a selection of) these metrics to fit the experts’ opinion. This eventually resulted in the given formula. The higher its value, the more maintainable a system is deemed to be.

The maintainability index attracted quite some attention, also because the Software Engineering Institute (SEI) promoted it, for example in their 1997 C4 Software Technology Reference Guide. This report describes the Maintainability Index as “good and sufficient predictors of maintainability”, and “potentially very useful for operational Department of Defense systems”. Furthermore, they suggest that “it is advisable to test the coefficients for proper fit with each major system to which the MI is applied.”

Use in Visual Studio

Visual Studio Code Metrics were announced in February 2007. A November 2007 blogpost clarifies the specifics of the maintainability index included in it. The formula Visual Studio uses is slightly different, based on the 1994 version:

Maintainability Index =
  MAX(0, (171 - 5.2 * ln(Halstead Volume)
             - 0.23 * Cyclomatic Complexity
             - 16.2 * ln(Lines of Code)
         ) * 100 / 171)

As you can see, the constants are literally the same as in the original formula. The new definition merely transforms the index to a number between 0 and 100. Also, the comment metrics has been removed.

Furthermore, Visual Studio provides an interpretation:

MI >= 20 High Maintainabillity
10 <= MI < 20 Moderate Maintainability
MI < 10 Low Maintainability

 

I have not been able to find a justification for these thresholds. The 1994 IEEE Computer paper used 85 and 65 (instead of 20 and 10) as thresholds, describing them as a good “rule of thumb”.

The metrics are available within Visual Studio, and are part of the code metrics power tools, which can also be used in a continuous integration server.

Concerns

I encountered the Maintainability Index myself in 2003, when working on Software Risk Assessments in collaboration with SIG. Later, researchers from SIG published a thorough analysis of the Maintainability Index (first when introducing their practical model for assessing maintainability and later as section 6.1 of their paper on technical quality and issue resolution).

Based on this, my key concerns about the Maintainability Index are:

  1. There is no clear explanation for the specific derived formula.
  2. The only explanation that can be given is that all underlying metrics (Halstead, Cyclomatic Complexity, Lines of Code) are directly correlated with size (lines of code). But then just measuring lines of code and taking the average per module is a much simpler metric.
  3. The Maintainability Index is based on the average per file of, e.g., cyclomatic complexity. However, as emphasized by Heitlager et al, these metrics follow a power law, and taking the average tends to mask the presence of high-risk parts.
  4. The set of programs used to derive the metric and evaluate it was small, and contained small programs only.
  5. Furthermore, the programs were written in C and Pascal, which may have rather different maintainability characteristics than current object-oriented languages such as C#, Java, or Javascript.
  6. For the experiments conducted, only few programs were analyzed, and no statistical significance was reported. Thus, the results might as well be due to chance.
  7. Tool smiths and vendors used the exact same formula and coefficients as the 1994 experiments, without any recalibration.

One could argue that any of these concerns is reason enough not to use the Maintainability Index.

These concerns are consistent with a recent (2012) empirical study, in which one application was independently built by four different companies. The researchers used these systems two compare maintainability and several metrics, including the Maintainability Index. Their findings include that size as a measure of maintainability has been underrated, and that the “sophisticated” maintenance metrics are overrated.

Think Twice

In summary, if you are a researcher, think twice before using the maintainability index in your experiments. Make sure you study and fully understand the original papers published about it.

If you are a tool smith or tool vendor, there is not much point in having several metrics that are all confounded by size. Check correlations between the metrics you offer, and if any of them are strongly correlated pick the one with the clearest and simplest explanation.

Last but not least, if you are a developer, and are wondering whether to use the Maintainability Index: Most likely, you’ll be better off looking at lines of code, as it gives easier to understand information on maintainability than a formula computed over averaged metrics confounded by size.

Further Reading

  1. Paul Omand and Jack Hagemeister. “Metrics for assessing a software system’s maintainability”. Proceedings International Conference on Software Mainatenance (ICSM), 1992, pp 337-344. (doi)
  2. Paul W. Oman, Jack R. Hagemeister: Construction and testing of polynomials predicting software maintainability. Journal of Systems and Software 24(3), 1994, pp. 251-266. (doi).
  3. Don M. Coleman, Dan Ash, Bruce Lowther, Paul W. Oman. Using Metrics to Evaluate Software System Maintainability. IEEE Computer 27(8), 1994, pp. 44-49. (doi, postprint)
  4. Kurt Welker. The Software Maintainability Index Revisited. CrossTalk, August 2001, pp 18-21. (pdf)
  5. Maintainability Index Range and Meaning. Code Analysis Team Blog, blogs.msdn, 20 November 2007.
  6. Ilja Heitlager, Tobias Kuipers, Joost Visser. A practical model for measuring maintainability. Proceedings 6th International Conference on the Quality of Information and Communications Technology, 2007. QUATIC 2007. (scholar)
  7. Dennis Bijlsma, Miguel Alexandre Ferreira, Bart Luijten, and Joost Visser. Faster Issue Resolution with Higher Technical Quality of Software. Software Quality Journal 20(2): 265-285 (2012). (doi, preprint). Page 14 addresses the Maintainability Index.
  8. Khaled El Emam, Saida Benlarbi, Nishith Goel, and Shesh N. Rai. The Confounding Effect of Class Size on the Validity of Object-Oriented Metrics. IEEE Transactions on Software Engineering, 27(7):630:650, 2001. (doi, preprint)
  9. Dag Sjøberg, Bente Anda, and Audris Mockus. Questioning software maintenance metrics: a comparative case study. Proceedings of the ACM-IEEE international symposium on Empirical software engineering and measurement (ESEM), 2012, pp. 107-110. (doi, postprint).
Edit September 2014

Included discussion on Sjøberg’s paper, the thresholds in Visual Studio, and the problems following from averaging in a power law.


© Arie van Deursen, August 2014.

Learning from Apple’s #gotofail Security Bug

Yesterday, Apple announced iOS7.0.6, a critical security update for iOS7 — and an update for OSX / Safari is likely to follow soon (if you haven’t updated iOS yet, do it now).

The problem turns out to be caused by a seemingly simple programming error, now widely discussed as #gotofail on Twitter.

What can we, as software engineers and educators of software engineers, learn from this high impact bug?

The Code

A careful analysis of the underlying problem is provided by Adam Langley. The root cause is in the following code:

static OSStatus
SSLVerifySignedServerKeyExchange(SSLContext *ctx, bool isRsa, SSLBuffer signedParams,
                                 uint8_t *signature, UInt16 signatureLen)
{
    OSStatus        err;
    ...

    if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0)
        goto fail;
    if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
        goto fail;
        goto fail;
    if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
        goto fail;
    ...

fail:
    SSLFreeBuffer(&signedHashes);
    SSLFreeBuffer(&hashCtx);
    return err;
}

To any software engineer, the two consecutive goto fail lines will be suspicious. They are, and more than that. To quote Adam Langley:

The code will always jump to the end from that second goto, err will contain a successful value because the SHA1 update operation was successful and so the signature verification will never fail.

Not verifying a signature is exploitable. In the upcoming months, there will remain plenty of devices running older versions of iOS and MacOS. These will remain vulnerable, and epxloitable.

Brittle Software Engineering

When first seeing this code, I was once again caught by how incredibly brittle programming is. Just adding a single line of code can bring a system to its knees.

For seasoned software engineers this will not be a real surprise. But students and aspiring software engineers will have a hard time believing it. Therefore, sharing problems like this is essential, in order to create sufficient awareness among students that code quality matters.

Code Formatting is a Security Feature

When reviewing code, I try to be picky on details, including white spaces, tabs, and new lines. Not everyone likes me for that. I often wondered whether I was just annoyingly pedantic, or whether it was the right thing to do.

The case at hand shows that white space is a security concern. The correct indentation immediately shows something fishy is going on, as the final check now has become unreachable:

    if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
        goto fail;
    goto fail;
    if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
        goto fail;

Insisting on curly braces would hightlight the fault even more:

    if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0) {
        goto fail;
    }
    goto fail;
    if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
        goto fail;

Indentation Must be Automatic

Because code formatting is a security feature, we must not do it by hand, but use tools to get the indentation right automatically.

A quick inspection of the sslKeyExchange.c source code reveals that it is not routinely formatted automatically: There are plenty of inconsistent spaces, tabs, and code in comments. With modern tools, such as Eclipse-format-on-save, one would not be able to save code like this.

Yet just forcing developers to use formating tools may not be enough. We must also invest in improving the quality of such tools. In some cases, hand-made layout can make a substantial difference in understandability of code. Perhaps current tools do not sufficiently acknowledge such needs, leading to under-use of today’s formatting tools.

Code Review to the Rescue?

Besides automated code formatting, critical reviews might also help. In the words of Adam Langley:

Code review can be effective against these sorts of bug. Not just auditing, but review of each change as it goes in. I’ve no idea what the code review culture is like at Apple but I strongly believe that my colleagues, Wan-Teh or Ryan Sleevi, would have caught it had I slipped up like this. Although not everyone can be blessed with folks like them.

While I fully subscribe to the importance of reviews, a word of caution is at place. My colleague Alberto Bacchelli has investigated how code review is applied today at Microsoft. His findings (published as a paper at ICSE 2013, and nicely summarized by Alex Nederlof as The Truth About Code Reviews) include:

There is a mismatch between the expectations and the actual outcomes of code reviews. From our study, review does not result in identifying defects as often as project members would like and even more rarely detects deep, subtle, or “macro” level issues. Relying on code review in this way for quality assurance may be fraught.

Automated Checkers to the Rescue?

If manual reviews won’t find the problem, perhaps tools can find it? Indeed, the present mistake is a simple example of a problem caused by unreachable code. Any computer science student will be able to write a (basic) “unreachable code detector” that would warn about the unguarded goto fail followed by more (unreachable) code (assuming parsing C is a ‘solved problem’).

Therefore, it is no surprise that plenty of commercial and open source tools exist to check for such problems automatically: Even using the compiler with the right options (presumably -Weverything for Clang) would warn about this problem.

Here, again, the key question is why such tools are not applied. The big problem with tools like these is their lack of precision, leading to too many false alarms. Forcing developers to wade through long lists of irrelevant warnings will do little to prevent bugs like these.

Unfortunately, this lack of precision is a direct consequence of the fact that unreachable code detection (like many other program properties of interest) is a fundamentally undecidable problem. As such, an analysis always needs to make a tradeoff between completeness (covering all suspicious cases) and precision (covering only cases that are certainly incorrect).

To understand the sweet spot in this trade off, more research is needed, both concerning the types of errors that are likely to occur, and concerning techniques to discover them automatically.

Testing is a Security Concern

As an advocate of unit testing, I wonder how the code could have passed a unit test suite.

Unfortunately, the testability of the problematic source code is very poor. In the current code, functions are long, and they cover many cases in different conditional branches. This makes it hard to invoke specific behavior, and bring the functions in a state in which the given behavior can be tested. Furthermore, observability is low, especially since parts of the code deliberately obfuscate results to protect against certain times of attacks.

Thus, given the current code structure, unit testing will be difficult. Nevertheless, the full outcome can be tested, albeit it at the system (not unit) level. Quoting Adam Langley again:

I coded up a very quick test site at https://www.imperialviolet.org:1266. […] If you can load an HTTPS site on port 1266 then you have this bug.

In other words, while the code may be hard to unit test, the system luckily has well defined behavior that can be tested.

Coverage Analysis

Once there is a set of (system or unit level) tests, the coverage of these tests can be used an indicator for (the lack of) completeness of the test suite.

For the bug at hand, even the simplest form of coverage analysis, namely line coverage, would have helped to spot the problem: Since the problematic code results from unreachable code, there is no way to achieve 100% line coverage.

Therefore, any serious attempt to achieve full statement coverage should have revealed this bug.

Note that trying to achieve full statement coverage, especially for security or safety critical code, is not a strange thing to require. For aviation software, statement coverage is required for criticality “level C”:

Level C:

Software whose anomalous behavior, as shown by the system safety assessment process, would cause or contribute to a failure of system function resulting in a major failure condition for the aircraft.

This is one of the weaker categories: For the catastrophic ‘level A’, even stronger test coverage criteria are required. Thus, achieving substantial line coverage is neither impossible nor uncommon.

The Return-Error-Code Idiom

Finally, looking at the full sources of the affected file, one of the key things to notice is the use of the return code idiom to mimic exception handling.

Since C has no built-in exception handling support, a common idiom is to insist that every function returns an error code. Subsequently, every caller must check this returned code, and include an explicit jump to the error handling functionality if the result is not OK.

This is exactly what is done in the present code: The global err variable is set, checked, and returned for every function call, and if not OK followed by (hopefully exactly one) goto fail.

Almost 10 years ago, together with Magiel Bruntink and Tom Tourwe, we conducted a detailed empirical analysis of this idiom in a large code base of an embedded system.

One of the key findings of our paper is that in the code we analyzed we found a defect density of 2.1 deviations from the return code idiom per 1000 lines of code. Thus, in spite of very strict guidelines at the organization involved, we found many examples of not just unchecked calls, but also incorrectly propagated return codes, or incorrectly handled error conditions.

Based on that, we concluded:

The idiom is particularly error prone, due to the fact that it is omnipresent as well as highly tangled, and requires focused and well-thought programming.

It is sad to see our results from 2005 confirmed today.


© Arie van Deursen, February 22, 2014.


EDIT February 24, 2014: Fixed incorrect use of ‘dead code detection’ terminology into correct ‘unreachable code detection’, and weakened the claim that any computer science student can write such a detector based on the reddit discussion.