Delft Students on Software Architecture: DESOSA 2015

Posted in Teaching by Arie van Deursen

This year, we taught another edition of the TU Delft Teaching Software Architecture — With GitHub course.

We are proud to announce the resulting on line book: Delft Students on Software Architecture is a collection of architectural descriptions of open source software systems written by students from Delft University of Technology during a master-level course taking place in the spring of 2015.

At the start of the course, teams of 3-4 students could adopt a project of choice on GitHub. The projects selected had to be sufficiently complex and actively maintained (one or more pull requests merged per day).

During a 10 week period, the students spent one third of their time on this course,and engaged with these systems in order to understand and describe their software architecture.

Inspired by Brown and Wilsons’ Architecture of Open Source Applications, we decided to organize each description as a chapter, resulting in the present online book.

Recurring Themes

The chapters share several common themes, which are based on smaller assignments the students conducted as part of the course. These themes cover different architectural ‘theories’ as available on the web or in textbooks. The course used Rozanski and Woods’ Software Systems Architecture, and therefore several of their architectural viewpoints and perspectives recur.

The first theme is outward looking, focusing on the use of the system. Thus, many of the chapters contain an explicit stakeholder analysis, as well as a description of the context in which the systems operate. These were based on available online documentation, as well as on an analysis of open and recently closed issues for these systems.

A second theme involves the development viewpoint, covering modules, layers, components, and their inter-dependencies. Furthermore, it addresses integration and testing processes used for the system under analysis.

A third recurring theme is variability management. Many of today’s software systems are highly configurable. In such systems, different features can be enabled or disabled, at compile time or at run time. Using techniques from the field of product line engineering, several of the chapters provide feature-based variability models of the systems under study.

A fourth theme is metrics-based evaluation of software architectures. Using such metrics architects can discuss (desired) quality attributes (performance, scaleability, maintainability, …) of a system quantitatively. Therefore various chapters discuss metrics and in some cases actual measurements tailored towards the systems under analysis.

First-Hand Experience

Last but not least, the chapters are also based on the student’s experience in actually contributing to the systems described. As part of the course over 75 pull requests to the projects under study were made, including refactorings (Jekyll 3545, Docker 11350, Docker 11323, Syncany 391), bug fixes
(Diaspora 5714, OpenRA 7486, OpenRA 7544, Kodi 6570), and helpful documentation such as a Play Framework screen cast.

Through these contributions the students often interacted with lead developers and architects of the systems under study, gaining first-hand experience with the architectural trade-offs made in these systems.

Enjoy!

Working with the open source systems and describing their architectures has been a great experience, both for the teachers and the students.

We hope you will enjoy reading the DESOSA chapters as much as we enjoyed writing them.

25Jun2015

Beyond Page Objects

Posted in Development by Arie van Deursen

Beyond Page Objects

During the last couple of months I had a good time using Protractor to create an end-to-end test suite for an AngularJS web application.

While applying the Page Object pattern, I realized that I needed more guidance on what page objects to create, and how to navigate through my web application.

To that end, I started drawing little state charts for my web application. Naturally, ‘page objects’ corresponded to states, and their methods to either state inspection methods (is my browser in the correct state?) or state transition methods (clicking this button will bring me to the next state).

Gradually, the following process emerged:

If you want to test certain behavior of your web application, draw a little state diagram to capture the navigation for that behavior.
Create ‘state objects’ for each of the states.
Give each state object its ‘inspection methods’ (what is visible in the web application if I’m in this state) as well as ‘transition methods’ (clicks leading to a new state).
I also find it helpful to give each state object a ‘selfcheck’ method, which just verifies whether the web application is indeed in the state corresponding to the state object.
With the state objects in place, think of the paths you want to take through the application.
The simplest starting point is to write one test for each basic transition: Bring the application in state A, click somehwere, and verify you ended up in the required state B.
Next, you may want to consider longer paths, in which earlier transitions affect later behavior. An example is testing proper use (and resetting of) client-side caching.

I wrote a longer article about this approach, available as “Beyond Page Objects: Testing Web Applications with State Objects”, published in ACM Queue in June 2015, as well as in the Communications of the ACM in August 2015.

The paper also explains how to deal with more complex state machines (using superstates and AND-states, for example), how to use a transition tree to oversee the coverage of longer paths, and how to deal with the infamous back-button. Furthermore, I extended the example “PhoneCat” AngularJS application with a state-object based test suite, available from my GitHub page.

Admittedly, the idea to use state machines for testing purposes is not new at all — yet elaborating how it can be used for testing web applications with WebDriver was helpful to me. I hope you find the use of state objects helpful too.

9Oct2014

Semantic Versioning? In Maven Central? Breaking Changes!

Posted in Development by Arie van Deursen

I love the use of semantic versioning. It provides clear guidelines when an API version number should be bumped. In particular, if a change to an API breaks backward compatibility, the major version identifier must be updated. Thus, when as a client of such an API I see a minor or patch update, I can safely upgrade, since these will never break my build.

But to what extent are people actually using the guidelines of semantic versioning? Has the adoption of semantic versioning increased over time? Does the use of semantic versioning speedup library upgrades? And if there are breaking changes, are these properly announced by means of deprecation tags?

To better understand semantic versioning, we studied seven years of versioning history in the Maven Central Repository. Here is what we found.

Versioning Policies

Semantic Versioning is a systematic approach to give version numbers to API releases. The approach has been developed by Tom Preston-Werner, and is advocated by GitHub. It is “based on but not necessarily limited to pre-existing widespread common practices in use in both closed and open-source software.”

In a nutshell, semantic versioning is based on MAJOR.MINOR.PATCH identifiers with the following meaning:

MAJOR: increment when you make incompatible API changes
MINOR: increment when you add functionality in backward compatible manner
PATCH: incremented when you make backward-compatible bug fixes

Versioning in Maven Central

To understand how developers actually use versioning policies, we studied 7 years of history of maven. Our study comprises around 130,000 released jar files, of which 100,000 come with source code. On average, we found around 7 versions per library in our data set.

The data set runs from 2006 to 2011, so predates the semantic versioning manifesto. Since the manifesto claims to be based on existing practices, it is interesting to investigate to what extent API developers releasing via maven have adhered to these practices.

As shown below, in Maven Central, the major.minor.patch scheme is adopted by the majority of projects.

Breaking Changes

To understand to what extent version increments correspond to breaking changes, we identified all “update-pairs”: A version as well as its immediate successor of a given maven package.

To be able to assess semantic versioning compliance, we need to determine backward compatibility for such update pairs. Since in general this is undecidable, we use binary compatibility as a proxy. Two libraries are binary compatible, if interchanging them does not force one to recompile. Examples of incompatible changes are removing a public method or class, or adding parameters to a public method.

We used the Clirr tool to detect such binary incompatibilities automatically for Java. For those interested, there is also a SonarQube Clirr plugin for compatibliity checking in a continuous integration setting; An alternative tool is japi.

Major versus Minor/Patch Releases

For major releases, breaking changes are permitted in semantic versioning. This is also what we see in our dataset: A little over 1/3d of the major releases introduces binary incompatibilities:

Interestingly, we see a very similar figure for minor releases: A little over 1/3d of the minor releases introduces binary incompatibilities too. This, however, is in violation with the semantic versioning guidelines.

The situation is somewhat better for patch releases, which introduce breaking changes in around 1/4th of the cases. This nevertheless is in conflict with semantic versioning, which insists that patches are compatible.

Overall, we see that minor and major releases are equally likely to introduce breaking changes.

The graph below shows that these trends are fairly stable over time. Minor (~40%) and patch releases (~45%) are most common, and major releases (~15%) are least common.

The orange line indicates releases with breaking changes: Around 30% of the releases introduce breaking changes — a number that is slightly decreasing over time, but still at 29%.

Of these releases with breaking changes, the vast majority is non-major (the crossed green line at ~25%). Thus, 1/4th of the releases does not comply with semantic versioning.

Deprecation to the Rescue?

The above analysis demonstrates a deep need to introduce breaking changes: appearantly, developers wish to introduce breaking changes in 30% of their API releases.

Apart from requiring that such changes take place in major releases,
semantic versioning insists on using deprecation to announce such changes. In particular:

Before you completely remove the functionality in a new major release there should be at least one minor release that contains the deprecation so that users can smoothly transition to the new API.

Given the many breaking changes we saw, how many deprecation tags would there be?

At the library level, 1000 out of 20,000 libraries (5%) we studied include at least one depracation tag.

When we look at the individual method level, we find the following:

Over 86,000 public methods are removed in a minor release, without any deprecation tag.
Almost 800 public methods receive a deprecation tag in their life span: These methods, however, are never removed.
16 public methods receive a deprecation tag that is subsequently undone in a later release (the method is “undeprecated”)

That is all we found. In other words, we did not find a single method that was deprecated according to the guidelines of semantic versioning (deprecate in minor, then delete in major).

Furthermore, the worst behavior (deletion in minor without deprecation, over 86,000 methods) occurs hundred times more often than the best behavior witnessed (deprecation without deletion, almost 800 methods)

Discussion

Our findings are based on a somewhat older dataset obtained from maven. It might therefore be that adherence to semantic versioning at the moment is higher than what we report.

Our findings also take breaking changes to any public method or class into account. In practice, certain packages will be considered as API only, whereas others are internal only — yet declared public due to the limitations of the Java modularization system. We are working on a further analysis in which we explicitly measure to what extent the changed public methods are used. Initial findings within the maven data set suggest that there are hundreds of thousands of compilation problems in clients caused by these breaking changes — suggesting that it is not just internal modules containing these changes.

Some non-Java package managers explicitly have adopted semantic versioning. A notable example is npm, and also nuget has initiated a discussion. While npm has a dedicated library to determine version orderings according to the Semantic Versioning standard, I am not aware of Javascript compatibility tools that are used to check for breaking changes. For .NET (nuget) tools like Clirr exists, such as LibCheck, ApiChange, and the NDepend build comparison tools. I do not know, however, how widespread their use is.

Last but not least, binary incompatibility, as measured by Clirr, is just one form of backward incompatibility. An API may change the behavior of methods as well, in breaking ways. A way to detect those might be by applying old test suites to new code — which we have not done yet.

Conclusion

Breaking API changes are a fact of life. Therefore, meaningful version identifiers are a cause worth fighting for. Thus:

If you adopt semantic versioning for your API, double check that your minor and patch releases are indeed backward compatible. Consider using tools such as clirr, or a dedicated test suite.
If you rely on an API that claims to be following semantic versioning, don’t put too much faith in backward compatibility promises.
If you want to follow semantic versioning, be prepared to bump the major version identifier often.
Should you need to introduce breaking changes, do not forget to announce these with deprecation tags in minor releases first.
If you are in research, note that the problems involved in preventing, encapsulating, detecting, and handling breaking changes are tough and real. Backward compatiblity deserves your research attention.

The full details of our study are published in our paper presented (slides) at the IEEE International Conference on Source Analysis and Manipulation (SCAM), held in Victoria, BC, September 2014.

This post is based on joint work by Steven Raemaekers and Joost Visser, both from the Software Improvement Group.

EDIT, November 2014

If you are a Java developer and serious about semantic versioning, then checkout this open source semantic versioning checking tool. It is available on Maven Central and on GitHub, and it can check differences, validate version numbers, and suggest proper version numbers for your new jar release.

29Aug2014

Think Twice Before Using the “Maintainability Index”

Posted in Research by Arie van Deursen

This is a quick note about the “Maintainability Index”, a metric aimed at assessing software maintainability, as I recently run into developers and researchers who are (still) using it.

The Maintainability Index was introduced at the International Conference on Software Maintenance in 1992. To date, it is included in Visual Studio (since 2007), in the recent (2012) JSComplexity and Radon metrics reporters for Javascript and Python, and in older metric tool suites such as verifysoft.

At first sight, this sounds like a great success of knowledge transfer from academic research to industry practice. Upon closer inspection, the Maintainability Index turns out to be problematic.

The Original Index

The Maintainabilty Index was introduced in 1992 by Paul Oman and Jack Hagemeister, originally presented at the International Conference on Software Maintenance ICSM 1992 and later refined in a paper that appeared in IEEE Computer. It is a blend of several metrics, including Halstead’s Volume (HV), McCabe’s cylcomatic complexity (CC), lines of code (LOC), and percentage of comments (COM). For these metrics, the average per module is taken, and combined into a single formula:

To arrive at this formula, Oman and Hagemeister started with a number of systems from Hewlett-Packard (written in C and Pacscal in the late 80s, “ranging in size from 1000 to 10,000 lines of code”). For each system, engineers provided a rating (between 1-100) of its maintainability. Subsequently, 40 different metrics were calculated for these systems. Finally, statistical regression analysis was applied to find the best way to combine (a selection of) these metrics to fit the experts’ opinion. This eventually resulted in the given formula. The higher its value, the more maintainable a system is deemed to be.

The maintainability index attracted quite some attention, also because the Software Engineering Institute (SEI) promoted it, for example in their 1997 C4 Software Technology Reference Guide. This report describes the Maintainability Index as “good and sufficient predictors of maintainability”, and “potentially very useful for operational Department of Defense systems”. Furthermore, they suggest that “it is advisable to test the coefficients for proper fit with each major system to which the MI is applied.”

Use in Visual Studio

Visual Studio Code Metrics were announced in February 2007. A November 2007 blogpost clarifies the specifics of the maintainability index included in it. The formula Visual Studio uses is slightly different, based on the 1994 version:

Maintainability Index =
  MAX(0, (171 - 5.2 * ln(Halstead Volume)
             - 0.23 * Cyclomatic Complexity
             - 16.2 * ln(Lines of Code)
         ) * 100 / 171)

As you can see, the constants are literally the same as in the original formula. The new definition merely transforms the index to a number between 0 and 100. Also, the comment metrics has been removed.

Furthermore, Visual Studio provides an interpretation:

	MI >= 20	High Maintainabillity
	10 <= MI < 20	Moderate Maintainability
	MI < 10	Low Maintainability

I have not been able to find a justification for these thresholds. The 1994 IEEE Computer paper used 85 and 65 (instead of 20 and 10) as thresholds, describing them as a good “rule of thumb”.

The metrics are available within Visual Studio, and are part of the code metrics power tools, which can also be used in a continuous integration server.

Concerns

I encountered the Maintainability Index myself in 2003, when working on Software Risk Assessments in collaboration with SIG. Later, researchers from SIG published a thorough analysis of the Maintainability Index (first when introducing their practical model for assessing maintainability and later as section 6.1 of their paper on technical quality and issue resolution).

Based on this, my key concerns about the Maintainability Index are:

There is no clear explanation for the specific derived formula.
The only explanation that can be given is that all underlying metrics (Halstead, Cyclomatic Complexity, Lines of Code) are directly correlated with size (lines of code). But then just measuring lines of code and taking the average per module is a much simpler metric.
The Maintainability Index is based on the average per file of, e.g., cyclomatic complexity. However, as emphasized by Heitlager et al, these metrics follow a power law, and taking the average tends to mask the presence of high-risk parts.
The set of programs used to derive the metric and evaluate it was small, and contained small programs only.
Furthermore, the programs were written in C and Pascal, which may have rather different maintainability characteristics than current object-oriented languages such as C#, Java, or Javascript.
For the experiments conducted, only few programs were analyzed, and no statistical significance was reported. Thus, the results might as well be due to chance.
Tool smiths and vendors used the exact same formula and coefficients as the 1994 experiments, without any recalibration.

One could argue that any of these concerns is reason enough not to use the Maintainability Index.

These concerns are consistent with a recent (2012) empirical study, in which one application was independently built by four different companies. The researchers used these systems two compare maintainability and several metrics, including the Maintainability Index. Their findings include that size as a measure of maintainability has been underrated, and that the “sophisticated” maintenance metrics are overrated.

Think Twice

In summary, if you are a researcher, think twice before using the maintainability index in your experiments. Make sure you study and fully understand the original papers published about it.

If you are a tool smith or tool vendor, there is not much point in having several metrics that are all confounded by size. Check correlations between the metrics you offer, and if any of them are strongly correlated pick the one with the clearest and simplest explanation.

Last but not least, if you are a developer, and are wondering whether to use the Maintainability Index: Most likely, you’ll be better off looking at lines of code, as it gives easier to understand information on maintainability than a formula computed over averaged metrics confounded by size.

Learning from Apple’s #gotofail Security Bug

Posted in Development by Arie van Deursen

Yesterday, Apple announced iOS7.0.6, a critical security update for iOS7 — and an update for OSX / Safari is likely to follow soon (if you haven’t updated iOS yet, do it now).

The problem turns out to be caused by a seemingly simple programming error, now widely discussed as #gotofail on Twitter.

What can we, as software engineers and educators of software engineers, learn from this high impact bug?

The Code

A careful analysis of the underlying problem is provided by Adam Langley. The root cause is in the following code:

static OSStatus
SSLVerifySignedServerKeyExchange(SSLContext *ctx, bool isRsa, SSLBuffer signedParams,
                                 uint8_t *signature, UInt16 signatureLen)
{
    OSStatus        err;
    ...

    if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0)
        goto fail;
    if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
        goto fail;
        goto fail;
    if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
        goto fail;
    ...

fail:
    SSLFreeBuffer(&signedHashes);
    SSLFreeBuffer(&hashCtx);
    return err;
}

To any software engineer, the two consecutive goto fail lines will be suspicious. They are, and more than that. To quote Adam Langley:

The code will always jump to the end from that second goto, err will contain a successful value because the SHA1 update operation was successful and so the signature verification will never fail.

Not verifying a signature is exploitable. In the upcoming months, there will remain plenty of devices running older versions of iOS and MacOS. These will remain vulnerable, and epxloitable.

Brittle Software Engineering

When first seeing this code, I was once again caught by how incredibly brittle programming is. Just adding a single line of code can bring a system to its knees.

For seasoned software engineers this will not be a real surprise. But students and aspiring software engineers will have a hard time believing it. Therefore, sharing problems like this is essential, in order to create sufficient awareness among students that code quality matters.

Code Formatting is a Security Feature

When reviewing code, I try to be picky on details, including white spaces, tabs, and new lines. Not everyone likes me for that. I often wondered whether I was just annoyingly pedantic, or whether it was the right thing to do.

The case at hand shows that white space is a security concern. The correct indentation immediately shows something fishy is going on, as the final check now has become unreachable:

    if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
        goto fail;
    goto fail;
    if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
        goto fail;

Insisting on curly braces would hightlight the fault even more:

    if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0) {
        goto fail;
    }
    goto fail;
    if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
        goto fail;

Indentation Must be Automatic

Because code formatting is a security feature, we must not do it by hand, but use tools to get the indentation right automatically.

A quick inspection of the sslKeyExchange.c source code reveals that it is not routinely formatted automatically: There are plenty of inconsistent spaces, tabs, and code in comments. With modern tools, such as Eclipse-format-on-save, one would not be able to save code like this.

Yet just forcing developers to use formating tools may not be enough. We must also invest in improving the quality of such tools. In some cases, hand-made layout can make a substantial difference in understandability of code. Perhaps current tools do not sufficiently acknowledge such needs, leading to under-use of today’s formatting tools.

Code Review to the Rescue?

Besides automated code formatting, critical reviews might also help. In the words of Adam Langley:

Code review can be effective against these sorts of bug. Not just auditing, but review of each change as it goes in. I’ve no idea what the code review culture is like at Apple but I strongly believe that my colleagues, Wan-Teh or Ryan Sleevi, would have caught it had I slipped up like this. Although not everyone can be blessed with folks like them.

While I fully subscribe to the importance of reviews, a word of caution is at place. My colleague Alberto Bacchelli has investigated how code review is applied today at Microsoft. His findings (published as a paper at ICSE 2013, and nicely summarized by Alex Nederlof as The Truth About Code Reviews) include:

There is a mismatch between the expectations and the actual outcomes of code reviews. From our study, review does not result in identifying defects as often as project members would like and even more rarely detects deep, subtle, or “macro” level issues. Relying on code review in this way for quality assurance may be fraught.

Automated Checkers to the Rescue?

If manual reviews won’t find the problem, perhaps tools can find it? Indeed, the present mistake is a simple example of a problem caused by unreachable code. Any computer science student will be able to write a (basic) “unreachable code detector” that would warn about the unguarded goto fail followed by more (unreachable) code (assuming parsing C is a ‘solved problem’).

Therefore, it is no surprise that plenty of commercial and open source tools exist to check for such problems automatically: Even using the compiler with the right options (presumably -Weverything for Clang) would warn about this problem.

Here, again, the key question is why such tools are not applied. The big problem with tools like these is their lack of precision, leading to too many false alarms. Forcing developers to wade through long lists of irrelevant warnings will do little to prevent bugs like these.

Unfortunately, this lack of precision is a direct consequence of the fact that unreachable code detection (like many other program properties of interest) is a fundamentally undecidable problem. As such, an analysis always needs to make a tradeoff between completeness (covering all suspicious cases) and precision (covering only cases that are certainly incorrect).

To understand the sweet spot in this trade off, more research is needed, both concerning the types of errors that are likely to occur, and concerning techniques to discover them automatically.

Testing is a Security Concern

As an advocate of unit testing, I wonder how the code could have passed a unit test suite.

Unfortunately, the testability of the problematic source code is very poor. In the current code, functions are long, and they cover many cases in different conditional branches. This makes it hard to invoke specific behavior, and bring the functions in a state in which the given behavior can be tested. Furthermore, observability is low, especially since parts of the code deliberately obfuscate results to protect against certain times of attacks.

Thus, given the current code structure, unit testing will be difficult. Nevertheless, the full outcome can be tested, albeit it at the system (not unit) level. Quoting Adam Langley again:

I coded up a very quick test site at https://www.imperialviolet.org:1266. […] If you can load an HTTPS site on port 1266 then you have this bug.

In other words, while the code may be hard to unit test, the system luckily has well defined behavior that can be tested.

Coverage Analysis

Once there is a set of (system or unit level) tests, the coverage of these tests can be used an indicator for (the lack of) completeness of the test suite.

For the bug at hand, even the simplest form of coverage analysis, namely line coverage, would have helped to spot the problem: Since the problematic code results from unreachable code, there is no way to achieve 100% line coverage.

Therefore, any serious attempt to achieve full statement coverage should have revealed this bug.

Note that trying to achieve full statement coverage, especially for security or safety critical code, is not a strange thing to require. For aviation software, statement coverage is required for criticality “level C”:

Level C:

Software whose anomalous behavior, as shown by the system safety assessment process, would cause or contribute to a failure of system function resulting in a major failure condition for the aircraft.

This is one of the weaker categories: For the catastrophic ‘level A’, even stronger test coverage criteria are required. Thus, achieving substantial line coverage is neither impossible nor uncommon.

The Return-Error-Code Idiom

Finally, looking at the full sources of the affected file, one of the key things to notice is the use of the return code idiom to mimic exception handling.

Since C has no built-in exception handling support, a common idiom is to insist that every function returns an error code. Subsequently, every caller must check this returned code, and include an explicit jump to the error handling functionality if the result is not OK.

This is exactly what is done in the present code: The global err variable is set, checked, and returned for every function call, and if not OK followed by (hopefully exactly one) goto fail.

Almost 10 years ago, together with Magiel Bruntink and Tom Tourwe, we conducted a detailed empirical analysis of this idiom in a large code base of an embedded system.

One of the key findings of our paper is that in the code we analyzed we found a defect density of 2.1 deviations from the return code idiom per 1000 lines of code. Thus, in spite of very strict guidelines at the organization involved, we found many examples of not just unchecked calls, but also incorrectly propagated return codes, or incorrectly handled error conditions.

Based on that, we concluded:

The idiom is particularly error prone, due to the fact that it is omnipresent as well as highly tangled, and requires focused and well-thought programming.

It is sad to see our results from 2005 confirmed today.

EDIT February 24, 2014: Fixed incorrect use of ‘dead code detection’ terminology into correct ‘unreachable code detection’, and weakened the claim that any computer science student can write such a detector based on the reddit discussion.

30Dec2013

Teaching Software Architecture: with GitHub!

Posted in Teaching by Arie van Deursen

Arie van Deursen, Alex Nederlof, and Eric Bouwers.

When teaching software architecture it is hard to strike the right balance between practice (learning how to work with real systems and painful trade offs) and theory (general solutions that any architect needs to thoroughly understand).

To address this, we decided to try something radically different at Delft University of Technology:

To make the course as realistic as possible, we based the entire course on GitHub. Thus, teams of 3-4 students had to adopt an active open source GitHub project and contribute. And, in the style of “How GitHub uses GitHub to Build GitHub“, all communication within the course, among team members, and with external stakeholders took place through GitHub where possible.

Furthermore, to ensure that students actually digested architectural theory, we made them apply the theory to the actual project they picked. As sources we used literature available on the web, as well as the software architecture textbook written by Nick Rozanski and Eoin Woods.

To see if this worked out, here’s what we did (the course contents), how we did it (the course structure, including our use of GitHub), and what most of us liked, or did not like so much (our evaluation).

Course Contents

GitHub Project Selection

At the start of the course, we let teams of 3-4 students adopt a project of choice on GitHub. We placed a few constraints, such as that the project should be sufficiently complex, and that it should provide software that is actually used by others — there should be something at stake.

Based on our own analysis of GitHub, we also provided a preselection of around 50 projects, which contained at least 200 pull requests (active involvement from a community) and an automated test suite (state of the art development practices).

The projects selected by the students included Netty, Jenkins, Neo4j, Diaspora, HornetQ, and jQuery Mobile, to name a few.

Making Contributions

To get the students involved in their projects, we encouraged them to make actual contributions. Several teams were successful in this, and offered pull requests that were actually merged. Examples include an improvement of the Aurelius Titan build process, or a fix for an open issue in HornetQ. Other contributions were in the form of documentation, for example one team created a (popular) screencast on how to build a chat client/server system with Netty in under 15 minutes, posted on Youtube.

The open source projects generally welcomed these contributions, as illustrated by the HornetQ reaction:

Thanks for the contribution… I can find you more if you need to.. thanks a lot! :+1:

And concerning the Netty tutorial:

Great screencast, fantastic project.

Students do not usually get this type of feedback, and were proud to have made an impact. Or, as the CakePHP team put it:

All in all we are very happy with the reactions we got, and we really feel our contribution mattered and was appreciated by the CakePHP community.

To make it easier to establish contact with such projects, we setup a decidcated @delftswa Twitter account. Many of the projects our students analyzed have their own Twitter accounts, and just mentioning a handle like @jenkinsci or @cakephp helped receive the project’s attention and establish the first contact.

Identifying Stakeholders

One of the first assignments was to conduct a stakeholder analysis:
understanding who had an interest in the project, what their interest was, and which possibly conflicting needs existed.

To do this, the students followed the approach to identify and engage stakeholders from Rozanski and Woods. They distinguish various stakeholder classes, and recommend looking for stakeholders who are most affected by architectural decisions, such as those who have to use the system, operate or manage it, develop or test it, or pay for it.

Aurelius Stakeholders (click to enlarge)

To find the stakeholders and their architectural concerns, the student teams analyzed any information they could find on the web about their project. Besides documentation and mailing lists, this also included an analysis of recent issues and pull requests as posted on GitHub, in order to see what the most pressing concerns of the project at the moment are and which stakeholders played a role in these discussions.

Architectural Views

Since Kruchten’s classical “4+1 views” paper, it has been commonly accepted that there is no such thing as the architecture. Instead, a single system will have multiple architectural views.

Following the terminology from Rozanski and Woods, stakeholders have different concerns, and these concerns are addressed through views.
Therefore, with the stakeholder analysis in place, we instructed students to create relevant views, using the viewpoint collection from Rozanski and Woods as starting point.

Rozanski & Woods Viewpoints

Some views the students worked on include:

A context view, describing the relationships, dependencies, and interactions between the system and its environment. For example, for Diaspora, the students highlighted the relations with users, `podmins’, other social networks, and the user database.

A development view, describing the system for those involved in building, testing, maintaining, and enhancing the system. For example, for neo4j, students explained the packaging structure, the build process, test suite design, and so on.

A deployment view, describing the environment into which the system will be deployed, including the dependencies the system has on its runtime environment. For GitLab, for example, the students described the installations on the clients as well as on the server.

Besides these views, students could cover alternative ones taken from other software architecture literature, or they could provide an analysis of the perspectives from Rozanski and Woods, to cover such aspects as security, performance, or maintainability.

Again, students were encouraged not just to produce architectural documentation, but to challenge the usefulness of their ideas by sharing them with the projects they analyzed. This also forced them to focus on what was needed by these projects.

For example, the jQuery Mobile was working on a new version aimed at realizing substantial performance improvements. Therefore, our students conducted a series of performance measurements comparing two versions as part of their assignment. Likewise, for CakePHP, our students found the event system of most interest, and contributed a well-received description of the new CakePHP Events System.

Software Metrics

For an architect, metrics are an important tool to discuss (desired) quality attributes (performance, scaleability, maintainability, …) of a system quantitatevely. To equip students with the necessary skills to use metrics adequately, we included two lectures on software metrics (based on our ICSE 2013 tutorial).

Students were asked to think of what measurements could be relevant for their projects. To that end, they first of all had to think of an architecturally relevant goal, in line with the Goal/Question/Metric paradigm. Goals picked by the students included analyzing maintainability, velocity in responding to bug reports, and improving application responsiveness.

Subsequently, students had to turn their project goal into 3 questions, and identify 2 metrics per question.

Brackets Analyzability

Many teams not just came up with metrics, but also used some scripting to actually conduct measurements. For example, the Brackets team decided to build RequireJS-Analyzer, a tool that crawls JavasScript code connected using RequireJS and analyzes the code, to report on metrics and quality.

Architectural Sketches

Being an architect is not just about views and metrics — it is also about communication. Through some luck, this year we were able to include an important yet often overlooked take on communication: The use of design sketches in software development teams.

We were able to attract Andre van der Hoek (UC Irvine) as guest speaker. To see his work on design sketches in action, have a look at the Calico tool from his group, in which software engineers can easily share design ideas they write at a sketch board.

In one of the first lectures, Andre presented Calico to the students. One of the early assignments was to draw any free format sketches that helped to understand the system under study.

The results were very creative. The students freed themselves from the obligation to draw UML-like diagrams, and used colored pencils and papers to draw what ever they felt like.

For example, the above picture illustrates the HornetQ API; Below you see the Jenkins work flow, including dollars floating from customers the contributing developers.

Lectures From Industry

Last but not least, we wanted students to meet and learn from real software architects. Thus, we invited presentations from the trenches — architects working in industry, sharing their experience.

For example, in 2013 we had TU Delft alumnus Maikel Lobbezoo, who presented the architectural challenges involved in building the payment platform offered by Adyen. The Adyen platform is used to process gazilions of payments per day, acrross the world. Adyen has been one of the fastest growing tech companies in Europe, and its success can be directly attributed to its software architecture.

Maikel explained the importance of having a stateless architecture, that is linearly scalable, redundant and based on push notifications and idempotency. His key message was that the successful architect possesses a (rare) combination of development skills, communicative skills, and strategical bussiness insight.

Our other guest lecture covered cyber-physical systems: addressing the key role software plays in safety-critical infrastructure systems such as road tunnels — a lecture provided by Eric Burgers from Soltegro. Here the core message revolved around traceability to regulations, the use of model-based techniques such as SysML, and the need to communicate with stakeholders with little software engineering experience.

Course Structure

Constraints

Some practical constraints of the course included:

The course was worth 5 credit points (ECTS), corresponding to 5 * 27 = 135 hours of work per student.
The course run for half a semester, which is 10 consecutive weeks
Each week, there was time for two lectures of 90 minutes each
We had a total of over 50 students, eventually resulting in 14 groups of 3-4 students.
The course was intended four 4th year students (median age 22) who were typically following the TU Delft master in Computer Science, Information Architecture, or Embedded Systems.

GitHub-in-the-Course

All communication within the course took place through GitHub, in the spirit of the inspirational video How GitHub uses GitHub to Build GitHub. As a side effect, this eliminated the need to use the (not so popular) Blackboard system for intra-course communication.

From GitHub, we obtained a (free) delftswa organization. We used it to host various repositories, for example for the actual assignments, for the work delivered by each of the teams, as well as for the reading material.

Each student team obtained one repository for themselves, which they could use to collaboratively work on their assignments. Since each group worked on a different system, we decided to make these repositories visible to the full group: In this way, students could see the work of all groups, and learn from their peers how to tackle a stakeholder analysis or how to assess security issues in a concrete system.

Having the assignment itself on a GitHub repository not only helped us to iteratively release and improve the assignment — it also gave students the possibility to propose changes to the assignment, simply by posting a pull request. Students actually did this, for example to fix typos or broken links, to pose questions, to provide answers, or to start a discussion to change the proposed submission date.

We considered whether we could make all student results public. However, we first of all wanted to provide students with a safe environment, in which failure could be part of the learning process. Hence it was up to the students to decide which parts of their work (if any) they wanted to share with the rest of the world (through blogs, tweets, or open GitHub repositories).

Time Keeping

One of our guiding principles in this course was that an architect is always eager to learn more. Thus, we expected all students to spend the full 135 hours that the course was supposed to take, whatever their starting point in terms of knowledge and experience.

This implies that it is not just the end result that primarily counts, but the progress the students have made (although there is of course, a “minimum” result that should be met at least).

To obtain insight in the learning process, we asked students to keep track of the hours they spent, and to maintain a journal of what they did each week.

For reasons of accountability such time keeping is a good habit for an architect to adopt. This time keeping was not popular with students, however.

Student Presentations

We spent only part of the lecturing time doing class room teaching.
Since communication skills are essential for the successful architect,
we explicitly allocated time for presentations by the students:

Half-way all teams presented a 3-minute pitch about their plans and challenges.
At the end of the course we organized a series of 15 minute presentations in which students presented their end results.

Each presentation was followed by a Q&A round, with questions coming from the students as well as an expert panel. The feedback concerned the architectural decisions themselves, as well as the presentation style.

Grading

Grading was based per team, and based on the following items:

Final report: the main deliverable of each team, providing the relevant architectural documentation created by the team.
Series of intermediate (‘weekly’) reports corresponding to dedicated assignments on, e.g, metrics, particular views, or design sketches. These assignments took place in the first half of the course, and formed input for the final report;
Team presentations

Furthermore, there was one individual assignment, in which each student had to write a review report for the work of one of the other teams. Students did a remarkably good (being critical as well as constructive) job at this. Besides allowing for individual grading, this also ensured each team received valuable feedback from 3-4 fellow students.

Evaluation

All in all, teaching this course was a wonderful experience. We gave the students considerable freedom, which they used to come up with remarkable results.

A key success factor was peer learning. Groups could learn from each other, and gave feedback to each other. Thus, students not just made an in depth study of the single system they were working on, but also learned about 13 different architectures as presented by the other groups. Likewise, they not just learned about the architectural views and perspectives they needed themselves, but learned from their co-students how they used different views in different ways.

The use of GitHub clearly contributed to this peer learning. Any group could see, study, and learn from the work of any other group. Furthermore, anyone, teachers and students alike, could post issues, give feedback, and propose changes through GitHub’s facilities.

On the negative side, while most students were familiar with git, not all were. The course could be followed using just the bare minimum of git knowledge (add, commit, checkout, push, pull), yet the underlying git magic sometimes resulted in frustration with the students if pull requests could not be merged due to conflicts.

An aspect the students liked least was the time keeping. We are searching for alternative ways of ensuring that time is spent early on in the project, and ways in which we can assess the knowledge gain instead of the mere result.

One of the parameters of a course like this is the actual theory that is discussed. This year, we included views, metrics, sketches, as well as software product lines and service oriented architectures. Topics that were less explicit this year were architectural patterns, or specific perspectives such as security or concurrency. Some of the students indicated that they liked the mix, but that they would have preferred a little more theory.

Conclusion

In this course, we aimed at getting our students ready to take up a leading role in software development projects. To that end, the course put a strong focus on:

Close involvement with open source projects where something was at stake, i.e., which were actually used and under active development;
The use of peer learning, leveraging GitHub’s social coding facilities for delivering and discussing results
A sound theoretical basis, acknowleding that architectural concepts can only be grasped and deeply understood when trying to put them into practice.

Next year, we will certainly follow a similar approach, naturally with some modifications. If you have suggestions for a course like this, or are aware of similar courses, please let us know!

Acknowledgments

A big thank you to all students who participated in IN4315 in 2013,
to jury members Nicolas Dintzner, Felienne Hermans, and Georgios Gousios, and guest lecturers Andre van der Hoek, Eric Burgers, Daniele Romano, and Maikel Lobbezoo for shaping this course!

Test Automation Day 2014

Posted in Development by Arie van Deursen

One of the software testing events in The Netherlands I like a lot is the Test Automation Day. It is a one day event, which in 2014 will take place on June 19 in Rotterdam. The day will be packed with (international) speakers, with presentations targeting developers, testers, as well as as management.

Test automation is a broad topic: it aims at offering tool support for software testing processes where ever this is possible. This raises a number of challenging questions:

What are the costs and benefits of software test automation? When and why should I apply a given technique?
What tools and techniques are available? What are their limitations and how can they be improved?
How can automation be applied in such areas as test execution, test case design, test input generation, test output evaluation, and test suite optimization?

The Test Automation Day was initiated four years ago by Squerist, and organized by CKC Seminars. In the past years, it has brought together over 200 test experts each year. During those years, we have built up some good traditions:

We have been able to attract some wonderful keynote speakers, including Scott Barber (performance testing, 2011, 2012), Elfriede Dustin (test automation for the DoD, 2012), Walter Belgers (security, 2012), Mieke Gevers (Agile processes, 2013) Lionel Briand (model-based testing, 2013), Emily Bache (future of test automation, 2013). For 2014, we are in the process of keynote selection and invitation.
A mixture of Dutch and international speakers.
Coverage of a broad range of topics in parallel tracks.

A mixture of different formats, including workshops, keynotes, a hands-on testlab, and tutorials, by such experts as Dorothy Graham, Steve Freeman, and Matt Heuser.
Presentations from industry (testers, developers, client organizations) and universities (new research results, tools and techniques), with a focus on practical applicability.
A mixture of invited (keynote) presentations, open contributions that can be submitted by anyone working in the area of software test automation, and sponsored presentations.
Pre-conference events, which this year will be in the form of a masterclass on Wednesday June 18.

This year’s call for contributions is still open. If you do something cool in the area of software test automation, please submit a proposal for a presentation! Deadline Monday December 2, 2013!

You are very much welcome to submit your proposal. Your proposal will be evaluated by the international TADNL program committee, who will look at the timeliness, innovation, technical content, and inspiration that the audience will draw from your presentation. Also, if you have suggestions for a keynote you would like to hear at TADNL2014, please send me an email!

This year’s special theme is test automation innovation: so if you’re work with or working on some exciting new test automation method or technique, make sure you submit a paper!

We look forward to your contributions, and to seeing you in Rotterdam in June 2014!

19Nov2013

Test Coverage: Not for Managers?

Posted in Development by Arie van Deursen

Last week I was discussing how one of my students could use test coverage analysis in his daily work. His observation was that his manager had little idea about coverage. Do managers need coverage information? Can they do harm with coverage numbers? What good can project leads do with coverage data?

For me, coverage analysis is first and foremost a code reviewing tool. Test coverage information helps me to see which code has not been tested. This I can use to rethink

The test suite: How can I extend the test suite to cover the corner cases that I missed?
The test strategy: Is there a need for me to extend my suite to cover the untested code? Why did I miss it?
The system design: How can I refactor my code so that it becomes easier to test the missing cases?
An incoming change (pull request): Is the contributed code well tested?

Coverage is not a management tool. While coverage information is essential to have good discussions on testing, using test coverage percentages to manage a project calls for caution.

To appreciate why managing by coverage numbers is problematic, let’s look at four common pitfalls of using software metrics in general, and see how they apply to test coverage as well.

Coverage Without Context

A coverage number gives a percentage of statements (or classes, methods, branches, blocks, …) in your program hit by the test suite. But what does such a number mean? Is 80% good or not? When is it good? Why is it good? Should I be alarmed at anything below 50%?

The key observation is that coverage numbers should not be leading: coverage targets (if used at all) should be derived from an overall project goal. For example:

We want a well-designed system. Therefore we use coverage analysis to discover classes that are too hard to test and hence call for refactoring;
We want to make sure we have meaningful tests in place for all critical behavior. Therefore, we analyze for which parts of this behavior no such tests are in place.
We have observed class Foo is bug prone. Hence we monitor its coverage, and no change that decreases its coverage is allowed.
For the avionics software system we are building, we need to comply with standard DO178b. Thus, we must demonstrate that we achieve modified condition/decision coverage;

With these higher level goals in place, it is also possible to argue when low coverage is acceptable:

These packages are deprecated, so we do not need to rethink the test suite nor its design anymore.
These classes take care of the user interface, which we do not consider critical for our application
These classes are so simple that we deem automated testing unnecessary.
This code has been stable for years, so we do not really care anymore how well it is tested.

Code coverage is a means to an end; it is not a goal in itself.

Treating The Numbers

The number one reason developers are wary of managers using coverage numbers is that it may give incentives to write useless tests.

src: flickr

Coverage numbers are easily tricked. All one needs to do is invoke a few methods in the test cases. No need for assertions, or any meaningful verification of the outcomes, and the coverage numbers are likely to go up.

Test cases to game coverage numbers are dangerous:

The resulting test cases are extremely weak. The only information they provide is that a certain sequence of calls does not lead to an unexpected exception.
The faked high coverage may give a false sense of confidence;
The presence of poor tests undermines the testing culture in the team.

Once again: A code coverage percentage is not a goal in itself: It is a means to an end.

One-Track Coverage

There is little information that can be derived from the mere knowledge that a test suite covers 100% of the statements. Which data values went through the statements? Was my if-then conditional exercised with a true as well as a false value? What happens if I invoke my methods in a different order?

src: flickr

A lot more information can be derived from the fact that the test suite achieves, say, just 50% test coverage. This means that half of your code is not even touched by the test suite. The number tells you that you cannot use your test suite to say anything about the quality of half of your code base.

As such, statement coverage can tell us when a test suite is inadequate, but it can never tell us whether a test suite is adequate.

For this reason, a narrow focus on a single type of coverage (such as statement coverage) is undesirable. Instead, a team should be aware of a range of different coverage criteria, and use them where appropriate. Example criteria include decision coverage, state transition coverage, data-flow coverage, or mutation coverage — and the only limitation is our imagination, as can be seen from Cem Kaner’s list of 101 coverage metrics, compiled in 1996.

The latter technique, mutation coverage, may help to fight gaming of coverage numbers. The idea is to generate mutations to the code under test, such as the negation of Boolean conditions or integer outcomes, or modifications of arithmetic expressions. A good test suite should trigger a failure for any such ‘erroneous’ motivation. If you want to experiment with mutation testing in Java (or Scala), you might want to take a look at the pitest tool.

Coverage Galore

A final pitfall in using coverage metrics is to become over-enthusiastic, measuring coverage in many (possibly overlapping) ways.

Even a (relatively straightforward) coverage tool like EclEmma for Java can measure instruction, branch, line, method, and type coverage, and as a bonus can compute the cyclomatic complexity. Per metric it can show covered, missed, and total, as well as the percentage of covered items. Which of those should be used? Is it a good idea to monitor all forms of coverage at all levels?

Again, what you measure should depend on your goal.

If your project is just starting to test systematically, a useful goal will be to identify classes where a lot can be gained by adding tests. Then it makes sense to start measuring at the coarse class or method level, in order to identify classes most in need of testing. If, on the other hand, most code is reasonably well tested, looking at missed branches may provide the most insight

The general rule is to start as simple as possible (just line coverage will do), and then to add alternative metrics when the simpler metric does not lead to new insights any more.

Additional forms of coverage should be different from (orthogonal to) the existing form of coverage used. For example, line coverage and branch coverage are relatively similar. Thus, measuring branch coverage besides line coverage may not lead to many additional cases to test (depending on your code). Mutation coverage, dataflow-based coverage, or state-machine based coverage, on the other hand, are much more different from line coverage. Thus, using one of those may be more effective in increasing the diversity of your test cases.

Putting Coverage to Good Use

Managers that are aware of the above pitfalls, can certainly put coverage to good use. Here are a few ways, taken from Glover’s classic Don’t be fooled by the coverage report:

src: flickr

“Code without corresponding tests can be more challenging to understand, and is harder to modify safely. Therefore, knowing whether code has been tested, and seeing the actual test coverage numbers, can allow developers and managers to more accurately predict the time needed to modify existing code.” — I can only agree.
“Monitoring coverage reports helps development teams quickly spot code that is growing without corresponding tests” — This happens: see our paper on the co-evolution of test and production code)
“Given that a code coverage report is most effective at demonstrating sections of code without adequate testing, quality assurance personnel can use this data to assess areas of concern with respect to functional testing.” — A good reason to combine user story acceptance testing with coverage analysis.

Summary

Test coverage analysis is an important tool that any development team taking testing seriously should use.

More than anything else, coverage analysis is a reviewing tool that you can use to improve your own (test) code, and to to evaluate someone else’s code and test cases.

Coverage numbers can not be used as a key performance indicator (KPI) for project success. If your manager uses coverage as a KPI, make sure he is aware of the pitfalls.

Coverage analysis is useful to managers: If you are a team lead, make sure you understand test coverage. Then use coverage analysis as a way to engage in discussion with your developers about their (test) code and the team’s test strategy.

Dimensions of Innovation

Posted in Research by Arie van Deursen

ECSS Amsterdam

As an academic in software engineering, I want to make the engineering of software more effective so that society can benefit even more from the amazing potential of software.

This requires not just good research, but also success in innovation: ensuring that research ideas are adopted in society. Why is innovation so hard? How can we increase innovation in software engineering and computer science alike? What can universities do to increase their innovation success rate?

I was made to rethink these questions when Informatics Europe contacted me to co-organize this year’s European Computer Science Summit (ECSS 2013) with special theme Informatics and Innovation.

Informatics Europe is an organization of Computer Science Departments in Europe, founded in 2005. Its mission is to foster the development of quality research and teaching in information and computer sciences, also known as Informatics. In its yearly summit, deans and heads of department get together, to share experience in leading academic groups, and to join forces when undertaking new activities at a European level.

When Informatics Europe asked me as program chair of their two day summit, my only answer could be “yes”. I have a long term interest in innovation, and here I had the opportunity and freedom to compile a full featured program on innovation as I saw fit, with various keynotes, panels, and presentations by participants — a wonderful assignment.

In compiling the program I took the words of Peter Denning as starting point: “an idea that changes no one’s behavior is only an invention, not an innovation.” In other words, innovation is about changing people, which is much harder than coming up with a clever idea.

In the end, I came up with the following “Dimensions of Innovations” that guided the composition of the program.

Startups
Innovation needs optimists who believe they can change the world. One of the best ways to bring a crazy new idea to sustainable success is by starting a new company dedicated to creating and conquering markets that had not existed before.

Many of the activities at ECSS 2013 relate to startups. The opening keynote is by FranÃ§ois Bancilhon, serial entrepreneur currently working in the field of open data. Furthermore, we have Heleen Kist, strategy consultant in the area of access to finance. Last but not least, the first day includes a panel, specifically devoted to entrepreneurship, and one of the pre-summit workshops is entirely devoted to entrepreneurship for faculty.
Patents
Patents are traditionally used to protect (possibly large) investments that may be required for successful innovation. In computer science, patents are more than problematic, as evidenced by patent trolls, fights between giants such as Oracle and Google, and the differences in regulations in the US and in Europe. Yet at the same time (software) patents can be crucial, for example to attract investors for a startup.

Several of the ECSS keynote speakers have concrete experience with patents — Pamela Zave at AT&T, and Erik Meijer from his time at Microsoft, when he co-authored hundreds of patents. Furthermore, Yannis Skulikaris of the European Patent Office will survey patenting of software-related inventions.
Open Source, Open Data

An often overlooked dimension of innovation are open source and open data. How much money can be made by giving away software, or by allowing other to freely use your data? Yet, many enterprises are immensely successful based on open source and open data.

At ECSS, keynote speaker Erik Meijer is actively working on a series of open source projects (related to his work on reactive programming). In the area of open data, we have entrpreneur FranÃ§ois Bancilhon, and semantic web specialist Frank van Harmelen, who is head of the Network Institute of the Vrije Universiteit in Amsterdam
Teaching for Innovation

How can universities use education to strengthen innovation? What should students learn so that they can become ambassadors of change? How innovative should students be so that they can become successful in society? At the same time, how can outreach and education be broadened so that new groups of students are reached, for example via on line learning?

To address these questions, at ECSS we have Anka Mulder, member of the executive board of Delft University of Technology, and former president of the OpenCourseWare Consortium. She is responsible for the TU Delft strategy on Massive On-Line Open Courses (MOOC), and she will share TU Delft experiences in setting up their MOOCs.

Furthermore, ECSS will host a panel discussion, in which experienced yet non-conformist teachers and managers will share their experience in innovative teaching to foster innovation.
Fostering Innovation

Policy makers and university management are often at loss on how to encourage their stubborn academics to contribute to innovation, the “third pillar” of academia.

Therefore, ECSS is a place for university managers to meet, as evidenced by the pre-summit Workshop for Deans, Department Chairs and Research Directors . Furthermore, we have executive board member Anka Mulder as speaker.

Last but not least, we have Carl-Cristian Buhr, member of the cabinet of Digital Agenda Commissioner and EU Commission Vice-President Neelie Kroes, who will speak about the EU Horizon 2020 programme and its relevance to computer science research, education, and innovation.
Inspriational Content

All talk about innovation is void without inspirational content. Therefore, throughout the conference, exciting research insights and new course examples will be interwoven in the presentations.

For example, we have Pamela Zave speaking on The Power of Abstraction, Frank van Harmelen addressing progress in his semantic web work at the Network Institute, and Felienne Hermans on how to reach thousands of people through social media. Last but not least we have Erik Meijer, who is never scared to throw both math and source code in his presentation.

The summit will take place October 7–9, 2013, in Amsterdam. You are all welcome to join!

p>
Update:

The slides I used for opening ECSS 2013

10Jul2013

Some Research Paper Writing Recommendations

Posted in Research by Arie van Deursen

Last week, I received an email from Alex Orso and Sebastian Uchitel, who had been asked to give a talk on “How to get my papers accepted at top SE conferences” at the Latin American School on Software Engineering. Here’s their question:

We hope you can spare a few minutes to share with us the key recommendations you would give to PhD students that have not yet had successful submissions to top software engineering conferences, such as ICSE.

An interesting request, and ~~I certainly look forward to receive some of the advice my fellow researchers will be providing~~ you can see the advice of my fellow researchers in a presentation by Alex Orso.

When working with my students on papers, I must admit I sometimes repeat myself. Below are some of the things I hear myself say most often.

Explain the Innovation

The first thing to keep in mind is that a research paper should explain the innovation. This makes it quite different from a text book chapter, or from a hands-on tutorial. The purpose is not to explain a technique so that others can use it. Instead, the purpose of a research paper is to explain what is new about the proposed technique.

Identify the Contributions

Explaining novelty is driven by contributions. A contribution is anything the world did not know before this paper, but which we now do know thanks to this paper.

I tend to insist on an explicit list of contributions, which I usually put at the end of the paper.

“The contributions of this paper are …”

Each contribution is an outcome, not the process of doing something. Contributions are things, not effort. Thus, “we spent 6 months manually analyzing 500,000 commit messages” is not a contribution. This effort, though, hopefully has resulted in a useful contribution, which may be that “for projects claiming to do test-driven development, to our surprise we found that 75% of the code commits are not accompanied by a commit in the test code.”

Usually, when thinking about the structure of a paper, quite a bit of actual research has been done already. It is then important to reassess everything that has been done, in order to see what the real contributions of the research are. Contributions can include a new experimental design, a novel technique, a shared data set or open source tool, as well as new empirical evidence contradicting, confirming, or enriching existing theories.

Structure the Paper

With the contributions laid out, the structure of the paper appears naturally: Each contribution corresponds to a section.

This does not hold for the introductory and concluding sections, but it does hold for each of the core sections.

Furthermore, it is essential to separate background material from own contributions. Clearly, most papers will rely on existing theories or techniques. These must be explained. Since the goal of the paper is to explain the innovation, all material that is not new should be clearly isolated. In this way, it easiest for the reader (and the reviewer) to see what is new, and what is not new, about this paper.

As an example, take a typical structure of a research paper:

Introduction
Background: Cool existing work that you build upon.
Problem statement: The deficiency you spotted
Conceptual solution: A new way to deal with that problem!
Open source implementation: Available for everyone!
Experimental design for evaluation: Trickier than we thought!
Evaluation results: It wasn’t easy to demonstrate, but yes, we’ve good evidence that this may work!
Discussion: What can we do with these results? Promising ideas for future research or applications? And: a critical analysis of the threats to the validity of our results.
Related work
Concluding remarks.

In such a setup, sections 4-7 can each correspond to a contribution (and sometimes to more than one). The discussion section (8) is much more speculative, and usually does not contribute solid new knowledge.

Communicate the Contributions

Contributions not just help in structuring a paper.

They are also the key aspect program committees look at when deciding about acceptence of a paper.

When reviewers evaluate a paper, they try to identify, and interpret the contributions. Are these contributions really new? Are they important? Are they trivial? Did the authors provide sufficient evaluations for their claims? The paper should help the reviewer, by being very explicit about the contributions and the claims to fame of these contributions.

When program committee members discuss a paper, they do so in terms of contributions. Thus, contributions should not just be strong, they should also be communicable.

For smaller conferences, it is safe to assume that all reviewers are epxerts. For large conferences, such as ICSE, the program committee is broad. Some of reviewers will be genuine experts on the topic of the paper, and these reviewers should be truly excited about the results. Other reviewers, however, will be experts in completely different fields, and may have little understanding of the paper’s topic. When submitting to prestigious yet broad conferences, it is essential to make sure that any reviewer can understand and appreciate the contributions.

The ultimate non-expert is the program chair. The chair has to make a decision on every paper. If the program chair cannot understand a paper’s contributions, it is highly unlikely that the paper will get accepted.

Share Contributions Early

Getting a research paper, including its contributions, right, is hard. Especially since contributions have to be understandable by non-experts.

Therefore, it is crucial to offer help to others, volunteering to read preliminary drafts of papers, assessing the strength of theircontributions. In return, you’ll have other people, possibly non-experts, assess the drafts you are producing, in this way helping each other to publish a paper at this prestigious conference.

But wait. Isn’t helping others a bad idea for highly competitive conferences? Doesn’t it reduce one’s own chances?

No. Software engineering conferences, including ICSE and FSE, accept any paper that is good. Such conferences do not work with accpetance rates that are fixed in advance. Thus, helping each other may increase the acceptance rate, but will not negatively affect any author.

Does This Help?

I hope some of these guidelines will be useful to “PhD students that have not yet had successful submissions to top software engineering conferences, such as ICSE.”

A lot more advice is available on the Internet on how to write a research paper. I do not have a list of useful resources available at the time of writing, but perhaps in the near future I will extend this post with useful additional pointers.

Luckily, this post is not a research paper. None of the ideas presented here is new. But they have worked for me, and I hope they’ll work for you too.

Image credits: Pencils in the Air, by Peter Logan, Photo by Mira66. flickr

Recurring Themes

First-Hand Experience

Enjoy!

Beyond Page Objects

Versioning Policies

Versioning in Maven Central

Breaking Changes

Major versus Minor/Patch Releases

Deprecation to the Rescue?

Discussion

Conclusion

EDIT, November 2014

The Original Index

Use in Visual Studio

Concerns

Think Twice

Further Reading

Edit September 2014

The Code

Brittle Software Engineering

Code Formatting is a Security Feature

Indentation Must be Automatic

Code Review to the Rescue?

Automated Checkers to the Rescue?

Testing is a Security Concern

Coverage Analysis

The Return-Error-Code Idiom

Course Contents

GitHub Project Selection

Making Contributions

Identifying Stakeholders

Architectural Views

Software Metrics

Architectural Sketches

Lectures From Industry

Course Structure

Constraints

GitHub-in-the-Course

Time Keeping

Student Presentations

Grading

Evaluation

Conclusion

Acknowledgments

Further Reading

Coverage Without Context

Treating The Numbers

One-Track Coverage

Coverage Galore

Putting Coverage to Good Use

Summary

Further reading

Explain the Innovation

Identify the Contributions

Structure the Paper

Communicate the Contributions

Share Contributions Early

Does This Help?