Financial ServicesProviders Company Schemes Public Sector Third Party Administrators

The curse of failing builds

Ruchir Sanghavi

Principal Developer

describes a strategy for reducing the impact of software builds that fail

18 October 2017

I haven't met a single developer who likes failing builds. The same old issue of checking some code in, and after a few hours or even days, a colleague will come over and point out how you have broken a build. At first you refuse to believe it and instinctively want to challenge them, until eventually, the proof is right in front of you. You accept responsibility and take on the task of fixing it as soon as possible so that everyone forgets about it.

That used to be my thinking, but now this has changed. In fact, it has changed quite radically. Why? Well, through years of experience, I've come to realise that builds will always fail - no matter how hard you try. And when you think about it, isn't that the point of a build? To catch errors or faults, notify developers of any issues and, if everything looks good, only then build the product?

So, the important thing, in my view, is that we embrace build failures and react to them quickly, with all the tools and techniques in place to enable the team to effectively achieve this. But therein lies the challenge.

Identifying the main problems

At a high level, there are usually three main underlying causes that result in a lack of build discipline:

  • Long builds
  • Lack of tools
  • Using the right techniques

We'll examine each of these separately...

Long builds

As a developer, I want to know as soon as possible that I've caused builds to fail. This means, having builds that are really quick to run - and when I say quick, I mean 'less than five minutes' quick. The reason being, chances are, I am going to get busy after five minutes on something else - so the quicker I receive feedback, the quicker I can make a change or fix it. We want to make sure that developers' 'wait time' on builds is reduced. The effects of this really comes into play when there are lots of builds being run and lots of developers are continuously checking-in code - if something is broken, you need to know quickly what has gone wrong, otherwise other developers will be held up. In terms of productivity, long builds also have an impact because it means that developers have to 'context switch', that is, by the time I receive feedback on what's caused the build failure, chances are I will be deeply embroiled in some other task.

This means I will have to spend time refocusing and mentally switching back to the build failure. Long builds should also be seen as 'technical debt' in the product on which Developers pay a high interest (by waiting around!). Traditionally, technical debt is always seen as code that is within the product, but any code that is used to build the product itself should also be performing and kept up to date. As such, it is important to identify this in any technical debt strategy that aims to reduce this debt, and improve build performances with the aim of having quicker builds.

Lack of tools

A good mantra we've been following is 'The right tool for the right job'. One of the things in which Aquila Heywood has invested heavily is the use of new tools to improve build quality. We have moved away from SVN and adopted Git (just doing checkouts - Git clones used to take 25 mins and it's now down to seconds!). Alongside this, we have also adopted the use of SonarQube Quality Gates, which are analysed as part of the builds so that any issues found by SonarQube will fail the quality gate for both Java and PL/SQL projects.

So you have all sorts of builds, and they may even be quick, but who is going to monitor them? Developers are too busy and need a quick way of finding out if something has failed. Traditional emails don't work - you get too many of those! So, we've even gone on to setup 'build radiators' for each team - this is essentially a big monitor that can be seen by the entire team and has a direct view on the status of all builds and Quality Gate status. Different tiles have either a green or a red background, indicating visually to the entire team if something is broken. If something is red, and a developer has already started to work on it, the build radiator shows that the failure is assigned to them, so everyone else is aware that it's been taken care of.

Example of a build radiator

build radiator

Using the right techniques

All these tools are great, but they are effective only if they are adopted properly. This is where we rely on our people and the lightweight process that they put in place to ensure that all the right techniques are being applied. For example, we have added the 'all builds are green' and 'quality gate is green' as part of our Scrum Definition of Done for stories taken into a Sprint. If something is red, it means your story can't get to the Done stage. This means there is nothing to show in the Sprint Review! Now that will catch everyone's attention. Another technique that has helped is the branching model that we have created. We know that developers are not perfect, we can easily break things and it happens - that's a fact. So, we give each team its own branch and development builds to play with. Outside their team, no-one else cares whether the builds on those branches are green or red (they even have their own team build radiator to monitor them!). The area that must be green, however, is our Integration environment: this is where we carry out the testing, and we endeavour to keep this as green as possible.

It takes time...

Overall, I have noticed that it takes a combined team effort to achieve build discipline. What matters is that individuals take responsibility for the code they are checking. That responsibility manifests from individuals, to teams, to the entire department and across the Company. Hence it is vital to make the process of build discovery and build semantics (that is, how the builds work, what they produce and how different builds interact with each other) transparent to all developers, which is best achieved through knowledge transfer. Builds should no longer be a 'black box' about which only the privileged few know: it should be something that we all embrace and try continuously to improve. At Aquila Heywood, we're always trying to find ways of doing this - long may that continue!

Ruchir Sanghavi is a Principal Developer at Aquila Heywood, the largest supplier of life and pensions administration software solutions in the UK.

Further Reading