All posts

Reflections on System Downtime on 30 October

As you may have experienced, we had an unplanned suspension of service on Wednesday last week when the Debitoor application was unavailable for 2,5 hours with Quotes being unavailable for another 4,5 hours.

As we aim to have Debitoor available for you around the clock, this is not something we are proud of and we deeply apologise for any inconvenience related to this downtime.

So in the following, I’ll detail what happened, what we did to address this issue and what actions we’re taking to avoid it from happening again.

Timeline of events

10:40 GMT: Two developers prepared and ran an update against the production system. The update was not intended to interrupt service and is a routine job that we do almost every day to bring you new features and keep things running smoothly. This time the update did not run as expected and it was immediately discovered that Quotes were no longer available in the application. Red Alert!

10:45 GMT: A decision was made to take the application offline to investigate further.

1:15 pm GMT: Confident that no damage could be done, Debitoor was put back online but with Quotes functionality disabled. The investigation continued.

1:41 pm GMT: It was discovered that also Expenses suffered a similar problem and that the cause of this was another update made on the system a day earlier.

2:30 pm GMT: The problem with Expenses was fixed and rolled out to production.

5:30 pm GMT: After having been disabled for 7 hours, Quotes were finally reenabled and Debitoor was now back to normal operations.

Root cause

The process of deploying changes to Debitoor is highly automated and we rely on a set of deployment tools to do it as often as it is needed, without human interaction and - most important of all - without any downtime for you.

A week before this incident, we had made changes to one of our deployment tools, along with updates to Debitoor that depended on the changes in the deployment tool. These changes had gone through both a peer review and a testing process.

Obviously, something failed this time, so we have subsequently spend a lot of time analyzing exactly what went wrong, in order to prevent making the same mistake again in the future.

Two failures have been identified during our retrospective:

  • Due to a human error, the update to Debitoor was rolled out without the changes to the deployment tool.

  • Review and testing of the new functionality were mistakenly done on an environment where the version of the deployment tool was wrong.

Learning from mistakes

Learning in general - and in particular from mistakes - is at the heart of running an online service like Debitoor. We aim to enforce the policy of “Don't let downtime happen twice for the same reason” and as a consequence of last week’s events, we have taken a number of actions which will prevent this from happening in the future.

This includes:

  • improving our manual procedures for applying this type of changes

  • implementing additional automated checks to support the procedures that caused problems.

We are confident that it will prevent the mistake from happening again.

Finally

We can’t apologise enough for what happened. We know how it feels to see your work go offline and are working very hard to ensure that it does not happen again.