Incident Management (was: Re: pointless mail, (was Re: android-build's are failing...))

11 May 2012


      On Fri, 11 May 2012 00:30:26 +0200, Alexander Sack asac@linaro.org wrote:
...
On Fri, May 11, 2012 at 12:24 AM, Ricardo Salveti
...
Sure, I just think there are better places for it :-) Based on issues
we had with LAVA and Jenkins at the previous cycle, if I had one email
for every issue, I'd send at least 20 of them, which is useful but
that still doesn't make me send them to the list.]
Actually, I think LAVA outage was announced. I poked for getting more
status updates, so more mails would have been great.
Same goes for ci.linaro.org ... if our CI service used for everything
but android is not available, I want to get a mail that this is the
case.
So, what this discussion points to is: we need a process for handling
disruptions to the services we provide.  When the **** hits the fan, the
last think you want people to be doing is _thinking_, or at least,
thinking about things that could have been thought through ahead of
time and are not totally specific to the incident at hand.
Just recently within the LAVA team, we've started following such a
process:
https://wiki.linaro.org/Internal/LAVA/Incidents
(apologies to the non-Linaro insiders for the internal link).  The
process will look very familiar to anyone who works at Canonical...
Creating a wiki page for each incident can feel a bit heavyweight, but
having some kind of defined place for recording details has two massive
values:
1. It means there's a canonical place to go for information while the
    incident is still in progress.[1]
2. It means that at the end of the month or quarter or whatever you can
    look back and have _actual data_ for how often various issues come
    up, rather than relying on vague feelings like "it seems we run out
    of disk space a lot".
I created a Google spreadsheet & form for adding details to it in an
attempt to reduce the overhead of recording an incident, but after
exactly two incidents, we already have an incident that was recorded in
a wiki page but not the spreadsheet, so maybe that was premature
optimization on my part.
It's only early days but I already feel happier for having this process
in place.  I'm happy to donate this policy to the wider set of services
Linaro runs if there is consensus it would be useful :-)
There is already a page on a related topic:
https://wiki.linaro.org/Internal/Process/DealingWithCrisis
but that seems to me to be aimed at bigger issues than android-build or
LAVA being unreachable for an hour.
One thing this thread points out to me though is that our policy does
not really cover communication, either within the team or with our
users.  I'll work on a proposal for that today.
Cheers,
mwh
[1] In particular, if an incident goes on for long enough to require
    hand overs between people working on in, then a wiki page like this
    is downright essential.

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Incident Management (was: Re: pointless mail, (was Re: android-build's are failing...))