On Fri, 11 May 2012 00:30:26 +0200, Alexander Sack asac@linaro.org wrote:
On Fri, May 11, 2012 at 12:24 AM, Ricardo Salveti
Sure, I just think there are better places for it :-) Based on issues we had with LAVA and Jenkins at the previous cycle, if I had one email for every issue, I'd send at least 20 of them, which is useful but that still doesn't make me send them to the list.]
Actually, I think LAVA outage was announced. I poked for getting more status updates, so more mails would have been great.
Same goes for ci.linaro.org ... if our CI service used for everything but android is not available, I want to get a mail that this is the case.
So, what this discussion points to is: we need a process for handling disruptions to the services we provide. When the **** hits the fan, the last think you want people to be doing is _thinking_, or at least, thinking about things that could have been thought through ahead of time and are not totally specific to the incident at hand.
Just recently within the LAVA team, we've started following such a process:
https://wiki.linaro.org/Internal/LAVA/Incidents
(apologies to the non-Linaro insiders for the internal link). The process will look very familiar to anyone who works at Canonical...
Creating a wiki page for each incident can feel a bit heavyweight, but having some kind of defined place for recording details has two massive values:
1. It means there's a canonical place to go for information while the incident is still in progress.[1]
2. It means that at the end of the month or quarter or whatever you can look back and have _actual data_ for how often various issues come up, rather than relying on vague feelings like "it seems we run out of disk space a lot".
I created a Google spreadsheet & form for adding details to it in an attempt to reduce the overhead of recording an incident, but after exactly two incidents, we already have an incident that was recorded in a wiki page but not the spreadsheet, so maybe that was premature optimization on my part.
It's only early days but I already feel happier for having this process in place. I'm happy to donate this policy to the wider set of services Linaro runs if there is consensus it would be useful :-)
There is already a page on a related topic:
https://wiki.linaro.org/Internal/Process/DealingWithCrisis
but that seems to me to be aimed at bigger issues than android-build or LAVA being unreachable for an hour.
One thing this thread points out to me though is that our policy does not really cover communication, either within the team or with our users. I'll work on a proposal for that today.
Cheers, mwh
[1] In particular, if an incident goes on for long enough to require hand overs between people working on in, then a wiki page like this is downright essential.