Hi all,
I'm going to test some scripts I wrote to fail the LAVA database over to another server in a couple of hours (we will use this during the upgrade of precise to control too). This will cause two forms of disruption:
1) I've already offlined all boards and am waiting for the jobs to finish, so you might have to wait a little longer for your LAVA jobs to finish.
2) There will be some very short moments of complete outage as the failover happens.
Apologies in advance if this causes you difficulties -- but I hope having better disaster recovery for the lab is a good goal :-)
Cheers, mwh
Michael Hudson-Doyle michael.hudson@linaro.org writes:
Hi all,
I'm going to test some scripts I wrote to fail the LAVA database over to another server in a couple of hours (we will use this during the upgrade of precise to control too). This will cause two forms of disruption:
I've already offlined all boards and am waiting for the jobs to finish, so you might have to wait a little longer for your LAVA jobs to finish.
There will be some very short moments of complete outage as the failover happens.
Apologies in advance if this causes you difficulties -- but I hope having better disaster recovery for the lab is a good goal :-)
So this didn't quite work out -- the outage was probably a few minutes in total, and the failover didn't succeed. Three problems:
1) trivial syntax mistakes in my script (not a problem really)
2) my scripts only changed http traffic to point at the failover node, not https
3) the failover node was configured to serve lava at /, not /lava-server/ as we currenly do for production (for extremely hysterical raisins)
All the above is easily enough fixed, and I'll try again tomorrow.
Apologies again for the disruption.
Cheers, mwh
Hi Michael,
First, let me introduce Matt Hart, our new Lava Lab Engineer. He and I are going to work on this failover and upgrade in the next week, and if possible, I'd like to find out where we are with this. If you haven't had time to look at this, maybe Matt could pick it up?
Thanks
Dave
On 17 Apr 2013, at 01:35, Michael Hudson-Doyle michael.hudson@linaro.org wrote:
Michael Hudson-Doyle michael.hudson@linaro.org writes:
Hi all,
I'm going to test some scripts I wrote to fail the LAVA database over to another server in a couple of hours (we will use this during the upgrade of precise to control too). This will cause two forms of disruption:
- I've already offlined all boards and am waiting for the jobs to
finish, so you might have to wait a little longer for your LAVA jobs to finish.
- There will be some very short moments of complete outage as the
failover happens.
Apologies in advance if this causes you difficulties -- but I hope having better disaster recovery for the lab is a good goal :-)
So this didn't quite work out -- the outage was probably a few minutes in total, and the failover didn't succeed. Three problems:
trivial syntax mistakes in my script (not a problem really)
my scripts only changed http traffic to point at the failover node,
not https
- the failover node was configured to serve lava at /, not
/lava-server/ as we currenly do for production (for extremely hysterical raisins)
All the above is easily enough fixed, and I'll try again tomorrow.
Apologies again for the disruption.
Cheers, mwh
linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation
Dave Pigott dave.pigott@linaro.org writes:
Hi Michael,
First, let me introduce Matt Hart, our new Lava Lab Engineer. He and I are going to work on this failover and upgrade in the next week, and if possible, I'd like to find out where we are with this. If you haven't had time to look at this, maybe Matt could pick it up?
So where we are: I ran the scripts but quite late at night and I forgot to shut down lava on the other nodes so things didn't quite go right. That said, they did seem to work overall. Another test would be good (but it takes such a long time for the currently running jobs to finish when you offline the boards) but I'm about 90% confident they will work during an upgrade.
On the third hand, I think perhaps trying to do the upgrade with minimal downtime is perhaps not worth the effort. Perhaps we should just pick a date, announce it and do the upgrade with downtime (especially now oneiric is official unsupported).
Cheers, mwh
linaro-validation@lists.linaro.org