I've started a google doc on this subject: https://docs.google.com/a/linaro.org/document/d/1K_FrpM0qaDCKd6fRHyt_NDf10lb...
Team members, feel free to edit as appropriate. Anyone else with thoughts can add comments.
Cheers, mwh
Michael Hudson-Doyle michael.hudson@canonical.com writes:
I've started a google doc on this subject: https://docs.google.com/a/linaro.org/document/d/1K_FrpM0qaDCKd6fRHyt_NDf10lb...
I've updated this to include thoughts on how to upgrade postgres to 9.1. I'd like to do this soon (my next Monday, they're usually quiet? There will be a few minutes of total LAVA downtime).
Setting up streaming replication to enable us to cut over to the failover system is starting to seem like too much work for this. We could just rsync the postgres data across after shutting down LAVA on control and rsync it back again just before restart it after the upgrade.
The other option is to just take the downtime while the upgrade happens. It really shouldn't that long, but I can't really think of a way to gauge the potential cost ahead of time.
Team members, feel free to edit as appropriate. Anyone else with thoughts can add comments.
Still true!
Cheers, mwh
On 01/06/2013 06:21 PM, Michael Hudson-Doyle wrote:
Michael Hudson-Doyle michael.hudson@canonical.com writes:
I've started a google doc on this subject: https://docs.google.com/a/linaro.org/document/d/1K_FrpM0qaDCKd6fRHyt_NDf10lb...
I've updated this to include thoughts on how to upgrade postgres to 9.1. I'd like to do this soon (my next Monday, they're usually quiet? There will be a few minutes of total LAVA downtime).
that's cool and including the script is really great.
Setting up streaming replication to enable us to cut over to the failover system is starting to seem like too much work for this. We could just rsync the postgres data across after shutting down LAVA on control and rsync it back again just before restart it after the upgrade.
The other option is to just take the downtime while the upgrade happens. It really shouldn't that long, but I can't really think of a way to gauge the potential cost ahead of time.
From what I read the downtime would probably be less than 3 minutes. I hate to waist developer *days* to save 3 minutes of uptime.
Team members, feel free to edit as appropriate. Anyone else with thoughts can add comments.
Still true!
Andy Doan andy.doan@linaro.org writes:
On 01/06/2013 06:21 PM, Michael Hudson-Doyle wrote:
Michael Hudson-Doyle michael.hudson@canonical.com writes:
I've started a google doc on this subject: https://docs.google.com/a/linaro.org/document/d/1K_FrpM0qaDCKd6fRHyt_NDf10lb...
I've updated this to include thoughts on how to upgrade postgres to 9.1. I'd like to do this soon (my next Monday, they're usually quiet? There will be a few minutes of total LAVA downtime).
that's cool and including the script is really great.
OK. I'll send an announcement about the downtime today.
Setting up streaming replication to enable us to cut over to the failover system is starting to seem like too much work for this. We could just rsync the postgres data across after shutting down LAVA on control and rsync it back again just before restart it after the upgrade.
The other option is to just take the downtime while the upgrade happens. It really shouldn't that long, but I can't really think of a way to gauge the potential cost ahead of time.
From what I read the downtime would probably be less than 3 minutes. I hate to waist developer *days* to save 3 minutes of uptime.
OK. I'd still like to have a system ready to go as a standby in case the upgrade implodes so that we can continue to accept jobs at least.
Team members, feel free to edit as appropriate. Anyone else with thoughts can add comments.
Still true!
Cheers, mwh
On 7 Jan 2013, at 20:38, Michael Hudson-Doyle michael.hudson@linaro.org wrote:
Andy Doan andy.doan@linaro.org writes:
On 01/06/2013 06:21 PM, Michael Hudson-Doyle wrote:
Michael Hudson-Doyle michael.hudson@canonical.com writes:
I've started a google doc on this subject: https://docs.google.com/a/linaro.org/document/d/1K_FrpM0qaDCKd6fRHyt_NDf10lb...
I've updated this to include thoughts on how to upgrade postgres to 9.1. I'd like to do this soon (my next Monday, they're usually quiet? There will be a few minutes of total LAVA downtime).
that's cool and including the script is really great.
OK. I'll send an announcement about the downtime today.
+1 - Made a lot of sense when I read through the script last night.
Setting up streaming replication to enable us to cut over to the failover system is starting to seem like too much work for this. We could just rsync the postgres data across after shutting down LAVA on control and rsync it back again just before restart it after the upgrade.
The other option is to just take the downtime while the upgrade happens. It really shouldn't that long, but I can't really think of a way to gauge the potential cost ahead of time.
From what I read the downtime would probably be less than 3 minutes. I hate to waist developer *days* to save 3 minutes of uptime.
OK. I'd still like to have a system ready to go as a standby in case the upgrade implodes so that we can continue to accept jobs at least.
Do you want me to create a cloud node called something like control-backup? I widened the pool of public IPs for the cloud yesterday to add 192.168.0.x so this *should* work - the reason we were hitting the stops before was that we had run out of public floating IPs. As luck would have it the private pool has always been wide enough, because I had planned for when we widened the IP address space. And just to reassure you: dnsmasq is configured to never serve from the 192.168.0.x address range.
Thanks
Dave
Dave Pigott dave.pigott@linaro.org writes:
OK. I'd still like to have a system ready to go as a standby in case the upgrade implodes so that we can continue to accept jobs at least.
Do you want me to create a cloud node called something like control-backup? I widened the pool of public IPs for the cloud yesterday to add 192.168.0.x so this *should* work - the reason we were hitting the stops before was that we had run out of public floating IPs. As luck would have it the private pool has always been wide enough, because I had planned for when we widened the IP address space. And just to reassure you: dnsmasq is configured to never serve from the 192.168.0.x address range.
Hm. I guess that makes sense, yes -- can you do that tomorrow?
Cheers, mwh
On 9 Jan 2013, at 02:18, Michael Hudson-Doyle michael.hudson@linaro.org wrote:
Dave Pigott dave.pigott@linaro.org writes:
OK. I'd still like to have a system ready to go as a standby in case the upgrade implodes so that we can continue to accept jobs at least.
Do you want me to create a cloud node called something like control-backup? I widened the pool of public IPs for the cloud yesterday to add 192.168.0.x so this *should* work - the reason we were hitting the stops before was that we had run out of public floating IPs. As luck would have it the private pool has always been wide enough, because I had planned for when we widened the IP address space. And just to reassure you: dnsmasq is configured to never serve from the 192.168.0.x address range.
Hm. I guess that makes sense, yes -- can you do that tomorrow?
Cheers, mwh
No problem. I'll email when it's done.
Dave
Dave Pigott dave.pigott@linaro.org writes:
On 9 Jan 2013, at 02:18, Michael Hudson-Doyle michael.hudson@linaro.org wrote:
Dave Pigott dave.pigott@linaro.org writes:
OK. I'd still like to have a system ready to go as a standby in case the upgrade implodes so that we can continue to accept jobs at least.
Do you want me to create a cloud node called something like control-backup? I widened the pool of public IPs for the cloud yesterday to add 192.168.0.x so this *should* work - the reason we were hitting the stops before was that we had run out of public floating IPs. As luck would have it the private pool has always been wide enough, because I had planned for when we widened the IP address space. And just to reassure you: dnsmasq is configured to never serve from the 192.168.0.x address range.
Hm. I guess that makes sense, yes -- can you do that tomorrow?
Cheers, mwh
No problem. I'll email when it's done.
I can't remember the status of this, but in any case I've set up a fallback instance on dispatcher01 (I'd forgotten about the cloud-backup vm! that can be deleted now I guess). You can access this instance at http://fallback.validation.linaro.org/ (Apache / LAVA on the server think they are serving "validation.linaro.org" and mod_headers trickery on gateway is mapping that from and to fallback.validation.linaro.org -- this seems like better preparation for cutting between control and this node).
I've updated the google doc at https://docs.google.com/a/linaro.org/document/d/1K_FrpM0qaDCKd6fRHyt_NDf10lb... to have some super super specific instructions around shuffling database data around (maybe these should be run via salt from gateway? Might eliminate some potential for slip ups).
I've also updated the work items a bit on https://blueprints.launchpad.net/lava-lab/+spec/control-12.04. I wonder about these two:
Run full backup of control using something like "partimage": TODO Do a backup of known important files (/usr/local/bin, /etc, /srv/lava, etc): TODO
I guess the former is a good idea, although if the upgrade explodes I favour installing precise from scratch over restoring control from a partition backup. For the latter, I think all these files are in salt now.
I'd like to practice the "cut between control and fallback" steps next Monday, which will mean some small downtime. I'll send an announcement soon.
Cheers, mwh
On 9 Jan 2013, at 02:18, Michael Hudson-Doyle michael.hudson@linaro.org wrote:
Dave Pigott dave.pigott@linaro.org writes:
OK. I'd still like to have a system ready to go as a standby in case the upgrade implodes so that we can continue to accept jobs at least.
Do you want me to create a cloud node called something like control-backup? I widened the pool of public IPs for the cloud yesterday to add 192.168.0.x so this *should* work - the reason we were hitting the stops before was that we had run out of public floating IPs. As luck would have it the private pool has always been wide enough, because I had planned for when we widened the IP address space. And just to reassure you: dnsmasq is configured to never serve from the 192.168.0.x address range.
Hm. I guess that makes sense, yes -- can you do that tomorrow?
Cheers, mwh
OK. Instance created. Just doing all the config. However, there's something odd. We've run out of RAM quota, although the total amount we've used isn't anywhere near the total available. I suspect (but don't know yet) that this is due to the number of instances running on individual nodes. I may need to update and reboot the cloud. :(
The RAM I could allocate to control-backup was only 2GB. Hope it's enough.
Dave
On 9 Jan 2013, at 09:22, Dave Pigott dave.pigott@linaro.org wrote:
On 9 Jan 2013, at 02:18, Michael Hudson-Doyle michael.hudson@linaro.org wrote:
Dave Pigott dave.pigott@linaro.org writes:
OK. I'd still like to have a system ready to go as a standby in case the upgrade implodes so that we can continue to accept jobs at least.
Do you want me to create a cloud node called something like control-backup? I widened the pool of public IPs for the cloud yesterday to add 192.168.0.x so this *should* work - the reason we were hitting the stops before was that we had run out of public floating IPs. As luck would have it the private pool has always been wide enough, because I had planned for when we widened the IP address space. And just to reassure you: dnsmasq is configured to never serve from the 192.168.0.x address range.
Hm. I guess that makes sense, yes -- can you do that tomorrow?
Cheers, mwh
OK. Instance created. Just doing all the config. However, there's something odd. We've run out of RAM quota, although the total amount we've used isn't anywhere near the total available. I suspect (but don't know yet) that this is due to the number of instances running on individual nodes. I may need to update and reboot the cloud. :(
The RAM I could allocate to control-backup was only 2GB. Hope it's enough.
Dave
All configured. LAVA team have ssh and sudo.
Thanks
Dave
Dave Pigott dave.pigott@linaro.org writes:
On 9 Jan 2013, at 02:18, Michael Hudson-Doyle michael.hudson@linaro.org wrote:
Dave Pigott dave.pigott@linaro.org writes:
OK. I'd still like to have a system ready to go as a standby in case the upgrade implodes so that we can continue to accept jobs at least.
Do you want me to create a cloud node called something like control-backup? I widened the pool of public IPs for the cloud yesterday to add 192.168.0.x so this *should* work - the reason we were hitting the stops before was that we had run out of public floating IPs. As luck would have it the private pool has always been wide enough, because I had planned for when we widened the IP address space. And just to reassure you: dnsmasq is configured to never serve from the 192.168.0.x address range.
Hm. I guess that makes sense, yes -- can you do that tomorrow?
Cheers, mwh
OK. Instance created. Just doing all the config. However, there's something odd. We've run out of RAM quota, although the total amount we've used isn't anywhere near the total available. I suspect (but don't know yet) that this is due to the number of instances running on individual nodes. I may need to update and reboot the cloud. :(
The RAM I could allocate to control-backup was only 2GB. Hope it's enough.
Hm, sounds pretty marginal to me... maybe using dispatcher01 would be better.
Cheers, mwh
On 01/09/2013 03:03 PM, Michael Hudson-Doyle wrote:
OK. Instance created. Just doing all the config. However, there's something odd. We've run out of RAM quota, although the total amount we've used isn't anywhere near the total available. I suspect (but don't know yet) that this is due to the number of instances running on individual nodes. I may need to update and reboot the cloud.:(
The RAM I could allocate to control-backup was only 2GB. Hope it's enough.
Hm, sounds pretty marginal to me... maybe using dispatcher01 would be better.
Its not much - but how much do we need to run postgres and an instance of a LAVA with all boards disabled?
Andy Doan andy.doan@linaro.org writes:
On 01/09/2013 03:03 PM, Michael Hudson-Doyle wrote:
OK. Instance created. Just doing all the config. However, there's something odd. We've run out of RAM quota, although the total amount we've used isn't anywhere near the total available. I suspect (but don't know yet) that this is due to the number of instances running on individual nodes. I may need to update and reboot the cloud.:(
The RAM I could allocate to control-backup was only 2GB. Hope it's enough.
Hm, sounds pretty marginal to me... maybe using dispatcher01 would be better.
Its not much - but how much do we need to run postgres and an instance of a LAVA with all boards disabled?
I don't know to be honest. Some things abuse postgres pretty hard, but currently it isn't using much RAM at all...
linaro-validation@lists.linaro.org