Hi guys,
I've deployed a new compute node in the cloud, and upped the VCPU allocation in the project, but I'm seeing some oddities.
The "nova-manage service list" command tells me that nova-network and nova-compute are not running on 02 (I knew that and am working on restoring it) and 03 and 05. However, the instances that are running on those nodes are still accessible. Additionally, when I try to create an instance, it immediately errors.
I've had similar problems in the past, and typically I had to reboot all nodes, starting with the control node (lava-cloud01). I also believe (note wording) I need to do an update/upgrade on every node to bring the cloud up to latest revisions of nova/services.
Obviously, if I do this, it will disrupt staging, dogfood and fastmodels, so this is a warning that absent from any dissent, I will reboot the cloud tomorrow morning. If you would rather defer this let's discuss when it should be deferred to. Obviously, to get the v8 fast model instance up is paramount.
Thanks
Dave
Dave Pigott escreveu:
Hi guys,
I've deployed a new compute node in the cloud, and upped the VCPU allocation in the project, but I'm seeing some oddities.
The "nova-manage service list" command tells me that nova-network and nova-compute are not running on 02 (I knew that and am working on restoring it) and 03 and 05. However, the instances that are running on those nodes are still accessible. Additionally, when I try to create an instance, it immediately errors.
That makes sense: once the VM's were started with KVM, they will continue to run independently of the processes that started them (nova-*).
I've had similar problems in the past, and typically I had to reboot all nodes, starting with the control node (lava-cloud01). I also believe (note wording) I need to do an update/upgrade on every node to bring the cloud up to latest revisions of nova/services.
Obviously, if I do this, it will disrupt staging, dogfood and fastmodels, so this is a warning that absent from any dissent, I will reboot the cloud tomorrow morning. If you would rather defer this let's discuss when it should be deferred to. Obviously, to get the v8 fast model instance up is paramount.
Interrupting staging and dogfood is fine, I think. For fastmodels, if you put the devices offline in the scheduler before the interruption and put them back online after it's done (i.e. control will still be accepting jobs and queuing them), it should be OK.
On 17 Oct 2012, at 14:39, Antonio Terceiro antonio.terceiro@linaro.org wrote:
Dave Pigott escreveu:
Hi guys,
I've deployed a new compute node in the cloud, and upped the VCPU allocation in the project, but I'm seeing some oddities.
The "nova-manage service list" command tells me that nova-network and nova-compute are not running on 02 (I knew that and am working on restoring it) and 03 and 05. However, the instances that are running on those nodes are still accessible. Additionally, when I try to create an instance, it immediately errors.
That makes sense: once the VM's were started with KVM, they will continue to run independently of the processes that started them (nova-*).
Yeah, I kind of figured something like that was going on. Trouble is we've seen so many different types of fail in the process of getting the cloud running, that I'm never *quite* sure how broken things have got. :)
I've had similar problems in the past, and typically I had to reboot all nodes, starting with the control node (lava-cloud01). I also believe (note wording) I need to do an update/upgrade on every node to bring the cloud up to latest revisions of nova/services.
Obviously, if I do this, it will disrupt staging, dogfood and fastmodels, so this is a warning that absent from any dissent, I will reboot the cloud tomorrow morning. If you would rather defer this let's discuss when it should be deferred to. Obviously, to get the v8 fast model instance up is paramount.
Interrupting staging and dogfood is fine, I think. For fastmodels, if you put the devices offline in the scheduler before the interruption and put them back online after it's done (i.e. control will still be accepting jobs and queuing them), it should be OK.
Of course. Was planning on that, but good to be reminded, and I should have listed it in my e-mail.
Thanks
Dave
On 10/17/2012 08:51 AM, Dave Pigott wrote:
Obviously, if I do this, it will disrupt staging, dogfood and fastmodels, so this is a warning that absent from any dissent, I will reboot the cloud tomorrow morning. If you would rather defer this let's discuss when it should be deferred to. Obviously, to get the v8 fast model instance up is paramount.
Interrupting staging and dogfood is fine, I think. For fastmodels, if you put the devices offline in the scheduler before the interruption and put them back online after it's done (i.e. control will still be accepting jobs and queuing them), it should be OK.
Of course. Was planning on that, but good to be reminded, and I should have listed it in my e-mail.
Also - can we make sure that we do a snapshot of each system so that we don't have to re-create the instance?
On 17 Oct 2012, at 15:10, Andy Doan andy.doan@linaro.org wrote:
On 10/17/2012 08:51 AM, Dave Pigott wrote:
Obviously, if I do this, it will disrupt staging, dogfood and fastmodels, so this is a warning that absent from any dissent, I will reboot the cloud tomorrow morning. If you would rather defer this let's discuss when it should be deferred to. Obviously, to get the v8 fast model instance up is paramount.
Interrupting staging and dogfood is fine, I think. For fastmodels, if you put the devices offline in the scheduler before the interruption and put them back online after it's done (i.e. control will still be accepting jobs and queuing them), it should be OK.
Of course. Was planning on that, but good to be reminded, and I should have listed it in my e-mail.
Also - can we make sure that we do a snapshot of each system so that we don't have to re-create the instance?
Good idea, although the way things should work is that the instances will come up exactly as they were when the nodes were closed down. I'm not proposing to delete the nodes and re-create them.
Dave
So, to summarise, here is what I'll do, and the order in which I plan to do it:
1) Take fast models offline 2) Take snapshots of dogfood, staging and fastmodels01/03 (can't do 02 as it's broken) 3) update/upgrade all cloud nodes 4) reboot the cloud 5) Work on fastmodels02
Thanks
Dave
On 17 Oct 2012, at 15:19, Dave Pigott dave.pigott@linaro.org wrote:
On 17 Oct 2012, at 15:10, Andy Doan andy.doan@linaro.org wrote:
On 10/17/2012 08:51 AM, Dave Pigott wrote:
> Obviously, if I do this, it will disrupt staging, dogfood and > fastmodels, so this is a warning that absent from any dissent, I will > reboot the cloud tomorrow morning. If you would rather defer this > let's discuss when it should be deferred to. Obviously, to get the v8 > fast model instance up is paramount.
Interrupting staging and dogfood is fine, I think. For fastmodels, if you put the devices offline in the scheduler before the interruption and put them back online after it's done (i.e. control will still be accepting jobs and queuing them), it should be OK.
Of course. Was planning on that, but good to be reminded, and I should have listed it in my e-mail.
Also - can we make sure that we do a snapshot of each system so that we don't have to re-create the instance?
Good idea, although the way things should work is that the instances will come up exactly as they were when the nodes were closed down. I'm not proposing to delete the nodes and re-create them.
Dave
Dave Pigott dave.pigott@linaro.org writes:
So, to summarise, here is what I'll do, and the order in which I plan to do it:
- Take fast models offline
- Take snapshots of dogfood, staging and fastmodels01/03 (can't do 02 as it's broken)
- update/upgrade all cloud nodes
- reboot the cloud
- Work on fastmodels02
+1. Hopefully we don't have to go through all this all that often!
Cheers, mwh
OK. Progress (or not) update:
On 18 Oct 2012, at 03:01, Michael Hudson-Doyle michael.hudson@linaro.org wrote:
Dave Pigott dave.pigott@linaro.org writes:
So, to summarise, here is what I'll do, and the order in which I plan to do it:
- Take fast models offline
The two fast models were stuck (and had been since yesterday afternoon) running health jobs, with nothing in the log. Cancelled jobs. Got stuck in cancelling. Did a kill -2 on processes. Still stuck. Manually set board status to offline.
- Take snapshots of dogfood, staging and fastmodels01/03 (can't do 02 as it's broken)
Sigh. Because of the cloud node states, the snapshotting is stuck. While the instances are running, the control node can't see them.
So my plan from here is to update and reboot all the cloud nodes.
Thanks
Dave
- update/upgrade all cloud nodes
- reboot the cloud
- Work on fastmodels02
+1. Hopefully we don't have to go through all this all that often!
Cheers, mwh
On 18 Oct 2012, at 09:48, Dave Pigott dave.pigott@linaro.org wrote:
OK. Progress (or not) update:
On 18 Oct 2012, at 03:01, Michael Hudson-Doyle michael.hudson@linaro.org wrote:
Dave Pigott dave.pigott@linaro.org writes:
So, to summarise, here is what I'll do, and the order in which I plan to do it:
- Take fast models offline
The two fast models were stuck (and had been since yesterday afternoon) running health jobs, with nothing in the log. Cancelled jobs. Got stuck in cancelling. Did a kill -2 on processes. Still stuck. Manually set board status to offline.
- Take snapshots of dogfood, staging and fastmodels01/03 (can't do 02 as it's broken)
Sigh. Because of the cloud node states, the snapshotting is stuck. While the instances are running, the control node can't see them.
Just to be on the safe side, I thought I'd wait a little longer, and the snapshots have started clearing. One left, so I'll leave it for a little while.
Dave
OK, the good news is that the whole cloud came up cleanly. Bad news is that when I try to create the v8 instance I get an error still. I'm working on tracking this down now, but dogfood et al are all back again.
Thanks
Dave
On 18 Oct 2012, at 03:01, Michael Hudson-Doyle michael.hudson@linaro.org wrote:
Dave Pigott dave.pigott@linaro.org writes:
So, to summarise, here is what I'll do, and the order in which I plan to do it:
- Take fast models offline
- Take snapshots of dogfood, staging and fastmodels01/03 (can't do 02 as it's broken)
- update/upgrade all cloud nodes
- reboot the cloud
- Work on fastmodels02
+1. Hopefully we don't have to go through all this all that often!
Cheers, mwh
So, by inspecting the database, I found a table called "instances" that was huge, but contained a field called "display_name", with one set to FastModels03, and a field called "deleted" that was set to 0. I set it to 1 and hey presto. FastModels02 is gone, and the compute node came back up!
Now to figure out why I can't start the new instance. :-/
Dave
On 18 Oct 2012, at 10:57, Dave Pigott dave.pigott@linaro.org wrote:
OK, the good news is that the whole cloud came up cleanly. Bad news is that when I try to create the v8 instance I get an error still. I'm working on tracking this down now, but dogfood et al are all back again.
Thanks
Dave
On 18 Oct 2012, at 03:01, Michael Hudson-Doyle michael.hudson@linaro.org wrote:
Dave Pigott dave.pigott@linaro.org writes:
So, to summarise, here is what I'll do, and the order in which I plan to do it:
- Take fast models offline
- Take snapshots of dogfood, staging and fastmodels01/03 (can't do 02 as it's broken)
- update/upgrade all cloud nodes
- reboot the cloud
- Work on fastmodels02
+1. Hopefully we don't have to go through all this all that often!
Cheers, mwh
\o/
We now have a fastmodelsv8 at 192.168.1.54. I'll update gateway hosts accordingly.
This is hopefully the end of my e-mails on this subject. :)
Thanks
Dave
On 18 Oct 2012, at 11:26, Dave Pigott dave.pigott@linaro.org wrote:
So, by inspecting the database, I found a table called "instances" that was huge, but contained a field called "display_name", with one set to FastModels03, and a field called "deleted" that was set to 0. I set it to 1 and hey presto. FastModels02 is gone, and the compute node came back up!
Now to figure out why I can't start the new instance. :-/
Dave
On 18 Oct 2012, at 10:57, Dave Pigott dave.pigott@linaro.org wrote:
OK, the good news is that the whole cloud came up cleanly. Bad news is that when I try to create the v8 instance I get an error still. I'm working on tracking this down now, but dogfood et al are all back again.
Thanks
Dave
On 18 Oct 2012, at 03:01, Michael Hudson-Doyle michael.hudson@linaro.org wrote:
Dave Pigott dave.pigott@linaro.org writes:
So, to summarise, here is what I'll do, and the order in which I plan to do it:
- Take fast models offline
- Take snapshots of dogfood, staging and fastmodels01/03 (can't do 02 as it's broken)
- update/upgrade all cloud nodes
- reboot the cloud
- Work on fastmodels02
+1. Hopefully we don't have to go through all this all that often!
Cheers, mwh
linaro-validation@lists.linaro.org