Fastmodels v8

List overview All Threads
Download

newer

older

Re: [Linaro-validation] test /...

How can I deploy...

Dave Pigott

17 Oct 2012 17 Oct '12

1:03 p.m.

Hi guys,

I've deployed a new compute node in the cloud, and upped the VCPU allocation in the project, but I'm seeing some oddities.

The "nova-manage service list" command tells me that nova-network and nova-compute are not running on 02 (I knew that and am working on restoring it) and 03 and 05. However, the instances that are running on those nodes are still accessible. Additionally, when I try to create an instance, it immediately errors.

I've had similar problems in the past, and typically I had to reboot all nodes, starting with the control node (lava-cloud01). I also believe (note wording) I need to do an update/upgrade on every node to bring the cloud up to latest revisions of nova/services.

Obviously, if I do this, it will disrupt staging, dogfood and fastmodels, so this is a warning that absent from any dissent, I will reboot the cloud tomorrow morning. If you would rather defer this let's discuss when it should be deferred to. Obviously, to get the v8 fast model instance up is paramount.

Thanks

Dave

Show replies by date

Antonio Terceiro

17 Oct 17 Oct

1:39 p.m.

Dave Pigott escreveu:

...

Hi guys,

I've deployed a new compute node in the cloud, and upped the VCPU allocation in the project, but I'm seeing some oddities.

The "nova-manage service list" command tells me that nova-network and nova-compute are not running on 02 (I knew that and am working on restoring it) and 03 and 05. However, the instances that are running on those nodes are still accessible. Additionally, when I try to create an instance, it immediately errors.

That makes sense: once the VM's were started with KVM, they will continue to run independently of the processes that started them (nova-*).

...

I've had similar problems in the past, and typically I had to reboot all nodes, starting with the control node (lava-cloud01). I also believe (note wording) I need to do an update/upgrade on every node to bring the cloud up to latest revisions of nova/services.

Obviously, if I do this, it will disrupt staging, dogfood and fastmodels, so this is a warning that absent from any dissent, I will reboot the cloud tomorrow morning. If you would rather defer this let's discuss when it should be deferred to. Obviously, to get the v8 fast model instance up is paramount.

Interrupting staging and dogfood is fine, I think. For fastmodels, if you put the devices offline in the scheduler before the interruption and put them back online after it's done (i.e. control will still be accepting jobs and queuing them), it should be OK.

-- Antonio Terceiro Software Engineer - Linaro http://www.linaro.org

Dave Pigott

1:51 p.m.

On 17 Oct 2012, at 14:39, Antonio Terceiro antonio.terceiro@linaro.org wrote:

...

Dave Pigott escreveu:

...
Hi guys,

I've deployed a new compute node in the cloud, and upped the VCPU allocation in the project, but I'm seeing some oddities.

The "nova-manage service list" command tells me that nova-network and nova-compute are not running on 02 (I knew that and am working on restoring it) and 03 and 05. However, the instances that are running on those nodes are still accessible. Additionally, when I try to create an instance, it immediately errors.

That makes sense: once the VM's were started with KVM, they will continue to run independently of the processes that started them (nova-*).

Yeah, I kind of figured something like that was going on. Trouble is we've seen so many different types of fail in the process of getting the cloud running, that I'm never *quite* sure how broken things have got. :)

...

...
I've had similar problems in the past, and typically I had to reboot all nodes, starting with the control node (lava-cloud01). I also believe (note wording) I need to do an update/upgrade on every node to bring the cloud up to latest revisions of nova/services.

Obviously, if I do this, it will disrupt staging, dogfood and fastmodels, so this is a warning that absent from any dissent, I will reboot the cloud tomorrow morning. If you would rather defer this let's discuss when it should be deferred to. Obviously, to get the v8 fast model instance up is paramount.

Interrupting staging and dogfood is fine, I think. For fastmodels, if you put the devices offline in the scheduler before the interruption and put them back online after it's done (i.e. control will still be accepting jobs and queuing them), it should be OK.

Of course. Was planning on that, but good to be reminded, and I should have listed it in my e-mail.

Thanks

Dave

Andy Doan

2:10 p.m.

On 10/17/2012 08:51 AM, Dave Pigott wrote:

...

...
...
...
...
Obviously, if I do this, it will disrupt staging, dogfood and fastmodels, so this is a warning that absent from any dissent, I will reboot the cloud tomorrow morning. If you would rather defer this let's discuss when it should be deferred to. Obviously, to get the v8 fast model instance up is paramount.

Interrupting staging and dogfood is fine, I think. For fastmodels, if you put the devices offline in the scheduler before the interruption and put them back online after it's done (i.e. control will still be accepting jobs and queuing them), it should be OK.

Of course. Was planning on that, but good to be reminded, and I should have listed it in my e-mail.

Also - can we make sure that we do a snapshot of each system so that we don't have to re-create the instance?

Dave Pigott

2:19 p.m.

On 17 Oct 2012, at 15:10, Andy Doan andy.doan@linaro.org wrote:

...

On 10/17/2012 08:51 AM, Dave Pigott wrote:

...
...
...
...
...
Obviously, if I do this, it will disrupt staging, dogfood and fastmodels, so this is a warning that absent from any dissent, I will reboot the cloud tomorrow morning. If you would rather defer this let's discuss when it should be deferred to. Obviously, to get the v8 fast model instance up is paramount.

Interrupting staging and dogfood is fine, I think. For fastmodels, if you put the devices offline in the scheduler before the interruption and put them back online after it's done (i.e. control will still be accepting jobs and queuing them), it should be OK.

Of course. Was planning on that, but good to be reminded, and I should have listed it in my e-mail.

Also - can we make sure that we do a snapshot of each system so that we don't have to re-create the instance?

Good idea, although the way things should work is that the instances will come up exactly as they were when the nodes were closed down. I'm not proposing to delete the nodes and re-create them.

Dave

Dave Pigott

2:41 p.m.

So, to summarise, here is what I'll do, and the order in which I plan to do it:

1) Take fast models offline 2) Take snapshots of dogfood, staging and fastmodels01/03 (can't do 02 as it's broken) 3) update/upgrade all cloud nodes 4) reboot the cloud 5) Work on fastmodels02

Thanks

Dave

On 17 Oct 2012, at 15:19, Dave Pigott dave.pigott@linaro.org wrote:

...

On 17 Oct 2012, at 15:10, Andy Doan andy.doan@linaro.org wrote:

...
On 10/17/2012 08:51 AM, Dave Pigott wrote:

...
...
...
...
> Obviously, if I do this, it will disrupt staging, dogfood and > fastmodels, so this is a warning that absent from any dissent, I will > reboot the cloud tomorrow morning. If you would rather defer this > let's discuss when it should be deferred to. Obviously, to get the v8 > fast model instance up is paramount.

Interrupting staging and dogfood is fine, I think. For fastmodels, if you put the devices offline in the scheduler before the interruption and put them back online after it's done (i.e. control will still be accepting jobs and queuing them), it should be OK.

Of course. Was planning on that, but good to be reminded, and I should have listed it in my e-mail.

Also - can we make sure that we do a snapshot of each system so that we don't have to re-create the instance?

Good idea, although the way things should work is that the instances will come up exactly as they were when the nodes were closed down. I'm not proposing to delete the nodes and re-create them.

Dave

Michael Hudson-Doyle

18 Oct 18 Oct

2:01 a.m.

Dave Pigott dave.pigott@linaro.org writes:

...

So, to summarise, here is what I'll do, and the order in which I plan to do it:

Take fast models offline

Take snapshots of dogfood, staging and fastmodels01/03 (can't do 02 as it's broken)

update/upgrade all cloud nodes

reboot the cloud

Work on fastmodels02

+1. Hopefully we don't have to go through all this all that often!

Cheers, mwh

Dave Pigott

8:48 a.m.

OK. Progress (or not) update:

On 18 Oct 2012, at 03:01, Michael Hudson-Doyle michael.hudson@linaro.org wrote:

...

Dave Pigott dave.pigott@linaro.org writes:

...
So, to summarise, here is what I'll do, and the order in which I plan to do it:

Take fast models offline

The two fast models were stuck (and had been since yesterday afternoon) running health jobs, with nothing in the log. Cancelled jobs. Got stuck in cancelling. Did a kill -2 on processes. Still stuck. Manually set board status to offline.

...

...

Take snapshots of dogfood, staging and fastmodels01/03 (can't do 02 as it's broken)

Sigh. Because of the cloud node states, the snapshotting is stuck. While the instances are running, the control node can't see them.

So my plan from here is to update and reboot all the cloud nodes.

Thanks

Dave

...

...

update/upgrade all cloud nodes

reboot the cloud

Work on fastmodels02

+1. Hopefully we don't have to go through all this all that often!

Cheers, mwh

Dave Pigott

9:10 a.m.

On 18 Oct 2012, at 09:48, Dave Pigott dave.pigott@linaro.org wrote:

...

OK. Progress (or not) update:

On 18 Oct 2012, at 03:01, Michael Hudson-Doyle michael.hudson@linaro.org wrote:

...
Dave Pigott dave.pigott@linaro.org writes:

...
So, to summarise, here is what I'll do, and the order in which I plan to do it:

Take fast models offline

The two fast models were stuck (and had been since yesterday afternoon) running health jobs, with nothing in the log. Cancelled jobs. Got stuck in cancelling. Did a kill -2 on processes. Still stuck. Manually set board status to offline.

...
...

Take snapshots of dogfood, staging and fastmodels01/03 (can't do 02 as it's broken)

Sigh. Because of the cloud node states, the snapshotting is stuck. While the instances are running, the control node can't see them.

Just to be on the safe side, I thought I'd wait a little longer, and the snapshots have started clearing. One left, so I'll leave it for a little while.

Dave

Dave Pigott

9:57 a.m.

OK, the good news is that the whole cloud came up cleanly. Bad news is that when I try to create the v8 instance I get an error still. I'm working on tracking this down now, but dogfood et al are all back again.

Thanks

Dave

On 18 Oct 2012, at 03:01, Michael Hudson-Doyle michael.hudson@linaro.org wrote:

...

Dave Pigott dave.pigott@linaro.org writes:

...
So, to summarise, here is what I'll do, and the order in which I plan to do it:

Take fast models offline

Take snapshots of dogfood, staging and fastmodels01/03 (can't do 02 as it's broken)

update/upgrade all cloud nodes

reboot the cloud

Work on fastmodels02

+1. Hopefully we don't have to go through all this all that often!

Cheers, mwh

Dave Pigott

10:26 a.m.

So, by inspecting the database, I found a table called "instances" that was huge, but contained a field called "display_name", with one set to FastModels03, and a field called "deleted" that was set to 0. I set it to 1 and hey presto. FastModels02 is gone, and the compute node came back up!

Now to figure out why I can't start the new instance. :-/

Dave

On 18 Oct 2012, at 10:57, Dave Pigott dave.pigott@linaro.org wrote:

...

OK, the good news is that the whole cloud came up cleanly. Bad news is that when I try to create the v8 instance I get an error still. I'm working on tracking this down now, but dogfood et al are all back again.

Thanks

Dave

On 18 Oct 2012, at 03:01, Michael Hudson-Doyle michael.hudson@linaro.org wrote:

...
Dave Pigott dave.pigott@linaro.org writes:

...
So, to summarise, here is what I'll do, and the order in which I plan to do it:

Take fast models offline

Take snapshots of dogfood, staging and fastmodels01/03 (can't do 02 as it's broken)

update/upgrade all cloud nodes

reboot the cloud

Work on fastmodels02

+1. Hopefully we don't have to go through all this all that often!

Cheers, mwh

Dave Pigott

10:53 a.m.

\o/

We now have a fastmodelsv8 at 192.168.1.54. I'll update gateway hosts accordingly.

This is hopefully the end of my e-mails on this subject. :)

Thanks

Dave

On 18 Oct 2012, at 11:26, Dave Pigott dave.pigott@linaro.org wrote:

...

So, by inspecting the database, I found a table called "instances" that was huge, but contained a field called "display_name", with one set to FastModels03, and a field called "deleted" that was set to 0. I set it to 1 and hey presto. FastModels02 is gone, and the compute node came back up!

Now to figure out why I can't start the new instance. :-/

Dave

On 18 Oct 2012, at 10:57, Dave Pigott dave.pigott@linaro.org wrote:

...
OK, the good news is that the whole cloud came up cleanly. Bad news is that when I try to create the v8 instance I get an error still. I'm working on tracking this down now, but dogfood et al are all back again.

Thanks

Dave

On 18 Oct 2012, at 03:01, Michael Hudson-Doyle michael.hudson@linaro.org wrote:

...
Dave Pigott dave.pigott@linaro.org writes:

...
So, to summarise, here is what I'll do, and the order in which I plan to do it:

Take fast models offline

Take snapshots of dogfood, staging and fastmodels01/03 (can't do 02 as it's broken)

update/upgrade all cloud nodes

reboot the cloud

Work on fastmodels02

+1. Hopefully we don't have to go through all this all that often!

Cheers, mwh

4642

days inactive

4643

days old

linaro-validation@lists.linaro.org

11 comments

participants

tags (0)

participants (4)

Andy Doan
Antonio Terceiro
Dave Pigott
Michael Hudson-Doyle