Deployment report/fallout

List overview All Threads
Download

newer

older

Re: [Linaro-validation] Pluggable...

Fwd: Networking on the fast model

Michael Hudson-Doyle

29 Mar 2012 29 Mar '12

1:31 a.m.

Hi,

Just a quick note about today's deployment.

The good news is that things are all working :)

There were a few wrinkles along the way. A few were due to the fact that replacing which package provides a certain module (lava.utils.interface in this case, which moved from lava-server to its own package) appears to confuse things a bit.

Once the new code was in place, I prepared to run health check jobs. I used the new admin action to mark all 'passing' boards as 'unknown' health state, cleared out /linaro/image enough to have plenty of disk space while the health jobs ran and then turned the scheduler back on.

As desired, all boards started running health check jobs. This meant that there were 30 odd processes uncompressing disk images, loopback mounting them and tarring bits of them up -- load went up to 90, and stayed there for a few hours! Although the machine was still quite usable in this state, it's probably not a good situation. I filed a couple of bugs about this:

https://bugs.launchpad.net/lava-scheduler/+bug/967849 https://bugs.launchpad.net/lava-dispatcher/+bug/967877

Thankfully, the large majority of health jobs passed. We had a few issues:

3 jobs on panda hit the serial truncation thing:

http://validation.linaro.org/lava-server/scheduler/job/16932/log_file http://validation.linaro.org/lava-server/scheduler/job/16946/log_file http://validation.linaro.org/lava-server/scheduler/job/16956/log_file

The last was a different command though -- "mount /dev/disk/by-label/testbo" -- could it be getting worse?

1 job on origen02 (had deployment time out for reasons that are not clear:

http://validation.linaro.org/lava-server/scheduler/job/16961/log_file

1 job on origen03 failed:

http://validation.linaro.org/lava-server/scheduler/job/16951

but because the board was spamming the serial line when the job started, I haven't actually loaded enough of the log to find the problem yet... (maybe we should hard reset the board when we terminate a badly behaving job?).

Cheers, mwh

Show replies by date

Paul Larson

29 Mar 29 Mar

3:49 a.m.

On Wed, Mar 28, 2012 at 8:31 PM, Michael Hudson-Doyle < michael.hudson@linaro.org> wrote:

...

Hi,

Just a quick note about today's deployment.

The good news is that things are all working :)

First off, thanks a LOT for doing this.

...

There were a few wrinkles along the way. A few were due to the fact that replacing which package provides a certain module (lava.utils.interface in this case, which moved from lava-server to its own package) appears to confuse things a bit.

Once the new code was in place, I prepared to run health check jobs. I used the new admin action to mark all 'passing' boards as 'unknown' health state, cleared out /linaro/image enough to have plenty of disk space while the health jobs ran and then turned the scheduler back on.

As desired, all boards started running health check jobs. This meant that there were 30 odd processes uncompressing disk images, loopback mounting them and tarring bits of them up -- load went up to 90, and stayed there for a few hours! Although the machine was still quite usable in this state, it's probably not a good situation. I filed a couple of bugs about this:

keep in mind, that in 24 hours from then they will all kick off health jobs again. Perhaps we should set the state to unknown for 1 board every 10 min. or so over the next day so that we can fan them out a bit? Another thing that should help with this is the cloudification stuff that zyga has been doing.

...

https://bugs.launchpad.net/lava-scheduler/+bug/967849 https://bugs.launchpad.net/lava-dispatcher/+bug/967877

Thankfully, the large majority of health jobs passed. We had a few issues:

3 jobs on panda hit the serial truncation thing:

http://validation.linaro.org/lava-server/scheduler/job/16932/log_file http://validation.linaro.org/lava-server/scheduler/job/16946/log_file http://validation.linaro.org/lava-server/scheduler/job/16956/log_file

The last was a different command though -- "mount /dev/disk/by-label/testbo" -- could it be getting worse?

1 job on origen02 (had deployment time out for reasons that are not clear:

http://validation.linaro.org/lava-server/scheduler/job/16961/log_file

1 job on origen03 failed:

http://validation.linaro.org/lava-server/scheduler/job/16951

but because the board was spamming the serial line when the job started, I haven't actually loaded enough of the log to find the problem yet... (maybe we should hard reset the board when we terminate a badly behaving job?).

+1 to that. In general, hard reset is a better option. As we migrate

toward having a separate hardreset function for the board, we should create a simple script that does something annoying like beeping and putting up a message on the console to push the reset button. This could be used for testing locally, while the lab machines would simply always use hard reset. Alex has been pushing for this, and we've discussed before. I think we have filled in some of the things that we need in order to do it now though.

Thanks, Paul Larson

Michael Hudson-Doyle

11:20 p.m.

On Wed, 28 Mar 2012 22:49:51 -0500, Paul Larson paul.larson@linaro.org wrote:

...

On Wed, Mar 28, 2012 at 8:31 PM, Michael Hudson-Doyle < michael.hudson@linaro.org> wrote:

...
Hi,

Just a quick note about today's deployment.

The good news is that things are all working :)

First off, thanks a LOT for doing this.

Zygmunt did most of the heavy lifting this time, fwiw :)

...

...
There were a few wrinkles along the way. A few were due to the fact that replacing which package provides a certain module (lava.utils.interface in this case, which moved from lava-server to its own package) appears to confuse things a bit.

Once the new code was in place, I prepared to run health check jobs. I used the new admin action to mark all 'passing' boards as 'unknown' health state, cleared out /linaro/image enough to have plenty of disk space while the health jobs ran and then turned the scheduler back on.

As desired, all boards started running health check jobs. This meant that there were 30 odd processes uncompressing disk images, loopback mounting them and tarring bits of them up -- load went up to 90, and stayed there for a few hours! Although the machine was still quite usable in this state, it's probably not a good situation. I filed a couple of bugs about this:

keep in mind, that in 24 hours from then they will all kick off health jobs again.

Yes, that's an excellent point. Although they will spread out a bit over time.

...

Perhaps we should set the state to unknown for 1 board every 10 min. or so over the next day so that we can fan them out a bit?

I guess we should ask the usual question: what are we looking to achieve here?

Basically, we're trying to make sure that the new code we've deployed doesn't have any terrible bugs that mean all jobs will fail.

Running the health checks we have defined today on each board is both too much work and too little: too much, because there's no reason to run this on _every_ board, just one board of each type would be enough and too little because the existing health jobs do not excercise large parts of the dispatcher because they do not actually run any tests.

(totally thinking aloud here) We could add a device _type_ health status, where no jobs would run on a device type until a health job had completed on one device of that type... hmm feels complicated. Maybe we should just be stricter about testing things out on staging.

...

Another thing that should help with this is the cloudification stuff that zyga has been doing.

Yeah. I do also think though that we should make the health checks as cheap to run as possible -- it's just a waste of time and electricity to uncompress these images each time.

...

...
1 job on origen03 failed:

http://validation.linaro.org/lava-server/scheduler/job/16951

but because the board was spamming the serial line when the job started, I haven't actually loaded enough of the log to find the problem yet... (maybe we should hard reset the board when we terminate a badly behaving job?).

+1 to that. In general, hard reset is a better option. As we migrate

toward having a separate hardreset function for the board, we should create a simple script that does something annoying like beeping and putting up a message on the console to push the reset button. This could be used for testing locally, while the lab machines would simply always use hard reset. Alex has been pushing for this, and we've discussed before. I think we have filled in some of the things that we need in order to do it now though.

Yeah. Actually maybe power cycling a board should just be the first thing the dispatcher does...

Cheers, mwh

Zygmunt Krynicki

30 Mar 30 Mar

12:27 a.m.

W dniu 30.03.2012 01:20, Michael Hudson-Doyle pisze:

...

On Wed, 28 Mar 2012 22:49:51 -0500, Paul Larson paul.larson@linaro.org wrote:

...
On Wed, Mar 28, 2012 at 8:31 PM, Michael Hudson-Doyle < michael.hudson@linaro.org> wrote:

...
Hi,

Just a quick note about today's deployment.

The good news is that things are all working :)

First off, thanks a LOT for doing this.

Zygmunt did most of the heavy lifting this time, fwiw :)

...
...
There were a few wrinkles along the way. A few were due to the fact that replacing which package provides a certain module (lava.utils.interface in this case, which moved from lava-server to its own package) appears to confuse things a bit.

Once the new code was in place, I prepared to run health check jobs. I used the new admin action to mark all 'passing' boards as 'unknown' health state, cleared out /linaro/image enough to have plenty of disk space while the health jobs ran and then turned the scheduler back on.

As desired, all boards started running health check jobs. This meant that there were 30 odd processes uncompressing disk images, loopback mounting them and tarring bits of them up -- load went up to 90, and stayed there for a few hours! Although the machine was still quite usable in this state, it's probably not a good situation. I filed a couple of bugs about this:

keep in mind, that in 24 hours from then they will all kick off health jobs again.

Yes, that's an excellent point. Although they will spread out a bit over time.

Maybe we could just start them at board_number offsets?

...

...
Perhaps we should set the state to unknown for 1 board every 10 min. or so over the next day so that we can fan them out a bit?

I guess we should ask the usual question: what are we looking to achieve here?

Basically, we're trying to make sure that the new code we've deployed doesn't have any terrible bugs that mean all jobs will fail.

Running the health checks we have defined today on each board is both too much work and too little: too much, because there's no reason to run this on _every_ board, just one board of each type would be enough and too little because the existing health jobs do not excercise large parts of the dispatcher because they do not actually run any tests.

(totally thinking aloud here) We could add a device _type_ health status, where no jobs would run on a device type until a health job had completed on one device of that type... hmm feels complicated. Maybe we should just be stricter about testing things out on staging.

Well I think that to some point all health jobs are useless. They only show that our implementation has a high failure rate. What they also show random hardware dependent issues. For that reason, until we get to a high degree of confidence, I'd rather run them everywhere.

Still, I like the idea of a device type health job. We should just ensure that each device class really has the same master.

...

...
Another thing that should help with this is the cloudification stuff that zyga has been doing.

Yeah. I do also think though that we should make the health checks as cheap to run as possible -- it's just a waste of time and electricity to uncompress these images each time.

I agree, it sucks. What can we do to save most of the cost without spending too much time on this?

...

...
...
1 job on origen03 failed:

http://validation.linaro.org/lava-server/scheduler/job/16951

but because the board was spamming the serial line when the job started, I haven't actually loaded enough of the log to find the problem yet... (maybe we should hard reset the board when we terminate a badly behaving job?).

+1 to that. In general, hard reset is a better option. As we migrate

toward having a separate hardreset function for the board, we should create a simple script that does something annoying like beeping and putting up a message on the console to push the reset button. This could be used for testing locally, while the lab machines would simply always use hard reset. Alex has been pushing for this, and we've discussed before. I think we have filled in some of the things that we need in order to do it now though.

Yeah. Actually maybe power cycling a board should just be the first thing the dispatcher does...

That will kill our master images faster than we get dave to replace them. No, that's a terribly bad idea in our current architecture.

Run a simple test, make a master, boot your panda, yank the power cord, do that for 24 hours (oh, you don't have that scriptable power cord). Your master will be hosed well before you're gonna think this is getting old.

I want to start from a known state but we just cannot do this today.

-- Zygmunt Krynicki Linaro Validation Team

4841

days inactive

4842

days old

linaro-validation@lists.linaro.org

3 comments

participants

tags (0)

participants (3)

Michael Hudson-Doyle
Paul Larson
Zygmunt Krynicki