Now that we have to spend less time looking at failed health jobs, we should start looking at stuck jobs:
------------
origen02
------------
http://validation.linaro.org/lava-server/scheduler/job/39388
Been running since Nov. 20, 2012, 3:57 a.m.
Submitted its bundle, but just never actually stopped. At the end, it had failed because the Android home screen never displayed, i.e. bootanim never stopped. Don't know if that is relevant or not.
I cancelled the job but, as often happens in these cases, the job ends up in a continually cancelling state. Went onto control and did a kill -2.
----------------
snowball08
----------------
http://validation.linaro.org/lava-server/scheduler/job/39394
Running since Nov. 20, 2012, 4:44 a.m.
Again, failed to get into Android test image and submitted results, but is stuck. Looked on control - no process. Cancelled job. Again, stuck. Did a kill -2.
----------------
snowball03
----------------
http://validation.linaro.org/lava-server/scheduler/job/39894
Running since Nov. 24, 2012, 5:54 p.m.
Same as the others. Failed to get into Android test image, stuck in cancelling. Kill -2.
----------------
snowball07
----------------
http://validation.linaro.org/lava-server/scheduler/job/39940
Running since Nov. 25, 2012, 4:39 a.m.
Same.
----------------
snowball04
----------------
http://validation.linaro.org/lava-server/scheduler/job/39970
Running since Nov. 25, 2012, 8:46 a.m.
Same.
Thanks
Dave
------------
panda12
------------
http://validation.linaro.org/lava-server/scheduler/job/39734
Failed to boot test image, with lots of errors coming out. Went onto board and booted into test image and set proxy all fine, so a reboot of test image fix would have fixed this one. Put back online.
Thanks
Dave
Hi all,
Two last night, which means we're averaging approximately one health check failure per day, which equates to a 95% pass rate. Not great.
------------
panda04
------------
http://validation.linaro.org/lava-server/scheduler/job/39484
When it got into the test image, the device was spewing out lots of weird error messages. Went onto the board and rebooted the test image: same problem. Shell prompt was also corrupted. I wasn't clear if this is a board/sd card/corrupt image deployment problem, so I booted the master image, and that seems fine. Putting back online to see if it was a one off corruption.
If the board passes this time, then the only way to fix this problem would be to set up so that if for some reason things fail in the test image, go round and do it all again - including deployment, because rebooting the test image wouldn't have worked.
------------
panda06
------------
http://validation.linaro.org/lava-server/scheduler/job/39477
wget weirdness. Kept getting "Connection reset by peer", and then retrying. Putting back online to see if it's a one off glitch.
If the board passes this time, then the way to fix the problem is, if deployment fails, reboot to the master image and try again.
Thanks
Dave
Hi, Andy & Michael
About the problem that the telnet process consumes
CPU(bug1034218<https://bugs.launchpad.net/linaro-android/+bug/1034218>
),
For now I tried two ways to verify it:
1. Run the CTS test via submitting a lava-job
In this way, the process that consumes CPU is telnet
2. Run the CTS test via command line "lava-android-test run cts"
In this way, there is no process that consumes CPU to 100%,
In the meanwhile, I also opened the telnet session.
So I guess the problem is the way we calling the telnet command in
lava-dispatcher.
>From my investigation, it's the select syscall in telnet that consumes CPU,
So I doubt if there is some place in lava-dispatcher that reads the ouput
of telnet in a loop without sleep in the loop.
but I did not find such place in lava-dispatcher.
How do you think about it?
Finally, I feel that this problem is of lava-dispatcher, not the problem of
lava-android-test or CTS,
so can we change it to be a bug of lava-dispatcher?
--
Thanks,
Yongqin Liu
---------------------------------------------------------------
#mailing list
linaro-android(a)lists.linaro.org <linaro-dev(a)lists.linaro.org>
http://lists.linaro.org/mailman/listinfo/linaro-android
linaro-validation(a)lists.linaro.org <linaro-dev(a)lists.linaro.org>
http://lists.linaro.org/pipermail/linaro-validation
Hi all,
David Zinman contacted me with an urgent request. Apparently Mathieu's TC2 is playing up and he has an urgent need for it. Could we liberate one from the lab while Mathieu returns it, and then deal with either fixing it (if it's the motherboard, I have a spare; if it's the tile, we'll get ARM to swap it out)?
Looking at the TC2 load, it's not so bad that we're going to block on it.
Thanks
Dave
Just the one, although the report says two. Did someone else put one online that had failed?
---------------------
vexpress-tc2-01
---------------------
http://validation.linaro.org/lava-server/scheduler/job/39042
This *looks* like there was an existing telnet session on the board that hand't closed. I went on the board and it was fine, so whatever/whoever it was had disconnected.
Rebooted and put back online.
Thanks
Dave
Hey Guys,
We currently use ubuntu-desktop RFS's for our daily health checks in
LAVA. Its my understanding the dev-platform team will be sunsetting
ubuntu-desktop images.
Due to this, I think we (the LAVA team) need to think about moving to a
new set of images for our health checks so that they better reflect
reality. 2012.11 should be producing some pre-built images for both
server and nano. I think picking the server pre-built images probably
makes more sense since it gives us a little more coverage than nano, but
I don't have a strong sense on whether it makes much difference since we
basically just do boot-testing in our health check.
= So what does this mean?
I think we'll need to file a 12.12 blueprint for doing health check
investigation work. Essentially, we should build a job for each
device-type in production based off its health job. We then update the
image URL to be the new candidate. We then submit it 100 times and do a
failure analysis. At this point we cross our fingers and hope the
failures aren't worse than what we currently see. If so, we can switch
the image over. If its not the case, we'll need to work with
dev-platform team on getting new issues addressed.
make sense?
who wants to help? :)
-andy