On 5 November 2012 17:35, David Long wrote:
> It looks to me like our TI 3.4 nightly LAVA runs are broken for all test
> cases (not just the TILT stuff). Is someone investigating this?
I'm looking to ci-LT-TI-working-tree-3_4 stream.
It seems the test fails to run since 2012-10-24.
lava-test doesn't install successfully:
OperationFailed: executing u'apt-get -o
Acquire::http::proxy=http://192.168.1.10:3128/ update' failed with
code 100
I see other errors on other streams since the same date...
OK. We've had 3 failures out of 250 in 24 hours - 98.8% -better than the 16 or so we were getting, but still...
----------------
snowball10
----------------
http://staging.validation.linaro.org/scheduler/job/35883
Test image kernel died on boot.
----------------
beaglexm04
----------------
http://staging.validation.linaro.org/scheduler/job/35901
and
http://staging.validation.linaro.org/scheduler/job/35838
Corrupted serial output in test image. Michael's new code caught it, but it was so corrupted it still failed. Next check went ok.
Conclusion: All 3 of these could have been fixed by attempting a reboot of the test image, just like we're now doing on the master.
Of course, one could take the view that the corrupted serial line means a broken board, but given it came back and consistently ran afterwards means we just hit some edge case.
I'll file a bug. I strongly believe that we are close to achieving our 99.9%+ target set by Alexander.
Thanks
Dave
Hi all,
I've just checked in and deployed a more general fix for the network failures on staging, and it's running through its paces now. Essentially, the change now tries to reboot three times if *anything* goes wrong, not just the network coming up. Looking back at the failures of the last 24 hours (2 out of 158) both of those would have been fixed by my change.
We'll know how well we've done in 24 hours time.
Thanks
Dave
Although munin is working out fine for us currently, an option I'd
prefer in some ways is graphite. This document:
http://www.aosabook.org/en/graphite.html
explains how graphite works in a way that that made way more sense to me
than anything else I'd read.
Cheers,
mwh
Hi,
I did code changes to support yaml based testdefs. I shall show you the
demo tomorrow of what I have and we can discuss the yaml structure in
more detail and get more stuff in.
My sample testdef looks like the following (testdef.yaml):
<snip1>
metadata:
name: simple
version: 1.0
format: lava-test v1.0
environment:
image-type: [beagle]
install:
url:
steps:
run:
steps:
- /bin/echo cache-coherency-switching - PASS
- ls
- pwd
parse:
pattern: (?P<test_case_id>.*-*)\\s+:\\s+(?P<result>(PASS|FAIL))
fixupdict:
PASS: pass
FAIL: fail
</snip1>
Following is a snipe from sample run:
<snip2>
cache-coherency-switching - PASS
install.sh
run.sh
testdef.yaml
/lava/tests/0_simple
<LAVA_TEST_RUNNER>: 0_simple exited with: 0
0_simple-1351674922 build.txt cpuinfo.txt meminfo.txt pkgs.txt
<LAVA_TEST_RUNNER>: exiting<LAVA_DISPATCHER>2012-10-31 02:44:54 PM INFO:
lava_test_shell seems to have completed
<LAVA_DISPATCHER>2012-10-31 02:44:54 PM INFO: attempting a filesystem
sync before power_off
linaro-test [rc=0]# sync
sync
linaro-test [rc=0]# <LAVA_DISPATCHER>2012-10-31 02:44:57 PM INFO:
[ACTION-E] lava_test_shell is finished successfully.
<LAVA_DISPATCHER>2012-10-31 02:44:57 PM INFO: Submitting the test result
with parameters = {u'stream': u'/anonymous/stylesen/', u'server':
u'http://10.155.13.219/RPC2/'}
dashboard-put-result:
http://10.155.13.219/dashboard/permalink/bundle/9fdccd73c7e825c2eec7850e61d…
<LAVA_DISPATCHER>2012-10-31 02:44:57 PM INFO: Dashboard :
http://10.155.13.219/dashboard/permalink/bundle/9fdccd73c7e825c2eec7850e61d…
</snip2>
Thank You.
--
Senthil Kumaran S
http://www.stylesen.org/http://www.sasenthilkumaran.com/
Hi all,
Just a heads up: I started to create three cloud nodes today, two for toolchain and one for the backup production system so we can seamlessly upgrade control, and I hit a problem. It's reporting (not very clearly - you have to dig) that we have run out of floating IPs. I allocated 192.168.1.48/29, which should give us 8 (just enough, as it happens) and I can't see why it's holding onto the other two. I'm investigating, but also suggest that I move it up to 192.168.1.48/28 so that we can have 16 instances.
Obviously, once we move to 192.1.0.0/16 this problem will go away and we'll be able to allocate a whole tranche of IPs to cloud nodes.
Upshot is, I may have to restart some services and do a db sync, so when I know how the land lies, I'll schedule some cloud downtime. In theory there shouldn't be any, but just to be on the safe side.
Thanks
Dave
Hi all,
Andy put me onto a good idea for forcing the edge case to trigger my code. He suggested I go on with conmux and interfere with it. Problem is, timing your keystrokes just right turns out to be practically impossible, and you end up making the job fail but in the wrong way. However, thinking about it, the obvious thing to do is to take the board offline, go on with conmux, disable eth0, put back online.
I'm just about to do that. As long as it passes, I'll put it back to looping.
Thanks
Dave
Hi all,
I deployed an update onto staging, and I can see that it's there, but somehow it seems to be running the old code. Take a look at:
http://staging.validation.linaro.org/scheduler/job/34608
This was run after my update, but still complains about "target". What have I don't wrong? I did an su as instance manager, and then a "$HOME/bin/update-staging"
What have I missed?
Thanks
Dave