Hi all,
Just a heads up: I started to create three cloud nodes today, two for toolchain and one for the backup production system so we can seamlessly upgrade control, and I hit a problem. It's reporting (not very clearly - you have to dig) that we have run out of floating IPs. I allocated 192.168.1.48/29, which should give us 8 (just enough, as it happens) and I can't see why it's holding onto the other two. I'm investigating, but also suggest that I move it up to 192.168.1.48/28 so that we can have 16 instances.
Obviously, once we move to 192.1.0.0/16 this problem will go away and we'll be able to allocate a whole tranche of IPs to cloud nodes.
Upshot is, I may have to restart some services and do a db sync, so when I know how the land lies, I'll schedule some cloud downtime. In theory there shouldn't be any, but just to be on the safe side.
Thanks
Dave
Hi all,
Andy put me onto a good idea for forcing the edge case to trigger my code. He suggested I go on with conmux and interfere with it. Problem is, timing your keystrokes just right turns out to be practically impossible, and you end up making the job fail but in the wrong way. However, thinking about it, the obvious thing to do is to take the board offline, go on with conmux, disable eth0, put back online.
I'm just about to do that. As long as it passes, I'll put it back to looping.
Thanks
Dave
Hi all,
I deployed an update onto staging, and I can see that it's there, but somehow it seems to be running the old code. Take a look at:
http://staging.validation.linaro.org/scheduler/job/34608
This was run after my update, but still complains about "target". What have I don't wrong? I did an su as instance manager, and then a "$HOME/bin/update-staging"
What have I missed?
Thanks
Dave
HI, All
I just run a CTS job on staging now,
and found that the cpu usage of the telnet process is nearly 100%
Anyone has any idea about that?
Below is the link of the collectd information installed on staging:
http://staging-metrics.validation.linaro.org:8080/collectd/bin/index.cgi?ho…
The high CPU usage is from 13:40 on CPU3 and changed to CPU1 at 14:30.
below is the output of the top command:
top - 15:18:26 up 7 days, 5:27, 1 user, load average: 1.81, 1.80, 1.46
Tasks: 144 total, 2 running, 142 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.5%us, 23.1%sy, 0.0%ni, 74.0%id, 0.3%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 8178504k total, 4866776k used, 3311728k free, 66408k buffers
Swap: 0k total, 0k used, 0k free, 3870820k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31383 root 20 0 27516 1504 1224 R 100 0.0 97:09.40 telnet
3626 root 20 0 2437m 92m 8496 S 0 1.2 0:22.77 java
13383 lava-sta 20 0 198m 47m 5180 S 0 0.6 0:18.85 uwsgi
1 root 20 0 24460 2336 1244 S 0 0.0 0:00.98 init
2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0 0.0 0:08.43 ksoftirqd/0
Thanks,
Yongqin Liu
---------------------------------------------------------------
#mailing list
linaro-android(a)lists.linaro.org <linaro-dev(a)lists.linaro.org>
http://lists.linaro.org/mailman/listinfo/linaro-android
linaro-validation(a)lists.linaro.org <linaro-dev(a)lists.linaro.org>
http://lists.linaro.org/pipermail/linaro-validation
Hey Guys,
I just looked into Panda10 health check failures over the past 24hours.
The good news is my bad code isn't to blame. The bad news is that the SD
card appears to be having some issues.
Hi all,
I've just created a very basic munin installation for the lab:
http://munin.validation.linaro.org/
The monitoring wonk I consulted said that munin is perhaps not the
greatest way of getting graphs of your system but that it's probably the
easiest to set up. Better than nothing :-)
To add a system to munin you need to:
1) apt-get install munin-node on the system
2) Edit /etc/munin/munin-node.conf on the system to contain:
host_name XXX.validation.linaro.org
allow ^192\.168\.1\.32$
3) sudo service munin-node restart on the system
4) Add the following to /etc/munin/munin.conf on linaro-gateway:
[XXX.validation.linaro.org]
address 192.168.1.YYY
use_node_name yes
and that's it! The data viewable at http://munin.validation.linaro.org/
is generated by a */5 cron, so it takes a while for a new host to
appear. If someone wants to add dogfood, the compute nodes, the fast
model instances etc etc be my guest...
Once all the systems are added, the next thing is to start looking at
adding more us-specific metrics -- scheduler queue lengths, request
numbers and duration from django or apache, various postgres stats etc
etc. It would also be nice to add "events" to the graphs such as
rollouts and job start/ends but I don't know if that is supported.
Cheers,
mwh
Hi All,
Due to a glitch with UEFI and the latest kernels, we are forced to leave the TC2s offline until the issue is resolved. Ryan Harkin and I have been working to try and resolve this, but the best we could do is to get them to pass their health check (using sticking plaster, string and a large hammer) but they would then fail every test that was submitted to them, which would be kind of pointless. We're working actively to fix this problem, and I'll let you know when we're back up and running.
Thanks, and apologies once again,
Dave
Hi all,
It's all kinds of rough but I've just crossed a milestone: I ran the
dispatcher and had DS-5 capture energy data from my host while it was
running, using this branch (lots of which Andy wrote):
https://code.launchpad.net/~mwhudson/lava-dispatcher/signals/+merge/131128
Currently the output of streamline -report is attached to the test
result as an attribute, which is just awful. Either it should be parsed
into interesting data, or the -report output should be attached to test
run in a useful way (or both). But it's a start! I'm attaching the
test definition and job file I used.
Cheers,
mwh
Hey Guys,
I just hit a really annoying issue while trying to upgrade control to
our latest lava-dispatcher code.
Everything works great in dogfood and staging. However, I guess the
python version on control is just different enough to cause a problem
with our new use of "configglue". The issue is with our "boot_cmds" that
are set by our device-type .conf files. The faulty snippet is roughly:
string_to_list(boot_cmds)
on a "normal" system, this produces an array of commands. On control we
get a encoding mess that doesn't work with u-boot. eg:
['m\x00\x00\x00m\x00\x00\x00c\x00\x00\x00 .......
I think the easiest fix is to change our master.py to call:
string_to_list(boot_cmds.encode('ascii'))
I'm doing another round of unit testing to prove this works before
attempting to deploy.
For now I've marked all the devices that execute from control as
offline. If the fix takes too long, I'll just revert to the previous
lava deployment
-andy
Its often difficult to achieve pre-planned hacking goals at Connect, but
Michael and I spent a little time thinking about this topic and wanted
to try and lay out an agenda for LAVA next week.
The general thought process we took for these was:
* Is it beneficial to work on as a group?
* Is it something that we'll benefit from even we only wind up having 20
minutes and could it also work well if we find the time to work on it
for 2 hours.
With that in mind we are thinking about these items:
= Galaxy Nexus Fastboot Hacking
Have a "learn fastboot" session based on the email thread from last week.
= NI battery simulator
I can show how this works
Zach can try and help with the TCP disconnect issue
= Versatile Express intro hacking
Dave can give us some education on how the VExpress works/boots/etc.
Possibly grab Ryan/Tixy to join.
= Deployment Type Improvements
I think this will roughly be a "get Antonio to teach us Chef" session.
Maybe think about how to get Chef built into open stack image with
cloud-init.
= monitoring – adding app-specific metrics to munin
Michael can talk to us a bit about Munin and how to add new custom
metrics like. Then maybe we can hack on adding some like:
* web/django stuff
* postgres stuff
* job stuff
- num flocks
- device type wait time at various %iles
- device type utilizations