On Tue, Feb 14, 2012 at 3:26 AM, Michael Hudson-Doyle
<michael.hudson(a)canonical.com> wrote:
> On Mon, 13 Feb 2012 22:27:25 +0100, Zygmunt Krynicki <zygmunt.krynicki(a)linaro.org> wrote:
>> Hi.
>>
>> Fast model support is getting better. It seems that with the excellent
>> patches by Peter Maydell we can now boot some kernels (I've only tried
>> one tree, additional trees welcome :-). I'm currently building a
>> from-scratch environment to ensure everything is accounted for and I
>> understand how pieces interact.
>>
>> Having said that I'd like to summarize how LAVA handles fast models:
>>
>> Technically the solution is not unlike QEMU which you are all familiar
>> with. The key differences are:
>> 1) Only NFS boot makes sense. There are no other sensible method that
>> I know of. We may also use a SD card (virtual obviously) but it is
>> constrained to two gigabytes of data.
>
> As mentioned in the other thread, it would be good to at least let ARM
> know that removing this limit would help us (if we can figure out how to
> do this).
We may figure out how to do this by reading the LISA source code that
came with the model. That's a big task though (maybe grepping for mmc0
is a low hanging fruit, I did not check)
>> 2) The way we actually boot is complicated. There is no uboot, fast
>> model interpreter actually starts an .axf file that can do anything
>> (some examples include running tests and benchmarks without actually
>> setting the kernel or anything like that). There is no way to easily
>> load the kernel and pass a command line. To work around that we're
>> using a special axf file that uses fast model semihosting features to
>> load the kernel/initrd from a host filesystem as well as to setup the
>> command line that will be passed to the booting kernel. This allows us
>> to freely configure NFS services and point our virtual kernel at
>> appropriate IP addresses and pathnames.
>
> So I guess I'd like to understand how this works in a bit more detail.
> Can you brain dump on the topic for a few minutes? :) What is "fast
> model semihosting"?
It's a way to have "syscalls" that connect the "bare hardware" (be it
physical or emulated) to an external debugger or other monitor. You
can find a short introduction in this blog [1]. For us it means we get
to write bare-metal assembly that does the equivalent of open(),
read(), write() and close(). The files are being opened are on the
machine that runs the fast model. You can also print debugging
statements straight to the console this way (we probably could write
semihosting console driver if there is no such code yet) to get all of
the output to the same tty that runs the model (model_shell). A more
detailed explanation of this topic can be found in [2]
Fast model semihosting simply refers to using semihosting facilities
in a fast model interpreter.
[1]: http://blogs.arm.com/software-enablement/418-semihosting-a-life-saver-durin…
[2]: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0471c/CHDJHH…
>
>> 3) After the machine starts up we immediately open a TPC/IP connection
>> to a local TCP socket. We know which port is being used so we can
>> easily allocate them up front. This port is now the traditional LAVA
>> serial console.
>
> I guess there is a risk here that we will miss early boot messages?
> This might not matter much.
There are other options but currently this seems to work quite okay.
> Once we've found someone at ARM we can officially complain at about fast
> models, an option to have serial comms happen on the process
> stdin/stdout would be nice.
I think the reason they don't happen in the console is that by default
we get four telnet ports to connect to (definitely more than one) so
the logical question they'll ask is "which port should we redirect".
Maybe there is an option buried somewhere to make that happen but so
far I have not found it.
>
>> The rest of this looks like QEMU:
>> - you can access the filesystem easily (to gather results)
>> - we can use QEMU to chroot into the NFS root to install additional
>> software (emulation via a fast model is extremely slow)
>
> In my testing, the pip install bzr+lp:lava-test step did not really work
> under QEMU. Maybe it does now, or maybe we can install a tarball or
> something.
I installed lava-test using release tarball. That has worked pretty well.
In general I think that:
1) We need to reconsider how to do testing on very slow machines
2) What can be invoked on the host (part of installation, unless that
wants to build stuff, result parsing and tracking)
3) What has to be invoked on the target (test code, system probes)
It's important to make the intent very clear. If we define that
cmd_install installs something while in "master image" on the "target"
then we should not break that. I think that it would be sensible to
add "host_chroot" mode that applies nicely to qemu and fast models.
Very slow things that don't care about the architecture could be
invoked in that mode without sacrificing performance.
Thanks
ZK
Hi
I took the liberty to update the launchpad project page for
lava-project [1]. I plan to use that whenever people come asking about
bugs/features/documentation. You may want to review the description.
/me really wishes for rich text editing on launchpad projects,
especially if it could just display the README file, eh
[1]: https://launchpad.net/lava-project/
Hi, I wanted to sync up on where things are with this, and as I understand,
there's some confusion still about how we should get the available kernel
and/or images for testing fast models.
First off, it seems there is no way to just take a kernel .axf as we
thought because the boot args are wrapped into it as well. Is there really
no way to inject that stuff after build time?
Is it worth revisiting whether we should have proper hwpacks for fast
models? I know there's the 2G max sd size issue with that, but if that's
something arm can fix, or if we have another way around it, would that help?
Finally, I feel like we've chased a pretty messy route to get fast models
supported here, which will ultimately break completely when we try to get
android running also. Please correct me if I'm wrong here, but it seems
the "recommended" approach is currently:
lava takes as input: git tree for the kernel we want to build, defconfig to
use, and rootfs tarball
lava builds the kernel (lava doesn't do this currently, complicating things
for us quite a bit - would be better if this step could be done externally,
like in jenkins where we do the kernel ci builds)
lava pushes the build axf to another system where we have the fast models
software installed, provisions the rootfs for booting over nfs
lava boots the system under fast models on this other machine, runs tests,
gathers results, etc
Is that pretty close Zygmunt? Is there something more straightforward and
less fragile that we can do here?
Thanks,
Paul Larson
Hi.
Fast model support is getting better. It seems that with the excellent
patches by Peter Maydell we can now boot some kernels (I've only tried
one tree, additional trees welcome :-). I'm currently building a
from-scratch environment to ensure everything is accounted for and I
understand how pieces interact.
Having said that I'd like to summarize how LAVA handles fast models:
Technically the solution is not unlike QEMU which you are all familiar
with. The key differences are:
1) Only NFS boot makes sense. There are no other sensible method that
I know of. We may also use a SD card (virtual obviously) but it is
constrained to two gigabytes of data.
2) The way we actually boot is complicated. There is no uboot, fast
model interpreter actually starts an .axf file that can do anything
(some examples include running tests and benchmarks without actually
setting the kernel or anything like that). There is no way to easily
load the kernel and pass a command line. To work around that we're
using a special axf file that uses fast model semihosting features to
load the kernel/initrd from a host filesystem as well as to setup the
command line that will be passed to the booting kernel. This allows us
to freely configure NFS services and point our virtual kernel at
appropriate IP addresses and pathnames.
3) After the machine starts up we immediately open a TPC/IP connection
to a local TCP socket. We know which port is being used so we can
easily allocate them up front. This port is now the traditional LAVA
serial console.
The rest of this looks like QEMU:
- you can access the filesystem easily (to gather results)
- we can use QEMU to chroot into the NFS root to install additional
software (emulation via a fast model is extremely slow)
The rest of lava is now related to managing resources and misc tasks:
- running model_shell instances and their 'serial lines'
- unpacking and blessing nfs root directories (blessing = patching it
enough to make the model boot and work for us)
- exporting nfsroot directories for the local network (currently
hardcoded as 192.168.0.0/24
- managing local library of models (building, enumerating, adding/removing)
- managing root to do the things required.
I'm still not done but I hope to push this to one of our servers this
week. Ideally I'd like to try on a virtual machine to see how that
setup copes with the code. In addition I want to ensure I don't put
any magic there, and that all of the installation is managed by
automation tools.
Thanks
ZK
PS: After spending the weekend looking at my scripts crashing I took a
break. On Monday, today, I grabbed Peter to troubleshoot the issue.
The issue was gone, I have no idea what changed (perhaps my long-lived
terminal accumulated some cruft).
Following on to the discussion earlier this week, I wanted to kick off a
discussion about how to implement one part we discussed, which is to make
health checks something that lava natively understands and knows how to do,
rather than just a lump of jobs scheduled daily from cron. Here are my
thoughts, and I'll let you decide how much it's crack.
1. extend the scheduler model for devices? to add a text blob for storing a
health check job
- downside is that we would have add one of these for every single board
- maybe it makes sense to do this for device type instead and have the
scheduler just start it on the proper board?
- we should make sure that these jobs use a locally downloadable,
complete image
2. at the beginning of every day, if the board is NOT in offline or going
offline state, insert this job into the queue for the board
- Is there a good way to do this automatically with some internal
mechanism - not cron from the outside?
3. Link the result of this job to the health check stuff already in progress
thoughts? suggestions?
Hi, I'm looking at the kernel ci results and origen seems to be running
tests pretty frequently, but always failing to boot. Sangwook, could you
take a look at the args that are getting passed and make sure there's not
something here that would explain it hanging on boot? Or is this perhaps
something in the kernel config that is getting used?
Thanks,
Paul Larson
Do you find yourself asking that? If the answer is yes, check here:
http://stats.pingdom.com/o15ezbsnv8tb
Following on to the discussion earlier in the week, I checked out this
pingdom thing that Michael mentioned. It doesn't provide data for system
monitoring, but will at least give us external availability checking.
Thanks,
Paul Larson
Hello everyone.
I just wanted to give you a head's up on my plans today:
1) Review merge requests for lava-{server,dashboard,dispatcher,test}
2) Merge things that seem ready to land
3) Release everything on pypi
4) Build lava-2012-02-07.pybundle
5) Deploy that to production
I also want to figure out how to release Andy's android benchmark
extension. Andy, could you ensure that:
*) You have merged the patches I +1 on gerrit
*) You have registered a project on pypi (ping me if you need help)
*) You have released a snapshot (setup.py sdist upload) -- ensure that
your MANIFEST.in is correct (look at the source tarball and make sure it
has all the files you wanted).
If you do that I'll just add your extension to the manifest in
lava-deployment-tool and it will be a part of the release today.
Thanks
ZK
I noticed that v.l.o was being obnoxiously slow again and top revealed that
uwsgi had gone on a memory eating binge again:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
3799 lava-pro 20 0 33.3g 24g 28 D 1 82.5 8:20.37 uwsgi
touch /srv/lava/instances/production/etc/lava-server/uwsgi.reload and after
a few minutes it had cleared up. It's clear the uwsgi changes that were
made previously weren't helping though. Any ideas?
-Paul Larson
Hi.
I've created a new extension for LAVA called lava-raven. This
extension wraps Raven (which is a client for Sentry) in a easy to
install package. The code is up on launchpad [1] and pypi [2], pypi
also has a short documentation on how to set up sentry+raven in a LAVA
instance.
Along with raven I did a few small changes to linaro-django-xmlrpc
(merge request [3]) to support raven better (those changes don't
depend on raven though) and lava-server (merge request [4]). Here it's
mostly a small helper base class for creating headless extensions. The
MP is quite long because I took the liberty of doing documentation
cleanups at the same time. Please have a look at those so that they
can land quickly.
Lastly with the help of Canonical IS team I've setup a new DNS for our
Sentry instance at [5]. Sadly sentry is not easy to put in a URL
prefix so I had to resort to another top level domain. Sentry is built
on top of django so it has a local user database. There is no OpenID
support and currently only I have an account there. Ping me to get
another account if you need it. As we use sentry we'll see if we can
keep this view public (I strongly doubt it as it has all the request
data, including (probably) security tokens. Sentry currently runs as a
new system user (sentry) on validation control server. It's currently
using sqlite backend so we may want to create a PostgresSQL
user/database for it instead if it gets slow.
In any way, as soon as those bits land we'll have live monitoring of
various parts of LAVA. All of the things that are integrated with LAVA
Server will automatically benefit (that includes the scheduler
_except_ for the scheduler daemon). As we progress towards using
celery all of our stack will be transparently using Raven. For those
bits that won't quickly become celery tasks I think it would be worth
to patch some optional raven support (I'm looking at you
lava-dispatcher).
Best regards
ZK
[1] https://launchpad.net/lava-raven
[2] http://pypi.python.org/pypi/lava-raven/0.1
[3] https://code.launchpad.net/~zkrynicki/linaro-django-xmlrpc/better-sentry-su…
[4] https://code.launchpad.net/~linaro-validation/lava-server/production/+merge…
[5] http://sentry.validation.linaro.org/