Hi all,
Can anyone help Dean at ARM out here?
Thanks
Dave
On 2 Oct 2013, at 15:49, Dean Arnold Dean.Arnold@arm.com wrote:
Hi Dave/Matt,
I have recently installed a remote worker on a new server, but I am unable to get it to talk to the master properly. I just wanted to check if what I have done is correct and if I had missed anything.
Here are the steps I followed..
My master server (pdsw-lava) is running a "production" instance of LAVA. I have updated the LAVA version to the latest using the git lava-deployement-tool.
On new dispatcher (pdsw-lava-dispatcher01), using the latest git lava-deployement-tool, I installed the remote worker in the following way: $ lava-deployment-tool setup $ lava-deployment-tool installworker production
Giving the IP address, hostname etc of the master when prompted along with the database info.
When instructed to do the following..
Running installation step remote_fs Configuring remote fileystem access production Please add the following public key into your master node's /srv/lava/instances/production/home/.ssh/authorized_keys file
-------- WORKER NODE PUBLIC KEY STARTS HERE -------- ssh-rsa .... ssh key used by LAVA for sshfs -------- WORKER NODE PUBLIC KEY ENDS HERE --------
I had to create the /home/.ssh/authorized_keys file and directories under /srv/lava/instances/production/ on the master as they never existed.
On pdsw-lava-dispatcher01 I created these directories: /srv/lava/instances/production/etc/lava-dispatcher/devices /srv/lava/instances/production/etc/lava-dispatcher/device-types
and SCP'd the contents of /srv/lava/instances/production/etc/lava-dispatcher/device-types over from my master server.
I removed the device config file pdswlava-vetc2-06.conf from /srv/lava/instances/production/etc/lava-dispatcher/devices on the master and added this to the same directory on the remote worker.
I then restarted lava on both servers (just in case it was needed).
I submitted a job on the master using pdswlava-vetc2-06 and this stayed in the "submitted" state but never did anything.
After a while I removed the pdswlava-vetc2-06.conf file from /srv/lava/instances/production/etc/lava-dispatcher/devices on the remote worker and added this back into the same directory on the master. Instantly the job ran as expected.
Can you think of anything I have missed in my setup which would prevent the master and worker from talking to each other? Is there anything I should be seeing in the web UI of the master to indicate it is connected to a remote worker, or is there an easy way to test the connection between the two?
Apologies if I have just done something really dumb, it wouldn't be the first time :)
Cheers Dean
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590 ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782
On Mon, 7 Oct 2013 09:03:09 +0100 Dave Pigott dave.pigott@linaro.org wrote:
Hi Dave/Matt,
I have recently installed a remote worker on a new server, but I am unable to get it to talk to the master properly. I just wanted to check if what I have done is correct and if I had missed anything.
Here are the steps I followed..
My master server (pdsw-lava) is running a "production" instance of LAVA. I have updated the LAVA version to the latest using the git lava-deployement-tool.
On new dispatcher (pdsw-lava-dispatcher01), using the latest git lava-deployement-tool, I installed the remote worker in the following way: $ lava-deployment-tool setup $ lava-deployment-tool installworker production
That should have been lava-deployment-tool setupworker
You need to now remove the lava-coordinator package - sudo apt-get purge lava-coordinator.
Then copy the coordinator configuration from the master, as described here: http://validation.linaro.org/static/docs/lava-dispatcher/multinode.html#lava...
That would block MultiNode jobs but wouldn't affect other jobs.
setup adds packages needed by the master (like apache) which setupworker does not install.
On pdsw-lava-dispatcher01 I created these directories: /srv/lava/instances/production/etc/lava-dispatcher/devices /srv/lava/instances/production/etc/lava-dispatcher/device-types
and SCP'd the contents of /srv/lava/instances/production/etc/lava-dispatcher/device-types over from my master server.
I removed the device config file pdswlava-vetc2-06.conf from /srv/lava/instances/production/etc/lava-dispatcher/devices on the master and added this to the same directory on the remote worker.
I then restarted lava on both servers (just in case it was needed).
I submitted a job on the master using pdswlava-vetc2-06 and this stayed in the "submitted" state but never did anything.
To debug the scheduler, check the lava-scheduler log on the master and the worker - change the scheduler to debug status too.
The log file will be something like: /srv/lava/instances/playground/var/log/lava-scheduler.log
To change the daemon to debug mode, edit the loglevel in the command at the end of this file: /etc/init/lava-instance-scheduler.conf
--loglevel=info => --loglevel=debug
Restart lava to get the daemon to notice the config change. Cancel the pending job, if it still exists and submit a new one. Consider using a KVM device type to isolate problems with the device configuration.
After a while I removed the pdswlava-vetc2-06.conf file from /srv/lava/instances/production/etc/lava-dispatcher/devices on the remote worker and added this back into the same directory on the master. Instantly the job ran as expected.
More than likely the answer will be in the lava-scheduler log on the worker.
Just check you have the scheduler enabled on the worker:
e.g. in your equivalent of: /srv/lava/instances/playground/instance.conf LAVA_SCHEDULER_ENABLED='yes'
Can you think of anything I have missed in my setup which would prevent the master and worker from talking to each other? Is there anything I should be seeing in the web UI of the master to indicate it is connected to a remote worker, or is there an easy way to test the connection between the two?
As long as the scheduler daemon is running on the worker, the scheduler log on the worker should give some indication of whether it can connect.
The log also shows how many configured devices the scheduler thinks it has.
Hi Neil,
Thanks for the response. Unfortunately I am still having issues.
On a fresh machine (clean Ubuntu server 64bit 12.04.3 LTS install), following your advice I used the setupworker command before installing the worker as so:
~/lava-deployment-tool/lava-deployment-tool setupworker ~/lava-deployment-tool/lava-deployment-tool installworker production
I used production as the instance, as this is the instance my master is running.
I have a single board on this remote worker. If I attempt to run a job on this board the job just stalls as mentioned before. Moving the board back to the master results in the job running though as expected. This makes me think the connection to my worker is still misconfigured.
I noticed when installing the worker I got the following error:
Remote filesystem configuration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The remote worker needs write access to the 'media' directory on the master LAVA node so that dispatcher logs will be visible. This is managed by configuring sshfs to mount the master's media directory
Master Instance Host: 'pdsw-lava.cambridge.arm.com' Master Instance User: 'lava-production' Master Instance Directory: '/srv/lava/instances/production'
next - Use the master information as is edit - Edit the master information Please decide what to do [next]: next ./lava-deployment-tool: line 270: defaults_coordinator: command not found
Could this be what is causing my problems?
You need to now remove the lava-coordinator package - sudo apt-get purge lava-coordinator.
Then copy the coordinator configuration from the master, as described here: http://validation.linaro.org/static/docs/lava- dispatcher/multinode.html#lava-coordinator-setup
When running "sudo apt-get purge lava-coordinator" it said I didn't have this package installed.
My /etc/lava-coordinator/lava-coordinator.conf (is this the right file?) looks like this:
{ "port": 3079, "blocksize": 4096, "poll_delay": 3, "coordinator_hostname": "pdsw-lava.cambridge.arm.com" }
pdsw-lava.cambridge.arm.com is the name of my master server.
--loglevel=info => --loglevel=debug
I have enabled this and restarted lava on the remote like so:
sudo service lava restart
This all seems to start OK.
Restart lava to get the daemon to notice the config change. Cancel the pending job, if it still exists and submit a new one. Consider using a KVM device type to isolate problems with the device configuration.
More than likely the answer will be in the lava-scheduler log on the worker.
Unfortunately I don't even have a log for the scheduler. Will this only appear if the scheduler is kicked into life by the master? The only log I have is lava-uwsgi.log.
Just check you have the scheduler enabled on the worker:
e.g. in your equivalent of: /srv/lava/instances/playground/instance.conf LAVA_SCHEDULER_ENABLED='yes'
This is enabled.
Thanks Dean
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590 ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782
On Mon, 7 Oct 2013 15:30:16 +0100 Dean Arnold Dean.Arnold@arm.com wrote:
Thanks for the response. Unfortunately I am still having issues. When running "sudo apt-get purge lava-coordinator" it said I didn't have this package installed.
My /etc/lava-coordinator/lava-coordinator.conf (is this the right file?) looks like this:
Yes. That looks correct.
--loglevel=info => --loglevel=debug
I have enabled this and restarted lava on the remote like so:
sudo service lava restart
This all seems to start OK.
Is the scheduler daemon running on the remote worker?
$ ps waux|grep lava-server|grep scheduler
If there is no logfile, it's quite possible that something is misconfigured and the scheduler isn't even starting on the remote worker. That would explain why jobs are not being assigned.
More than likely the answer will be in the lava-scheduler log on the worker.
Unfortunately I don't even have a log for the scheduler. Will this only appear if the scheduler is kicked into life by the master? The
No. Once the process is using --loglevel=debug, then there will always be content in the scheduler log if the scheduler is running.
If it's not running, try executing the command directly as root - at least that way you'll see any errors.
The command on the worker will be similar to what clearly *is* running on the master, with the loglevel change.
linaro-validation@lists.linaro.org