In lava website,it is lack the documents on how to build multiple remote slave test system.For example,if our system has a lot of test devices. What are the general design principles about how to plan to deploy remote workers, how many dut connect to one worker,etc. Is there any background knowledges about multiple slave and multiNode?
On Thu, 30 Aug 2018 at 11:14, john zhang laojianghusz@163.com wrote:
In lava website,it is lack the documents on how to build multiple remote slave test system.For example,if our system has a lot of test devices.
What are the general design principles about how to plan to deploy remote
workers, how many dut connect to one worker,etc.
We have a complete page on this topic: https://staging.validation.linaro.org/static/docs/v2/growing_your_lab.html - check the Index for "growing your lab".
If you think anyone can give you numbers now you are horribly mistaken. Everything is dependent on the specific details of your situation and nobody else's experience can provide anything more than guesses as to how your lab will behave as it grows.
Is there any background knowledges about multiple slave and multiNode?
Yes. Don't run before you can walk. Exactly as in the docs, do not start thinking about the full lab at this stage. You must start with a small lab and grow that lab organically. That is the only way to get *relevant* data about: 0: How many devices you need to keep the queues under control for the amount of CI test jobs that will be created 1: How many devices can be put onto your specific dispatcher machines, according to how your CI test jobs stress the dispatcher hardware and how that specific hardware copes with the operations used by those test jobs. 2: How many devices and what mix of devices are safe to run on a single worker, according to your specification of the dispatcher hardware and your particular mix of test jobs. 3: How many workers might be needed to get the reliability you require. 4: Do NOT even start with MultiNode until you have run several of these reliability cycles. Avoid unnecessary complexity at every step - change one thing at a time and grow the lab SLOWLY because every step increases complexity and therefore lowers reliability and makes triage much harder.
You WILL experience intermittent failures that can take months of engineering time to fix. These are inevitable but the exact nature of the problem cannot be anticipated.
Do not start the planning now, it would be pointless and an expensive waste of time.
0: Build a lab of 6 or so devices - run it for a month or more with a limited number of CI Test Plans until you get maximal usage (most devices busy most of the time) with maximum reliability (3 or 4 infrastructure failures per 1,000 test jobs across all devices). 1: Test with adding more devices to the existing worker, see if reliability is maintained 2: Add another worker with the same number & mix of devices and repeat the reliability test 3: The maintenance burden will likely increase more quickly than a simple proportion of the workers added. Every new worker makes the entire instance more complex, lowering reliability in a nonlinear manner. 4: Earn the in-house knowledge of how your devices, your test jobs, your infrastructure and your admin team work best to get a reliable CI framework.
This cannot be rushed because all of the important details are completely dependent on your specific situation.