Hello,
On Thu, 18 Apr 2013 11:57:43 +0530 Vishal Bhoj vishal.bhoj@linaro.org wrote:
On 16 April 2013 15:31, Paul Sokolovsky paul.sokolovsky@linaro.org wrote:
Hello,
On Tue, 16 Apr 2013 06:19:51 +0530 Vishal Bhoj vishal.bhoj@linaro.org wrote:
[]
This error is related to infrastructure. I am not sure if this will be resolved if the publishing is updated.
ChannelClosedException as quoted below means an EC2 instance got terminated (or otherwise "lost") behind Jenkins' back. Generally, this issue is a known non-deterministic failure issue, and bound to happen from time to time due to the nature of EC2 (complex big system, has non-zero stream of errors).
We really need to find a solution for this. We are running into this error on a regular basis nowadays.
Yes, I see that 2 builds I tried yesterday didn't succeed either, with the same error. Builds of that job look weird, because they're still in compile phase after 3.5hrs after the start - that's too long. At last 2 builds were killed at almost the same time, but it's not build timeout (set at 4:45mins, looks differently), not EC2 monitoring script (not active now, never killed running builds, only zombie instances).
I'm cc:ing Phillip just in case if he may know of anything which may kill EC2 in the "old" EC2 Linaro account in 3.5hrs?
I still think the likely cause though is master overload due to publishing issues, and would like to keep working on resolving that first. I have good results so far - "copycat" build on a sandbox finished with less than 2min:
https://ec2-107-20-93-222.compute-1.amazonaws.com/jenkins/job/pfalcon_galaxy...
That's 1/2 of all work needed tho, going to deploy needed parts on production and continue with it.
Is there any update on why we are seeing this failure ? We still continue to see the same failure: https://android-build.linaro.org/jenkins/job/linaro-android_vexpress-linaro-...
So, last GMT evening, new publishing was deployed, and previously known jobs with publishing failures were confirmed to be ok.
Later, with daily builds kicking in, and small zombie pile-up on ci.linaro.org, we had publishing testing turned into overall stress testing, and at 12 concurrent builds, Jenkins master started to choke, and lose track of build slaves. Well, 12 parallel is too much anyway, we never tested with more than 10 previously, and normally don't have more than 6.
That build was rebuilt without issues a bit later:
https://android-build.linaro.org/jenkins/job/linaro-android_vexpress-linaro-...
And access-control issue was discovered and fixes too, so currently there're no known issue on android-build.linaro.org, with me keeping monitoring it and around to resolve any issues.
[]