On Mon, Oct 09, 2017 at 09:44:36AM +0000, Linaro QA wrote:
Summary
kernel: 4.9.54 git repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git git branch: linux-4.9.y git commit: f37eb7b586f1dd24a86c50278c65322fc6787722 git describe: v4.9.54 Test details: https://qa-reports.linaro.org/lkft/linux-stable-rc-4.9-oe/build/v4.9.54
Hm, I don't seem to have a test result here for 4.9.55-rc1, so I'm hijacking this thread to ask what happened to the tests there?
Turns out that 4.9.55-rc1 did not work at all for networking, yet no tests seem to have caught it. Are we not testing something with a network here? You said you were using NFS, how did that work?
Anything we can do to add to the tests to verify that basic dhcp works? We should learn from our mistakes...
Heck, the simple build/boot tests that Google runs instantly found this issue when I tried to merge 4.9.55 into the android-common trees, maybe I should just rely on thost tests from now on?
Because of this, I had to release 4.9.56 a few hours later fixing the issue. This doesn't make me feel like I can trust this testing effort at all just yet, given the track record it has shown this past week. We have had two known-bad kernel releases and none of this testing caught it at all (both were caught by community members...) :(
Back to booting the kernels on my laptop and running 'make allmodconfig', that's found more errors so far than anything else...
greg k-h
On Fri, Oct 13, 2017 at 12:31:37PM +0200, Greg KH wrote:
I just looked at this for kernelci, not sure what's going on with LKFT here and haven't talked to anyone working on it but I'll bet it was the same.
Turns out that 4.9.55-rc1 did not work at all for networking, yet no tests seem to have caught it. Are we not testing something with a network here? You said you were using NFS, how did that work?
Looks like it wasn't networking in general but rather the specific non-IP protocol that DHCP userspace uses for DHCP packets that that was broken - you're going to find that most people testing in labs will use static network configurations so won't exercise DHCP by default. Quite apart from anything else if you're working with embedded stuff the woeful unwillingness of hardware manufacturers to provide unique MAC addresses on their development boards means that trying to get a consistent IP address to the boards becomes troublesome, it's a lot easier to just provide a static configuration.
For example kernelci has a bunch of NFS boots like this one for the affected kernels:
https://storage.kernelci.org/stable/linux-4.9.y/v4.9.55/arm/multi_v7_defconf...
but as you'll see from the log a static network configuration was passed on the kernel command line and the bug bypassed. If this analysis is correct I imagine Google caught this as they do use DHCP due to a combination of testing with production hardware units where the device manufacturers could be bothered and being able to use that to test the default network configuration in their production software.
Anything we can do to add to the tests to verify that basic dhcp works? We should learn from our mistakes...
If this is what's going on probably it's possible to move the x86 systems being used for testing to DHCP as they will have sensible MAC addresses programmed so can be assigned static IPs by the DHCP server (that's just an idea, dunno if there's some operational issues that make it impractical - like I say I only looked at this for kernelci). An explicit testsuite for the mechanism that's being used for DHCP packets (I forget what the modern thing is, it's been years since I looked at that stuff) would obviously also be useful though if it's just that and doesn't go out onto a physical link it's not going to have quite the same coverage and won't catch everything a full stack test would.
Back to booting the kernels on my laptop and running 'make allmodconfig', that's found more errors so far than anything else...
To be fair they have caught and fixed some things like the timer bugs that John found, it's just that none of them have been boot breaks thus far.
On Fri, Oct 13, 2017 at 12:23:37PM +0100, Mark Brown wrote:
On Fri, Oct 13, 2017 at 12:31:37PM +0200, Greg KH wrote:
I just looked at this for kernelci, not sure what's going on with LKFT here and haven't talked to anyone working on it but I'll bet it was the same.
Turns out that 4.9.55-rc1 did not work at all for networking, yet no tests seem to have caught it. Are we not testing something with a network here? You said you were using NFS, how did that work?
Looks like it wasn't networking in general but rather the specific non-IP protocol that DHCP userspace uses for DHCP packets that that was broken - you're going to find that most people testing in labs will use static network configurations so won't exercise DHCP by default. Quite apart from anything else if you're working with embedded stuff the woeful unwillingness of hardware manufacturers to provide unique MAC addresses on their development boards means that trying to get a consistent IP address to the boards becomes troublesome, it's a lot easier to just provide a static configuration.
For example kernelci has a bunch of NFS boots like this one for the affected kernels:
https://storage.kernelci.org/stable/linux-4.9.y/v4.9.55/arm/multi_v7_defconf...
but as you'll see from the log a static network configuration was passed on the kernel command line and the bug bypassed. If this analysis is correct I imagine Google caught this as they do use DHCP due to a combination of testing with production hardware units where the device manufacturers could be bothered and being able to use that to test the default network configuration in their production software.
As android devices are not normally set up using static IPs, it seems natural for them to test DHCP :)
There's also a whole networking test suite that Android has, I'm not sure if the initial "smoke tests" include them, but I know they would be great to pull into the lts testing as well if at all possible.
Ok, so that makes more sense why kernelci didn't see that, I assumed those boxes were all static IPs, didn't realize that Linaro's test systems were set up the same, especially for x86 as you mention.
Anything we can do to add to the tests to verify that basic dhcp works? We should learn from our mistakes...
If this is what's going on probably it's possible to move the x86 systems being used for testing to DHCP as they will have sensible MAC addresses programmed so can be assigned static IPs by the DHCP server (that's just an idea, dunno if there's some operational issues that make it impractical - like I say I only looked at this for kernelci). An explicit testsuite for the mechanism that's being used for DHCP packets (I forget what the modern thing is, it's been years since I looked at that stuff) would obviously also be useful though if it's just that and doesn't go out onto a physical link it's not going to have quite the same coverage and won't catch everything a full stack test would.
Back to booting the kernels on my laptop and running 'make allmodconfig', that's found more errors so far than anything else...
To be fair they have caught and fixed some things like the timer bugs that John found, it's just that none of them have been boot breaks thus far.
Yeah, but the boot and build breakages should be much easier to find, one would think :)
thanks,
greg k-h
On Fri, Oct 13, 2017 at 12:28:46PM +0000, Greg KH wrote:
On Fri, Oct 13, 2017 at 12:23:37PM +0100, Mark Brown wrote:
On Fri, Oct 13, 2017 at 12:31:37PM +0200, Greg KH wrote:
I just looked at this for kernelci, not sure what's going on with LKFT here and haven't talked to anyone working on it but I'll bet it was the same.
Turns out that 4.9.55-rc1 did not work at all for networking, yet no tests seem to have caught it. Are we not testing something with a network here? You said you were using NFS, how did that work?
Looks like it wasn't networking in general but rather the specific non-IP protocol that DHCP userspace uses for DHCP packets that that was broken - you're going to find that most people testing in labs will use static network configurations so won't exercise DHCP by default. Quite apart from anything else if you're working with embedded stuff the woeful unwillingness of hardware manufacturers to provide unique MAC addresses on their development boards means that trying to get a consistent IP address to the boards becomes troublesome, it's a lot easier to just provide a static configuration.
In the LKFT lab, we do use dhcp, but we do not use 'dhclient', which seems to have been required to trigger this bug. We bring up interfaces two different ways: at boot time, which worked fine, and explicitly on hikey using 'udhcpc', which also worked fine.
I am able to reproduce the problem by running 'dhclient' on x15 explicitly. The problem does not occur on 4.9.54, but it does occur on 4.9.55-rc1 as well as 4.9.55 release.
4.9.54 (pass): https://lkft.validation.linaro.org/scheduler/job/46865 4.9.55-rc1 (fail): https://lkft.validation.linaro.org/scheduler/job/46859#L1068 4.9.55 (fail): https://lkft.validation.linaro.org/scheduler/job/46866#L1078
I haven't tested 4.9.56 yet but I will shortly and expect it to pass.
We do have some existing tests that we don't currently run in both LTP and in our test-definitions repository that use dhclient. If it's a good idea, we could add such tests to exercise code paths that we are concerned about - I just want to be sure that we add test coverages strategically and not just as a knee-jerk reaction to one-time problems.
Hope that helps clarify, Dan
On Fri, Oct 13, 2017 at 10:16:04AM -0500, Dan Rue wrote:
On Fri, Oct 13, 2017 at 12:28:46PM +0000, Greg KH wrote:
On Fri, Oct 13, 2017 at 12:23:37PM +0100, Mark Brown wrote:
On Fri, Oct 13, 2017 at 12:31:37PM +0200, Greg KH wrote:
I just looked at this for kernelci, not sure what's going on with LKFT here and haven't talked to anyone working on it but I'll bet it was the same.
Turns out that 4.9.55-rc1 did not work at all for networking, yet no tests seem to have caught it. Are we not testing something with a network here? You said you were using NFS, how did that work?
Looks like it wasn't networking in general but rather the specific non-IP protocol that DHCP userspace uses for DHCP packets that that was broken - you're going to find that most people testing in labs will use static network configurations so won't exercise DHCP by default. Quite apart from anything else if you're working with embedded stuff the woeful unwillingness of hardware manufacturers to provide unique MAC addresses on their development boards means that trying to get a consistent IP address to the boards becomes troublesome, it's a lot easier to just provide a static configuration.
In the LKFT lab, we do use dhcp, but we do not use 'dhclient', which seems to have been required to trigger this bug. We bring up interfaces two different ways: at boot time, which worked fine, and explicitly on hikey using 'udhcpc', which also worked fine.
I am able to reproduce the problem by running 'dhclient' on x15 explicitly. The problem does not occur on 4.9.54, but it does occur on 4.9.55-rc1 as well as 4.9.55 release.
4.9.54 (pass): https://lkft.validation.linaro.org/scheduler/job/46865 4.9.55-rc1 (fail): https://lkft.validation.linaro.org/scheduler/job/46859#L1068 4.9.55 (fail): https://lkft.validation.linaro.org/scheduler/job/46866#L1078
I haven't tested 4.9.56 yet but I will shortly and expect it to pass.
We do have some existing tests that we don't currently run in both LTP and in our test-definitions repository that use dhclient. If it's a good idea, we could add such tests to exercise code paths that we are concerned about - I just want to be sure that we add test coverages strategically and not just as a knee-jerk reaction to one-time problems.
Hope that helps clarify,
Yes that does thanks. We just got unlucky in the specific bug here.
greg k-h
On Fri, Oct 13, 2017 at 02:28:46PM +0200, Greg KH wrote:
On Fri, Oct 13, 2017 at 12:23:37PM +0100, Mark Brown wrote:
To be fair they have caught and fixed some things like the timer bugs that John found, it's just that none of them have been boot breaks thus far.
Yeah, but the boot and build breakages should be much easier to find, one would think :)
Sort of. They do tend to be quite easy to detect if you're running the right boot test but we've acquired rather a lot of different options in our boot paths (even excluding all the possible driver combinations) so getting the coverage there isn't as trivial as you might hope. kernelci regularly finds boot problems that only manifest on some small subset of boards or subset of boot methods.
The LKFT people can correct me if I'm wrong here but my understanding is that they're focusing more on improving coverage of things beyond boot than on the basic boot stuff given that there's other stuff like kernelci and Olof's build farm out there covering boot, boot does benefit from having a big broad set of boards to test on and LKFT doesn't have that.
On Fri, Oct 13, 2017 at 12:31:37PM +0200, Greg KH wrote:
On Mon, Oct 09, 2017 at 09:44:36AM +0000, Linaro QA wrote:
Summary
kernel: 4.9.54 git repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git git branch: linux-4.9.y git commit: f37eb7b586f1dd24a86c50278c65322fc6787722 git describe: v4.9.54 Test details: https://qa-reports.linaro.org/lkft/linux-stable-rc-4.9-oe/build/v4.9.54
Hm, I don't seem to have a test result here for 4.9.55-rc1, so I'm hijacking this thread to ask what happened to the tests there?
I broke the infrastructure with an update yesterday, and that's why no tests were triggered. All the pending jobs were queued, and already started to be processed. I expect the system to be done with the backlog by the end of today.
On Oct 13, 2017, at 5:31 AM, Greg KH gregkh@google.com wrote:
On Mon, Oct 09, 2017 at 09:44:36AM +0000, Linaro QA wrote:
Summary
kernel: 4.9.54 git repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git git branch: linux-4.9.y git commit: f37eb7b586f1dd24a86c50278c65322fc6787722 git describe: v4.9.54 Test details: https://qa-reports.linaro.org/lkft/linux-stable-rc-4.9-oe/build/v4.9.54
Hm, I don't seem to have a test result here for 4.9.55-rc1, so I'm hijacking this thread to ask what happened to the tests there?
Was posted to linux-stable on Oct 11.
Turns out that 4.9.55-rc1 did not work at all for networking, yet no tests seem to have caught it. Are we not testing something with a network here? You said you were using NFS, how did that work?
Anything we can do to add to the tests to verify that basic dhcp works? We should learn from our mistakes…
You’ve probably heard me and/or others lament about coverage. I see that is an issue that *we* the linux-stable community need to hash out.
Frankly across the entire android community with all the companies that do some amount of kernel effort, we all should be looking at LTS and thinking about how to test it better. LKFT is really only a start. Failures like you saw this time around, illustrate we need to go further. That’s my session idea over the next year at various open source conferences.
Putting my Android hat on, it does bug me that we’re not merging and testing the rc LTS patches as part of the rc cycle. It’s work I know but has to be done at some point. Why wait? It sounds like you might be doing this on Google’s infra already?
Amit Pundir has a public mainline tree that blends the out of tree android common patches with mainline as the rcs roll by. That sees some amount of testing. We haven’t put it into LKFT but we sure could.
Heck, the simple build/boot tests that Google runs instantly found this issue when I tried to merge 4.9.55 into the android-common trees, maybe I should just rely on thost tests from now on?
You’d think that even the linux distro community (if they were watching RCs) should have picked it up as well.
Dan’s comment about dhclient needing to be involved is spot on tho.
Because of this, I had to release 4.9.56 a few hours later fixing the issue. This doesn't make me feel like I can trust this testing effort at all just yet, given the track record it has shown this past week. We have had two known-bad kernel releases and none of this testing caught it at all (both were caught by community members...) :(
Given that dhclient were required for those one of the uncaught errors, and KASAN the other, this illustrates the slippery slope that can quickly become distro testing and trying to do infinite combinations.
Gentoo finds a bug, debian doesn’t therefore Debian can’t be trusted. We’ve both lived in that world Greg, and I hope you appreciate the logic when I observe what you are saying is not a fair comment.
I hope you agree, we in the linux stable community have a coverage problem and we need to collectively get more eyeballs and own up to expanding what we’re doing.
In the android world there ought to be a fairly good swath of kernel engineers that should deeply care about 6 year (or whatever) LTS and pitch in. It won’t happen overnight.
Back to booting the kernels on my laptop and running 'make allmodconfig', that's found more errors so far than anything else...
greg k-h _______________________________________________ Lts-dev mailing list Lts-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lts-dev
On Fri, Oct 13, 2017 at 05:06:48PM -0500, Tom Gall wrote:
On Oct 13, 2017, at 5:31 AM, Greg KH gregkh@google.com wrote:
On Mon, Oct 09, 2017 at 09:44:36AM +0000, Linaro QA wrote:
Summary
kernel: 4.9.54 git repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git git branch: linux-4.9.y git commit: f37eb7b586f1dd24a86c50278c65322fc6787722 git describe: v4.9.54 Test details: https://qa-reports.linaro.org/lkft/linux-stable-rc-4.9-oe/build/v4.9.54
Hm, I don't seem to have a test result here for 4.9.55-rc1, so I'm hijacking this thread to ask what happened to the tests there?
Was posted to linux-stable on Oct 11.
Ok, thanks, but was looking for an "internal" one to complain on :)
Turns out that 4.9.55-rc1 did not work at all for networking, yet no tests seem to have caught it. Are we not testing something with a network here? You said you were using NFS, how did that work?
Anything we can do to add to the tests to verify that basic dhcp works? We should learn from our mistakes…
You’ve probably heard me and/or others lament about coverage. I see that is an issue that *we* the linux-stable community need to hash out.
Frankly across the entire android community with all the companies that do some amount of kernel effort, we all should be looking at LTS and thinking about how to test it better. LKFT is really only a start. Failures like you saw this time around, illustrate we need to go further. That’s my session idea over the next year at various open source conferences.
I've been asking for this "help" for over a decade, and no one has cared, until now, when Google realized they were going to be the ones having to drive this, as no one else was. And hence, "LKFT". If this project isn't helping out, then I'm just going to drop the "long term stable" stuff and everyone can go back to how things used to be (i.e. crappy and insecure), as obviously no one cared enough.
Sorry if I sound grumpy about this, but the "tragedy of the commons" is really annoying to me at times.
Now if you can drum up more help and support for this, wonderful, I would love to see that happen. But first off, let's get _this_ project actually working to ensure that 4.4 will be able to be maintained over a long period of time. I think we still have a ways to go, given the other emails this week here...
Putting my Android hat on, it does bug me that we’re not merging and testing the rc LTS patches as part of the rc cycle. It’s work I know but has to be done at some point. Why wait? It sounds like you might be doing this on Google’s infra already?
I do it when I do the merge to the android-common tree. I can try to do it for the -rc releases as well, just to get some testing, but I was _hoping_ that the LKFT work would be doing all of that testing for me (the Google internal treehugger build/boot test is very simple, and not at all reliable.)
I'd rather LKFT provide a framework I could use for it, as all it "should" require is a tree I could push to somewhere {hint}.
Heck, the simple build/boot tests that Google runs instantly found this issue when I tried to merge 4.9.55 into the android-common trees, maybe I should just rely on thost tests from now on?
You’d think that even the linux distro community (if they were watching RCs) should have picked it up as well.
The distros fall into two camps: - rolling distros run by community members or a very limited number of company developers (Fedora, openSUSE, Gentoo, Arch, etc.) - enterprise distros
The rolling distros rely on the kernel community to get this right, and provide feedback when they can, but they wait for the "real" release to happen to have a semblance of sanity. In fact, they were the ones that found this bug, so in that sense, they are doing a great job. They notified me within hours about the issue and the fix.
The enterprise distros don't care about stable kernels, or -rcs or anything else, they are of no use to us, or the community, here, other than the fact of them providing a patch stream into Linus's tree for new features that users actually want, and bugfixes that are found by their users over time (i.e. rare bugs).
Gentoo finds a bug, debian doesn’t therefore Debian can’t be trusted. We’ve both lived in that world Greg, and I hope you appreciate the logic when I observe what you are saying is not a fair comment.
Ok, fair enough, it just was a "perfect" storm last week of bugs that were missed by all testers, which made me wonder if any of the testing was actually useful. For all of us to ignore something like this, would be folly.
In the android world there ought to be a fairly good swath of kernel engineers that should deeply care about 6 year (or whatever) LTS and pitch in. It won’t happen overnight.
I can almost guarantee you that it will not happen at all, based on everything I've ever heard from all of the companies involved, Google being the exception here. People want someone else to do the work, and when I ask for help and a list of tasks of what the work is, they have never responded in the past, Linaro included...
So I might be jaded, but please, prove me wrong :)
thanks,
greg k-h