On Fri, Oct 13, 2017 at 12:31:37PM +0200, Greg KH wrote:
I just looked at this for kernelci, not sure what's going on with LKFT here and haven't talked to anyone working on it but I'll bet it was the same.
Turns out that 4.9.55-rc1 did not work at all for networking, yet no tests seem to have caught it. Are we not testing something with a network here? You said you were using NFS, how did that work?
Looks like it wasn't networking in general but rather the specific non-IP protocol that DHCP userspace uses for DHCP packets that that was broken - you're going to find that most people testing in labs will use static network configurations so won't exercise DHCP by default. Quite apart from anything else if you're working with embedded stuff the woeful unwillingness of hardware manufacturers to provide unique MAC addresses on their development boards means that trying to get a consistent IP address to the boards becomes troublesome, it's a lot easier to just provide a static configuration.
For example kernelci has a bunch of NFS boots like this one for the affected kernels:
https://storage.kernelci.org/stable/linux-4.9.y/v4.9.55/arm/multi_v7_defconf...
but as you'll see from the log a static network configuration was passed on the kernel command line and the bug bypassed. If this analysis is correct I imagine Google caught this as they do use DHCP due to a combination of testing with production hardware units where the device manufacturers could be bothered and being able to use that to test the default network configuration in their production software.
Anything we can do to add to the tests to verify that basic dhcp works? We should learn from our mistakes...
If this is what's going on probably it's possible to move the x86 systems being used for testing to DHCP as they will have sensible MAC addresses programmed so can be assigned static IPs by the DHCP server (that's just an idea, dunno if there's some operational issues that make it impractical - like I say I only looked at this for kernelci). An explicit testsuite for the mechanism that's being used for DHCP packets (I forget what the modern thing is, it's been years since I looked at that stuff) would obviously also be useful though if it's just that and doesn't go out onto a physical link it's not going to have quite the same coverage and won't catch everything a full stack test would.
Back to booting the kernels on my laptop and running 'make allmodconfig', that's found more errors so far than anything else...
To be fair they have caught and fixed some things like the timer bugs that John found, it's just that none of them have been boot breaks thus far.