Adding coverage

List overview All Threads
Download

newer

older

Getting to a clean baseline with...

4.9.56/53d9fca9: no regressions...

Tom Gall

17 Oct 2017 17 Oct '17

12:14 a.m.

This past RC cycle I think we exposed a weakness in what we have in LKFT where the ability to execute some key functional stacks in the system to drive the kernel would probably be useful for validation.

The networking bug involving dhclient for example.

So what if we used either Debian, Gentoo or akin that has a mechanism that has as part of it’s packaging system a test target for each package. Simplest build a package, runs ‘make test’ (or akin) for some key packages that exercises parts of the system that should help tickle the kernel in interesting ways to tease out regressions.

Thus wouldn’t work on modest boards but the socionext or Juno boards could probably work fine.

Thoughts?

Regards, Tom

Show replies by date

Neil Williams

17 Oct 17 Oct

6:41 a.m.

On 17 October 2017 at 01:14, Tom Gall tom.gall@linaro.org wrote:

...

This past RC cycle I think we exposed a weakness in what we have in LKFT where the ability to execute some key functional stacks in the system to drive the kernel would probably be useful for validation.

The networking bug involving dhclient for example.

So what if we used either Debian, Gentoo or akin that has a mechanism that has as part of it’s packaging system a test target for each package. Simplest build a package, runs ‘make test’ (or akin) for some key packages that exercises parts of the system that should help tickle the kernel in interesting ways to tease out regressions.

Debian has two types of test supported within the packaging:

* in-build tests - needs a full build environment on the buildd device and tests the binaries which have just been compiled rather than what is actually installed. (buildd.debian.org) - most of these tests are only run and maintained against Debian unstable (when the package is first built for the archive). Intermittent rebuilds are performed, again using Debian unstable. These tests are therefore only a snapshot against a particular set of packages and the focus is on ensuring that the package builds successfully. e.g. https://buildd.debian.org/status/package.php?p=dpkg

* autopkg-tests - designed to test that the installed package works against updated dependencies. Needs a bit of setup (generally a QEMU image or an LXC) (ci.debian.org) - these tests are also run against Debian unstable (as that is where newly updated dependencies turn up). Tests are run continually, whenever a dependency is updated. Rather than a build environment, these tests require the package to be installed with a few extra tools. e.g. https://ci.debian.net/packages/l/lava-server/unstable/amd64/

Neither test covers 100% of packages (separately or combined) - in-build tests tend to be present in many packages written in C or Perl, autopkgtests tend to be of more interest with packages with a wide range of dependencies. (e.g. LAVA uses autopkgtest.) Particular note is that each of these run against Debian unstable which is a constantly moving target, not the stable release which is what we will tend to deploy.

...

Thus wouldn’t work on modest boards but the socionext or Juno boards could probably work fine.

Thoughts?

Carefully chosen, these tests will be useful but there isn't going to be a blanket we can put over all of the functional stacks. I'm uncertain how many of these tests will exercise the kernel as many will try quite hard to isolate the build/test environment from the runtime environment in the interests of reproducibility.

Some work will be required to write new tests in the gaps - between packages and the kernel. It is likely that organisations like Debian would consider these tests useful, once created.

-- Neil Williams ============= neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Fathi Boudra

6:44 a.m.

On 17 October 2017 at 03:14, Tom Gall tom.gall@linaro.org wrote:

...

This past RC cycle I think we exposed a weakness in what we have in LKFT where the ability to execute some key functional stacks in the system to drive the kernel would probably be useful for validation.

The networking bug involving dhclient for example.

So what if we used either Debian, Gentoo or akin that has a mechanism that has as part of it’s packaging system a test target for each package. Simplest build a package, runs ‘make test’ (or akin) for some key packages that exercises parts of the system that should help tickle the kernel in interesting ways to tease out regressions.

OE has the same mechanism. It's called ptest.

...

Thus wouldn’t work on modest boards but the socionext or Juno boards could probably work fine.

Thoughts?

I'm not convinced it would have helped in the above example.

...

Regards, Tom

Lts-dev mailing list Lts-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lts-dev

Tom Gall

12:53 p.m.

...

On Oct 17, 2017, at 1:44 AM, Fathi Boudra fathi.boudra@linaro.org wrote:

On 17 October 2017 at 03:14, Tom Gall tom.gall@linaro.org wrote:

...
This past RC cycle I think we exposed a weakness in what we have in LKFT where the ability to execute some key functional stacks in the system to drive the kernel would probably be useful for validation.

The networking bug involving dhclient for example.

So what if we used either Debian, Gentoo or akin that has a mechanism that has as part of it’s packaging system a test target for each package. Simplest build a package, runs ‘make test’ (or akin) for some key packages that exercises parts of the system that should help tickle the kernel in interesting ways to tease out regressions.

OE has the same mechanism. It's called ptest.

Ok cool.

...

...
Thus wouldn’t work on modest boards but the socionext or Juno boards could probably work fine.

Thoughts?

I'm not convinced it would have helped in the above example.

I think that’s what we’d want to look into. Do some of the package test suites do enough to drive some representative activity in the kernel that could be useful for finding regressions.

I don’t think just universally do this for all 10,000+ packages for a distro or distro(s). Tho if we had a huge farm of servers we could shard out to …. anyway this seems like an interesting experiment that we could apply to 4.9.55-rc1 and see if it would have been detected.

...

...
Regards, Tom

Lts-dev mailing list Lts-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lts-dev

Milosz Wasilewski

8:04 a.m.

On 17 October 2017 at 01:14, Tom Gall tom.gall@linaro.org wrote:

...

This past RC cycle I think we exposed a weakness in what we have in LKFT where the ability to execute some key functional stacks in the system to drive the kernel would probably be useful for validation.

The networking bug involving dhclient for example.

So what if we used either Debian, Gentoo or akin that has a mechanism that has as part of it’s packaging system a test target for each package. Simplest build a package, runs ‘make test’ (or akin) for some key packages that exercises parts of the system that should help tickle the kernel in interesting ways to tease out regressions.

Thus wouldn’t work on modest boards but the socionext or Juno boards could probably work fine.

Thoughts?

There are 53 open bugs, at least 3 test jobs fail on every attempt and we struggle to point the root cause of the failures we already capture. In this situation adding more tests is the worst idea possible. IMHO the the current highest priority should be 'making all tests green'. Once that happens (through bu fixes or disabling tests) we can add new tests.

milosz

...

Regards, Tom

Lts-dev mailing list Lts-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lts-dev

Tom Gall

12:43 p.m.

Hi Milosz

...

On Oct 17, 2017, at 3:04 AM, Milosz Wasilewski milosz.wasilewski@linaro.org wrote:

On 17 October 2017 at 01:14, Tom Gall tom.gall@linaro.org wrote:

...
This past RC cycle I think we exposed a weakness in what we have in LKFT where the ability to execute some key functional stacks in the system to drive the kernel would probably be useful for validation.

The networking bug involving dhclient for example.

So what if we used either Debian, Gentoo or akin that has a mechanism that has as part of it’s packaging system a test target for each package. Simplest build a package, runs ‘make test’ (or akin) for some key packages that exercises parts of the system that should help tickle the kernel in interesting ways to tease out regressions.

Thus wouldn’t work on modest boards but the socionext or Juno boards could probably work fine.

Thoughts?

There are 53 open bugs, at least 3 test jobs fail on every attempt and we struggle to point the root cause of the failures we already capture. In this situation adding more tests is the worst idea possible. IMHO the the current highest priority should be 'making all tests green'. Once that happens (through bu fixes or disabling tests) we can add new tests.

The point of this email is discussion. As you’ll note in the other email thread getting to a clean state is and remains top priority.

...

milosz

...
Regards, Tom

Lts-dev mailing list Lts-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lts-dev

Dan Rue

2:02 p.m.

On Tue, Oct 17, 2017 at 12:14:43AM +0000, Tom Gall wrote:

...

This past RC cycle I think we exposed a weakness in what we have in LKFT where the ability to execute some key functional stacks in the system to drive the kernel would probably be useful for validation.

The networking bug involving dhclient for example.

So what if we used either Debian, Gentoo or akin that has a mechanism that has as part of it’s packaging system a test target for each package. Simplest build a package, runs ‘make test’ (or akin) for some key packages that exercises parts of the system that should help tickle the kernel in interesting ways to tease out regressions.

Thus wouldn’t work on modest boards but the socionext or Juno boards could probably work fine.

I think we just need to iterate on the framework we have until we're stable for a period of time. We are presently running many tests that once trusted, will find regressions that nobody else notices in a timely manner. My biggest concern is trust - our results need to be rock solid, stable, so that they are trusted and so that people jump when there is a reported regression. Currently, that is not the case.

Once we reach that point, we will be able to carefully introduce additional tests strategically, based on gaps in our existing coverage. There are still many tests in LTP that we don't run, for example.

Regarding packaging tests - I think Neil covered it well. I would be surprised if there's value there, from a kernel testing perspective. I don't think introducing another OS is a good idea, unless we are actually considering dropping one.

What I would like to see, and I don't know if it is even possible, is something that actually measures test coverage based on code paths in the linux kernel so that we have a means to actually measure our effectiveness. If we knew we were testing 43% (number pulled out of thin air) of the linux kernel code paths, then we would know what areas to focus on to bring that number up, and we would know which subsystems to have some confidence in, and which are uncovered.

Dan

Milosz Wasilewski

3:08 p.m.

On 17 October 2017 at 15:02, Dan Rue dan.rue@linaro.org wrote:

...

What I would like to see, and I don't know if it is even possible, is something that actually measures test coverage based on code paths in the linux kernel so that we have a means to actually measure our effectiveness. If we knew we were testing 43% (number pulled out of thin air) of the linux kernel code paths, then we would know what areas to focus on to bring that number up, and we would know which subsystems to have some confidence in, and which are uncovered.

Naresh did gcov with kernel some time ago, but that wasn't trivial IIRC. It requires to have the sources and symbols to be present in the root filesystem. It also requires some build time instrumentation so it's not transparent.

milosz

Greg KH

3:08 p.m.

On Tue, Oct 17, 2017 at 09:02:18AM -0500, Dan Rue wrote:

...

What I would like to see, and I don't know if it is even possible, is something that actually measures test coverage based on code paths in the linux kernel so that we have a means to actually measure our effectiveness. If we knew we were testing 43% (number pulled out of thin air) of the linux kernel code paths, then we would know what areas to focus on to bring that number up, and we would know which subsystems to have some confidence in, and which are uncovered.

Please read: http://blog.ploeh.dk/2015/11/16/code-coverage-is-a-useless-target-measure/

I worked with a team of developers over a decade ago trying to help with code-coverage analysis of the Linux kernel (many of those tests ended up in LTP). I'm pretty sure the ability is still there, but it turned out, in the end, that it means nothing at all.

Heck, even when you turn on fun things like "fail kmalloc() X% of the time to exercise error paths", you still don't really test the overall system.

So please, never think in terms of code coverage, but feature coverage, like what LTP is trying to accomplish, is a great metric to strive for.

thanks,

greg k-h

Neil Williams

3:39 p.m.

On 17 October 2017 at 16:08, Greg KH gregkh@google.com wrote:

...

On Tue, Oct 17, 2017 at 09:02:18AM -0500, Dan Rue wrote:

...
What I would like to see, and I don't know if it is even possible, is something that actually measures test coverage based on code paths in the linux kernel so that we have a means to actually measure our effectiveness. If we knew we were testing 43% (number pulled out of thin air) of the linux kernel code paths, then we would know what areas to focus on to bring that number up, and we would know which subsystems to have some confidence in, and which are uncovered.

Please read: http://blog.ploeh.dk/2015/11/16/code-coverage-is-a-useless- target-measure/

I worked with a team of developers over a decade ago trying to help with code-coverage analysis of the Linux kernel (many of those tests ended up in LTP). I'm pretty sure the ability is still there, but it turned out, in the end, that it means nothing at all.

Heck, even when you turn on fun things like "fail kmalloc() X% of the time to exercise error paths", you still don't really test the overall system.

So please, never think in terms of code coverage, but feature coverage, like what LTP is trying to accomplish, is a great metric to strive for.

It depends a lot on the focus of the upstream team, but this approach is reflected in the in-build tests and install time tests of various userspace projects. However, package based tests alone are not a good way to test the complete system. Equally neither are full system tests necessarily - it is all too easy to generate a test suite of hundreds of thousands of results which becomes all but impossible to debug when something subtle goes wrong but which the test suite doesn't explicitly check. A targeted package-based or feature specific test would identify the problem much more quickly.

It needs to be a layered approach combining small and large tests, package-based and system-based, which returns to Dan's original point that we need to iterate to get to a stable platform and then step up with wider tests. Userspace still has an effect on kernel support, especially at the level of init, as we found with the systemd getty race condition issue. A wider range of devices and a wider range of userspace software (including an extra distribution at a point in the future) helps in the triage. The systemd issue was first spotted on about 3% of jobs on x86_64 but it wasn't until it was reproducible on 50% of test jobs on the X15 that it became clear that this wasn't a kernel or hardware issue.

Reproducible bugs can be easy - intermittent bugs need wider and repetitive testing and rapidly become rabbit holes which devour engineering time. I would like to see the term "coverage" including this wider, more varied, support which becomes essential with the more difficult bugs.

What we will also need is a map of which tests are stressing which features - so a sane metric for feature coverage inside and outside the kernel would be needed here.

-- Neil Williams ============= neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Dan Rue

4:26 p.m.

On Tue, Oct 17, 2017 at 03:39:15PM +0000, Neil Williams wrote:

...

On 17 October 2017 at 16:08, Greg KH <[1]gregkh@google.com> wrote:

 On Tue, Oct 17, 2017 at 09:02:18AM -0500, Dan Rue wrote:
 > What I would like to see, and I don't know if it is even possible, is
 > something that actually measures test coverage based on code paths in
 > the linux kernel so that we have a means to actually measure our
 > effectiveness. If we knew we were testing 43% (number pulled out of
 thin
 > air) of the linux kernel code paths, then we would know what areas to
 > focus on to bring that number up, and we would know which subsystems
 to
 > have some confidence in, and which are uncovered.

 Please read:

 [2]http://blog.ploeh.dk/2015/11/16/code-coverage-is-a-useless-target-measure/

 I worked with a team of developers over a decade ago trying to help with
 code-coverage analysis of the Linux kernel (many of those tests ended up
 in LTP).  I'm pretty sure the ability is still there, but it turned out,
 in the end, that it means nothing at all.

 Heck, even when you turn on fun things like "fail kmalloc() X% of the
 time to exercise error paths", you still don't really test the overall
 system.

 So please, never think in terms of code coverage, but feature coverage,
 like what LTP is trying to accomplish, is a great metric to strive for.

<snip> What we will also need is a map of which tests are stressing which features - so a sane metric for feature coverage inside and outside the kernel would be needed here.

++ ^^

...

From the blog post:

Some people use it to find areas where coverage is weak. There may be good reasons that some parts of a code base are sparsely covered by tests, but doing a manual inspection once in a while is a good idea. Perhaps you find that all is good, but you may also discover that a quality effort is overdue.

The problem that I have is that I don't know where coverage is strong, or where it is weak. Before last week, if someone suggested adding a 'dhclient' test, I would have told them it is redundant. Now, I know that dhclient actually uses a different code path than both init and udhcpc. The only way I know to measure feature coverage is to look at the LTP tests that we're running, and which we're not, but that is a secondary measure.

Do you have a good suggestion for evaluating feature coverage? I don't disagree with your feedback, but it would be good to have some shared perspective on coverage analysis so that we can improve it strategically rather than based on gut feelings, or as a reaction to uncaught problems.

I also agree with Mark's response that my coverage suggestion is premature. This whole thread is premature. But it's also premature to bring in additional test suites at this time. Have to stabilize and expand on what we have, namely LTP.

Dan

Mark Brown

6:41 p.m.

On Tue, Oct 17, 2017 at 11:26:36AM -0500, Dan Rue wrote:

...

On Tue, Oct 17, 2017 at 03:39:15PM +0000, Neil Williams wrote:

...

...
What we will also need is a map of which tests are stressing which features - so a sane metric for feature coverage inside and outside the kernel would be needed here.

...

From the blog post:

Some people use it to find areas where coverage is weak. There may
be good reasons that some parts of a code base are sparsely covered
by tests, but doing a manual inspection once in a while is a good
idea. Perhaps you find that all is good, but you may also discover
that a quality effort is overdue.

You are aware of the tendencies people have to latch onto metrics, right? In this case it's not like it's going to come as a sudden revelation that there are holes in coverage.

...

Do you have a good suggestion for evaluating feature coverage? I don't disagree with your feedback, but it would be good to have some shared perspective on coverage analysis so that we can improve it strategically rather than based on gut feelings, or as a reaction to uncaught problems.

I made a couple of concrete suggestions on this in my prior mail - picking up existing testsuites and looking at areas where there's active development or an awareness of frequent problems (including things like what's getting a lot of attention in terms of stable fixes). We could also just look at a phone and think about the subsystems it relies on, glancing at mine graphics, multimedia, extcon, bluetooth and networking jump out off the top of my head as having weak coverage.

...

I also agree with Mark's response that my coverage suggestion is premature. This whole thread is premature. But it's also premature to bring in additional test suites at this time. Have to stabilize and expand on what we have, namely LTP.

I see where you're coming from but I don't think it's quite that black and white. Getting testsuites integrated with the framework and getting them to run cleanly are two different activities which probably want to be carried out by different people so there's something to be said for looking at the next batch of testsuites to stage into production before we're ready to do that. There will also be cases where different people should be looking at different testsuites due to their domain specific knowledge or where we can pull in people from the community so that work can be parallelized as well. There are limits to how far that can go though, and we do need to be careful we're not just flinging stuff at the wall.

Ryan Arnold

9:46 p.m.

On Tue, Oct 17, 2017 at 11:26 AM, Dan Rue dan.rue@linaro.org wrote:

...

The problem that I have is that I don't know where coverage is strong, or where it is weak. Before last week, if someone suggested adding a 'dhclient' test, I would have told them it is redundant. Now, I know that dhclient actually uses a different code path than both init and udhcpc. The only way I know to measure feature coverage is to look at the LTP tests that we're running, and which we're not, but that is a secondary measure.

Do you have a good suggestion for evaluating feature coverage? I don't disagree with your feedback, but it would be good to have some shared perspective on coverage analysis so that we can improve it strategically rather than based on gut feelings, or as a reaction to uncaught problems.

I also agree with Mark's response that my coverage suggestion is premature. This whole thread is premature. But it's also premature to bring in additional test suites at this time. Have to stabilize and expand on what we have, namely LTP.

Some projects with a more disciplined testing approach ask developers to submit reasonably complete feature based tests along side the enablement patch and in the future a new test is required for each encountered regression. If at least the latter is enforced it can build reasonable coverage over time.

Is it premature to work with the test suite projects right now to make sure that these regressions (dhclient & KASAN) have a test created _somewhere_ to document them?

Mark Brown

11:52 p.m.

On Tue, Oct 17, 2017 at 04:46:22PM -0500, Ryan Arnold wrote:

...

Some projects with a more disciplined testing approach ask developers to submit reasonably complete feature based tests along side the enablement patch and in the future a new test is required for each encountered regression. If at least the latter is enforced it can build reasonable coverage over time.

...

Is it premature to work with the test suite projects right now to make sure that these regressions (dhclient & KASAN) have a test created _somewhere_ to document them?

Well, it's always possible to contribute tests to relevant testsuites. You might have trouble finding a sensible existing testsuite for some things, and there will be plenty of issues where finding a sensible test is also unreasonably difficult so you're not going to have much chance of making it a requirement in the forseeable future.

KASAN is just an option that needs turning on in builds, it's not something you'd write a test for. It is already covered in kernelci, looks like it and a bunch of the other test configurations have been blacklisted for the stable kernels though so someone ought to look at reenabling it - there were a bunch of build fixes that were backported a while ago, probably fixes to enable KASAN were part of it or it was just blacklisted at a point where no stable kernels worked.

Greg KH

18 Oct 18 Oct

9:51 a.m.

On Tue, Oct 17, 2017 at 04:46:22PM -0500, Ryan Arnold wrote:

...

On Tue, Oct 17, 2017 at 11:26 AM, Dan Rue dan.rue@linaro.org wrote:

...
The problem that I have is that I don't know where coverage is strong, or where it is weak. Before last week, if someone suggested adding a 'dhclient' test, I would have told them it is redundant. Now, I know that dhclient actually uses a different code path than both init and udhcpc. The only way I know to measure feature coverage is to look at the LTP tests that we're running, and which we're not, but that is a secondary measure.

Do you have a good suggestion for evaluating feature coverage? I don't disagree with your feedback, but it would be good to have some shared perspective on coverage analysis so that we can improve it strategically rather than based on gut feelings, or as a reaction to uncaught problems.

I also agree with Mark's response that my coverage suggestion is premature. This whole thread is premature. But it's also premature to bring in additional test suites at this time. Have to stabilize and expand on what we have, namely LTP.

Some projects with a more disciplined testing approach ask developers to submit reasonably complete feature based tests along side the enablement patch and in the future a new test is required for each encountered regression. If at least the latter is enforced it can build reasonable coverage over time.

We try to ask for a new test to be added for every new syscall, which is how kselftest has been growing over the past few years. For other things, like networking and storage features and filesystems, there are other test suites that are managed by the community to test those functions.

...

Is it premature to work with the test suite projects right now to make sure that these regressions (dhclient & KASAN) have a test created _somewhere_ to document them?

Try implementing all of our known test suites first before worrying about this.

Oh, and a simple 'make allmodconfig' please, that would have caught the KASAN issue...

thanks,

greg k-h

Arnd Bergmann

10:09 a.m.

On Wed, Oct 18, 2017 at 11:51 AM, Greg KH gregkh@google.com wrote:

...

On Tue, Oct 17, 2017 at 04:46:22PM -0500, Ryan Arnold wrote:

...
On Tue, Oct 17, 2017 at 11:26 AM, Dan Rue dan.rue@linaro.org wrote: Is it premature to work with the test suite projects right now to make sure that these regressions (dhclient & KASAN) have a test created _somewhere_ to document them?

Try implementing all of our known test suites first before worrying about this.

Oh, and a simple 'make allmodconfig' please, that would have caught the KASAN issue...

I think for that is is better to integrate the build results from kernelci, which already does a really good job at build testing.

Arnd

Mark Brown

10:19 a.m.

On Wed, Oct 18, 2017 at 12:09:17PM +0200, Arnd Bergmann wrote:

...

On Wed, Oct 18, 2017 at 11:51 AM, Greg KH gregkh@google.com wrote:

...

...
Oh, and a simple 'make allmodconfig' please, that would have caught the KASAN issue...

...

I think for that is is better to integrate the build results from kernelci, which already does a really good job at build testing.

Definitely, and it's where people with new ideas for coverage tend to go to suggest things so it's going to be less work long term. Like I said the yesterday we do need to get the coverage for things like KASAN turned back on.

Greg KH

9:49 a.m.

On Tue, Oct 17, 2017 at 11:26:36AM -0500, Dan Rue wrote:

...

On Tue, Oct 17, 2017 at 03:39:15PM +0000, Neil Williams wrote:

...
On 17 October 2017 at 16:08, Greg KH <[1]gregkh@google.com> wrote:
 On Tue, Oct 17, 2017 at 09:02:18AM -0500, Dan Rue wrote:
 > What I would like to see, and I don't know if it is even possible, is
 > something that actually measures test coverage based on code paths in
 > the linux kernel so that we have a means to actually measure our
 > effectiveness. If we knew we were testing 43% (number pulled out of
 thin
 > air) of the linux kernel code paths, then we would know what areas to
 > focus on to bring that number up, and we would know which subsystems
 to
 > have some confidence in, and which are uncovered.

 Please read:

 [2]http://blog.ploeh.dk/2015/11/16/code-coverage-is-a-useless-target-measure/

 I worked with a team of developers over a decade ago trying to help with
 code-coverage analysis of the Linux kernel (many of those tests ended up
 in LTP).  I'm pretty sure the ability is still there, but it turned out,
 in the end, that it means nothing at all.

 Heck, even when you turn on fun things like "fail kmalloc() X% of the
 time to exercise error paths", you still don't really test the overall
 system.

 So please, never think in terms of code coverage, but feature coverage,
 like what LTP is trying to accomplish, is a great metric to strive for.
<snip> What we will also need is a map of which tests are stressing which features - so a sane metric for feature coverage inside and outside the kernel would be needed here.
++ ^^

From the blog post:
Some people use it to find areas where coverage is weak. There may
be good reasons that some parts of a code base are sparsely covered
by tests, but doing a manual inspection once in a while is a good
idea. Perhaps you find that all is good, but you may also discover
that a quality effort is overdue.
The problem that I have is that I don't know where coverage is strong, or where it is weak. Before last week, if someone suggested adding a 'dhclient' test, I would have told them it is redundant. Now, I know that dhclient actually uses a different code path than both init and udhcpc. The only way I know to measure feature coverage is to look at the LTP tests that we're running, and which we're not, but that is a secondary measure.

Do you have a good suggestion for evaluating feature coverage? I don't disagree with your feedback, but it would be good to have some shared perspective on coverage analysis so that we can improve it strategically rather than based on gut feelings, or as a reaction to uncaught problems.

Start with LTP, we _know_ that is a good first step, combined with kselftests, to implement the basics like syscall functionality.

...

I also agree with Mark's response that my coverage suggestion is premature. This whole thread is premature. But it's also premature to bring in additional test suites at this time. Have to stabilize and expand on what we have, namely LTP.

Yes, it is premature, and you already have a long list of tests to add to the system after LTP is finally integrated (i.e. the list of 0-day tests). Only after you have that implemented should we start looking around for adding new tests (I have a list somewhere, but don't want to overwhelm you just yet...)

thanks,

greg k-h

Mark Brown

10:10 a.m.

On Wed, Oct 18, 2017 at 11:49:13AM +0200, Greg KH wrote:

...

Yes, it is premature, and you already have a long list of tests to add to the system after LTP is finally integrated (i.e. the list of 0-day tests). Only after you have that implemented should we start looking around for adding new tests (I have a list somewhere, but don't want to overwhelm you just yet...)

I've got a list as well that I keep sharing every time this gets asked (well, mostly reeling off the top of my head since it's easy enough to come up with a long enough list).

Mark Brown

17 Oct 17 Oct

3:39 p.m.

On Tue, Oct 17, 2017 at 09:02:18AM -0500, Dan Rue wrote:

...

I think we just need to iterate on the framework we have until we're stable for a period of time. We are presently running many tests that once trusted, will find regressions that nobody else notices in a timely manner. My biggest concern is trust - our results need to be rock solid, stable, so that they are trusted and so that people jump when there is a reported regression. Currently, that is not the case.

Well, we also need the reporting quality to be good (I know Milosz and Antonio are working on this) and to be directing the reports outwards so that other people trust them too (once the results are stable). That helps enormously with getting people to pay attention when issues are found, and will hopefully also help motivate people to work more on testsuites.

...

What I would like to see, and I don't know if it is even possible, is something that actually measures test coverage based on code paths in the linux kernel so that we have a means to actually measure our effectiveness. If we knew we were testing 43% (number pulled out of thin air) of the linux kernel code paths, then we would know what areas to focus on to bring that number up, and we would know which subsystems to have some confidence in, and which are uncovered.

That's come up before. I personally feel that collecting and trying to optimize coverage numbers is really premature here and is likely to be a distraction when we inevitably sign ourselves up for metrics based targets. It's not like it is a struggle for us to identify areas where we could usefully add coverage, nor is it likely to be so for quite a while, but I have seen testing efforts failing to deliver value while showing great metrics (eg, by adding tests for things that are easy to test but rarely break so don't really help people find defects).

Instead I think we should focus on two directions for expanding coverage. One is bringing in existing testsuites. That's obviously less development cost for us and has the additional advantage of bringing the testing community more together - we can learn from other people working in the area, they feel more appreciated and it all helps push collaboration on best practices. The other direction I see as likely to bring good results is to look at where current activity that could be supported by automated testing is. That's a combination of looking at areas where people frequently report problems and looking at the things that are most actively developed (with the angle on stable that'd be areas that get the most stable backports for example).

Right now all it takes is momentary thought to find areas where we're lacking coverage so it seems much more interesting to try to prioritize where we're going to get most value from efforts to improve coverage rather than go hunting for them. As coverage improves it's going to start to become more and more useful to bring things like coverage metrics in.

2953

days inactive

2954

days old

lts-dev@lists.linaro.org

19 comments

participants

tags (0)

participants (9)

Arnd Bergmann
Dan Rue
Fathi Boudra
Greg KH
Mark Brown
Milosz Wasilewski
Neil Williams
Ryan Arnold
Tom Gall