-----Original Message----- From: NĂcolas F. R. A. Prado nfraprado@collabora.com On Thu, Jul 11, 2024 at 01:53:37PM -0600, Shuah Khan wrote:
On 7/10/24 15:49, Shuah Khan wrote:
On 7/10/24 07:11, Greg Kroah-Hartman wrote:
On Fri, Jul 05, 2024 at 07:29:53PM -0400, NĂcolas F. R. A. Prado wrote:
Log errors are the most widely used mechanism for reporting issues in the kernel. When an error is logged using the device helpers, eg dev_err(), it gets metadata attached that identifies the subsystem and device where the message is coming from. This series makes use of that metadata in a new test to report which devices logged errors.
The first two patches move a test and a helper script to keep things organized before this new test is added in the third patch.
It is expected that there might be many false-positive error messages throughout the drivers code which will be reported by this test. By having this test in the first place and working through the results we can address those occurrences by adjusting the loglevel of the messages that turn out to not be real errors that require the user's attention. It will also motivate additional error messages to be introduced in the code to detect real errors where they turn out to be missing, since it will be possible to detect said issues automatically.
As an example, below you can see the test result for mt8192-asurada-spherion. The single standing issue has been investigated and will be addressed in an EC firmware update [1]:
TAP version 13 1..1 power_supply sbs-8-000b: driver failed to report `time_to_empty_now' property: -5 power_supply sbs-8-000b: driver failed to report `time_to_empty_now' property: -5 power_supply sbs-8-000b: driver failed to report `time_to_empty_now' property: -5 power_supply sbs-8-000b: driver failed to report `time_to_empty_now' property: -5 power_supply sbs-8-000b: driver failed to report `time_to_empty_now' property: -5 power_supply sbs-8-000b: driver failed to report `time_to_empty_now' property: -5 power_supply sbs-8-000b: driver failed to report `time_to_empty_now' property: -5 power_supply sbs-8-000b: driver failed to report `time_to_empty_now' property: -5 power_supply sbs-8-000b: driver failed to report `time_to_empty_now' property: -5 power_supply sbs-8-000b: driver failed to report `time_to_empty_now' property: -5 power_supply sbs-8-000b: driver failed to report `time_to_empty_now' property: -5 power_supply sbs-8-000b: driver failed to report `time_to_empty_now' property: -5 power_supply sbs-8-000b: driver failed to report `model_name' property: -6 power_supply sbs-8-000b: driver failed to report `time_to_empty_now' property: -5 power_supply sbs-8-000b: driver failed to report `energy_full_design' property: -6 power_supply sbs-8-000b: driver failed to report `time_to_empty_now' property: -5 power_supply sbs-8-000b: driver failed to report `time_to_empty_now' property: -5 power_supply sbs-8-000b: driver failed to report `time_to_empty_now' property: -5 power_supply sbs-8-000b: driver failed to report `time_to_empty_now' property: -5 power_supply sbs-8-000b: driver failed to report `time_to_empty_now' property: -5 not ok 1 +power_supply:sbs-8-000b
Sorry this feedback is coming late.
The test description only includes the device identifier, and not a description of the testcase. (See below for why this might be the case).
Maybe a better testcase description might be: "check that power_supply:sbs-8-000b device has no error messages"
I note that there are 3 different error messages, but only one testcase report generated. It would be a lot of work (probably too much), for this test to interpret the error messages and convert them into testcase language. But it might be worthwhile to report a test result for each type of error message generated.
Totals: pass:0 fail:1 xfail:0 xpass:0 skip:0 error:0
There are a few interesting issues that this 'test' raises.
It is not so much a test itself, as a way to convert an error log report into KTAP output.
The 'test' has no knowledge of what the testcases are, that are being reported. That is, the actual test has occurred at runtime, inside the driver or device manager code, which produced the error string. This 'test' just reports that the device reported an error.
As currently structured, there is no way for this to report an 'ok' testcase. I talked with Nicolas on a KernelCI conference call, it is apparently possible to try to find all possible devices on a system, and report them as 'ok'. But this would produce a LOT of noise, as all the devices that did not have errors would be reported as 'ok'. This level of verbosity is probably not desirable. (It's similar to how a compiler does not report the success of every test it conducts on the source code, as it is being compiled. Rather a compiler only outputs errors and warnings for test failures.)
In terms of tracking regressions, it would be useful to identify the different things that could cause an error message to appear or disappear. For example, some errors related to hardware might change due to flakiness of the hardware. Some errors might be introduced or go away based on a change in the kernel configuration. This is one case, where have a long list of "ok" lines would be useful - to disambiguate drivers that were present, but didn't have an error, from drivers that were not present at all.
I think this test is useful, and am not NAK-ing it. I'm just giving some feedback on the general approach, and hopefully raising some issues to consider, to make the test more useful.
-- Tim