[4.14] Failing selftest timer/adjtick

List overview All Threads
Download

newer

older

[PATCH v5] drm: Use USB...

stable-rc/queue/4.14 build: 201...

Joerg Vehlow

10 Feb 2021 10 Feb '21

12:43 p.m.

Hi,

we found that on the selftest timer/adjtick fails on arm64 (tested on some renesas board and in qemu) quite frequently. By bisecting the kernel I found that it stopped failing after commit 78b98e3c5a66 (timekeeping/ntp: Determine the multiplier directly from NTP tick length). Should this patch be applied to 4.14 and is it even possible or could it break something else?

Thanks, Joerg

Show replies by date

Greg KH

10 Feb 10 Feb

1 p.m.

On Wed, Feb 10, 2021 at 01:43:10PM +0100, Joerg Vehlow wrote:

...

Hi,

we found that on the selftest timer/adjtick fails on arm64 (tested on some renesas board and in qemu) quite frequently. By bisecting the kernel I found that it stopped failing after commit 78b98e3c5a66 (timekeeping/ntp: Determine the multiplier directly from NTP tick length). Should this patch be applied to 4.14 and is it even possible or could it break something else?

Have you tried applying it to that tree to see if it solves your problem and works properly? If so, please feel free to provide a working backported copy, with your signed-off-by and we can consider it.

But, why not just use 4.19 or newer on that system?

thanks,

greg k-h

Joerg Vehlow

1:07 p.m.

Hi Greg,

On 2/10/2021 2:00 PM, Greg KH wrote:

...

Have you tried applying it to that tree to see if it solves your problem and works properly? If so, please feel free to provide a working backported copy, with your signed-off-by and we can consider it.

It can be applied without any changes and fixes the problem, but since I have not a lot of knowledge about this subsystem, I don't know if this breaks anything or if it requires other patches to be applied first, to not break anything.. Maybe the authors of the patch can check this easily or maybe know it. That's why I added them to the initial mail.

...

But, why not just use 4.19 or newer on that system?

Why does an LTS version of 4.14 exist? Because the customer demands it :) If the failing test was not one of the kernel selftest, I wouldn't bother you with this...

Joerg

Miroslav Lichvar

1:19 p.m.

On Wed, Feb 10, 2021 at 02:07:21PM +0100, Joerg Vehlow wrote:

...

On 2/10/2021 2:00 PM, Greg KH wrote:

...
Have you tried applying it to that tree to see if it solves your problem and works properly? If so, please feel free to provide a working backported copy, with your signed-off-by and we can consider it.

It can be applied without any changes and fixes the problem, but since I have not a lot of knowledge about this subsystem, I don't know if this breaks anything or if it requires other patches to be applied first, to not break anything.. Maybe the authors of the patch can check this easily or maybe know it. That's why I added them to the initial mail.

That patch cannot be applied alone. It would break the timekeeping in not so obvious ways as there will be unexpected sources of the NTP tracking error. IIRC, at least the following changes would need to be included with it. There may be others.

c2cda2a5bda9 ("timekeeping/ntp: Don't align NTP frequency adjustments to ticks") aea3706cfc4d ("timekeeping: Remove CONFIG_GENERIC_TIME_VSYSCALL_OLD") d4d1fc61eb38 ("ia64: Update fsyscall gettime to use modern vsyscall_update")

My suggestion for a fix would be to increase the limit in the failing test.

-- Miroslav Lichvar

Naresh Kamboju

6:59 p.m.

I have tested adjtick on arm64 juno-r2 device and it got pass and here is the test output on Linux version 4.14.221-rc1.

+ ./adjtick Each iteration takes about 15 seconds Estimating tick (act: 9000 usec, -100000 ppm): 9000 usec, -100000 ppm [OK] Estimating tick (act: 9250 usec, -75000 ppm): 9250 usec, -75000 ppm [OK] Estimating tick (act: 9500 usec, -50000 ppm): 9500 usec, -50000 ppm [OK] Estimating tick (act: 9750 usec, -25000 ppm): 9750 usec, -25001 ppm [OK] Estimating tick (act: 10000 usec, 0 ppm): 10000 usec, 0 ppm [OK] Estimating tick (act: 10250 usec, 25000 ppm): 10249 usec, 24999 ppm [OK] Estimating tick (act: 10500 usec, 50000 ppm): 10500 usec, 50000 ppm [OK] Estimating tick (act: 10750 usec, 75000 ppm): 10750 usec, 75000 ppm [OK] Pass 0 Fail 0 Xfail 0 Xpass 0 Skip 0 Error 0 1..0

output link, https://lkft.validation.linaro.org/scheduler/job/2254102#L1255

- Naresh

Joerg Vehlow

11 Feb 11 Feb

10:34 a.m.

Hi,

On 2/10/2021 7:59 PM, Naresh Kamboju wrote:

...

I have tested adjtick on arm64 juno-r2 device and it got pass and here is the test output on Linux version 4.14.221-rc1.

Interesting. Is this vanilla 4.14.221 or are there some o-e patches applied? I just tried again on qemu arm with 4.14.222 from kernel.org stable tree and still have failures like the one below every time I try. The failing test step differs, but it always fails.

Each iteration takes about 15 seconds Estimating tick (act: 9000 usec, -100000 ppm): 9000 usec, -100000 ppm [OK] Estimating tick (act: 9250 usec, -75000 ppm): 9250 usec, -75001 ppm [OK] Estimating tick (act: 9500 usec, -50000 ppm): 9501 usec, -49995 ppm [OK] Estimating tick (act: 9750 usec, -25000 ppm): 9750 usec, -25003 ppm [OK] Estimating tick (act: 10000 usec, 0 ppm): 9996 usec, -463 ppm [FAILED] Bail out! Pass 0 Fail 0 Xfail 0 Xpass 0 Skip 0 Error 0 1..0

Joerg

Joerg Vehlow

10:33 a.m.

Hi Miroslav,

On 2/10/2021 2:19 PM, Miroslav Lichvar wrote:

...

That patch cannot be applied alone. It would break the timekeeping in not so obvious ways as there will be unexpected sources of the NTP tracking error. IIRC, at least the following changes would need to be included with it. There may be others.

c2cda2a5bda9 ("timekeeping/ntp: Don't align NTP frequency adjustments to ticks") aea3706cfc4d ("timekeeping: Remove CONFIG_GENERIC_TIME_VSYSCALL_OLD") d4d1fc61eb38 ("ia64: Update fsyscall gettime to use modern vsyscall_update")

My suggestion for a fix would be to increase the limit in the failing test.

Thanks, that's what I expected. But I still wonder why the test is failing almost 100% of time for me on qemu-arm64 (running on x86). Is this a regression in 4.14, that was working at some point or was it never tested on arm?

Joerg

Greg KH

10:45 a.m.

On Thu, Feb 11, 2021 at 11:33:05AM +0100, Joerg Vehlow wrote:

...

Hi Miroslav,

On 2/10/2021 2:19 PM, Miroslav Lichvar wrote:

...
That patch cannot be applied alone. It would break the timekeeping in not so obvious ways as there will be unexpected sources of the NTP tracking error. IIRC, at least the following changes would need to be included with it. There may be others.

c2cda2a5bda9 ("timekeeping/ntp: Don't align NTP frequency adjustments to ticks") aea3706cfc4d ("timekeeping: Remove CONFIG_GENERIC_TIME_VSYSCALL_OLD") d4d1fc61eb38 ("ia64: Update fsyscall gettime to use modern vsyscall_update")

My suggestion for a fix would be to increase the limit in the failing test.

Thanks, that's what I expected. But I still wonder why the test is failing almost 100% of time for me on qemu-arm64 (running on x86). Is this a regression in 4.14, that was working at some point or was it never tested on arm?

Does it work on a real system? That's the proper test...

Miroslav Lichvar

10:59 a.m.

On Thu, Feb 11, 2021 at 11:33:05AM +0100, Joerg Vehlow wrote:

...

...
My suggestion for a fix would be to increase the limit in the failing test.

Thanks, that's what I expected. But I still wonder why the test is failing almost 100% of time for me on qemu-arm64 (running on x86). Is this a regression in 4.14, that was working at some point or was it never tested on arm?

I don't think it is specific to arm or that it is a regression. I think the virtual machine just happens to be too idle for the test. There may be unrelated changes, maybe in the kernel, qemu, or applications, that caused the rate of the clock updates to decrease so much that the instability now triggers the failure in the test. The issue with the clock was there since NO_HZ was introduced, but it becomes more severe as the activity of the kernel decreases.

-- Miroslav Lichvar

Joerg Vehlow

18 Feb 18 Feb

7:05 a.m.

Hi Miroslav,

On 2/11/2021 11:59 AM, Miroslav Lichvar wrote:

...

I don't think it is specific to arm or that it is a regression. I think the virtual machine just happens to be too idle for the test. There may be unrelated changes, maybe in the kernel, qemu, or applications, that caused the rate of the clock updates to decrease so much that the instability now triggers the failure in the test. The issue with the clock was there since NO_HZ was introduced, but it becomes more severe as the activity of the kernel decreases.

Thank you for that explanation. I did create some background load (copy from urandom to null) and ran the test. This made the test pass every time.

Jörg

Joerg Vehlow

1 Mar 1 Mar

7:04 a.m.

Hi Mi

On 2/11/2021 11:59 AM, Miroslav Lichvar wrote:

...

I don't think it is specific to arm or that it is a regression. I think the virtual machine just happens to be too idle for the test. There may be unrelated changes, maybe in the kernel, qemu, or applications, that caused the rate of the clock updates to decrease so much that the instability now triggers the failure in the test. The issue with the clock was there since NO_HZ was introduced, but it becomes more severe as the activity of the kernel decreases.

Thanks for the hint towards NO_HZ. Running the tests with some background load makes them pass reliably.

Jörg

1758

days inactive

1777

days old

linux-stable-mirror@lists.linaro.org

10 comments

participants

tags (0)

participants (4)

Greg KH
Joerg Vehlow
Miroslav Lichvar
Naresh Kamboju