Hi,
we found that on the selftest timer/adjtick fails on arm64 (tested on some renesas board and in qemu) quite frequently. By bisecting the kernel I found that it stopped failing after commit 78b98e3c5a66 (timekeeping/ntp: Determine the multiplier directly from NTP tick length). Should this patch be applied to 4.14 and is it even possible or could it break something else?
Thanks, Joerg
On Wed, Feb 10, 2021 at 01:43:10PM +0100, Joerg Vehlow wrote:
Hi,
we found that on the selftest timer/adjtick fails on arm64 (tested on some renesas board and in qemu) quite frequently. By bisecting the kernel I found that it stopped failing after commit 78b98e3c5a66 (timekeeping/ntp: Determine the multiplier directly from NTP tick length). Should this patch be applied to 4.14 and is it even possible or could it break something else?
Have you tried applying it to that tree to see if it solves your problem and works properly? If so, please feel free to provide a working backported copy, with your signed-off-by and we can consider it.
But, why not just use 4.19 or newer on that system?
thanks,
greg k-h
Hi Greg,
On 2/10/2021 2:00 PM, Greg KH wrote:
Have you tried applying it to that tree to see if it solves your problem and works properly? If so, please feel free to provide a working backported copy, with your signed-off-by and we can consider it.
It can be applied without any changes and fixes the problem, but since I have not a lot of knowledge about this subsystem, I don't know if this breaks anything or if it requires other patches to be applied first, to not break anything.. Maybe the authors of the patch can check this easily or maybe know it. That's why I added them to the initial mail.
But, why not just use 4.19 or newer on that system?
Why does an LTS version of 4.14 exist? Because the customer demands it :) If the failing test was not one of the kernel selftest, I wouldn't bother you with this...
Joerg
On Wed, Feb 10, 2021 at 02:07:21PM +0100, Joerg Vehlow wrote:
On 2/10/2021 2:00 PM, Greg KH wrote:
Have you tried applying it to that tree to see if it solves your problem and works properly? If so, please feel free to provide a working backported copy, with your signed-off-by and we can consider it.
It can be applied without any changes and fixes the problem, but since I have not a lot of knowledge about this subsystem, I don't know if this breaks anything or if it requires other patches to be applied first, to not break anything.. Maybe the authors of the patch can check this easily or maybe know it. That's why I added them to the initial mail.
That patch cannot be applied alone. It would break the timekeeping in not so obvious ways as there will be unexpected sources of the NTP tracking error. IIRC, at least the following changes would need to be included with it. There may be others.
c2cda2a5bda9 ("timekeeping/ntp: Don't align NTP frequency adjustments to ticks") aea3706cfc4d ("timekeeping: Remove CONFIG_GENERIC_TIME_VSYSCALL_OLD") d4d1fc61eb38 ("ia64: Update fsyscall gettime to use modern vsyscall_update")
My suggestion for a fix would be to increase the limit in the failing test.
I have tested adjtick on arm64 juno-r2 device and it got pass and here is the test output on Linux version 4.14.221-rc1.
+ ./adjtick Each iteration takes about 15 seconds Estimating tick (act: 9000 usec, -100000 ppm): 9000 usec, -100000 ppm [OK] Estimating tick (act: 9250 usec, -75000 ppm): 9250 usec, -75000 ppm [OK] Estimating tick (act: 9500 usec, -50000 ppm): 9500 usec, -50000 ppm [OK] Estimating tick (act: 9750 usec, -25000 ppm): 9750 usec, -25001 ppm [OK] Estimating tick (act: 10000 usec, 0 ppm): 10000 usec, 0 ppm [OK] Estimating tick (act: 10250 usec, 25000 ppm): 10249 usec, 24999 ppm [OK] Estimating tick (act: 10500 usec, 50000 ppm): 10500 usec, 50000 ppm [OK] Estimating tick (act: 10750 usec, 75000 ppm): 10750 usec, 75000 ppm [OK] Pass 0 Fail 0 Xfail 0 Xpass 0 Skip 0 Error 0 1..0
output link, https://lkft.validation.linaro.org/scheduler/job/2254102#L1255
- Naresh
Hi,
On 2/10/2021 7:59 PM, Naresh Kamboju wrote:
I have tested adjtick on arm64 juno-r2 device and it got pass and here is the test output on Linux version 4.14.221-rc1.
Interesting. Is this vanilla 4.14.221 or are there some o-e patches applied? I just tried again on qemu arm with 4.14.222 from kernel.org stable tree and still have failures like the one below every time I try. The failing test step differs, but it always fails.
Each iteration takes about 15 seconds Estimating tick (act: 9000 usec, -100000 ppm): 9000 usec, -100000 ppm [OK] Estimating tick (act: 9250 usec, -75000 ppm): 9250 usec, -75001 ppm [OK] Estimating tick (act: 9500 usec, -50000 ppm): 9501 usec, -49995 ppm [OK] Estimating tick (act: 9750 usec, -25000 ppm): 9750 usec, -25003 ppm [OK] Estimating tick (act: 10000 usec, 0 ppm): 9996 usec, -463 ppm [FAILED] Bail out! Pass 0 Fail 0 Xfail 0 Xpass 0 Skip 0 Error 0 1..0
Joerg
Hi Miroslav,
On 2/10/2021 2:19 PM, Miroslav Lichvar wrote:
That patch cannot be applied alone. It would break the timekeeping in not so obvious ways as there will be unexpected sources of the NTP tracking error. IIRC, at least the following changes would need to be included with it. There may be others.
c2cda2a5bda9 ("timekeeping/ntp: Don't align NTP frequency adjustments to ticks") aea3706cfc4d ("timekeeping: Remove CONFIG_GENERIC_TIME_VSYSCALL_OLD") d4d1fc61eb38 ("ia64: Update fsyscall gettime to use modern vsyscall_update")
My suggestion for a fix would be to increase the limit in the failing test.
Thanks, that's what I expected. But I still wonder why the test is failing almost 100% of time for me on qemu-arm64 (running on x86). Is this a regression in 4.14, that was working at some point or was it never tested on arm?
Joerg
On Thu, Feb 11, 2021 at 11:33:05AM +0100, Joerg Vehlow wrote:
Hi Miroslav,
On 2/10/2021 2:19 PM, Miroslav Lichvar wrote:
That patch cannot be applied alone. It would break the timekeeping in not so obvious ways as there will be unexpected sources of the NTP tracking error. IIRC, at least the following changes would need to be included with it. There may be others.
c2cda2a5bda9 ("timekeeping/ntp: Don't align NTP frequency adjustments to ticks") aea3706cfc4d ("timekeeping: Remove CONFIG_GENERIC_TIME_VSYSCALL_OLD") d4d1fc61eb38 ("ia64: Update fsyscall gettime to use modern vsyscall_update")
My suggestion for a fix would be to increase the limit in the failing test.
Thanks, that's what I expected. But I still wonder why the test is failing almost 100% of time for me on qemu-arm64 (running on x86). Is this a regression in 4.14, that was working at some point or was it never tested on arm?
Does it work on a real system? That's the proper test...
On Thu, Feb 11, 2021 at 11:33:05AM +0100, Joerg Vehlow wrote:
My suggestion for a fix would be to increase the limit in the failing test.
Thanks, that's what I expected. But I still wonder why the test is failing almost 100% of time for me on qemu-arm64 (running on x86). Is this a regression in 4.14, that was working at some point or was it never tested on arm?
I don't think it is specific to arm or that it is a regression. I think the virtual machine just happens to be too idle for the test. There may be unrelated changes, maybe in the kernel, qemu, or applications, that caused the rate of the clock updates to decrease so much that the instability now triggers the failure in the test. The issue with the clock was there since NO_HZ was introduced, but it becomes more severe as the activity of the kernel decreases.
Hi Miroslav,
On 2/11/2021 11:59 AM, Miroslav Lichvar wrote:
I don't think it is specific to arm or that it is a regression. I think the virtual machine just happens to be too idle for the test. There may be unrelated changes, maybe in the kernel, qemu, or applications, that caused the rate of the clock updates to decrease so much that the instability now triggers the failure in the test. The issue with the clock was there since NO_HZ was introduced, but it becomes more severe as the activity of the kernel decreases.
Thank you for that explanation. I did create some background load (copy from urandom to null) and ran the test. This made the test pass every time.
Jörg
Hi Mi
On 2/11/2021 11:59 AM, Miroslav Lichvar wrote:
I don't think it is specific to arm or that it is a regression. I think the virtual machine just happens to be too idle for the test. There may be unrelated changes, maybe in the kernel, qemu, or applications, that caused the rate of the clock updates to decrease so much that the instability now triggers the failure in the test. The issue with the clock was there since NO_HZ was introduced, but it becomes more severe as the activity of the kernel decreases.
Thanks for the hint towards NO_HZ. Running the tests with some background load makes them pass reliably.
Jörg
linux-stable-mirror@lists.linaro.org