linaro-kernel February 2019

linaro-kernel@lists.linaro.org

11 participants
104 discussions

by Renato Golin

Kernel folks, Sandia is setting up their new Astra cluster[1] based on ThunderX2 and they found an issue on RedHat's kernel (4.14, I know, not upstream) related to Mellanox drivers. Coincidentally, Huawei also found the same problem (see the conversation below). The attached patch is Sandia's attempt to solve the problem, but apparently, it just makes it better, doesn't solve it. Here's what they said about it: "The only difference compared to stock rhel7.6 is the following patch. This mitigates the issue, but does not fix the root cause. The original bug is rare, requiring for example repeated HPL runs on 288 nodes to trigger it on ~4 nodes overnight. When the bug hits, one CPU gets stuck 100% in kworker. This is because the rht_shrink_below_30() call in rht_deferred_worker() returns -EEXIST indefinitely, causing the work to be requeued at the end of rht_deferred_worker() (i.e., the deferred work in rht_deferred_worker() fails so it requeues itself to try again later, only it always fails later, hence the infinite loop)." "I suspect there is a subtle race condition in the Linux rhashtable code and/or RCU code on aarch64, perhaps due to memory consistency model differences compared to x86. It may be fixed in kernel.org mainline, as there have been a lot of changes compared to what's in the rhel 7.6 kernel." I have pointed them to the ERP kernel, which works well with ThunderX2s in our lab, and we'll see how that goes, but just wanted to make sure there isn't some known issue around rhashtable, netorking or infiniband. cheers, --renato [1] https://share-ng.sandia.gov/news/resources/news_releases/arm_supercomputer/ ---------- Forwarded message --------- From: Pedretti, Kevin T <ktpedre(a)sandia.gov> Date: Fri, 11 Jan 2019 at 19:16 Subject: Re: [EXTERNAL] Re: [Linaro Collaborate] HPC SIG > Weekly Sync Minutes To: Pak Lui <pak.lui(a)linaro.org> Cc: Renato Golin <renato.golin(a)linaro.org> Yes, this is exactly what we saw as well. Our workaround fix was to return early from rht_deferred_worker on one specific error, -EAGAIN. If -EAGAIN is the err, we just don’t requeue the work. It seems that if -EAGAIN is returned once, it will be returned forever, causing the infinite requeuing loop. This is likely a leak of some sort, but works around the issue well enough for our purposes. Kevin *From: *Pak Lui <pak.lui(a)linaro.org> *Date: *Friday, January 11, 2019 at 12:11 PM *To: *Kevin Pedretti <ktpedre(a)sandia.gov> *Cc: *Renato Golin <renato.golin(a)linaro.org> *Subject: *Re: [EXTERNAL] Re: [Linaro Collaborate] HPC SIG > Weekly Sync Minutes Have to fly now but here's what we see. I'll check email again later. Not sure if the image show up correctly. The following is the call stack, rht_deferred_worker call queue_work_on, if rht_deferred_worker get error, it will repeat to call work queue that make the cpu not released. Suspect some kernel schedule problem between kernel and CPU. [root@node0 tracing]# echo 0 > trace [root@node0 tracing]# head trace # tracer: function # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | kworker/15:1-161 [015] .... 14053.007473: queue_work_on <- rht_deferred_worker [image: image.png] [image: image.png] On Sat, Jan 12, 2019 at 2:55 AM Pedretti, Kevin T <ktpedre(a)sandia.gov> wrote: Hi Pak, We saw this same issue with the RHEL7.5 kernel as well and were hoping that RHEL7.6 fixed it. Finally we bit the bullet and started adding printk's all over the place. The rhashtable that it was hanging on was not Mellanox related, rather it was almost always the built-in netlink rhashtable. That's not to say it wasn't induced by MOFED HCOLL somehow, and this seems plausible. Glad to hear you have a reproducer and are working with Mellanox. Our reproducer takes an overnight run on lots of nodes to hit it. Kevin

6 years, 8 months

pending-fixes build: 0 failures 10 warnings (v5.0-rc4-326-g0e2a32a55ef6)

by Build bot for Mark Brown

Tree/Branch: pending-fixes Git describe: v5.0-rc4-326-g0e2a32a55ef6 Commit: 0e2a32a55e Merge remote-tracking branch 'drm-misc-fixes/for-linux-next-fixes' Build Time: 135 min 21 sec Passed: 11 / 11 (100.00 %) Failed: 0 / 11 ( 0.00 %) Errors: 0 Warnings: 10 Section Mismatches: 0 ------------------------------------------------------------------------------- defconfigs with issues (other than build errors): 3 warnings 0 mismatches : arm64-allmodconfig 3 warnings 0 mismatches : arm-multi_v5_defconfig 4 warnings 0 mismatches : arm-multi_v7_defconfig 7 warnings 0 mismatches : arm-allmodconfig 3 warnings 0 mismatches : arm-multi_v4t_defconfig 3 warnings 0 mismatches : x86_64-allmodconfig 1 warnings 0 mismatches : arm64-defconfig ------------------------------------------------------------------------------- Warnings Summary: 10 8 ../drivers/regulator/core.c:244:45: warning: array subscript is above array bounds [-Warray-bounds] 5 ../include/linux/spinlock.h:279:3: warning: 'flags' may be used uninitialized in this function [-Wmaybe-uninitialized] 4 ../drivers/regulator/core.c:4799:38: warning: array subscript is above array bounds [-Warray-bounds] 1 ../samples/seccomp/user-trap.c:83:2: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 1 ../samples/seccomp/user-trap.c:50:2: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] 1 ../include/linux/kernel.h:846:29: warning: comparison of distinct pointer types lacks a cast 1 ../drivers/staging/erofs/unzip_vle.c:253:29: warning: array subscript is above array bounds [-Warray-bounds] 1 ../drivers/scsi/myrs.c:821:24: warning: 'sshdr.sense_key' may be used uninitialized in this function [-Wmaybe-uninitialized] 1 ../drivers/net/ethernet/mellanox/mlx5/core/en_stats.c:217:1: warning: the frame size of 1064 bytes is larger than 1024 bytes [-Wframe-larger-than=] 1 ../drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_crat.c:866:5: warning: "CONFIG_X86_64" is not defined, evaluates to 0 [-Wundef] =============================================================================== Detailed per-defconfig build reports below: ------------------------------------------------------------------------------- arm64-allmodconfig : PASS, 0 errors, 3 warnings, 0 section mismatches Warnings: ../include/linux/kernel.h:846:29: warning: comparison of distinct pointer types lacks a cast ../drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_crat.c:866:5: warning: "CONFIG_X86_64" is not defined, evaluates to 0 [-Wundef] ../include/linux/spinlock.h:279:3: warning: 'flags' may be used uninitialized in this function [-Wmaybe-uninitialized] ------------------------------------------------------------------------------- arm-multi_v5_defconfig : PASS, 0 errors, 3 warnings, 0 section mismatches Warnings: ../drivers/regulator/core.c:244:45: warning: array subscript is above array bounds [-Warray-bounds] ../drivers/regulator/core.c:244:45: warning: array subscript is above array bounds [-Warray-bounds] ../drivers/regulator/core.c:4799:38: warning: array subscript is above array bounds [-Warray-bounds] ------------------------------------------------------------------------------- arm-multi_v7_defconfig : PASS, 0 errors, 4 warnings, 0 section mismatches Warnings: ../drivers/regulator/core.c:244:45: warning: array subscript is above array bounds [-Warray-bounds] ../drivers/regulator/core.c:244:45: warning: array subscript is above array bounds [-Warray-bounds] ../drivers/regulator/core.c:4799:38: warning: array subscript is above array bounds [-Warray-bounds] ../include/linux/spinlock.h:279:3: warning: 'flags' may be used uninitialized in this function [-Wmaybe-uninitialized] ------------------------------------------------------------------------------- arm-allmodconfig : PASS, 0 errors, 7 warnings, 0 section mismatches Warnings: ../drivers/regulator/core.c:244:45: warning: array subscript is above array bounds [-Warray-bounds] ../drivers/regulator/core.c:244:45: warning: array subscript is above array bounds [-Warray-bounds] ../drivers/regulator/core.c:4799:38: warning: array subscript is above array bounds [-Warray-bounds] ../drivers/net/ethernet/mellanox/mlx5/core/en_stats.c:217:1: warning: the frame size of 1064 bytes is larger than 1024 bytes [-Wframe-larger-than=] ../include/linux/spinlock.h:279:3: warning: 'flags' may be used uninitialized in this function [-Wmaybe-uninitialized] ../drivers/staging/erofs/unzip_vle.c:253:29: warning: array subscript is above array bounds [-Warray-bounds] ../drivers/scsi/myrs.c:821:24: warning: 'sshdr.sense_key' may be used uninitialized in this function [-Wmaybe-uninitialized] ------------------------------------------------------------------------------- arm-multi_v4t_defconfig : PASS, 0 errors, 3 warnings, 0 section mismatches Warnings: ../drivers/regulator/core.c:244:45: warning: array subscript is above array bounds [-Warray-bounds] ../drivers/regulator/core.c:244:45: warning: array subscript is above array bounds [-Warray-bounds] ../drivers/regulator/core.c:4799:38: warning: array subscript is above array bounds [-Warray-bounds] ------------------------------------------------------------------------------- x86_64-allmodconfig : PASS, 0 errors, 3 warnings, 0 section mismatches Warnings: ../samples/seccomp/user-trap.c:50:2: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] ../samples/seccomp/user-trap.c:83:2: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] ../include/linux/spinlock.h:279:3: warning: 'flags' may be used uninitialized in this function [-Wmaybe-uninitialized] ------------------------------------------------------------------------------- arm64-defconfig : PASS, 0 errors, 1 warnings, 0 section mismatches Warnings: ../include/linux/spinlock.h:279:3: warning: 'flags' may be used uninitialized in this function [-Wmaybe-uninitialized] ------------------------------------------------------------------------------- Passed with no errors, warnings or mismatches: x86_64-allnoconfig arm64-allnoconfig arm-allnoconfig x86_64-defconfig

6 years, 8 months

v4.19.19 build: 0 failures 2 warnings (v4.19.19)

by Build bot for Mark Brown

Tree/Branch: v4.19.19 Git describe: v4.19.19 Commit: dffbba4348 Linux 4.19.19 Build Time: 124 min 39 sec Passed: 11 / 11 (100.00 %) Failed: 0 / 11 ( 0.00 %) Errors: 0 Warnings: 2 Section Mismatches: 0 ------------------------------------------------------------------------------- defconfigs with issues (other than build errors): 1 warnings 0 mismatches : arm64-allmodconfig 1 warnings 0 mismatches : arm-allmodconfig ------------------------------------------------------------------------------- Warnings Summary: 2 1 ../drivers/staging/erofs/unzip_vle.c:186:29: warning: array subscript is above array bounds [-Warray-bounds] 1 ../drivers/isdn/hardware/eicon/message.c:5985:1: warning: the frame size of 2064 bytes is larger than 2048 bytes [-Wframe-larger-than=] =============================================================================== Detailed per-defconfig build reports below: ------------------------------------------------------------------------------- arm64-allmodconfig : PASS, 0 errors, 1 warnings, 0 section mismatches Warnings: ../drivers/isdn/hardware/eicon/message.c:5985:1: warning: the frame size of 2064 bytes is larger than 2048 bytes [-Wframe-larger-than=] ------------------------------------------------------------------------------- arm-allmodconfig : PASS, 0 errors, 1 warnings, 0 section mismatches Warnings: ../drivers/staging/erofs/unzip_vle.c:186:29: warning: array subscript is above array bounds [-Warray-bounds] ------------------------------------------------------------------------------- Passed with no errors, warnings or mismatches: arm64-allnoconfig arm-multi_v5_defconfig arm-multi_v7_defconfig x86_64-defconfig arm-allnoconfig x86_64-allnoconfig arm-multi_v4t_defconfig x86_64-allmodconfig arm64-defconfig

6 years, 8 months

v4.14.97 build: 0 failures 0 warnings (v4.14.97)

by Build bot for Mark Brown

Tree/Branch: v4.14.97 Git describe: v4.14.97 Commit: e1e364bf09 Linux 4.14.97 Build Time: 112 min 41 sec Passed: 11 / 11 (100.00 %) Failed: 0 / 11 ( 0.00 %) Errors: 0 Warnings: 0 Section Mismatches: 0 ------------------------------------------------------------------------------- defconfigs with issues (other than build errors): ------------------------------------------------------------------------------- =============================================================================== Detailed per-defconfig build reports below: ------------------------------------------------------------------------------- Passed with no errors, warnings or mismatches: arm64-allnoconfig arm64-allmodconfig arm-multi_v5_defconfig arm-multi_v7_defconfig x86_64-defconfig arm-allmodconfig arm-allnoconfig x86_64-allnoconfig arm-multi_v4t_defconfig x86_64-allmodconfig arm64-defconfig close failed in file object destructor: sys.excepthook is missing lost sys.stderr

6 years, 8 months

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

linaro-kernel February 2019