On 4/27/20 9:01 AM, Sven Eckelmann wrote:
On Monday, 27 April 2020 01:14:26 CEST Sasha Levin wrote:
On Sun, Apr 26, 2020 at 09:06:17AM +0200, Sven Eckelmann wrote:
From: Vlastimil Babka vbabka@suse.cz
commit 0882ff9190e3bc51e2d78c3aadd7c690eeaa91d5 upstream.
[...]
The original problem is explained in the patch description as performance problem. And maybe this could also be one reason why it was never submitted for a stable kernel.
But tests on mips ath79 (OpenWrt ar71xx target) showed that it most likely related to "random" data bus errors. At least applying this patch seemed to have solved it for Matthias Schiffer mschiffer@universe-factory.net and some other persons who where debugging/testing this problem with him.
More details about it can be found in https://github.com/freifunk-gluon/gluon/issues/1982
Hmm, doesn't explain much how the fix was eventually found, but nevermind, good job.
Interesting... I wonder why this issue has started only now.
Unfortunately, I don't know the details. So I (actually we) would love to get some feedback from the slub experts. Not that there is another problem which we just don't grasp yet.
I think the prefetch my go to an address that would cause a real fetch to page fault. Under normal circumstances that could be only the NULL pointer that terminates a freelist, otherwise the address should be valid.
So that could mean: 1) prefetch() on mips is implemented/compiled wrong? 2) the CPU really has issues with prefetch causing a page fault 3) the prefetch gets reordered between LL/SC and there's some bug similar to this one described in arch/mips/include/asm/sync.h:
/* * Some Loongson 3 CPUs have a bug wherein execution of a memory access (load, * store or prefetch) in between an LL & SC can cause the SC instruction to * erroneously succeed, breaking atomicity. Whilst it's unusual to write code * containing such sequences, this bug bites harder than we might otherwise * expect due to reordering & speculation:
Just some background information about the "why" from freifunk-gluon's perspective:
OpenWrt 19.07 was released (despite its name) at the beginning of 2020. And it was the first release using kernel 4.14 on the most used target: ar71xx (ath79). The wireless community network firmware projects (freifunk-gluon in this example) updated their frameworks to this OpenWrt release in the last months and just now started to roll it out on their networks.
And while the wireless community networks around here usually don't track the connected clients, the health of the APs is often tracked on some central system. And some people then just noticed a sudden spike of reboots on their APs. Since ar71xx is (often) the most used architecture at the moment, this could be spotted rather easily if you spend some time looking at graphs.
Kind regards, Sven