Re: [PATCH 6.6 000/122] 6.6.28-rc1 review

18 Apr 2024


      On Tue, 16 Apr 2024 18:28:10 +0100,
Catalin Marinas catalin.marinas@arm.com wrote:
...
On Tue, Apr 16, 2024 at 02:22:07PM +0100, Marc Zyngier wrote:
...
On Tue, 16 Apr 2024 14:07:30 +0100,
Naresh Kamboju naresh.kamboju@linaro.org wrote:
...
On Tue, 16 Apr 2024 at 16:04, Mark Brown broonie@kernel.org wrote:
...
On Mon, Apr 15, 2024 at 04:19:25PM +0200, Greg Kroah-Hartman wrote:
...
This is the start of the stable review cycle for the 6.6.28 release.
There are 122 patches in this series, all will be posted as a response
to this one.  If anyone has any issues with these being applied, please
let me know.
The bisect of the boot issue that's affecting the FVP in v6.6 (only)
landed on c9ad150ed8dd988 (arm64: tlb: Fix TLBI RANGE operand),
e3ba51ab24fdd in mainline, as being the first bad commit - it's also in
the -rc for v6.8 but that seems fine.  I've done no investigation beyond
the bisect and looking at the commit log to pull out people to CC and
note that the fix was explicitly targeted at v6.6.
Anders investigated this reported issues and bisected and also found
the missing commit for stable-rc 6.6 is
e2768b798a19 ("arm64/mm: Modify range-based tlbi to decrement scale")
Which is definitely *not* stable candidate. We need to understand why
the invalidation goes south when the scale go up instead of down.
If you backport e3ba51ab24fd ("arm64: tlb: Fix TLBI RANGE operand")
which fixes 117940aa6e5f ("KVM: arm64: Define
kvm_tlb_flush_vmid_range()") but without the newer e2768b798a19
("arm64/mm: Modify range-based tlbi to decrement scale"), it looks like
"scale" in __flush_tlb_range_op() goes out of range to 4. Tested on my
CBMC model, not on the actual kernel. It may be worth adding some
WARN_ONs in __flush_tlb_range_op() if scale is outside the 0..3 range or
num greater than 31.
I haven't investigated properly (and I'm off tomorrow, back on Thu) but
it's likely the original code was not very friendly to the maximum
range, never tested. Anyway, if one figures out why it goes out of
range, I think the solution is to also backport e2768b798a19 to stable.
I looked into this, and I came to the conclusion that this patch is
pretty much incompatible with the increasing scale (even if you cap
num to 30).
The number of pages to invalidate is a 20 bit quantity, a 5 bit slice
per scale. With the 6.6 approach (limit of num=30 and increasing
scale), we invalidate each 5 bit slice independently. After each
scale round, the corresponding slice is guaranteed to be 0.
With the 6.9 method, we invalidate the maximum possible for a given
scale. With a decreasing scale, we converge towards 0 or 1 on each
round.  With an increasing scale, this breaks spectacularly, because
the strong guarantee that the remaining page count is "aligned" to
2^(5*scale+1) is not valid anymore (the low bits may not be 0).
As a result, we don't converge because we never consider these low
bits anymore, the page count doesn't decrease, scale goes past 3, and
everything catches fire.
So despite my earlier comment, it looks like picking e2768b798a19 is
the right thing to do *if* we're taking e3ba51ab24fd into 6.6-stable.
Otherwise, we need a separate fix, which Ryan initially advocating for
initially.
Thanks,
M.
-- 
Without deviation from the norm, progress is not possible.

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH 6.6 000/122] 6.6.28-rc1 review