On Fri, May 31, 2024 at 1:31 PM Yu Zhao yuzhao@google.com wrote:
On Fri, May 31, 2024 at 1:24 AM Oliver Upton oliver.upton@linux.dev wrote:
On Wed, May 29, 2024 at 03:03:21PM -0600, Yu Zhao wrote:
On Wed, May 29, 2024 at 12:05 PM James Houghton jthoughton@google.com wrote:
Secondary MMUs are currently consulted for access/age information at eviction time, but before then, we don't get accurate age information. That is, pages that are mostly accessed through a secondary MMU (like guest memory, used by KVM) will always just proceed down to the oldest generation, and then at eviction time, if KVM reports the page to be young, the page will be activated/promoted back to the youngest generation.
Correct, and as I explained offline, this is the only reasonable behavior if we can't locklessly walk secondary MMUs.
Just for the record, the (crude) analogy I used was: Imagine a large room with many bills ($1, $5, $10, ...) on the floor, but you are only allowed to pick up 10 of them (and put them in your pocket). A smart move would be to survey the room *first and then* pick up the largest ones. But if you are carrying a 500 lbs backpack, you would just want to pick up whichever that's in front of you rather than walk the entire room.
MGLRU should only scan (or lookaround) secondary MMUs if it can be done lockless. Otherwise, it should just fall back to the existing approach, which existed in previous versions but is removed in this version.
Grabbing the MMU lock for write to scan sucks, no argument there. But can you please be specific about the impact of read lock v. RCU in the case of arm64? I had asked about this before and you never replied.
My concern remains that adding support for software table walkers outside of the MMU lock entirely requires more work than just deferring the deallocation to an RCU callback. Walkers that previously assumed 'exclusive' access while holding the MMU lock for write must now cope with volatile PTEs.
Yes, this problem already exists when hardware sets the AF, but the lock-free walker implementation needs to be generic so it can be applied for other PTE bits.
Direct reclaim is multi-threaded and each reclaimer can take the mmu lock for read (testing the A-bit) or write (unmapping before paging out) on arm64. The fundamental problem of using the readers-writer lock in this case is priority inversion: the readers have lower priority than the writers, so ideally, we don't want the readers to block the writers at all.
Using my previous (crude) analogy: puting the bill right in front of you (the writers) profits immediately whereas searching for the largest bill (the readers) can be futile.
As I said earlier, I prefer we drop the arm64 support for now, but I will not object to taking the mmu lock for read when clearing the A-bit, as long as we fully understand the problem here and document it clearly.
FWIW, Google Cloud has been doing proactive reclaim and kstaled-based aging (a Google-internal page aging daemon, for those outside of Google) for many years on x86 VMs with the A-bit harvesting under the write-lock. So I'm skeptical that making ARM64 lockless is necessary to allow Secondary MMUs to participate in MGLRU aging with acceptable performance for Cloud usecases. I don't even think it's necessary on x86 but it's a simple enough change that we might as well just do it.
I suspect under pathological conditions (host under intense memory pressure and high rate of reclaim occurring) making A-bit harvesting lockless will perform better. But under such conditions VM performance is likely going to suffer regardless. In a Cloud environment we deal with that through other mechanisms to reduce the rate of reclaim and make the host healthy.
For these reasons, I think there's value in giving users the option to enable Secondary MMUs participation MGLRU aging even when A-bit test/clearing is not done locklessly. I believe this was James' intent with the Kconfig. Perhaps a default-off writable module parameter would be better to avoid distros accidentally turning it on?
If and when there is a usecase for optimizing VM performance under pathological reclaim conditions on ARM, we can make it lockless then.