On Fri, Jul 11, 2025 at 07:16:00PM +0800, wang lian wrote:
Hi Lorenzo Stoakes,
- This test deterministically validates process_madvise() with MADV_COLLAPSE
- on a remote process, other advices are difficult to verify reliably.
- The test verifies that a memory region in a child process, initially
- backed by small pages, can be collapsed into a Transparent Huge Page by a
- request from the parent. The result is verified by parsing the child's
- /proc/<pid>/smaps file.
- */
This is clever and you've put a lot of effort in, but this just seems absolutely prone to flaking and you're essentially testing something that's highly automated.
I think you're also going way outside of the realms of testing process_madvise() and are getting into testing essentially MADV_COLLAPSE here.
We have to try to keep the test specific to what it is you're testing - which is process_madvise() itself.
So for me, and I realise you've put a ton of work into this and I'm really sorry to say it, I think you should drop this specific test.
For me simply testing the remote MADV_DONTNEED is enough.
My motivation for this complex test came from the need to verify that the process_madvise operation was actually successful. Without checking the outcome, the test would only validate that the syscall returns the correct number of bytes, not that the advice truly took effect on the target process's memory.
For remote calls, process_madvise is intentionally limited to non-destructive advice: MADV_COLD, MADV_PAGEOUT, MADV_WILLNEED, and MADV_COLLAPSE. However, verifying the effects of COLD, PAGEOUT, and WILLNEED is very difficult to do reliably in a selftest. This left MADV_COLLAPSE as what seemed to be the only verifiable option.
But, as you correctly pointed out, MADV_COLLAPSE is too dependent on the system's THP state and prone to races with khugepaged. This is the very issue I tried to work around in v4 after the v3 test failures. So I think this test is necessary. As for your other opinions, I completely agree.
MADV_COLLAPSE is not a reliable test and we're going to end up with flakes. The implementation as-is is unreliable, and I"m not sure there's any way to make it not-unreliable.
This is especially true as we change THP behaviour over time. I don't want to see failed test reports because of this.
I think it might be best to simply assert that the operation succesfully completes without checking whether it actually executes the requested task - it would render this functionality completely broken if it were not to actually do what was requested.
Best regards, Wang Lian