On Tue, Mar 19, 2024 at 06:38:23PM +0000, Matthew Wilcox wrote:
The syscall design is mostly being posted right now to collaborate via public channels, but if the idea is so fundamentally offensive then i'll drop it and relay the opinion accordingly.
The syscall design is wrong. Exposing physical addresses to userspace is never the right answer. Think rowhammer.
1) The syscall does not expose physical addresses information, it consumes it.
2) The syscall does not allow the user to select target physical address only the target node. Now, that said, if source-pages are zeroed on migration, that's definitely a concern. I did not see this to be the case, however, and the frequency of write required to make use of that for rowhammer seems to be a mitigating factor.
3) there exist 4 interfaces which do expose physical address information - /proc/pid/pagemap - perf / IBS and PEBs - zoneinfo - /sys/kerne/mm/page_idle (PFNs)
4) The syscall requires CAP_SYS_ADMIN because these other sources require the same, though as v1/v2 discussed there could be an argument for CAP_SYS_NIDE.
I'm vehemently opposed to all of the bullshit around CXL. However, if you are going to propose something, it should be based around an abstraction. Say "We have 8 pools of memory. This VMA is backed by memory from pools 3 & 6. The relative hotness of the 8 pools are <vector>. The quantities of memory in the 8 ppols are <vector>". And then you can say "migrate this range of memory to pool 2".
That's just an initial response to the idea. I refuse to invest a serious amount of time in a dead-end idea like CXL memory pooling.
Who said anything about pools? Local memory expanders are capable of hosting hotness tracking offload.
~Gregory