On Fri, Oct 13, 2023 at 09:49:10AM -0700, Lokesh Gidra wrote:
On Fri, Oct 13, 2023 at 9:08 AM Peter Xu peterx@redhat.com wrote:
On Fri, Oct 13, 2023 at 11:56:31AM +0200, David Hildenbrand wrote:
Hi Peter,
Hi, David,
I used to have the same thought with David on whether we can simplify the design to e.g. limit it to single mm. Then I found that the trickiest is actually patch 1 together with the anon_vma manipulations, and the problem is that's not avoidable even if we restrict the api to apply on single mm.
What else we can benefit from single mm? One less mmap read lock, but probably that's all we can get; IIUC we need to keep most of the rest of the code, e.g. pgtable walks, double pgtable lockings, etc.
No existing mechanisms move anon pages between unrelated processes, that naturally makes me nervous if we're doing it "just because we can".
IMHO that's also the potential, when guarded with userfaultfd descriptor being shared between two processes.
See below with more comment on the raised concerns.
Actually, even though I have no solid clue, but I had a feeling that there can be some interesting way to leverage this across-mm movement, while keeping things all safe (by e.g. elaborately requiring other proc to create uffd and deliver to this proc).
Okay, but no real use cases yet.
I can provide a "not solid" example. I didn't mention it because it's really something that just popped into my mind when thinking cross-mm, so I never discussed with anyone yet nor shared it anywhere.
Consider VM live upgrade in a generic form (e.g., no VFIO), we can do that very efficiently with shmem or hugetlbfs, but not yet anonymous. We can do extremely efficient postcopy live upgrade now with anonymous if with REMAP.
Basically I see it a potential way of moving memory efficiently especially with thp.
Considering Andrea's original version already contains those bits and all above, I'd vote that we go ahead with supporting two MMs.
You can do nasty things with that, as it stands, on the upstream codebase.
If you pin the page in src_mm and move it to dst_mm, you successfully broke an invariant that "exclusive" means "no other references from other processes". That page is marked exclusive but it is, in fact, not exclusive.
It is still exclusive to the dst mm? I see your point, but I think you're taking exclusiveness altogether with pinning, and IMHO that may not be always necessary?
Once you achieved that, you can easily have src_mm not have MMF_HAS_PINNED,
(I suppose you meant dst_mm here)
so you can just COW-share that page. Now you successfully broke the invariant that COW-shared pages must not be pinned. And you can even trigger VM_BUG_ONs, like in sanity_check_pinned_pages().
Yeah, that's really unfortunate. But frankly, I don't think it's the fault of this new feature, but the rest.
Let's imagine if the MMF_HAS_PINNED wasn't proposed as a per-mm flag, but per-vma, which I don't see why we can't because it's simply a hint so far. Then if we apply the same rule here, UFFDIO_REMAP won't even work for single-mm as long as cross-vma. Then UFFDIO_REMAP as a whole feature will be NACKed simply because of this..
And I don't think anyone can guarantee a per-vma MMF_HAS_PINNED can never happen, or any further change to pinning solution that may affect this. So far it just looks unsafe to remap a pin page to me.
I don't have a good suggestion here if this is a risk.. I'd think it risky then to do REMAP over pinned pages no matter cross-mm or single-mm. It means probably we just rule them out: folio_maybe_dma_pinned() may not even be enough to be safe with fast-gup. We may need page_needs_cow_for_dma() with proper write_protect_seq no matter cross-mm or single-mm?
Can it all be fixed? Sure, with more complexity. For something without clear motivation, I'll have to pass.
I think what you raised is a valid concern, but IMHO it's better fixed no matter cross-mm or single-mm. What do you think?
In general, pinning lose its whole point here to me for an userspace either if it DONTNEEDs it or REMAP it. What would be great to do here is we unpin it upon DONTNEED/REMAP/whatever drops the page, because it loses its coherency anyway, IMHO.
Once there is real demand, we can revisit it and explore what else we would have to take care of (I don't know how memcg behaves when moving between completely unrelated processes, maybe that works as expected, I don't know and I have no time to spare on reviewing features with no real use cases) and announce it as a new feature.
Good point. memcg is probably needed..
So you reminded me to do a more thorough review against zap/fault paths, I think what's missing are (besides page pinning):
mem_cgroup_charge()/mem_cgroup_uncharge():
(side note: I think folio_throttle_swaprate() is only for when allocating new pages, so not needed here)
check_stable_address_space() (under pgtable lock)
tlb flush
Hmm???????????????? I can't see anywhere we did tlb flush, batched or not, either single-mm or cross-mm should need it. Is this missing?
IIUC, ptep_clear_flush() flushes tlb entry. So I think we are doing unbatched flushing. Possibly a nice performance improvement later on would be to try doing it batched. Suren can throw more light on it.
Oh yeah.. thanks.
One thing I was wondering is don't we need cache flush for the src pages? mremap's move_page_tables() does it. IMHO, it's required here as well.
I commented in my reply, I also think it's needed. Otherwise for some arches I think we can have page containing stall data if not fully flushed before the movement. x86 is probably fine, though.
Note: that (with only reading the documentation) it also kept me wondering how the MMs are even implied from
struct uffdio_move { __u64 dst; /* Destination of move */ __u64 src; /* Source of move */ __u64 len; /* Number of bytes to move */ __u64 mode; /* Flags controlling behavior of move */ __s64 move; /* Number of bytes moved, or negated error */ };
That probably has to be documented as well, in which address space dst and src reside.
Agreed, some better documentation will never hurt. Dst should be in the mm address space that was bound to the userfault descriptor. Src should be in the current mm address space.
Thanks,
-- Peter Xu