On 03/12/2025 09:23, David Hildenbrand (Red Hat) wrote:
On 12/2/25 12:50, Nikita Kalyazin wrote:
On 01/12/2025 20:57, Peter Xu wrote:
On Mon, Dec 01, 2025 at 08:12:38PM +0000, Nikita Kalyazin wrote:
On 01/12/2025 18:35, Peter Xu wrote:
On Mon, Dec 01, 2025 at 04:48:22PM +0000, Nikita Kalyazin wrote:
I believe I found the precise point where we convinced ourselves that minor support was sufficient: [1]. If at this moment we don't find that reasoning valid anymore, then indeed implementing missing is the only option.
Now after I re-read the discussion, I may have made a wrong statement there, sorry. I could have got slightly confused on when the write() syscall can be involved.
I agree if you want to get an event when cache missed with the current uffd definitions and when pre-population is forbidden, then MISSING trap is required. That is, with/without the need of UFFDIO_COPY being available.
Do I understand it right that UFFDIO_COPY is not allowed in your case, but only write()?
No, UFFDIO_COPY would work perfectly fine. We will still use write() whenever we resolve stage-2 faults as they aren't visible to UFFD. When a userfault occurs at an offset that already has a page in the cache, we will have to keep using UFFDIO_CONTINUE so it looks like both will be required:
- user mapping major fault -> UFFDIO_COPY (fills the cache and sets up userspace PT) - user mapping minor fault -> UFFDIO_CONTINUE (only sets up userspace PT) - stage-2 fault -> write() (only fills the cache)
Is stage-2 fault about KVM_MEMORY_EXIT_FLAG_USERFAULT, per James's series?
Yes, that's the one ([1]).
[1] https://lore.kernel.org/kvm/20250618042424.330664-1-jthoughton@google.com
It looks fine indeed, but it looks slightly weird then, as you'll have two ways to populate the page cache. Logically here atomicity is indeed not needed when you trap both MISSING + MINOR.
I reran the test based on the UFFDIO_COPY prototype I had using your series [2], and UFFDIO_COPY is slower than write() to populate 512 MiB: 237 vs 202 ms (+17%). Even though UFFDIO_COPY alone is functionally sufficient, I would prefer to have an option to use write() where possible and only falling back to UFFDIO_COPY for userspace faults to have better performance.
Just so I understand correctly: we could even do without UFFDIO_COPY for that scenario by using write() + minor faults?
We still need major fault notifications as well (which we were accidentally generating until this version). But we can resolve them with write() + UFFDIO_CONTINUE instead of UFFDIO_COPY.
But what you are saying is that there might be a performance benefit in using UFFDIO_COPY for userspace faults, to avoid the write()+minor fault overhead?
UFFDIO_COPY _may_ be faster to resolve userspace faults because it's a single syscall instead of two, but the amount of userspace faults, at least in our scenario, is negligible compared to the amount of stage-2 faults, so I wouldn't use it as an argument for supporting UFFDIO_COPY if it can be avoided.
-- Cheers
David