From: Nadav Amit namit@vmware.com
When a PTE is set by UFFD operations such as UFFDIO_COPY, the PTE is currently only marked as write-protected if the VMA has VM_WRITE flag set. This seems incorrect or at least would be unexpected by the users.
Consider the following sequence of operations that are being performed on a certain page:
mprotect(PROT_READ) UFFDIO_COPY(UFFDIO_COPY_MODE_WP) mprotect(PROT_READ|PROT_WRITE)
At this point the user would expect to still get UFFD notification when the page is accessed for write, but the user would not get one, since the PTE was not marked as UFFD_WP during UFFDIO_COPY.
Fix it by always marking PTEs as UFFD_WP regardless on the write-permission in the VMA flags.
Fixes: 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit") Cc: Mike Rapoport rppt@linux.vnet.ibm.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Peter Xu peterx@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit namit@vmware.com --- mm/userfaultfd.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 0780c2a57ff1..885e5adb0168 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -72,12 +72,15 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, _dst_pte = pte_mkdirty(_dst_pte); if (page_in_cache && !vm_shared) writable = false; - if (writable) { - if (wp_copy) - _dst_pte = pte_mkuffd_wp(_dst_pte); - else - _dst_pte = pte_mkwrite(_dst_pte); - } + + /* + * Always mark a PTE as write-protected when needed, regardless of + * VM_WRITE, which the user might change. + */ + if (wp_copy) + _dst_pte = pte_mkuffd_wp(_dst_pte); + else if (writable) + _dst_pte = pte_mkwrite(_dst_pte);
dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
On Feb 17, 2022, at 1:16 PM, Nadav Amit nadav.amit@gmail.com wrote:
From: Nadav Amit namit@vmware.com
When a PTE is set by UFFD operations such as UFFDIO_COPY, the PTE is currently only marked as write-protected if the VMA has VM_WRITE flag set. This seems incorrect or at least would be unexpected by the users.
One more note - if you want you can try the following reproducer:
#define _GNU_SOURCE #include <linux/userfaultfd.h> #include <errno.h> #include <unistd.h> #include <stdlib.h> #include <fcntl.h> #include <pthread.h> #include <sys/mman.h> #include <sys/syscall.h> #include <sys/ioctl.h> #include <sys/mman.h> #include <sys/types.h> #include <stdio.h>
volatile int ok;
static void * fault_handler_thread(void *arg) { struct uffd_msg msg = {0}; int nread; int uffd = (int)(long)arg; struct uffdio_writeprotect uffd_wp = {0};
nread = read(uffd, &msg, sizeof(msg)); if (nread == 0) { printf("EOF on userfaultfd!\n"); exit(EXIT_FAILURE); }
ok = 1;
uffd_wp.range.start = msg.arg.pagefault.address & ~(4095ull); uffd_wp.range.len = 4096;
if (ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_wp) == -1) { perror("uffd-w-unprotect"); exit(EXIT_FAILURE); } return NULL; }
char page[4096];
int main(void) { struct uffdio_api uffdio_api = {0}; struct uffdio_register uffdio_register = {0}; struct uffdio_writeprotect uffdio_wp = {0}; struct uffdio_copy uffdio_copy = {0}; volatile char *pc; pthread_t thr; int err; int uffd; void *p;
uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); if (uffd == -1) { perror("userfaultfd"); exit(-1); }
uffdio_api.api = UFFD_API; uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP; if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) { perror("api"); exit(EXIT_FAILURE); }
p = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) { perror("mmap"); exit(EXIT_FAILURE); }
uffdio_register.range.start = (unsigned long)p; uffdio_register.range.len = 4096; uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING|UFFDIO_REGISTER_MODE_WP; if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) { perror("register"); exit(EXIT_FAILURE); }
uffdio_copy.src = (unsigned long)page; uffdio_copy.dst = (unsigned long)p; uffdio_copy.len = 4096; uffdio_copy.mode = UFFDIO_COPY_MODE_WP; uffdio_copy.copy = 0; if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1) { perror("uffd-copy"); exit(EXIT_FAILURE); }
err = pthread_create(&thr, NULL, fault_handler_thread, (void *)(long)uffd); if (err != 0) { exit(EXIT_FAILURE); }
if (mprotect(p, 4096, PROT_READ|PROT_WRITE) < 0) { perror("mprotect(PROT_READ|PROT_WRITE)"); exit(EXIT_FAILURE); }
pc = (volatile char *)p; *pc = 1; printf("%s\n", ok ? "ok" : "failed”); return 0; }
Hello, Nadav,
On Thu, Feb 17, 2022 at 09:16:02PM +0000, Nadav Amit wrote:
From: Nadav Amit namit@vmware.com
When a PTE is set by UFFD operations such as UFFDIO_COPY, the PTE is currently only marked as write-protected if the VMA has VM_WRITE flag set. This seems incorrect or at least would be unexpected by the users.
Consider the following sequence of operations that are being performed on a certain page:
mprotect(PROT_READ) UFFDIO_COPY(UFFDIO_COPY_MODE_WP) mprotect(PROT_READ|PROT_WRITE)
No objection to the patch, however I'm wondering why this is a valid use case because mprotect seems to be conflict with uffd, because AFAICT mprotect(PROT_READ|PROT_WRITE) can already grant write bit.
In change_pte_range():
if (dirty_accountable && pte_dirty(ptent) && (pte_soft_dirty(ptent) || !(vma->vm_flags & VM_SOFTDIRTY))) { ptent = pte_mkwrite(ptent); }
PS: I always think here the VM_SOFTDIRTY check is wrong, IMHO it should be:
if (dirty_accountable && pte_dirty(ptent) && (pte_soft_dirty(ptent) || (vma->vm_flags & VM_SOFTDIRTY))) { ptent = pte_mkwrite(ptent); }
Because when VM_SOFTDIRTY is cleared it means soft dirty enabled. I wanted to post a patch but I never yet.
Could I ask why you need mprotect() with uffd?
Thanks,
On Feb 17, 2022, at 5:58 PM, Peter Xu peterx@redhat.com wrote:
Hello, Nadav,
On Thu, Feb 17, 2022 at 09:16:02PM +0000, Nadav Amit wrote:
From: Nadav Amit namit@vmware.com
When a PTE is set by UFFD operations such as UFFDIO_COPY, the PTE is currently only marked as write-protected if the VMA has VM_WRITE flag set. This seems incorrect or at least would be unexpected by the users.
Consider the following sequence of operations that are being performed on a certain page:
mprotect(PROT_READ) UFFDIO_COPY(UFFDIO_COPY_MODE_WP) mprotect(PROT_READ|PROT_WRITE)
No objection to the patch, however I'm wondering why this is a valid use case because mprotect seems to be conflict with uffd, because AFAICT mprotect(PROT_READ|PROT_WRITE) can already grant write bit.
In change_pte_range():
if (dirty_accountable && pte_dirty(ptent) && (pte_soft_dirty(ptent) || !(vma->vm_flags & VM_SOFTDIRTY))) { ptent = pte_mkwrite(ptent); }
I think you are right, and an additional patch is needed to prevent mprotect() from making an entry writable if the PTE has _PAGE_UFFD_WP set and uffd_wp_resolve was not provided. I missed that.
I’ll post another patch for this one.
PS: I always think here the VM_SOFTDIRTY check is wrong, IMHO it should be:
if (dirty_accountable && pte_dirty(ptent) && (pte_soft_dirty(ptent) || (vma->vm_flags & VM_SOFTDIRTY))) { ptent = pte_mkwrite(ptent); }
Because when VM_SOFTDIRTY is cleared it means soft dirty enabled. I wanted to post a patch but I never yet.
Seems that you are right. Yet, having this wrong code around for some time raises the concern whether something will break. By the soft-dirty I saw so far, it seems that it is not commonly used.
Could I ask why you need mprotect() with uffd?
Sure. I think I mentioned it before, that I want to use userfaultfd for other processes [1], by having one monitor UFFD for multiple processes that handles their swap/prefetch activities based on custom policies.
I try to set the least amount of constraints on what these processes might do, and mprotect() is something they are allowed to do.
I would hopefully send the patches that are required for all of that and open source my code soon. In the meanwhile I try to upstream the least controversial parts.
[1] https://lore.kernel.org/linux-mm/YWZCClDorCCM7KMG@t490s/t/
On Thu, Feb 17, 2022 at 06:23:14PM -0800, Nadav Amit wrote:
On Feb 17, 2022, at 5:58 PM, Peter Xu peterx@redhat.com wrote:
Hello, Nadav,
On Thu, Feb 17, 2022 at 09:16:02PM +0000, Nadav Amit wrote:
From: Nadav Amit namit@vmware.com
When a PTE is set by UFFD operations such as UFFDIO_COPY, the PTE is currently only marked as write-protected if the VMA has VM_WRITE flag set. This seems incorrect or at least would be unexpected by the users.
Consider the following sequence of operations that are being performed on a certain page:
mprotect(PROT_READ) UFFDIO_COPY(UFFDIO_COPY_MODE_WP) mprotect(PROT_READ|PROT_WRITE)
No objection to the patch, however I'm wondering why this is a valid use case because mprotect seems to be conflict with uffd, because AFAICT mprotect(PROT_READ|PROT_WRITE) can already grant write bit.
In change_pte_range():
if (dirty_accountable && pte_dirty(ptent) && (pte_soft_dirty(ptent) || !(vma->vm_flags & VM_SOFTDIRTY))) { ptent = pte_mkwrite(ptent); }
I think you are right, and an additional patch is needed to prevent mprotect() from making an entry writable if the PTE has _PAGE_UFFD_WP set and uffd_wp_resolve was not provided. I missed that.
Perhaps we can simply make this "if" to be "else" so as to connect with the previous "if"? After all the three (wp, wp_resolv, dirty_acct) are never used with more than one flag set.
I’ll post another patch for this one.
PS: I always think here the VM_SOFTDIRTY check is wrong, IMHO it should be:
if (dirty_accountable && pte_dirty(ptent) && (pte_soft_dirty(ptent) || (vma->vm_flags & VM_SOFTDIRTY))) { ptent = pte_mkwrite(ptent); }
Because when VM_SOFTDIRTY is cleared it means soft dirty enabled. I wanted to post a patch but I never yet.
Seems that you are right. Yet, having this wrong code around for some time raises the concern whether something will break. By the soft-dirty I saw so far, it seems that it is not commonly used.
I'll see whether I should prepare a patch and a test, maybe after yours.
Could I ask why you need mprotect() with uffd?
Sure. I think I mentioned it before, that I want to use userfaultfd for other processes [1], by having one monitor UFFD for multiple processes that handles their swap/prefetch activities based on custom policies.
I try to set the least amount of constraints on what these processes might do, and mprotect() is something they are allowed to do.
I see. I didn't expect mprotect() can work well with uffd, but it seems fine at least in this case.
Have you thought about other use of mprotect() other than RO? Say, I only know a valid use case of PROT_NONE for region reservation purpose, which normally will be followed up by a munmap() and remap on the same address. That sounds okay. But not sure whether this patch will cover all the possible mprotect() uses in the tracee.
I would hopefully send the patches that are required for all of that and open source my code soon. In the meanwhile I try to upstream the least controversial parts.
Sure, I'm always happy to learn it. Thanks,
[1] https://lore.kernel.org/linux-mm/YWZCClDorCCM7KMG@t490s/t/
On Feb 17, 2022, at 6:23 PM, Nadav Amit nadav.amit@gmail.com wrote:
PS: I always think here the VM_SOFTDIRTY check is wrong, IMHO it should be:
if (dirty_accountable && pte_dirty(ptent) && (pte_soft_dirty(ptent) || (vma->vm_flags & VM_SOFTDIRTY))) { ptent = pte_mkwrite(ptent); }
I know it is off-topic (not directly related to my patch), but I tried to understand the logic - both of the existing code and of your suggested change - and I failed.
IIUC dirty_accountable (whose value is taken from vma_wants_writenotify()) means that the writes *should* be tracked, and therefore the page should remain read-only.
So basically the condition should have been based on !dirty_accountable, i.e. the inverted value of dirty_accountable.
The problem is that dirty_accountable also reflects VM_SOFTDIRTY considerations, so looking on the PTE does not tell you whether the PTE should remain write-protected even if it is dirty.
Am I missing something?
On Feb 17, 2022, at 8:00 PM, Nadav Amit nadav.amit@gmail.com wrote:
On Feb 17, 2022, at 6:23 PM, Nadav Amit nadav.amit@gmail.com wrote:
PS: I always think here the VM_SOFTDIRTY check is wrong, IMHO it should be:
if (dirty_accountable && pte_dirty(ptent) && (pte_soft_dirty(ptent) || (vma->vm_flags & VM_SOFTDIRTY))) { ptent = pte_mkwrite(ptent); }
I know it is off-topic (not directly related to my patch), but I tried to understand the logic - both of the existing code and of your suggested change - and I failed.
IIUC dirty_accountable (whose value is taken from vma_wants_writenotify()) means that the writes *should* be tracked, and therefore the page should remain read-only.
So basically the condition should have been based on !dirty_accountable, i.e. the inverted value of dirty_accountable.
The problem is that dirty_accountable also reflects VM_SOFTDIRTY considerations, so looking on the PTE does not tell you whether the PTE should remain write-protected even if it is dirty.
Just as I pressed send, I understood your suggestion.
It is still confusing for me how setting write-permissions for dirty_accountable PTEs is safe (as long as they are dirty), and yet in the general case it is not safe. I need to further think of it.
As I understand it, this patch (below) is still good to merge upstream, although Peter hasn't acked it (please).
And a whole bunch of followup patches are being thought about, but have yet to eventuate.
Do we feel that this patch warrants the cc:stable? I'm suspecting "no", as it isn't clear that the use-case is really legitimate at this time?
Thanks.
From: Nadav Amit namit@vmware.com Subject: userfaultfd: mark uffd_wp regardless of VM_WRITE flag
When a PTE is set by UFFD operations such as UFFDIO_COPY, the PTE is currently only marked as write-protected if the VMA has VM_WRITE flag set. This seems incorrect or at least would be unexpected by the users.
Consider the following sequence of operations that are being performed on a certain page:
mprotect(PROT_READ) UFFDIO_COPY(UFFDIO_COPY_MODE_WP) mprotect(PROT_READ|PROT_WRITE)
At this point the user would expect to still get UFFD notification when the page is accessed for write, but the user would not get one, since the PTE was not marked as UFFD_WP during UFFDIO_COPY.
Fix it by always marking PTEs as UFFD_WP regardless on the write-permission in the VMA flags.
Link: https://lkml.kernel.org/r/20220217211602.2769-1-namit@vmware.com Fixes: 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit") Signed-off-by: Nadav Amit namit@vmware.com Cc: Axel Rasmussen axelrasmussen@google.com Cc: Mike Rapoport rppt@linux.vnet.ibm.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Peter Xu peterx@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org ---
mm/userfaultfd.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-)
--- a/mm/userfaultfd.c~userfaultfd-mark-uffd_wp-regardless-of-vm_write-flag +++ a/mm/userfaultfd.c @@ -72,12 +72,15 @@ int mfill_atomic_install_pte(struct mm_s _dst_pte = pte_mkdirty(_dst_pte); if (page_in_cache && !vm_shared) writable = false; - if (writable) { - if (wp_copy) - _dst_pte = pte_mkuffd_wp(_dst_pte); - else - _dst_pte = pte_mkwrite(_dst_pte); - } + + /* + * Always mark a PTE as write-protected when needed, regardless of + * VM_WRITE, which the user might change. + */ + if (wp_copy) + _dst_pte = pte_mkuffd_wp(_dst_pte); + else if (writable) + _dst_pte = pte_mkwrite(_dst_pte);
dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
_
Hi, Andrew,
On Wed, Mar 16, 2022 at 03:05:53PM -0700, Andrew Morton wrote:
As I understand it, this patch (below) is still good to merge upstream, although Peter hasn't acked it (please).
Thanks for asking. I didn't ack because I saw that it's queued a long time ago into -mm, and also it's in -next for a long time too (my new uffd-wp patchset is rebased to this one already).
I normally assume that means you read and ack that patch already, so if I didn't see anything obviously wrong I'll just keep silent. Please let me know if that's not the expected behavior..
So here it is..
Acked-by: Peter Xu peterx@redhat.com
And a whole bunch of followup patches are being thought about, but have yet to eventuate.
Is there a pointer/subject?
Do we feel that this patch warrants the cc:stable? I'm suspecting "no", as it isn't clear that the use-case is really legitimate at this time?
Right. Uffd-wp+mprotect usage is IMHO not a major use case for uffd-wp per my knowledge. At least I didn't even expect both work together, not until Nadav started working on some of the problems..
Said that, for this specific case it should be literally only changing the behavior of anonymous UFFD-WP && !WRITE case but nothing else (please correct me otherwise..), then IMHO it's pretty safe to copy stable too especially for the cleanly applicable branches.
Thanks,
On Thu, 17 Mar 2022 08:11:44 +0800 Peter Xu peterx@redhat.com wrote:
Hi, Andrew,
On Wed, Mar 16, 2022 at 03:05:53PM -0700, Andrew Morton wrote:
As I understand it, this patch (below) is still good to merge upstream, although Peter hasn't acked it (please).
Thanks for asking. I didn't ack because I saw that it's queued a long time ago into -mm, and also it's in -next for a long time too (my new uffd-wp patchset is rebased to this one already).
I normally assume that means you read and ack that patch already, so if I didn't see anything obviously wrong I'll just keep silent. Please let me know if that's not the expected behavior..
Acks and reviews are always welcome. If they come in late, git tree maintainer might not want to update and rebase, but it's still there in the archives for people who click on the Link:.
So here it is..
Acked-by: Peter Xu peterx@redhat.com
Thanks.
And a whole bunch of followup patches are being thought about, but have yet to eventuate.
Is there a pointer/subject?
The messages in this thread. Several followup patches were discussed.
Do we feel that this patch warrants the cc:stable? I'm suspecting "no", as it isn't clear that the use-case is really legitimate at this time?
Right. Uffd-wp+mprotect usage is IMHO not a major use case for uffd-wp per my knowledge. At least I didn't even expect both work together, not until Nadav started working on some of the problems..
Said that, for this specific case it should be literally only changing the behavior of anonymous UFFD-WP && !WRITE case but nothing else (please correct me otherwise..), then IMHO it's pretty safe to copy stable too especially for the cleanly applicable branches.
OK, thanks.
On Thu, Feb 17, 2022 at 08:00:12PM -0800, Nadav Amit wrote:
On Feb 17, 2022, at 6:23 PM, Nadav Amit nadav.amit@gmail.com wrote:
PS: I always think here the VM_SOFTDIRTY check is wrong, IMHO it should be:
if (dirty_accountable && pte_dirty(ptent) && (pte_soft_dirty(ptent) || (vma->vm_flags & VM_SOFTDIRTY))) { ptent = pte_mkwrite(ptent); }
I know it is off-topic (not directly related to my patch), but I tried to understand the logic - both of the existing code and of your suggested change - and I failed.
IIUC dirty_accountable (whose value is taken from vma_wants_writenotify()) means that the writes *should* be tracked, and therefore the page should remain read-only.
Right.
So basically the condition should have been based on !dirty_accountable, i.e. the inverted value of dirty_accountable.
The problem is that dirty_accountable also reflects VM_SOFTDIRTY considerations, so looking on the PTE does not tell you whether the PTE should remain write-protected even if it is dirty.
My understanding is that the dirty bits (especially if both set) means we've tracked dirty on this pte already so we don't need to, hence we can set the dirty bit here. E.g., continuous mprotect(RO), mprotect(RW) upon a full dirty pte.
When something wants to enable tracking again, it needs to clear the dirty bit, either the real one or soft-dirty one. So it's a pure performance enhancement to conditionally set write bit here, when we're sure we won't need any further tracking on this pte.
One thing to mention is that this path only applies to VM_SHARED|VM_WRITE, because that's what checked the first in vma_wants_writenotify():
/* If it was private or non-writable, the write bit is already clear */ if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) return 0;
IOW private mappings are not optimized in current tree yet.
Peter Collingbourne proposed a patch some time ago to optimize it but it didn't get merged somehow. Meanwhile even with his latest version it should still miss the thp case, so if to reference the private optimization Andrea's tree would be the best:
https://github.com/aagit/aa/commit/fadb5e04d94472614c76819acd979b2f60e4eff6
Hope it clarifies things a bit. Thanks,
On Feb 20, 2022, at 10:23 PM, Peter Xu peterx@redhat.com wrote:
On Thu, Feb 17, 2022 at 08:00:12PM -0800, Nadav Amit wrote:
On Feb 17, 2022, at 6:23 PM, Nadav Amit nadav.amit@gmail.com wrote:
PS: I always think here the VM_SOFTDIRTY check is wrong, IMHO it should be:
if (dirty_accountable && pte_dirty(ptent) && (pte_soft_dirty(ptent) || (vma->vm_flags & VM_SOFTDIRTY))) { ptent = pte_mkwrite(ptent); }
I know it is off-topic (not directly related to my patch), but I tried to understand the logic - both of the existing code and of your suggested change - and I failed.
IIUC dirty_accountable (whose value is taken from vma_wants_writenotify()) means that the writes *should* be tracked, and therefore the page should remain read-only.
Right.
So basically the condition should have been based on !dirty_accountable, i.e. the inverted value of dirty_accountable.
The problem is that dirty_accountable also reflects VM_SOFTDIRTY considerations, so looking on the PTE does not tell you whether the PTE should remain write-protected even if it is dirty.
My understanding is that the dirty bits (especially if both set) means we've tracked dirty on this pte already so we don't need to, hence we can set the dirty bit here. E.g., continuous mprotect(RO), mprotect(RW) upon a full dirty pte.
When something wants to enable tracking again, it needs to clear the dirty bit, either the real one or soft-dirty one. So it's a pure performance enhancement to conditionally set write bit here, when we're sure we won't need any further tracking on this pte.
One thing to mention is that this path only applies to VM_SHARED|VM_WRITE, because that's what checked the first in vma_wants_writenotify():
/* If it was private or non-writable, the write bit is already clear */ if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) return 0;
IOW private mappings are not optimized in current tree yet.
Peter Collingbourne proposed a patch some time ago to optimize it but it didn't get merged somehow. Meanwhile even with his latest version it should still miss the thp case, so if to reference the private optimization Andrea's tree would be the best:
https://github.com/aagit/aa/commit/fadb5e04d94472614c76819acd979b2f60e4eff6
Hope it clarifies things a bit. Thanks,
Thanks for the clarification. That’s what I suspected - I did not encounter it since I only used private anonymous mappings. I will try to create a test-case and send an additional fix for this issue.
Regards, Nadav
linux-stable-mirror@lists.linaro.org