On Mon, 12 Apr 2021, Peter Xu wrote:
On Tue, Apr 06, 2021 at 11:14:30PM -0700, Hugh Dickins wrote:
+static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
unsigned long dst_addr, struct page *page,
enum mcopy_atomic_mode mode, bool wp_copy)
+{
[...]
- if (writable) {
_dst_pte = pte_mkdirty(_dst_pte);
if (wp_copy)
_dst_pte = pte_mkuffd_wp(_dst_pte);
else
_dst_pte = pte_mkwrite(_dst_pte);
- } else if (vm_shared) {
/*
* Since we didn't pte_mkdirty(), mark the page dirty or it
* could be freed from under us. We could do this
* unconditionally, but doing it only if !writable is faster.
*/
set_page_dirty(page);
I do not remember why Andrea or I preferred set_page_dirty() here to pte_mkdirty(); but I suppose there might somewhere be a BUG_ON(pte_dirty) which this would avoid. Risky to change it, though it does look odd.
Is any of the possible BUG_ON(pte_dirty) going to trigger because the pte has write bit cleared? That's one question I was not very sure, e.g., whether one pte is allowed to be "dirty" if it's not writable.
To me it's okay, it's actually very suitable for UFFDIO_COPY case, where it is definitely dirty data (so we must never drop it) even if it's installed as RO, however to achieve that we can still set the dirty on the page rather than the pte as what we do here. It's just a bit awkward as you said.
Meanwhile today I just noticed this in arm64 code:
static inline pte_t pte_wrprotect(pte_t pte) { /* * If hardware-dirty (PTE_WRITE/DBM bit set and PTE_RDONLY * clear), set the PTE_DIRTY bit. */ if (pte_hw_dirty(pte)) pte = pte_mkdirty(pte);
pte = clear_pte_bit(pte, __pgprot(PTE_WRITE)); pte = set_pte_bit(pte, __pgprot(PTE_RDONLY)); return pte; }
So arm64 will explicitly set the dirty bit (from the HW dirty bit) when wr-protect. It seems to prove that at least for arm64 it's very valid to have !write && dirty pte.
I did not mean to imply that it's wrong to have pte_dirty without pte_write: no, I agree with you, I believe that there are accepted and generic ways in which we can have pte_dirty without pte_write (and we could each probably add a warning somewhere which would very quickly prove that - but those would not prove that there are not BUG_ONs on some other path, which had been my fear).
I wanted now to demonstrate that by pointing to change_pte_range() in mm/mprotect.c, showing that it does not clear pte_dirty when it clears pte_write. But alarmingly found rather the reverse: that it appears to set pte_write when it finds pte_dirty - if dirty_accountable.
That looks very wrong, but if I spent long enough following up dirty_accountable in detail, I think I would be reassured to find that it is only adding the pte_write there when it had removed it from the prot passed down, for dirty accounting reasons (which apply !VM_SHARED protections in the VM_SHARED case, so that page_mkwrite() is called and dirty accounting done when necessary).
What I did mean to imply is that changing set_page_dirty to pte_mkdirty, to make that userfaultfd code block look nicer, is not a change to be done lightly: by all means try it out, test it, and send a patch after Axel's series is in, but please do not ask Axel to make that change as a part of his series - it would be taking a risk, just for a cleanup.
Now, I have also looked up the mail exchange with Andrea which led to his dcf7fe9d8976 ("userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not set") - it had to be off-list at the time. And he was rather led to that set_page_dirty by following old patterns left over in shmem_getpage_gfp(); but when I said "or it could be done with pte_mkdirty without pte_mkwrite", he answered "I explicitly avoided that because pte_dirty then has side effects on mprotect to decide pte_write. It looks safer to do set_page_dirty and not set dirty bits in not writable ptes unnecessarily".
Haha: I think Andrea is referring to exactly the dirty_accountable code in change_pte_protection() which worried me above. Now, I think that will turn out okay (shmem does not have a page_mkwrite(), and does not participate in dirty accounting), but you will have to do some work to assure us all of that, before sending in a cleanup patch.
Hugh