On Thu, 8 Apr 2021, Axel Rasmussen wrote:
On Tue, Apr 6, 2021 at 4:49 PM Peter Xu peterx@redhat.com wrote:
On Mon, Apr 05, 2021 at 10:19:17AM -0700, Axel Rasmussen wrote:
...
--- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c
...
if (is_file_backed) {
/* The shmem MAP_PRIVATE case requires checking the i_size */
inode = dst_vma->vm_file->f_inode;
offset = linear_page_index(dst_vma, dst_addr);
max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
ret = -EFAULT;
if (unlikely(offset >= max_off))
goto out_unlock;
Frankly I don't really know why this must be put into pgtable lock.. Since if not required then it can be moved into UFFDIO_COPY path, as CONTINUE doesn't need it iiuc. Just raise it up as a pure question.
It's not clear to me either. shmem_getpage_gfp() does check this twice kinda like we're doing, but it doesn't ever touch the PTL. What it seems to be worried about is, what happens if a concurrent FALLOC_FL_PUNCH_HOLE happens somewhere in the middle of whatever manipulation we're doing? From looking at shmem_fallocate(), I think the basic point is that truncation happens while "inode_lock(inode)" is held, but neither shmem_mcopy_atomic_pte() or the new mcopy_atomic_install_ptes() take that lock.
I'm a bit hesitant to just remove it, run some tests, and then declare victory, because it seems plausible it's there to catch some semi-hard-to-induce race. I'm not sure how to prove that *isn't* needed, so my inclination is to just keep it?
I'll send a series addressing the feedback so far this afternoon, and I'll leave this alone for now - at least, it doesn't seem to hurt anything. Maybe Hugh or someone else has some more advice about it. If so, I'm happy to remove it in a follow-up.
It takes some thinking about, but the i_size check is required to be under the pagetable lock, for the MAP_PRIVATE UFFDIO_COPY path, where it is inserting an anonymous page into the file-backed vma (skipping actually inserting a page into page cache, as an ordinary fault would).
Not because of FALLOC_FL_PUNCH_HOLE (which makes no change to i_size; and it's okay if a race fills in the hole immediately afterwards), but because of truncation (which must remove all beyond i_size).
In the MAP_SHARED case, with a locked page inserted into page cache, the page lock is enough to exclude concurrent truncation. But even in that case the second i_size check (I'm looking at 5.12-rc's shmem_mfill_atomic_pte(), rather than recent patches which might differ) is required: because the first i_size check was done before the page became visible in page cache, so a concurrent truncation could miss it).
Maybe that first check is redundant, though I'm definitely for doing it; or maybe shmem_add_to_page_cache() would be better if it made that check itself, under xas_lock (I think the reason it does not is historical). The second check, in the MAP_SHARED case, does not need to be under pagetable lock - the page lock on the page cache page is enough - but probably Andrea placed it there to resemble the anonymous case.
You might then question, how come there is no i_size check in all of mm/memory.c, where ordinary faulting is handled. I'll answer that the pte_same() check, under pagetable lock in wp_page_copy(), is where the equivalent to userfaultfd's MAP_PRIVATE UFFDIO_COPY check is made: if the page cache page has already been truncated, that pte will have been cleared.
Or, if the page cache page is truncated an instant after wp_page_copy() drops page table lock, then the unmap_mapping_range(,,, even_cows = 1) which follows truncation has to clean it up. Er, does that mean that the i_size check I started off insisting is required, actually is not required? Um, maybe, but let's just keep it and say goodnight!
Hugh