On Fri, 24 Apr 2020 15:58:29 -0700 John Hubbard jhubbard@nvidia.com wrote:
On 2020-04-24 13:15, Alex Williamson wrote:
On Fri, 24 Apr 2020 12:20:03 -0700 John Hubbard jhubbard@nvidia.com wrote:
On 2020-04-24 11:18, Alex Williamson wrote: ...
Hi John,
I'm seeing a regression bisected back to this commit (3faa52c03f44 mm/gup: track FOLL_PIN pages). I've attached some vfio-pci test code that reproduces this by mmap'ing a page of MMIO space of a device and then tries to map that through the IOMMU, so this should be attempting a gup/pin of a PFNMAP page. Previously this failed gracefully (-EFAULT), but now results in:
Hi Alex,
Thanks for this report, and especially for source code to test it, seeing as how I can't immediately spot the problem just from the crash data so far. I'll get set up and attempt a repro.
Actually this looks like it should be relatively easier than the usual sort of "oops, we leaked a pin_user_pages() or unpin_user_pages() call, good luck finding which one" report that I fear the most. :) This one looks more like a crash that happens directly, when calling into the pin_user_pages_remote() code. Which should be a lot easier to solve...
btw, if you are set up for it, it would be nice to know what source file and line number corresponds to the RIP (get_pfnblock_flags_mask+0x22) below. But if not, no problem, because I've likely got to do the repro in any case.
Hey John,
TBH I'm feeling a lot less confident about this bisect. This was readily reproducible to me on a clean tree a bit ago, but now it eludes me. Let me go back and figure out what's going on before you spend any more time on it. Thanks,
OK. But I'm keeping the repro program! :) It made it quick and easy to set up a vfio test, so it was worth doing in any case.
Great, because I've traced my steps, re-bisected and came back to the same upstream commit with the same test program. The major difference is that I thought I was seeing this on pure upstream, but some vfio code that I'm trying to prepare for upstream snuck in, so this isn't a pure upstream regression, but the changes I was making worked on v5.6 and does not work with this commit. Maybe still a latent regression, maybe a bug in my changes.
Anyway, I wanted to double check this just out of paranoia, and so now I have a data point for you: your test program runs and passes for me using today's linux.git kernel, with an NVIDIA GPU as the vfio device, and the kernel log is clean. No hint of any problems.
Yep, I agree. The vfio change I'm experimenting with is to move the remap_pfn_range() from vfio_pci_mmap() to a vm_ops.fault handler. This is why I have the test program creating an mmap of the device mmio space and then immediately mapping it through the iommu without touching it. If the vma gets faulted in the dma mapping path via pin_user_pages_remote(), I see the crash I reported initially. If the test program is modified to access the mmap before doing the dma mapping, everything works normally. In either case, the fault handler is called and satisfies the fault with remap_pfn_range() and returns VM_FAULT_NOPAGE (vfio patch attached).
Here's the crash I'm seeing with some further debugging:
BUG: unable to handle page fault for address: ffffa5b8bfe14f38 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 70 PID: 3343 Comm: vfio-pci-dma-ma Not tainted 5.6.0-3faa52c03f44+ #20 Hardware name: AMD Corporation Diesel/Diesel, BIOS TDL100CB 03/17/2020 RIP: 0010:get_pfnblock_flags_mask+0x22/0x70 Code: c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 48 8b 05 bc e1 d9 01 48 89 f7 49 89 c8 48 c1 ef 0f 48 85 c0 74 48 48 89 f1 48 c1 e9 17 <48> 8b 04 c8 48 85 c0 74 0b 40 0f b6 ff 48 c1 e7 04 48 01 f8 48 c1 RSP: 0018:ffffb2e3c910fcc8 EFLAGS: 00010216 RAX: ffff95b8bff50000 RBX: 0000000000000001 RCX: 000001fffffd89e7 RDX: 0000000000000002 RSI: fffffec4f3a899ba RDI: 0001fffffd89e751 RBP: ffffb2e3c910fd88 R08: 0000000000000007 R09: ffff95a4aa79fce8 R10: 0000000000000000 R11: ffffb2e3c910f840 R12: 0000000100000000 R13: 0000000000000000 R14: 0000000000000001 R15: ffff959caa266e80 FS: 00007f1a95023740(0000) GS:ffff959caf180000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffa5b8bfe14f38 CR3: 0000000462a1e000 CR4: 00000000003406e0 Call Trace: __gup_longterm_locked+0x274/0x620 vaddr_get_pfn+0x74/0x110 [vfio_iommu_type1] vfio_pin_pages_remote+0x6e/0x370 [vfio_iommu_type1] vfio_iommu_type1_ioctl+0x8e5/0xaac [vfio_iommu_type1]
get_pfnblock_flags_mask+0x22/0x70 is here:
include/linux/mmzone.h:1254 static inline struct mem_section *__nr_to_section(unsigned long nr) { #ifdef CONFIG_SPARSEMEM_EXTREME if (!mem_section) return NULL; #endif ====> if (!mem_section[SECTION_NR_TO_ROOT(nr)]) return NULL; return &mem_section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK]; }
The call trace is:
__gup_longterm_locked check_and_migrate_cma_pages is_migrate_cma_page get_pageblock_migratetype get_pfnblock_flags_mask __get_pfnblock_flags_mask get_pageblock_bitmap __pfn_to_section __nr_to_section
Any ideas why we see this difference between the vma being faulted in outside of the page pinning path versus faulted by it?
FWIW, here's the stack dump for getting to the fault handler in the latter case:
vfio_pci_mmap_fault+0x22/0x130 [vfio_pci] __do_fault+0x38/0xd0 __handle_mm_fault+0xd4b/0x1380 handle_mm_fault+0xe2/0x1f0 __get_user_pages+0x188/0x820 __gup_longterm_locked+0xc8/0x620
Thanks! Alex