Hi,
Christoph Hellwig has a preference to do things a little differently, for the devmap cleanup in patch 5 ("mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages"). That came up in a different review thread, because the patch is out for review in two locations. Here's that review thread:
https://lore.kernel.org/r/20191118070826.GB3099@infradead.org
...and I'm hoping that we can defer that request, because otherwise it derails this series, which is starting to otherwise look like it could be ready for 5.5.
There is a git repo and branch, for convenience:
git@github.com:johnhubbard/linux.git pin_user_pages_tracking_v6
Changes since v5:
* Fixed the refcounting for huge pages: in most cases, it was only taking one GUP_PIN_COUNTING_BIAS's worth of refs, when it should have been taking one GUP_PIN_COUNTING_BIAS for each subpage.
(Much thanks to Jan Kara for spotting that one!)
* Renamed user_page_ref_inc() to try_pin_page(), and added a new try_pin_compound_head(). This definitely improves readability.
* Factored out some more duplication in the FOLL_PIN and FOLL_GET cases, in gup.c.
* Fixed up some straggling "get_" --> "pin_" references in the comments.
* Added reviewed-by tags.
Changes since v4:
* Renamed put_user_page*() --> unpin_user_page().
* Removed all pin_longterm_pages*() calls. We will use FOLL_LONGTERM at the call sites. (FOLL_PIN, however, remains an internal gup flag).
This is very nice: many patches just change three characters now: get_user_pages --> pin_user_pages. I think we've found the right balance of wrapper calls and gup flags, for the call sites.
* Updated a lot of documentation and commit logs to match the above two large changes.
* Changed gup_benchmark tests and run_vmtests, to adapt to one less use case: there is no pin_longterm_pages() call anymore.
* This includes a new devmap cleanup patch from Dan Williams, along with a rebased follow-up: patches 4 and 5, already mentioned above.
* Fixed patch 10 ("mm/gup: introduce pin_user_pages*() and FOLL_PIN"), so as to make pin_user_pages*() calls act as placeholders for the corresponding get_user_pages*() calls, until a later patch fully implements the DMA-pinning functionality.
Thanks to Jan Kara for noticing that.
* Fixed the implementation of pin_user_pages_remote().
* Further tweaked patch 2 ("mm/gup: factor out duplicate code from four routines"), in response to Jan Kara's feedback.
* Dropped a few reviewed-by tags due to changes that invalidated them.
Changes since v3:
* VFIO fix (patch 8): applied further cleanup: removed a pre-existing, unnecessary release and reacquire of mmap_sem. Moved the DAX vma checks from the vfio call site, to gup internals, and added comments (and commit log) to clarify.
* Due to the above, made a corresponding fix to the pin_longterm_pages_remote(), which was actually calling the wrong gup internal function.
* Changed put_user_page() comments, to refer to pin*() APIs, rather than get_user_pages*() APIs.
* Reverted an accidental whitespace-only change in the IB ODP code.
* Added a few more reviewed-by tags.
Changes since v2:
* Added a patch to convert IB/umem from normal gup, to gup_fast(). This is also posted separately, in order to hopefully get some runtime testing.
* Changed the page devmap code to be a little clearer, thanks to Jerome for that.
* Split out the page devmap changes into a separate patch (and moved Ira's Signed-off-by to that patch).
* Fixed my bug in IB: ODP code does not require pin_user_pages() semantics. Therefore, revert the put_user_page() calls to put_page(), and leave the get_user_pages() call as-is.
* As part of the revert, I am proposing here a change directly from put_user_pages(), to release_pages(). I'd feel better if someone agrees that this is the best way. It uses the more efficient release_pages(), instead of put_page() in a loop, and keep the change to just a few character on one line, but OTOH it is not a pure revert.
* Loosened the FOLL_LONGTERM restrictions in the __get_user_pages_locked() implementation, and used that in order to fix up a VFIO bug. Thanks to Jason for that idea.
* Note the use of release_pages() in IB: is that OK?
* Added a few more WARN's and clarifying comments nearby.
* Many documentation improvements in various comments.
* Moved the new pin_user_pages.rst from Documentation/vm/ to Documentation/core-api/ .
* Commit descriptions: added clarifying notes to the three patches (drm/via, fs/io_uring, net/xdp) that already had put_user_page() calls in place.
* Collected all pending Reviewed-by and Acked-by tags, from v1 and v2 email threads.
* Lot of churn from v2 --> v3, so it's possible that new bugs sneaked in.
NOT DONE: separate patchset is required:
* __get_user_pages_locked(): stop compensating for buggy callers who failed to set FOLL_GET. Instead, assert that FOLL_GET is set (and fail if it's not).
====================================================================== Original cover letter (edited to fix up the patch description numbers)
This applies cleanly to linux-next and mmotm, and also to linux.git if linux-next's commit 20cac10710c9 ("mm/gup_benchmark: fix MAP_HUGETLB case") is first applied there.
This provides tracking of dma-pinned pages. This is a prerequisite to solving the larger problem of proper interactions between file-backed pages, and [R]DMA activities, as discussed in [1], [2], [3], and in a remarkable number of email threads since about 2017. :)
A new internal gup flag, FOLL_PIN is introduced, and thoroughly documented in the last patch's Documentation/vm/pin_user_pages.rst.
I believe that this will provide a good starting point for doing the layout lease work that Ira Weiny has been working on. That's because these new wrapper functions provide a clean, constrained, systematically named set of functionality that, again, is required in order to even know if a page is "dma-pinned".
In contrast to earlier approaches, the page tracking can be incrementally applied to the kernel call sites that, until now, have been simply calling get_user_pages() ("gup"). In other words, opt-in by changing from this:
get_user_pages() (sets FOLL_GET) put_page()
to this: pin_user_pages() (sets FOLL_PIN) put_user_page()
Because there are interdependencies with FOLL_LONGTERM, a similar conversion as for FOLL_PIN, was applied. The change was from this:
get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET) put_page()
to this: pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM) put_user_page()
============================================================ Patch summary:
* Patches 1-9: refactoring and preparatory cleanup, independent fixes
* Patch 10: introduce pin_user_pages(), FOLL_PIN, but no functional changes yet * Patches 11-16: Convert existing put_user_page() callers, to use the new pin*() * Patch 17: Activate tracking of FOLL_PIN pages. * Patches 18-20: convert various callers * Patches: 21-23: gup_benchmark and run_vmtests support * Patch 24: rename put_user_page*() --> unpin_user_page*()
============================================================ Testing:
* I've done some overall kernel testing (LTP, and a few other goodies), and some directed testing to exercise some of the changes. And as you can see, gup_benchmark is enhanced to exercise this. Basically, I've been able to runtime test the core get_user_pages() and pin_user_pages() and related routines, but not so much on several of the call sites--but those are generally just a couple of lines changed, each.
Not much of the kernel is actually using this, which on one hand reduces risk quite a lot. But on the other hand, testing coverage is low. So I'd love it if, in particular, the Infiniband and PowerPC folks could do a smoke test of this series for me.
Also, my runtime testing for the call sites so far is very weak:
* io_uring: Some directed tests from liburing exercise this, and they pass. * process_vm_access.c: A small directed test passes. * gup_benchmark: the enhanced version hits the new gup.c code, and passes. * infiniband (still only have crude "IB pingpong" working, on a good day: it's not exercising my conversions at runtime...) * VFIO: compiles (I'm vowing to set up a run time test soon, but it's not ready just yet) * powerpc: it compiles... * drm/via: compiles... * goldfish: compiles... * net/xdp: compiles... * media/v4l2: compiles...
============================================================ Next:
* Get the block/bio_vec sites converted to use pin_user_pages().
* Work with Ira and Dave Chinner to weave this together with the layout lease stuff.
============================================================
[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/ [2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/ [3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
Dan Williams (1): mm: Cleanup __put_devmap_managed_page() vs ->page_free()
John Hubbard (23): mm/gup: pass flags arg to __gup_device_* functions mm/gup: factor out duplicate code from four routines mm/gup: move try_get_compound_head() to top, fix minor issues mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages goldish_pipe: rename local pin_user_pages() routine IB/umem: use get_user_pages_fast() to pin DMA pages media/v4l2-core: set pages dirty upon releasing DMA buffers vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM mm/gup: introduce pin_user_pages*() and FOLL_PIN goldish_pipe: convert to pin_user_pages() and put_user_page() IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() drm/via: set FOLL_PIN via pin_user_pages_fast() fs/io_uring: set FOLL_PIN via pin_user_pages() net/xdp: set FOLL_PIN via pin_user_pages() mm/gup: track FOLL_PIN pages media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion powerpc: book3s64: convert to pin_user_pages() and put_user_page() mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1" mm/gup_benchmark: support pin_user_pages() and related calls selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage mm, tree-wide: rename put_user_page*() to unpin_user_page*()
Documentation/core-api/index.rst | 1 + Documentation/core-api/pin_user_pages.rst | 233 +++++++++ arch/powerpc/mm/book3s64/iommu_api.c | 12 +- drivers/gpu/drm/via/via_dmablit.c | 6 +- drivers/infiniband/core/umem.c | 19 +- drivers/infiniband/core/umem_odp.c | 13 +- drivers/infiniband/hw/hfi1/user_pages.c | 4 +- drivers/infiniband/hw/mthca/mthca_memfree.c | 8 +- drivers/infiniband/hw/qib/qib_user_pages.c | 4 +- drivers/infiniband/hw/qib/qib_user_sdma.c | 8 +- drivers/infiniband/hw/usnic/usnic_uiom.c | 4 +- drivers/infiniband/sw/siw/siw_mem.c | 4 +- drivers/media/v4l2-core/videobuf-dma-sg.c | 8 +- drivers/nvdimm/pmem.c | 6 - drivers/platform/goldfish/goldfish_pipe.c | 35 +- drivers/vfio/vfio_iommu_type1.c | 35 +- fs/io_uring.c | 6 +- include/linux/mm.h | 168 +++++- include/linux/mmzone.h | 2 + include/linux/page_ref.h | 10 + mm/gup.c | 548 +++++++++++++++----- mm/gup_benchmark.c | 74 ++- mm/huge_memory.c | 54 +- mm/hugetlb.c | 39 +- mm/memremap.c | 76 ++- mm/process_vm_access.c | 28 +- mm/vmstat.c | 2 + net/xdp/xdp_umem.c | 4 +- tools/testing/selftests/vm/gup_benchmark.c | 21 +- tools/testing/selftests/vm/run_vmtests | 22 + 30 files changed, 1104 insertions(+), 350 deletions(-) create mode 100644 Documentation/core-api/pin_user_pages.rst