Changes since v1 [1]: * Cleanup local 'vmas' argument (Christoph) * Replace inline IS_ENABLED(CONFIG_FS_DAX) in C code with ifdef versions of get_user_pages_longterm() for the FS_DAX on/off cases (Christoph) * Add a new patch for the get_vaddr_frames() case, this impacts users like V4L2, and the Exynos driver. * Collect Christoph's reviewed-by for the rdma change
[1]: https://lwn.net/Articles/738323/
---
Andrew,
Here is a new get_user_pages api for cases where a driver intends to keep an elevated page count indefinitely. This is distinct from usages like iov_iter_get_pages where the elevated page counts are transient. The iov_iter_get_pages cases immediately turn around and submit the pages to a device driver which will put_page when the i/o operation completes (under kernel control).
In the longterm case userspace is responsible for dropping the page reference at some undefined point in the future. This is untenable for filesystem-dax case where the filesystem is in control of the lifetime of the block / page and needs reasonable limits on how long it can wait for pages in a mapping to become idle.
Fixing filesystems to actually wait for dax pages to be idle before blocks from a truncate/hole-punch operation are repurposed is saved for a later patch series.
Also, allowing longterm registration of dax mappings is a future patch series that introduces a "map with lease" semantic where the kernel can revoke a lease and force userspace to drop its page references.
I have also tagged these for -stable to purposely break cases that might assume that longterm memory registrations for filesystem-dax mappings were supported by the kernel. The behavior regression this policy change implies is one of the reasons we maintain the "dax enabled. Warning: EXPERIMENTAL, use at your own risk" notification when mounting a filesystem in dax mode.
It is worth noting the device-dax interface does not suffer the same constraints since it does not support file space management operations like hole-punch.
---
Dan Williams (4): mm: introduce get_user_pages_longterm mm: fail get_vaddr_frames() for filesystem-dax mappings [media] v4l2: disable filesystem-dax mapping support IB/core: disable memory registration of fileystem-dax vmas
drivers/infiniband/core/umem.c | 2 - drivers/media/v4l2-core/videobuf-dma-sg.c | 5 +- include/linux/fs.h | 14 ++++++ include/linux/mm.h | 13 ++++++ mm/frame_vector.c | 4 ++ mm/gup.c | 64 +++++++++++++++++++++++++++++ 6 files changed, 99 insertions(+), 3 deletions(-)
Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow long standing memory registrations against filesytem-dax vmas. Device-dax vmas do not have this problem and are explicitly allowed.
This is temporary until a "memory registration with layout-lease" mechanism can be implemented for the affected sub-systems (RDMA and V4L2).
Cc: stable@vger.kernel.org Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Suggested-by: Christoph Hellwig hch@lst.de Signed-off-by: Dan Williams dan.j.williams@intel.com --- include/linux/fs.h | 14 +++++++++++ include/linux/mm.h | 13 +++++++++++ mm/gup.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 91 insertions(+)
diff --git a/include/linux/fs.h b/include/linux/fs.h index 57added3201d..0bacccc9461e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3175,6 +3175,20 @@ static inline bool vma_is_dax(struct vm_area_struct *vma) return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); }
+static inline bool vma_is_fsdax(struct vm_area_struct *vma) +{ + struct inode *inode; + + if (!vma->vm_file) + return false; + if (!vma_is_dax(vma)) + return false; + inode = file_inode(vma->vm_file); + if (inode->i_mode == S_IFCHR) + return false; /* device-dax */ + return true; +} + static inline int iocb_flags(struct file *file) { int res = 0; diff --git a/include/linux/mm.h b/include/linux/mm.h index 8d9f52a84f77..34ff8830f717 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1369,6 +1369,19 @@ long get_user_pages_locked(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, int *locked); long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); +#ifdef CONFIG_FS_DAX +long get_user_pages_longterm(unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas); +#else +static inline long get_user_pages_longterm(unsigned long start, + unsigned long nr_pages, unsigned int gup_flags, + struct page **pages, struct vm_area_struct **vmas) +{ + return get_user_pages(start, nr_pages, gup_flags, pages, vmas); +} +#endif /* CONFIG_FS_DAX */ + int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages);
diff --git a/mm/gup.c b/mm/gup.c index b2b4d4263768..ad9d13987f14 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1095,6 +1095,70 @@ long get_user_pages(unsigned long start, unsigned long nr_pages, } EXPORT_SYMBOL(get_user_pages);
+#ifdef CONFIG_FS_DAX +/* + * This is the same as get_user_pages() in that it assumes we are + * operating on the current task's mm, but it goes further to validate + * that the vmas associated with the address range are suitable for + * longterm elevated page reference counts. For example, filesystem-dax + * mappings are subject to the lifetime enforced by the filesystem and + * we need guarantees that longterm users like RDMA and V4L2 only + * establish mappings that have a kernel enforced revocation mechanism. + * + * "longterm" == userspace controlled elevated page count lifetime. + * Contrast this to iov_iter_get_pages() usages which are transient. + */ +long get_user_pages_longterm(unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas_arg) +{ + struct vm_area_struct **vmas = vmas_arg; + struct vm_area_struct *vma_prev = NULL; + long rc, i; + + if (!pages) + return -EINVAL; + + if (!vmas) { + vmas = kzalloc(sizeof(struct vm_area_struct *) * nr_pages, + GFP_KERNEL); + if (!vmas) + return -ENOMEM; + } + + rc = get_user_pages(start, nr_pages, gup_flags, pages, vmas); + + for (i = 0; i < rc; i++) { + struct vm_area_struct *vma = vmas[i]; + + if (vma == vma_prev) + continue; + + vma_prev = vma; + + if (vma_is_fsdax(vma)) + break; + } + + /* + * Either get_user_pages() failed, or the vma validation + * succeeded, in either case we don't need to put_page() before + * returning. + */ + if (i >= rc) + goto out; + + for (i = 0; i < rc; i++) + put_page(pages[i]); + rc = -EOPNOTSUPP; +out: + if (vmas != vmas_arg) + kfree(vmas); + return rc; +} +EXPORT_SYMBOL(get_user_pages_longterm); +#endif /* CONFIG_FS_DAX */ + /** * populate_vma_page_range() - populate a range of pages in the vma. * @vma: target vma
On Tue, 14 Nov 2017 11:56:34 -0800 Dan Williams dan.j.williams@intel.com wrote:
Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow long standing memory registrations against filesytem-dax vmas. Device-dax vmas do not have this problem and are explicitly allowed.
This is temporary until a "memory registration with layout-lease" mechanism can be implemented for the affected sub-systems (RDMA and V4L2).
Sounds like that will be unpleasant. Do we really need it to be that complex? Can we get away with simply failing the get_user_pages() request? Or are there significant usecases for RDMA and V4L to play with DAX memory?
On Tue, Nov 21, 2017 at 3:05 PM, Andrew Morton akpm@linux-foundation.org wrote:
On Tue, 14 Nov 2017 11:56:34 -0800 Dan Williams dan.j.williams@intel.com wrote:
Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow long standing memory registrations against filesytem-dax vmas. Device-dax vmas do not have this problem and are explicitly allowed.
This is temporary until a "memory registration with layout-lease" mechanism can be implemented for the affected sub-systems (RDMA and V4L2).
Sounds like that will be unpleasant. Do we really need it to be that complex? Can we get away with simply failing the get_user_pages() request? Or are there significant usecases for RDMA and V4L to play with DAX memory?
V4L plus DAX is indeed dubious, but RDMA to persistent memory is something the RDMA community is interested in supporting [1].
[1]: http://www.snia.org/sites/default/files/SDC15_presentations/persistant_mem/C...
On Tue, 14 Nov 2017 11:56:34 -0800 Dan Williams dan.j.williams@intel.com wrote:
Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow long standing memory registrations against filesytem-dax vmas. Device-dax vmas do not have this problem and are explicitly allowed.
This is temporary until a "memory registration with layout-lease" mechanism can be implemented for the affected sub-systems (RDMA and V4L2).
--- a/mm/gup.c +++ b/mm/gup.c @@ -1095,6 +1095,70 @@ long get_user_pages(unsigned long start, unsigned long nr_pages, } EXPORT_SYMBOL(get_user_pages); +#ifdef CONFIG_FS_DAX +/*
- This is the same as get_user_pages() in that it assumes we are
- operating on the current task's mm, but it goes further to validate
- that the vmas associated with the address range are suitable for
- longterm elevated page reference counts. For example, filesystem-dax
- mappings are subject to the lifetime enforced by the filesystem and
- we need guarantees that longterm users like RDMA and V4L2 only
- establish mappings that have a kernel enforced revocation mechanism.
- "longterm" == userspace controlled elevated page count lifetime.
- Contrast this to iov_iter_get_pages() usages which are transient.
- */
+long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas_arg)
+{
- struct vm_area_struct **vmas = vmas_arg;
- struct vm_area_struct *vma_prev = NULL;
- long rc, i;
- if (!pages)
return -EINVAL;
- if (!vmas) {
vmas = kzalloc(sizeof(struct vm_area_struct *) * nr_pages,
GFP_KERNEL);
if (!vmas)
return -ENOMEM;
- }
...
I'll do this:
--- a/mm/gup.c~mm-introduce-get_user_pages_longterm-fix +++ a/mm/gup.c @@ -1120,8 +1120,8 @@ long get_user_pages_longterm(unsigned lo return -EINVAL;
if (!vmas) { - vmas = kzalloc(sizeof(struct vm_area_struct *) * nr_pages, - GFP_KERNEL); + vmas = kcalloc(nr_pages, sizeof(struct vm_area_struct *), + GFP_KERNEL); if (!vmas) return -ENOMEM; } _
Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow V4L2, Exynos, and other frame vector users to create long standing / irrevocable memory registrations against filesytem-dax vmas.
Cc: Inki Dae inki.dae@samsung.com Cc: Seung-Woo Kim sw0312.kim@samsung.com Cc: Joonyoung Shim jy0922.shim@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: linux-media@vger.kernel.org Cc: Jan Kara jack@suse.cz Cc: Mel Gorman mgorman@suse.de Cc: Vlastimil Babka vbabka@suse.cz Cc: Andrew Morton akpm@linux-foundation.org Cc: stable@vger.kernel.org Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Signed-off-by: Dan Williams dan.j.williams@intel.com --- mm/frame_vector.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/mm/frame_vector.c b/mm/frame_vector.c index 72ebec18629c..d2fdbeaadc8b 100644 --- a/mm/frame_vector.c +++ b/mm/frame_vector.c @@ -52,6 +52,10 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, ret = -EFAULT; goto out; } + + if (vma_is_fsdax(vma)) + return -EOPNOTSUPP; + if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) { vec->got_ref = true; vec->is_pfns = false;
On Tue 14-11-17 11:56:39, Dan Williams wrote:
Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow V4L2, Exynos, and other frame vector users to create long standing / irrevocable memory registrations against filesytem-dax vmas.
Cc: Inki Dae inki.dae@samsung.com Cc: Seung-Woo Kim sw0312.kim@samsung.com Cc: Joonyoung Shim jy0922.shim@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: linux-media@vger.kernel.org Cc: Jan Kara jack@suse.cz Cc: Mel Gorman mgorman@suse.de Cc: Vlastimil Babka vbabka@suse.cz Cc: Andrew Morton akpm@linux-foundation.org Cc: stable@vger.kernel.org Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Signed-off-by: Dan Williams dan.j.williams@intel.com
Makes sense. I'd just note that in principle get_vaddr_frames() is no more long-term than get_user_pages(). It is just so that all the users of get_vaddr_frames() currently want a long-term reference. Maybe could you add here also a comment that the vma_is_fsdax() check is there because all users of this function want a long term page reference? With that you can add:
Reviewed-by: Jan Kara jack@suse.cz
Honza
mm/frame_vector.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/mm/frame_vector.c b/mm/frame_vector.c index 72ebec18629c..d2fdbeaadc8b 100644 --- a/mm/frame_vector.c +++ b/mm/frame_vector.c @@ -52,6 +52,10 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, ret = -EFAULT; goto out; }
- if (vma_is_fsdax(vma))
return -EOPNOTSUPP;
- if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) { vec->got_ref = true; vec->is_pfns = false;
On Mon, Nov 27, 2017 at 8:15 AM, Jan Kara jack@suse.cz wrote:
On Tue 14-11-17 11:56:39, Dan Williams wrote:
Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow V4L2, Exynos, and other frame vector users to create long standing / irrevocable memory registrations against filesytem-dax vmas.
Cc: Inki Dae inki.dae@samsung.com Cc: Seung-Woo Kim sw0312.kim@samsung.com Cc: Joonyoung Shim jy0922.shim@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: linux-media@vger.kernel.org Cc: Jan Kara jack@suse.cz Cc: Mel Gorman mgorman@suse.de Cc: Vlastimil Babka vbabka@suse.cz Cc: Andrew Morton akpm@linux-foundation.org Cc: stable@vger.kernel.org Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Signed-off-by: Dan Williams dan.j.williams@intel.com
Makes sense. I'd just note that in principle get_vaddr_frames() is no more long-term than get_user_pages(). It is just so that all the users of get_vaddr_frames() currently want a long-term reference. Maybe could you add here also a comment that the vma_is_fsdax() check is there because all users of this function want a long term page reference? With that you can add:
Ok, will do.
Reviewed-by: Jan Kara jack@suse.cz
Thanks.
V4L2 memory registrations are incompatible with filesystem-dax that needs the ability to revoke dma access to a mapping at will, or otherwise allow the kernel to wait for completion of DMA. The filesystem-dax implementation breaks the traditional solution of truncate of active file backed mappings since there is no page-cache page we can orphan to sustain ongoing DMA.
If v4l2 wants to support long lived DMA mappings it needs to arrange to hold a file lease or use some other mechanism so that the kernel can coordinate revoking DMA access when the filesystem needs to truncate mappings.
Reported-by: Jan Kara jack@suse.cz Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: linux-media@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Signed-off-by: Dan Williams dan.j.williams@intel.com --- drivers/media/v4l2-core/videobuf-dma-sg.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c index 0b5c43f7e020..f412429cf5ba 100644 --- a/drivers/media/v4l2-core/videobuf-dma-sg.c +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c @@ -185,12 +185,13 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma, dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n", data, size, dma->nr_pages);
- err = get_user_pages(data & PAGE_MASK, dma->nr_pages, + err = get_user_pages_longterm(data & PAGE_MASK, dma->nr_pages, flags, dma->pages, NULL);
if (err != dma->nr_pages) { dma->nr_pages = (err >= 0) ? err : 0; - dprintk(1, "get_user_pages: err=%d [%d]\n", err, dma->nr_pages); + dprintk(1, "get_user_pages_longterm: err=%d [%d]\n", err, + dma->nr_pages); return err < 0 ? err : -EINVAL; } return 0;
On Tue 14-11-17 11:56:45, Dan Williams wrote:
V4L2 memory registrations are incompatible with filesystem-dax that needs the ability to revoke dma access to a mapping at will, or otherwise allow the kernel to wait for completion of DMA. The filesystem-dax implementation breaks the traditional solution of truncate of active file backed mappings since there is no page-cache page we can orphan to sustain ongoing DMA.
If v4l2 wants to support long lived DMA mappings it needs to arrange to hold a file lease or use some other mechanism so that the kernel can coordinate revoking DMA access when the filesystem needs to truncate mappings.
Reported-by: Jan Kara jack@suse.cz Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: linux-media@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Signed-off-by: Dan Williams dan.j.williams@intel.com
Looks good to me. You can add:
Reviewed-by: Jan Kara jack@suse.cz
Honza
drivers/media/v4l2-core/videobuf-dma-sg.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c index 0b5c43f7e020..f412429cf5ba 100644 --- a/drivers/media/v4l2-core/videobuf-dma-sg.c +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c @@ -185,12 +185,13 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma, dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n", data, size, dma->nr_pages);
- err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
- err = get_user_pages_longterm(data & PAGE_MASK, dma->nr_pages, flags, dma->pages, NULL);
if (err != dma->nr_pages) { dma->nr_pages = (err >= 0) ? err : 0;
dprintk(1, "get_user_pages: err=%d [%d]\n", err, dma->nr_pages);
dprintk(1, "get_user_pages_longterm: err=%d [%d]\n", err,
return err < 0 ? err : -EINVAL; } return 0;dma->nr_pages);
Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow RDMA to create long standing memory registrations against filesytem-dax vmas.
Cc: Sean Hefty sean.hefty@intel.com Cc: Doug Ledford dledford@redhat.com Cc: Hal Rosenstock hal.rosenstock@gmail.com Cc: Jeff Moyer jmoyer@redhat.com Cc: Ross Zwisler ross.zwisler@linux.intel.com Cc: Jason Gunthorpe jgunthorpe@obsidianresearch.com Cc: linux-rdma@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Reported-by: Christoph Hellwig hch@lst.de Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Dan Williams dan.j.williams@intel.com --- drivers/infiniband/core/umem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 21e60b1e2ff4..130606c3b07c 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -191,7 +191,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, sg_list_start = umem->sg_head.sgl;
while (npages) { - ret = get_user_pages(cur_base, + ret = get_user_pages_longterm(cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof (struct page *)), gup_flags, page_list, vma_list);
linux-stable-mirror@lists.linaro.org