Changes since v2 [1]: * Add a comment for the vma_is_fsdax() check in get_vaddr_frames() (Jan) * Collect Jan's Reviewed-by. * Rebased on v4.15-rc1
[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-November/013295.html
The summary text below is unchanged from v2.
---
Andrew,
Here is a new get_user_pages api for cases where a driver intends to keep an elevated page count indefinitely. This is distinct from usages like iov_iter_get_pages where the elevated page counts are transient. The iov_iter_get_pages cases immediately turn around and submit the pages to a device driver which will put_page when the i/o operation completes (under kernel control).
In the longterm case userspace is responsible for dropping the page reference at some undefined point in the future. This is untenable for filesystem-dax case where the filesystem is in control of the lifetime of the block / page and needs reasonable limits on how long it can wait for pages in a mapping to become idle.
Fixing filesystems to actually wait for dax pages to be idle before blocks from a truncate/hole-punch operation are repurposed is saved for a later patch series.
Also, allowing longterm registration of dax mappings is a future patch series that introduces a "map with lease" semantic where the kernel can revoke a lease and force userspace to drop its page references.
I have also tagged these for -stable to purposely break cases that might assume that longterm memory registrations for filesystem-dax mappings were supported by the kernel. The behavior regression this policy change implies is one of the reasons we maintain the "dax enabled. Warning: EXPERIMENTAL, use at your own risk" notification when mounting a filesystem in dax mode.
It is worth noting the device-dax interface does not suffer the same constraints since it does not support file space management operations like hole-punch.
---
Dan Williams (4): mm: introduce get_user_pages_longterm mm: fail get_vaddr_frames() for filesystem-dax mappings [media] v4l2: disable filesystem-dax mapping support IB/core: disable memory registration of fileystem-dax vmas
drivers/infiniband/core/umem.c | 2 - drivers/media/v4l2-core/videobuf-dma-sg.c | 5 +- include/linux/fs.h | 14 ++++++ include/linux/mm.h | 13 ++++++ mm/frame_vector.c | 12 +++++ mm/gup.c | 64 +++++++++++++++++++++++++++++ 6 files changed, 107 insertions(+), 3 deletions(-)
Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow long standing memory registrations against filesytem-dax vmas. Device-dax vmas do not have this problem and are explicitly allowed.
This is temporary until a "memory registration with layout-lease" mechanism can be implemented for the affected sub-systems (RDMA and V4L2).
Cc: stable@vger.kernel.org Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Suggested-by: Christoph Hellwig hch@lst.de Signed-off-by: Dan Williams dan.j.williams@intel.com --- include/linux/fs.h | 14 +++++++++++ include/linux/mm.h | 13 +++++++++++ mm/gup.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 91 insertions(+)
diff --git a/include/linux/fs.h b/include/linux/fs.h index 2995a271ec46..8a9f6d048487 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3194,6 +3194,20 @@ static inline bool vma_is_dax(struct vm_area_struct *vma) return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); }
+static inline bool vma_is_fsdax(struct vm_area_struct *vma) +{ + struct inode *inode; + + if (!vma->vm_file) + return false; + if (!vma_is_dax(vma)) + return false; + inode = file_inode(vma->vm_file); + if (inode->i_mode == S_IFCHR) + return false; /* device-dax */ + return true; +} + static inline int iocb_flags(struct file *file) { int res = 0; diff --git a/include/linux/mm.h b/include/linux/mm.h index b3b6a7e313e9..ea818ff739cd 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1380,6 +1380,19 @@ long get_user_pages_locked(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, int *locked); long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); +#ifdef CONFIG_FS_DAX +long get_user_pages_longterm(unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas); +#else +static inline long get_user_pages_longterm(unsigned long start, + unsigned long nr_pages, unsigned int gup_flags, + struct page **pages, struct vm_area_struct **vmas) +{ + return get_user_pages(start, nr_pages, gup_flags, pages, vmas); +} +#endif /* CONFIG_FS_DAX */ + int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages);
diff --git a/mm/gup.c b/mm/gup.c index 85cc822fd403..3dc8a7807ea0 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1095,6 +1095,70 @@ long get_user_pages(unsigned long start, unsigned long nr_pages, } EXPORT_SYMBOL(get_user_pages);
+#ifdef CONFIG_FS_DAX +/* + * This is the same as get_user_pages() in that it assumes we are + * operating on the current task's mm, but it goes further to validate + * that the vmas associated with the address range are suitable for + * longterm elevated page reference counts. For example, filesystem-dax + * mappings are subject to the lifetime enforced by the filesystem and + * we need guarantees that longterm users like RDMA and V4L2 only + * establish mappings that have a kernel enforced revocation mechanism. + * + * "longterm" == userspace controlled elevated page count lifetime. + * Contrast this to iov_iter_get_pages() usages which are transient. + */ +long get_user_pages_longterm(unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas_arg) +{ + struct vm_area_struct **vmas = vmas_arg; + struct vm_area_struct *vma_prev = NULL; + long rc, i; + + if (!pages) + return -EINVAL; + + if (!vmas) { + vmas = kzalloc(sizeof(struct vm_area_struct *) * nr_pages, + GFP_KERNEL); + if (!vmas) + return -ENOMEM; + } + + rc = get_user_pages(start, nr_pages, gup_flags, pages, vmas); + + for (i = 0; i < rc; i++) { + struct vm_area_struct *vma = vmas[i]; + + if (vma == vma_prev) + continue; + + vma_prev = vma; + + if (vma_is_fsdax(vma)) + break; + } + + /* + * Either get_user_pages() failed, or the vma validation + * succeeded, in either case we don't need to put_page() before + * returning. + */ + if (i >= rc) + goto out; + + for (i = 0; i < rc; i++) + put_page(pages[i]); + rc = -EOPNOTSUPP; +out: + if (vmas != vmas_arg) + kfree(vmas); + return rc; +} +EXPORT_SYMBOL(get_user_pages_longterm); +#endif /* CONFIG_FS_DAX */ + /** * populate_vma_page_range() - populate a range of pages in the vma. * @vma: target vma
On Wed 29-11-17 10:05:35, Dan Williams wrote:
Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow long standing memory registrations against filesytem-dax vmas. Device-dax vmas do not have this problem and are explicitly allowed.
This is temporary until a "memory registration with layout-lease" mechanism can be implemented for the affected sub-systems (RDMA and V4L2).
One thing is not clear to me. Who is allowed to pin pages for ever? Is it possible to pin LRU pages that way as well? If yes then there absolutely has to be a limit for that. Sorry I could have studied the code much more but from a quick glance it seems to me that this is not limited to dax (or non-LRU in general) pages.
On Thu, Nov 30, 2017 at 1:53 AM, Michal Hocko mhocko@kernel.org wrote:
On Wed 29-11-17 10:05:35, Dan Williams wrote:
Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow long standing memory registrations against filesytem-dax vmas. Device-dax vmas do not have this problem and are explicitly allowed.
This is temporary until a "memory registration with layout-lease" mechanism can be implemented for the affected sub-systems (RDMA and V4L2).
One thing is not clear to me. Who is allowed to pin pages for ever? Is it possible to pin LRU pages that way as well? If yes then there absolutely has to be a limit for that. Sorry I could have studied the code much more but from a quick glance it seems to me that this is not limited to dax (or non-LRU in general) pages.
I would turn this question around. "who can not tolerate a page being pinned forever?". In the case of filesytem-dax a page is one-in-the-same object as a filesystem-block, and a filesystem expects that its operations will not be blocked indefinitely. LRU pages can continue to be pinned indefinitely because operations can continue around the pinned page, i.e. every agent, save for the dma agent, drops their reference to the page and its tolerable that the final put_page() never arrives. As far as I can tell it's only filesystems and dax that have this collision of wanting to revoke dma access to a page combined with not being able to wait indefinitely for dma to quiesce.
On Thu 30-11-17 08:39:51, Dan Williams wrote:
On Thu, Nov 30, 2017 at 1:53 AM, Michal Hocko mhocko@kernel.org wrote:
On Wed 29-11-17 10:05:35, Dan Williams wrote:
Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow long standing memory registrations against filesytem-dax vmas. Device-dax vmas do not have this problem and are explicitly allowed.
This is temporary until a "memory registration with layout-lease" mechanism can be implemented for the affected sub-systems (RDMA and V4L2).
One thing is not clear to me. Who is allowed to pin pages for ever? Is it possible to pin LRU pages that way as well? If yes then there absolutely has to be a limit for that. Sorry I could have studied the code much more but from a quick glance it seems to me that this is not limited to dax (or non-LRU in general) pages.
I would turn this question around. "who can not tolerate a page being pinned forever?".
Any struct page on the movable zone or anything that is living on the LRU list because such a memory is unreclaimable.
In the case of filesytem-dax a page is one-in-the-same object as a filesystem-block, and a filesystem expects that its operations will not be blocked indefinitely. LRU pages can continue to be pinned indefinitely because operations can continue around the pinned page, i.e. every agent, save for the dma agent, drops their reference to the page and its tolerable that the final put_page() never arrives.
I do not understand. Are you saying that a user triggered IO can pin LRU pages indefinitely. This would be _really_ wrong. It would be basically an mlock without any limit. So I must be misreading you here
As far as I can tell it's only filesystems and dax that have this collision of wanting to revoke dma access to a page combined with not being able to wait indefinitely for dma to quiesce.
On Thu, Nov 30, 2017 at 9:42 AM, Michal Hocko mhocko@kernel.org wrote:
On Thu 30-11-17 08:39:51, Dan Williams wrote:
On Thu, Nov 30, 2017 at 1:53 AM, Michal Hocko mhocko@kernel.org wrote:
On Wed 29-11-17 10:05:35, Dan Williams wrote:
Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow long standing memory registrations against filesytem-dax vmas. Device-dax vmas do not have this problem and are explicitly allowed.
This is temporary until a "memory registration with layout-lease" mechanism can be implemented for the affected sub-systems (RDMA and V4L2).
One thing is not clear to me. Who is allowed to pin pages for ever? Is it possible to pin LRU pages that way as well? If yes then there absolutely has to be a limit for that. Sorry I could have studied the code much more but from a quick glance it seems to me that this is not limited to dax (or non-LRU in general) pages.
I would turn this question around. "who can not tolerate a page being pinned forever?".
Any struct page on the movable zone or anything that is living on the LRU list because such a memory is unreclaimable.
In the case of filesytem-dax a page is one-in-the-same object as a filesystem-block, and a filesystem expects that its operations will not be blocked indefinitely. LRU pages can continue to be pinned indefinitely because operations can continue around the pinned page, i.e. every agent, save for the dma agent, drops their reference to the page and its tolerable that the final put_page() never arrives.
I do not understand. Are you saying that a user triggered IO can pin LRU pages indefinitely. This would be _really_ wrong. It would be basically an mlock without any limit. So I must be misreading you here
You're not misreading. See ib_umem_get() for example, it pins pages in response to the userspace library call ibv_reg_mr() (memory registration), and will not release those pages unless/until a call to ibv_dereg_mr() is made. The current plan to fix this is to create something like a ibv_reg_mr_lease() call that registers the memory with an F_SETLEASE semantic so that the kernel can notify userspace that a memory registration is being forcibly revoked by the kernel. A previous attempt at something like this was the proposed MAP_DIRECT mmap flag [1].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012815.html
On Thu 30-11-17 10:03:26, Dan Williams wrote:
On Thu, Nov 30, 2017 at 9:42 AM, Michal Hocko mhocko@kernel.org wrote:
On Thu 30-11-17 08:39:51, Dan Williams wrote:
On Thu, Nov 30, 2017 at 1:53 AM, Michal Hocko mhocko@kernel.org wrote:
On Wed 29-11-17 10:05:35, Dan Williams wrote:
Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow long standing memory registrations against filesytem-dax vmas. Device-dax vmas do not have this problem and are explicitly allowed.
This is temporary until a "memory registration with layout-lease" mechanism can be implemented for the affected sub-systems (RDMA and V4L2).
One thing is not clear to me. Who is allowed to pin pages for ever? Is it possible to pin LRU pages that way as well? If yes then there absolutely has to be a limit for that. Sorry I could have studied the code much more but from a quick glance it seems to me that this is not limited to dax (or non-LRU in general) pages.
I would turn this question around. "who can not tolerate a page being pinned forever?".
Any struct page on the movable zone or anything that is living on the LRU list because such a memory is unreclaimable.
In the case of filesytem-dax a page is one-in-the-same object as a filesystem-block, and a filesystem expects that its operations will not be blocked indefinitely. LRU pages can continue to be pinned indefinitely because operations can continue around the pinned page, i.e. every agent, save for the dma agent, drops their reference to the page and its tolerable that the final put_page() never arrives.
I do not understand. Are you saying that a user triggered IO can pin LRU pages indefinitely. This would be _really_ wrong. It would be basically an mlock without any limit. So I must be misreading you here
You're not misreading. See ib_umem_get() for example, it pins pages in response to the userspace library call ibv_reg_mr() (memory registration), and will not release those pages unless/until a call to ibv_dereg_mr() is made.
Who and how many LRU pages can pin that way and how do you prevent nasty users to DoS systems this way?
I remember PeterZ wanted to address a similar issue by vmpin syscall that would be a subject of a rlimit control. Sorry but I cannot find a reference here but if this is at g-u-p level without any accounting then it smells quite broken to me.
[ adding linux-rdma ]
On Thu, Nov 30, 2017 at 10:17 AM, Michal Hocko mhocko@kernel.org wrote:
On Thu 30-11-17 10:03:26, Dan Williams wrote:
On Thu, Nov 30, 2017 at 9:42 AM, Michal Hocko mhocko@kernel.org wrote:
On Thu 30-11-17 08:39:51, Dan Williams wrote:
On Thu, Nov 30, 2017 at 1:53 AM, Michal Hocko mhocko@kernel.org wrote:
On Wed 29-11-17 10:05:35, Dan Williams wrote:
Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow long standing memory registrations against filesytem-dax vmas. Device-dax vmas do not have this problem and are explicitly allowed.
This is temporary until a "memory registration with layout-lease" mechanism can be implemented for the affected sub-systems (RDMA and V4L2).
One thing is not clear to me. Who is allowed to pin pages for ever? Is it possible to pin LRU pages that way as well? If yes then there absolutely has to be a limit for that. Sorry I could have studied the code much more but from a quick glance it seems to me that this is not limited to dax (or non-LRU in general) pages.
I would turn this question around. "who can not tolerate a page being pinned forever?".
Any struct page on the movable zone or anything that is living on the LRU list because such a memory is unreclaimable.
In the case of filesytem-dax a page is one-in-the-same object as a filesystem-block, and a filesystem expects that its operations will not be blocked indefinitely. LRU pages can continue to be pinned indefinitely because operations can continue around the pinned page, i.e. every agent, save for the dma agent, drops their reference to the page and its tolerable that the final put_page() never arrives.
I do not understand. Are you saying that a user triggered IO can pin LRU pages indefinitely. This would be _really_ wrong. It would be basically an mlock without any limit. So I must be misreading you here
You're not misreading. See ib_umem_get() for example, it pins pages in response to the userspace library call ibv_reg_mr() (memory registration), and will not release those pages unless/until a call to ibv_dereg_mr() is made.
Who and how many LRU pages can pin that way and how do you prevent nasty users to DoS systems this way?
I assume this is something the RDMA community has had to contend with? I'm not an RDMA person, I'm just here to fix dax.
I remember PeterZ wanted to address a similar issue by vmpin syscall that would be a subject of a rlimit control. Sorry but I cannot find a reference here
https://lwn.net/Articles/600502/
but if this is at g-u-p level without any accounting then it smells quite broken to me.
It's certainly broken with respect to filesystem-dax and if there is other breakage we should get it all on the table.
On Thu, Nov 30, 2017 at 10:32:42AM -0800, Dan Williams wrote:
Who and how many LRU pages can pin that way and how do you prevent nasty users to DoS systems this way?
I assume this is something the RDMA community has had to contend with? I'm not an RDMA person, I'm just here to fix dax.
The RDMA implementation respects the mlock rlimit
Jason
On Thu 30-11-17 12:01:17, Jason Gunthorpe wrote:
On Thu, Nov 30, 2017 at 10:32:42AM -0800, Dan Williams wrote:
Who and how many LRU pages can pin that way and how do you prevent nasty users to DoS systems this way?
I assume this is something the RDMA community has had to contend with? I'm not an RDMA person, I'm just here to fix dax.
The RDMA implementation respects the mlock rlimit
OK, so then I am kind of lost in why do we need a special g-u-p variant. The documentation doesn't say and quite contrary it assumes that the caller knows what he is doing. This cannot be the right approach.
In other words, what does V4L2 does in the same context? Does it account the pinned memory or it allows user to pin arbitrary amount of memory.
On Fri, Dec 01, 2017 at 11:12:18AM +0100, Michal Hocko wrote:
On Thu 30-11-17 12:01:17, Jason Gunthorpe wrote:
On Thu, Nov 30, 2017 at 10:32:42AM -0800, Dan Williams wrote:
Who and how many LRU pages can pin that way and how do you prevent nasty users to DoS systems this way?
I assume this is something the RDMA community has had to contend with? I'm not an RDMA person, I'm just here to fix dax.
The RDMA implementation respects the mlock rlimit
OK, so then I am kind of lost in why do we need a special g-u-p variant. The documentation doesn't say and quite contrary it assumes that the caller knows what he is doing. This cannot be the right approach.
I thought it was because get_user_pages_longterm is supposed to fail on DAX mappings?
And maybe we should think about moving the rlimit accounting into this new function too someday?
Jason
On Fri, Dec 1, 2017 at 8:02 AM, Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Dec 01, 2017 at 11:12:18AM +0100, Michal Hocko wrote:
On Thu 30-11-17 12:01:17, Jason Gunthorpe wrote:
On Thu, Nov 30, 2017 at 10:32:42AM -0800, Dan Williams wrote:
Who and how many LRU pages can pin that way and how do you prevent nasty users to DoS systems this way?
I assume this is something the RDMA community has had to contend with? I'm not an RDMA person, I'm just here to fix dax.
The RDMA implementation respects the mlock rlimit
OK, so then I am kind of lost in why do we need a special g-u-p variant. The documentation doesn't say and quite contrary it assumes that the caller knows what he is doing. This cannot be the right approach.
I thought it was because get_user_pages_longterm is supposed to fail on DAX mappings?
Correct, the rlimit checks are a separate issue, get_user_pages_longterm is only there to avoid open coding vma lookup and vma_is_fsdax() checks in multiple code paths.
And maybe we should think about moving the rlimit accounting into this new function too someday?
DAX pages are not accounted in any rlimit because they are statically allocated reserved memory regions.
On Fri, Dec 01, 2017 at 08:29:53AM -0800, Dan Williams wrote:
And maybe we should think about moving the rlimit accounting into this new function too someday?
DAX pages are not accounted in any rlimit because they are statically allocated reserved memory regions.
I mean, unrelated to DAX, any user of get_user_pages_longterm should respect the memlock rlimit and that check is shared code.
Jason
On Fri 01-12-17 08:29:53, Dan Williams wrote:
On Fri, Dec 1, 2017 at 8:02 AM, Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Dec 01, 2017 at 11:12:18AM +0100, Michal Hocko wrote:
On Thu 30-11-17 12:01:17, Jason Gunthorpe wrote:
On Thu, Nov 30, 2017 at 10:32:42AM -0800, Dan Williams wrote:
Who and how many LRU pages can pin that way and how do you prevent nasty users to DoS systems this way?
I assume this is something the RDMA community has had to contend with? I'm not an RDMA person, I'm just here to fix dax.
The RDMA implementation respects the mlock rlimit
OK, so then I am kind of lost in why do we need a special g-u-p variant. The documentation doesn't say and quite contrary it assumes that the caller knows what he is doing. This cannot be the right approach.
I thought it was because get_user_pages_longterm is supposed to fail on DAX mappings?
Correct, the rlimit checks are a separate issue, get_user_pages_longterm is only there to avoid open coding vma lookup and vma_is_fsdax() checks in multiple code paths.
Then it is a terrible misnomer. One would expect this is a proper way to get a longterm pin on a page.
And maybe we should think about moving the rlimit accounting into this new function too someday?
DAX pages are not accounted in any rlimit because they are statically allocated reserved memory regions.
Which is OK, but how do you prevent anybody calling this function on normal LRU pages?
On Mon, Dec 4, 2017 at 1:31 AM, Michal Hocko mhocko@kernel.org wrote:
On Fri 01-12-17 08:29:53, Dan Williams wrote:
On Fri, Dec 1, 2017 at 8:02 AM, Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Dec 01, 2017 at 11:12:18AM +0100, Michal Hocko wrote:
On Thu 30-11-17 12:01:17, Jason Gunthorpe wrote:
On Thu, Nov 30, 2017 at 10:32:42AM -0800, Dan Williams wrote:
> Who and how many LRU pages can pin that way and how do you prevent nasty > users to DoS systems this way?
I assume this is something the RDMA community has had to contend with? I'm not an RDMA person, I'm just here to fix dax.
The RDMA implementation respects the mlock rlimit
OK, so then I am kind of lost in why do we need a special g-u-p variant. The documentation doesn't say and quite contrary it assumes that the caller knows what he is doing. This cannot be the right approach.
I thought it was because get_user_pages_longterm is supposed to fail on DAX mappings?
Correct, the rlimit checks are a separate issue, get_user_pages_longterm is only there to avoid open coding vma lookup and vma_is_fsdax() checks in multiple code paths.
Then it is a terrible misnomer. One would expect this is a proper way to get a longterm pin on a page.
Yes, I can see that. The "get_user_pages_longterm" symbol name is encoding the lifetime expectations of the caller vs properly implementing 'longterm' pinning. However the proper interface to establish a long term pin does not currently exist needs and ultimately needs more coordination with userspace. We need a way for the kernel to explicitly revoke the pin. So, this get_user_pages_longterm change is only a stop-gap to prevent data corruption and userspace from growing further expectations that filesystem-dax supports long term pinning through the legacy interfaces.
And maybe we should think about moving the rlimit accounting into this new function too someday?
DAX pages are not accounted in any rlimit because they are statically allocated reserved memory regions.
Which is OK, but how do you prevent anybody calling this function on normal LRU pages?
I don't, and didn't consider this angle as it's a consideration that is missing from the existing gup interfaces. It is an additional gap we need to fill.
Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow V4L2, Exynos, and other frame vector users to create long standing / irrevocable memory registrations against filesytem-dax vmas.
Cc: Inki Dae inki.dae@samsung.com Cc: Seung-Woo Kim sw0312.kim@samsung.com Cc: Joonyoung Shim jy0922.shim@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: linux-media@vger.kernel.org Cc: Mel Gorman mgorman@suse.de Cc: Vlastimil Babka vbabka@suse.cz Cc: Andrew Morton akpm@linux-foundation.org Cc: stable@vger.kernel.org Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Reviewed-by: Jan Kara jack@suse.cz Signed-off-by: Dan Williams dan.j.williams@intel.com --- mm/frame_vector.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/mm/frame_vector.c b/mm/frame_vector.c index 2f98df0d460e..297c7238f7d4 100644 --- a/mm/frame_vector.c +++ b/mm/frame_vector.c @@ -53,6 +53,18 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, ret = -EFAULT; goto out; } + + /* + * While get_vaddr_frames() could be used for transient (kernel + * controlled lifetime) pinning of memory pages all current + * users establish long term (userspace controlled lifetime) + * page pinning. Treat get_vaddr_frames() like + * get_user_pages_longterm() and disallow it for filesystem-dax + * mappings. + */ + if (vma_is_fsdax(vma)) + return -EOPNOTSUPP; + if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) { vec->got_ref = true; vec->is_pfns = false;
V4L2 memory registrations are incompatible with filesystem-dax that needs the ability to revoke dma access to a mapping at will, or otherwise allow the kernel to wait for completion of DMA. The filesystem-dax implementation breaks the traditional solution of truncate of active file backed mappings since there is no page-cache page we can orphan to sustain ongoing DMA.
If v4l2 wants to support long lived DMA mappings it needs to arrange to hold a file lease or use some other mechanism so that the kernel can coordinate revoking DMA access when the filesystem needs to truncate mappings.
Reported-by: Jan Kara jack@suse.cz Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: linux-media@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Reviewed-by: Jan Kara jack@suse.cz Signed-off-by: Dan Williams dan.j.williams@intel.com --- drivers/media/v4l2-core/videobuf-dma-sg.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c index 0b5c43f7e020..f412429cf5ba 100644 --- a/drivers/media/v4l2-core/videobuf-dma-sg.c +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c @@ -185,12 +185,13 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma, dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n", data, size, dma->nr_pages);
- err = get_user_pages(data & PAGE_MASK, dma->nr_pages, + err = get_user_pages_longterm(data & PAGE_MASK, dma->nr_pages, flags, dma->pages, NULL);
if (err != dma->nr_pages) { dma->nr_pages = (err >= 0) ? err : 0; - dprintk(1, "get_user_pages: err=%d [%d]\n", err, dma->nr_pages); + dprintk(1, "get_user_pages_longterm: err=%d [%d]\n", err, + dma->nr_pages); return err < 0 ? err : -EINVAL; } return 0;
Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow RDMA to create long standing memory registrations against filesytem-dax vmas.
Cc: Sean Hefty sean.hefty@intel.com Cc: Doug Ledford dledford@redhat.com Cc: Hal Rosenstock hal.rosenstock@gmail.com Cc: Jeff Moyer jmoyer@redhat.com Cc: Ross Zwisler ross.zwisler@linux.intel.com Cc: Jason Gunthorpe jgunthorpe@obsidianresearch.com Cc: linux-rdma@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Reported-by: Christoph Hellwig hch@lst.de Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Dan Williams dan.j.williams@intel.com --- drivers/infiniband/core/umem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 21e60b1e2ff4..130606c3b07c 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -191,7 +191,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, sg_list_start = umem->sg_head.sgl;
while (npages) { - ret = get_user_pages(cur_base, + ret = get_user_pages_longterm(cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof (struct page *)), gup_flags, page_list, vma_list);
On Wed, Nov 29, 2017 at 10:05:51AM -0800, Dan Williams wrote:
Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow RDMA to create long standing memory registrations against filesytem-dax vmas.
Cc: Sean Hefty sean.hefty@intel.com Cc: Doug Ledford dledford@redhat.com Cc: Hal Rosenstock hal.rosenstock@gmail.com Cc: Jeff Moyer jmoyer@redhat.com Cc: Ross Zwisler ross.zwisler@linux.intel.com Cc: Jason Gunthorpe jgunthorpe@obsidianresearch.com Cc: linux-rdma@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Reported-by: Christoph Hellwig hch@lst.de Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Dan Williams dan.j.williams@intel.com drivers/infiniband/core/umem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
No problem here with drivers/rdma. This will go through another tree with the rest of the series? In which case here is a co-maintainer ack for this patch:
Acked-by: Jason Gunthorpe jgg@mellanox.com
Dan, can you please update my address to jgg@ziepe.ca, thanks :)
Jason
linux-stable-mirror@lists.linaro.org