From: Zi Yan <zi.yan(a)cs.rutgers.edu>
Subject: mm: migrate: fix an incorrect call of prep_transhuge_page()
In https://lkml.org/lkml/2017/11/20/411, Andrea reported that during
memory hotplug/hot remove prep_transhuge_page() is called incorrectly on
non-THP pages for migration, when THP is on but THP migration is not
enabled. This leads to a bad state of target pages for migration.
By inspecting the code, if called on a non-THP, prep_transhuge_page() will
1) change the value of the mapping of (page + 2), since it is used
for THP deferred list;
2) change the lru value of (page + 1), since it is used for THP's
dtor.
Both can lead to data corruption of these two pages.
Andrea said:
: Pragmatically and from the point of view of the memory_hotplug subsys, the
: effect is a kernel crash when pages are being migrated during a memory hot
: remove offline and migration target pages are found in a bad state.
This patch fixes it by only calling prep_transhuge_page() when we are
certain that the target page is THP.
Link: http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com
Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports thp migration")
Signed-off-by: Zi Yan <zi.yan(a)cs.rutgers.edu>
Reported-by: Andrea Reale <ar(a)linux.vnet.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: "Jérôme Glisse" <jglisse(a)redhat.com>
Cc: <stable(a)vger.kernel.org> [4.14]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/migrate.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff -puN include/linux/migrate.h~mm-migrate-fix-an-incorrect-call-of-prep_transhuge_page include/linux/migrate.h
--- a/include/linux/migrate.h~mm-migrate-fix-an-incorrect-call-of-prep_transhuge_page
+++ a/include/linux/migrate.h
@@ -54,7 +54,7 @@ static inline struct page *new_page_node
new_page = __alloc_pages_nodemask(gfp_mask, order,
preferred_nid, nodemask);
- if (new_page && PageTransHuge(page))
+ if (new_page && PageTransHuge(new_page))
prep_transhuge_page(new_page);
return new_page;
_
From: chenjie <chenjie6(a)huawei.com>
Subject: mm/madvise.c: fix madvise() infinite loop under special circumstances
MADVISE_WILLNEED has always been a noop for DAX (formerly XIP) mappings.
Unfortunately madvise_willneed() doesn't communicate this information
properly to the generic madvise syscall implementation. The calling
convention is quite subtle there. madvise_vma() is supposed to either
return an error or update &prev otherwise the main loop will never advance
to the next vma and it will keep looping for ever without a way to get out
of the kernel.
It seems this has been broken since introduction. Nobody has noticed
because nobody seems to be using MADVISE_WILLNEED on these DAX mappings.
[mhocko(a)suse.com: rewrite changelog]
Link: http://lkml.kernel.org/r/20171127115318.911-1-guoxuenan@huawei.com
Fixes: fe77ba6f4f97 ("[PATCH] xip: madvice/fadvice: execute in place")
Signed-off-by: chenjie <chenjie6(a)huawei.com>
Signed-off-by: guoxuenan <guoxuenan(a)huawei.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: zhangyi (F) <yi.zhang(a)huawei.com>
Cc: Miao Xie <miaoxie(a)huawei.com>
Cc: Mike Rapoport <rppt(a)linux.vnet.ibm.com>
Cc: Shaohua Li <shli(a)fb.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Anshuman Khandual <khandual(a)linux.vnet.ibm.com>
Cc: Rik van Riel <riel(a)redhat.com>
Cc: Carsten Otte <cotte(a)de.ibm.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/madvise.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff -puN mm/madvise.c~mmmadvise-bugfix-of-madvise-systemcall-infinite-loop-under-special-circumstances mm/madvise.c
--- a/mm/madvise.c~mmmadvise-bugfix-of-madvise-systemcall-infinite-loop-under-special-circumstances
+++ a/mm/madvise.c
@@ -276,15 +276,14 @@ static long madvise_willneed(struct vm_a
{
struct file *file = vma->vm_file;
+ *prev = vma;
#ifdef CONFIG_SWAP
if (!file) {
- *prev = vma;
force_swapin_readahead(vma, start, end);
return 0;
}
if (shmem_mapping(file->f_mapping)) {
- *prev = vma;
force_shm_swapin_readahead(vma, start, end,
file->f_mapping);
return 0;
@@ -299,7 +298,6 @@ static long madvise_willneed(struct vm_a
return 0;
}
- *prev = vma;
start = ((start - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
if (end > vma->vm_end)
end = vma->vm_end;
_
From: Kees Cook <keescook(a)chromium.org>
Subject: exec: avoid RLIMIT_STACK races with prlimit()
While the defense-in-depth RLIMIT_STACK limit on setuid processes was
protected against races from other threads calling setrlimit(), I missed
protecting it against races from external processes calling prlimit().
This adds locking around the change and makes sure that rlim_max is set
too.
Link: http://lkml.kernel.org/r/20171127193457.GA11348@beast
Fixes: 64701dee4178e ("exec: Use sane stack rlimit under secureexec")
Signed-off-by: Kees Cook <keescook(a)chromium.org>
Reported-by: Ben Hutchings <ben.hutchings(a)codethink.co.uk>
Reported-by: Brad Spengler <spender(a)grsecurity.net>
Acked-by: Serge Hallyn <serge(a)hallyn.com>
Cc: James Morris <james.l.morris(a)oracle.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Jiri Slaby <jslaby(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/exec.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff -puN fs/exec.c~exec-avoid-rlimit_stack-races-with-prlimit fs/exec.c
--- a/fs/exec.c~exec-avoid-rlimit_stack-races-with-prlimit
+++ a/fs/exec.c
@@ -1340,10 +1340,15 @@ void setup_new_exec(struct linux_binprm
* avoid bad behavior from the prior rlimits. This has to
* happen before arch_pick_mmap_layout(), which examines
* RLIMIT_STACK, but after the point of no return to avoid
- * needing to clean up the change on failure.
+ * races from other threads changing the limits. This also
+ * must be protected from races with prlimit() calls.
*/
+ task_lock(current->group_leader);
if (current->signal->rlim[RLIMIT_STACK].rlim_cur > _STK_LIM)
current->signal->rlim[RLIMIT_STACK].rlim_cur = _STK_LIM;
+ if (current->signal->rlim[RLIMIT_STACK].rlim_max > _STK_LIM)
+ current->signal->rlim[RLIMIT_STACK].rlim_max = _STK_LIM;
+ task_unlock(current->group_leader);
}
arch_pick_mmap_layout(current->mm);
_
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: IB/core: disable memory registration of filesystem-dax vmas
Until there is a solution to the dma-to-dax vs truncate problem it is not
safe to allow RDMA to create long standing memory registrations against
filesytem-dax vmas.
Link: http://lkml.kernel.org/r/151068941011.7446.7766030590347262502.stgit@dwilli…
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Reported-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Acked-by: Jason Gunthorpe <jgg(a)mellanox.com>
Cc: Sean Hefty <sean.hefty(a)intel.com>
Cc: Doug Ledford <dledford(a)redhat.com>
Cc: Hal Rosenstock <hal.rosenstock(a)gmail.com>
Cc: Jeff Moyer <jmoyer(a)redhat.com>
Cc: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Cc: Inki Dae <inki.dae(a)samsung.com>
Cc: Jan Kara <jack(a)suse.cz>
Cc: Joonyoung Shim <jy0922.shim(a)samsung.com>
Cc: Kyungmin Park <kyungmin.park(a)samsung.com>
Cc: Mauro Carvalho Chehab <mchehab(a)kernel.org>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Seung-Woo Kim <sw0312.kim(a)samsung.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
drivers/infiniband/core/umem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff -puN drivers/infiniband/core/umem.c~ib-core-disable-memory-registration-of-fileystem-dax-vmas drivers/infiniband/core/umem.c
--- a/drivers/infiniband/core/umem.c~ib-core-disable-memory-registration-of-fileystem-dax-vmas
+++ a/drivers/infiniband/core/umem.c
@@ -191,7 +191,7 @@ struct ib_umem *ib_umem_get(struct ib_uc
sg_list_start = umem->sg_head.sgl;
while (npages) {
- ret = get_user_pages(cur_base,
+ ret = get_user_pages_longterm(cur_base,
min_t(unsigned long, npages,
PAGE_SIZE / sizeof (struct page *)),
gup_flags, page_list, vma_list);
_
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: v4l2: disable filesystem-dax mapping support
V4L2 memory registrations are incompatible with filesystem-dax that needs
the ability to revoke dma access to a mapping at will, or otherwise allow
the kernel to wait for completion of DMA. The filesystem-dax
implementation breaks the traditional solution of truncate of active file
backed mappings since there is no page-cache page we can orphan to sustain
ongoing DMA.
If v4l2 wants to support long lived DMA mappings it needs to arrange to
hold a file lease or use some other mechanism so that the kernel can
coordinate revoking DMA access when the filesystem needs to truncate
mappings.
Link: http://lkml.kernel.org/r/151068940499.7446.12846708245365671207.stgit@dwill…
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Reported-by: Jan Kara <jack(a)suse.cz>
Reviewed-by: Jan Kara <jack(a)suse.cz>
Cc: Mauro Carvalho Chehab <mchehab(a)kernel.org>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Doug Ledford <dledford(a)redhat.com>
Cc: Hal Rosenstock <hal.rosenstock(a)gmail.com>
Cc: Inki Dae <inki.dae(a)samsung.com>
Cc: Jason Gunthorpe <jgg(a)mellanox.com>
Cc: Jeff Moyer <jmoyer(a)redhat.com>
Cc: Joonyoung Shim <jy0922.shim(a)samsung.com>
Cc: Kyungmin Park <kyungmin.park(a)samsung.com>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Cc: Sean Hefty <sean.hefty(a)intel.com>
Cc: Seung-Woo Kim <sw0312.kim(a)samsung.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
drivers/media/v4l2-core/videobuf-dma-sg.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff -puN drivers/media/v4l2-core/videobuf-dma-sg.c~v4l2-disable-filesystem-dax-mapping-support drivers/media/v4l2-core/videobuf-dma-sg.c
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c~v4l2-disable-filesystem-dax-mapping-support
+++ a/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -185,12 +185,13 @@ static int videobuf_dma_init_user_locked
dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
data, size, dma->nr_pages);
- err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
+ err = get_user_pages_longterm(data & PAGE_MASK, dma->nr_pages,
flags, dma->pages, NULL);
if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;
- dprintk(1, "get_user_pages: err=%d [%d]\n", err, dma->nr_pages);
+ dprintk(1, "get_user_pages_longterm: err=%d [%d]\n", err,
+ dma->nr_pages);
return err < 0 ? err : -EINVAL;
}
return 0;
_
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: mm: fail get_vaddr_frames() for filesystem-dax mappings
Until there is a solution to the dma-to-dax vs truncate problem it is not
safe to allow V4L2, Exynos, and other frame vector users to create long
standing / irrevocable memory registrations against filesytem-dax vmas.
[dan.j.williams(a)intel.com: add comment for vma_is_fsdax() check in get_vaddr_frames(), per Jan]
Link: http://lkml.kernel.org/r/151197874035.26211.4061781453123083667.stgit@dwill…
Link: http://lkml.kernel.org/r/151068939985.7446.15684639617389154187.stgit@dwill…
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Reviewed-by: Jan Kara <jack(a)suse.cz>
Cc: Inki Dae <inki.dae(a)samsung.com>
Cc: Seung-Woo Kim <sw0312.kim(a)samsung.com>
Cc: Joonyoung Shim <jy0922.shim(a)samsung.com>
Cc: Kyungmin Park <kyungmin.park(a)samsung.com>
Cc: Mauro Carvalho Chehab <mchehab(a)kernel.org>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Doug Ledford <dledford(a)redhat.com>
Cc: Hal Rosenstock <hal.rosenstock(a)gmail.com>
Cc: Jason Gunthorpe <jgg(a)mellanox.com>
Cc: Jeff Moyer <jmoyer(a)redhat.com>
Cc: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Cc: Sean Hefty <sean.hefty(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/frame_vector.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff -puN mm/frame_vector.c~mm-fail-get_vaddr_frames-for-filesystem-dax-mappings mm/frame_vector.c
--- a/mm/frame_vector.c~mm-fail-get_vaddr_frames-for-filesystem-dax-mappings
+++ a/mm/frame_vector.c
@@ -53,6 +53,18 @@ int get_vaddr_frames(unsigned long start
ret = -EFAULT;
goto out;
}
+
+ /*
+ * While get_vaddr_frames() could be used for transient (kernel
+ * controlled lifetime) pinning of memory pages all current
+ * users establish long term (userspace controlled lifetime)
+ * page pinning. Treat get_vaddr_frames() like
+ * get_user_pages_longterm() and disallow it for filesystem-dax
+ * mappings.
+ */
+ if (vma_is_fsdax(vma))
+ return -EOPNOTSUPP;
+
if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
vec->got_ref = true;
vec->is_pfns = false;
_
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: mm: introduce get_user_pages_longterm
Patch series "introduce get_user_pages_longterm()", v2.
Here is a new get_user_pages api for cases where a driver intends to keep
an elevated page count indefinitely. This is distinct from usages like
iov_iter_get_pages where the elevated page counts are transient. The
iov_iter_get_pages cases immediately turn around and submit the pages to a
device driver which will put_page when the i/o operation completes (under
kernel control).
In the longterm case userspace is responsible for dropping the page
reference at some undefined point in the future. This is untenable for
filesystem-dax case where the filesystem is in control of the lifetime of
the block / page and needs reasonable limits on how long it can wait for
pages in a mapping to become idle.
Fixing filesystems to actually wait for dax pages to be idle before blocks
from a truncate/hole-punch operation are repurposed is saved for a later
patch series.
Also, allowing longterm registration of dax mappings is a future patch
series that introduces a "map with lease" semantic where the kernel can
revoke a lease and force userspace to drop its page references.
I have also tagged these for -stable to purposely break cases that might
assume that longterm memory registrations for filesystem-dax mappings were
supported by the kernel. The behavior regression this policy change
implies is one of the reasons we maintain the "dax enabled. Warning:
EXPERIMENTAL, use at your own risk" notification when mounting a
filesystem in dax mode.
It is worth noting the device-dax interface does not suffer the same
constraints since it does not support file space management operations
like hole-punch.
This patch (of 4):
Until there is a solution to the dma-to-dax vs truncate problem it is not
safe to allow long standing memory registrations against filesytem-dax
vmas. Device-dax vmas do not have this problem and are explicitly
allowed.
This is temporary until a "memory registration with layout-lease"
mechanism can be implemented for the affected sub-systems (RDMA and V4L2).
[akpm(a)linux-foundation.org: use kcalloc()]
Link: http://lkml.kernel.org/r/151068939435.7446.13560129395419350737.stgit@dwill…
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Suggested-by: Christoph Hellwig <hch(a)lst.de>
Cc: Doug Ledford <dledford(a)redhat.com>
Cc: Hal Rosenstock <hal.rosenstock(a)gmail.com>
Cc: Inki Dae <inki.dae(a)samsung.com>
Cc: Jan Kara <jack(a)suse.cz>
Cc: Jason Gunthorpe <jgg(a)mellanox.com>
Cc: Jeff Moyer <jmoyer(a)redhat.com>
Cc: Joonyoung Shim <jy0922.shim(a)samsung.com>
Cc: Kyungmin Park <kyungmin.park(a)samsung.com>
Cc: Mauro Carvalho Chehab <mchehab(a)kernel.org>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Cc: Sean Hefty <sean.hefty(a)intel.com>
Cc: Seung-Woo Kim <sw0312.kim(a)samsung.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/fs.h | 14 +++++++++
include/linux/mm.h | 13 ++++++++
mm/gup.c | 64 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 91 insertions(+)
diff -puN include/linux/fs.h~mm-introduce-get_user_pages_longterm include/linux/fs.h
--- a/include/linux/fs.h~mm-introduce-get_user_pages_longterm
+++ a/include/linux/fs.h
@@ -3194,6 +3194,20 @@ static inline bool vma_is_dax(struct vm_
return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
}
+static inline bool vma_is_fsdax(struct vm_area_struct *vma)
+{
+ struct inode *inode;
+
+ if (!vma->vm_file)
+ return false;
+ if (!vma_is_dax(vma))
+ return false;
+ inode = file_inode(vma->vm_file);
+ if (inode->i_mode == S_IFCHR)
+ return false; /* device-dax */
+ return true;
+}
+
static inline int iocb_flags(struct file *file)
{
int res = 0;
diff -puN include/linux/mm.h~mm-introduce-get_user_pages_longterm include/linux/mm.h
--- a/include/linux/mm.h~mm-introduce-get_user_pages_longterm
+++ a/include/linux/mm.h
@@ -1380,6 +1380,19 @@ long get_user_pages_locked(unsigned long
unsigned int gup_flags, struct page **pages, int *locked);
long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
struct page **pages, unsigned int gup_flags);
+#ifdef CONFIG_FS_DAX
+long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
+ unsigned int gup_flags, struct page **pages,
+ struct vm_area_struct **vmas);
+#else
+static inline long get_user_pages_longterm(unsigned long start,
+ unsigned long nr_pages, unsigned int gup_flags,
+ struct page **pages, struct vm_area_struct **vmas)
+{
+ return get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+}
+#endif /* CONFIG_FS_DAX */
+
int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
diff -puN mm/gup.c~mm-introduce-get_user_pages_longterm mm/gup.c
--- a/mm/gup.c~mm-introduce-get_user_pages_longterm
+++ a/mm/gup.c
@@ -1095,6 +1095,70 @@ long get_user_pages(unsigned long start,
}
EXPORT_SYMBOL(get_user_pages);
+#ifdef CONFIG_FS_DAX
+/*
+ * This is the same as get_user_pages() in that it assumes we are
+ * operating on the current task's mm, but it goes further to validate
+ * that the vmas associated with the address range are suitable for
+ * longterm elevated page reference counts. For example, filesystem-dax
+ * mappings are subject to the lifetime enforced by the filesystem and
+ * we need guarantees that longterm users like RDMA and V4L2 only
+ * establish mappings that have a kernel enforced revocation mechanism.
+ *
+ * "longterm" == userspace controlled elevated page count lifetime.
+ * Contrast this to iov_iter_get_pages() usages which are transient.
+ */
+long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
+ unsigned int gup_flags, struct page **pages,
+ struct vm_area_struct **vmas_arg)
+{
+ struct vm_area_struct **vmas = vmas_arg;
+ struct vm_area_struct *vma_prev = NULL;
+ long rc, i;
+
+ if (!pages)
+ return -EINVAL;
+
+ if (!vmas) {
+ vmas = kcalloc(nr_pages, sizeof(struct vm_area_struct *),
+ GFP_KERNEL);
+ if (!vmas)
+ return -ENOMEM;
+ }
+
+ rc = get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+
+ for (i = 0; i < rc; i++) {
+ struct vm_area_struct *vma = vmas[i];
+
+ if (vma == vma_prev)
+ continue;
+
+ vma_prev = vma;
+
+ if (vma_is_fsdax(vma))
+ break;
+ }
+
+ /*
+ * Either get_user_pages() failed, or the vma validation
+ * succeeded, in either case we don't need to put_page() before
+ * returning.
+ */
+ if (i >= rc)
+ goto out;
+
+ for (i = 0; i < rc; i++)
+ put_page(pages[i]);
+ rc = -EOPNOTSUPP;
+out:
+ if (vmas != vmas_arg)
+ kfree(vmas);
+ return rc;
+}
+EXPORT_SYMBOL(get_user_pages_longterm);
+#endif /* CONFIG_FS_DAX */
+
/**
* populate_vma_page_range() - populate a range of pages in the vma.
* @vma: target vma
_
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: mm, hugetlbfs: introduce ->split() to vm_operations_struct
Patch series "device-dax: fix unaligned munmap handling"
When device-dax is operating in huge-page mode we want it to behave like
hugetlbfs and fail attempts to split vmas into unaligned ranges. It would
be messy to teach the munmap path about device-dax alignment constraints
in the same (hstate) way that hugetlbfs communicates this constraint.
Instead, these patches introduce a new ->split() vm operation.
This patch (of 2):
The device-dax interface has similar constraints as hugetlbfs in that it
requires the munmap path to unmap in huge page aligned units. Rather than
add more custom vma handling code in __split_vma() introduce a new vm
operation to perform this vma specific check.
Link: http://lkml.kernel.org/r/151130418135.4029.6783191281930729710.stgit@dwilli…
Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap")
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Cc: Jeff Moyer <jmoyer(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mm.h | 1 +
mm/hugetlb.c | 8 ++++++++
mm/mmap.c | 8 +++++---
3 files changed, 14 insertions(+), 3 deletions(-)
diff -puN include/linux/mm.h~mm-hugetlbfs-introduce-split-to-vm_operations_struct include/linux/mm.h
--- a/include/linux/mm.h~mm-hugetlbfs-introduce-split-to-vm_operations_struct
+++ a/include/linux/mm.h
@@ -377,6 +377,7 @@ enum page_entry_size {
struct vm_operations_struct {
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
+ int (*split)(struct vm_area_struct * area, unsigned long addr);
int (*mremap)(struct vm_area_struct * area);
int (*fault)(struct vm_fault *vmf);
int (*huge_fault)(struct vm_fault *vmf, enum page_entry_size pe_size);
diff -puN mm/hugetlb.c~mm-hugetlbfs-introduce-split-to-vm_operations_struct mm/hugetlb.c
--- a/mm/hugetlb.c~mm-hugetlbfs-introduce-split-to-vm_operations_struct
+++ a/mm/hugetlb.c
@@ -3125,6 +3125,13 @@ static void hugetlb_vm_op_close(struct v
}
}
+static int hugetlb_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
+{
+ if (addr & ~(huge_page_mask(hstate_vma(vma))))
+ return -EINVAL;
+ return 0;
+}
+
/*
* We cannot handle pagefaults against hugetlb pages at all. They cause
* handle_mm_fault() to try to instantiate regular-sized pages in the
@@ -3141,6 +3148,7 @@ const struct vm_operations_struct hugetl
.fault = hugetlb_vm_op_fault,
.open = hugetlb_vm_op_open,
.close = hugetlb_vm_op_close,
+ .split = hugetlb_vm_op_split,
};
static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
diff -puN mm/mmap.c~mm-hugetlbfs-introduce-split-to-vm_operations_struct mm/mmap.c
--- a/mm/mmap.c~mm-hugetlbfs-introduce-split-to-vm_operations_struct
+++ a/mm/mmap.c
@@ -2555,9 +2555,11 @@ int __split_vma(struct mm_struct *mm, st
struct vm_area_struct *new;
int err;
- if (is_vm_hugetlb_page(vma) && (addr &
- ~(huge_page_mask(hstate_vma(vma)))))
- return -EINVAL;
+ if (vma->vm_ops && vma->vm_ops->split) {
+ err = vma->vm_ops->split(vma, addr);
+ if (err)
+ return err;
+ }
new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
if (!new)
_
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: mm: fix device-dax pud write-faults triggered by get_user_pages()
Currently only get_user_pages_fast() can safely handle the writable gup
case due to its use of pud_access_permitted() to check whether the pud
entry is writable. In the gup slow path pud_write() is used instead of
pud_access_permitted() and to date it has been unimplemented, just calls
BUG_ON().
kernel BUG at ./include/linux/hugetlb.h:244!
[..]
RIP: 0010:follow_devmap_pud+0x482/0x490
[..]
Call Trace:
follow_page_mask+0x28c/0x6e0
__get_user_pages+0xe4/0x6c0
get_user_pages_unlocked+0x130/0x1b0
get_user_pages_fast+0x89/0xb0
iov_iter_get_pages_alloc+0x114/0x4a0
nfs_direct_read_schedule_iovec+0xd2/0x350
? nfs_start_io_direct+0x63/0x70
nfs_file_direct_read+0x1e0/0x250
nfs_file_read+0x90/0xc0
For now this just implements a simple check for the _PAGE_RW bit similar
to pmd_write. However, this implies that the gup-slow-path check is
missing the extra checks that the gup-fast-path performs with
pud_access_permitted. Later patches will align all checks to use the
'access_permitted' helper if the architecture provides it. Note that the
generic 'access_permitted' helper fallback is the simple _PAGE_RW check on
architectures that do not define the 'access_permitted' helper(s).
[dan.j.williams(a)intel.com: fix powerpc compile error]
Link: http://lkml.kernel.org/r/151129126165.37405.16031785266675461397.stgit@dwil…
Link: http://lkml.kernel.org/r/151043109938.2842.14834662818213616199.stgit@dwill…
Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages")
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Reported-by: Stephen Rothwell <sfr(a)canb.auug.org.au>
Acked-by: Thomas Gleixner <tglx(a)linutronix.de> [x86]
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Will Deacon <will.deacon(a)arm.com>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/x86/include/asm/pgtable.h | 6 ++++++
include/asm-generic/pgtable.h | 8 ++++++++
include/linux/hugetlb.h | 8 --------
3 files changed, 14 insertions(+), 8 deletions(-)
diff -puN arch/x86/include/asm/pgtable.h~mm-fix-device-dax-pud-write-faults-triggered-by-get_user_pages arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~mm-fix-device-dax-pud-write-faults-triggered-by-get_user_pages
+++ a/arch/x86/include/asm/pgtable.h
@@ -1088,6 +1088,12 @@ static inline void pmdp_set_wrprotect(st
clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
}
+#define pud_write pud_write
+static inline int pud_write(pud_t pud)
+{
+ return pud_flags(pud) & _PAGE_RW;
+}
+
/*
* clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
*
diff -puN include/asm-generic/pgtable.h~mm-fix-device-dax-pud-write-faults-triggered-by-get_user_pages include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h~mm-fix-device-dax-pud-write-faults-triggered-by-get_user_pages
+++ a/include/asm-generic/pgtable.h
@@ -814,6 +814,14 @@ static inline int pmd_write(pmd_t pmd)
#endif /* __HAVE_ARCH_PMD_WRITE */
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#ifndef pud_write
+static inline int pud_write(pud_t pud)
+{
+ BUG();
+ return 0;
+}
+#endif /* pud_write */
+
#if !defined(CONFIG_TRANSPARENT_HUGEPAGE) || \
(defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
!defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD))
diff -puN include/linux/hugetlb.h~mm-fix-device-dax-pud-write-faults-triggered-by-get_user_pages include/linux/hugetlb.h
--- a/include/linux/hugetlb.h~mm-fix-device-dax-pud-write-faults-triggered-by-get_user_pages
+++ a/include/linux/hugetlb.h
@@ -239,14 +239,6 @@ static inline int pgd_write(pgd_t pgd)
}
#endif
-#ifndef pud_write
-static inline int pud_write(pud_t pud)
-{
- BUG();
- return 0;
-}
-#endif
-
#define HUGETLB_ANON_FILE "anon_hugepage"
enum {
_