Matthew,
I've merged in your dump_page() ideas, and also factored things out into a new __dump_tail_page() routine, in order to save a few indentation levels, mainly.
Kirill, thanks for your review comments. I've applied them, and I think splitting this up as you recommended really makes it a lot better, and easier to spot problems.
============================================================ Changes since v2:
* Rebased onto linux.git, because the akpm tree for 5.6 has been merged.
* Split the tracking patch into even more patches, as requested.
* Merged Matthew Wilcox's dump_page() changes into mine, as part of the first patch.
* Renamed: page_dma_pinned() --> page_maybe_dma_pinned(), in response to Kirill Shutemov's review.
* Moved a WARN to the top of a routine, and fixed a typo in the commit description of patch #7, also as suggested by Kirill.
============================================================ Changes since v1:
* Split the tracking patch into 6 smaller patches
* Rebased onto today's linux-next/akpm (there weren't any conflicts).
* Fixed an "unsigned int" vs. "int" problem in gup_benchmark, reported by Nathan Chancellor. (I don't see it in my local builds, probably because they use gcc, but an LLVM test found the mismatch.)
* Fixed a huge page pincount problem (add/subtract vs. increment/decrement), spotted by Jan Kara. ============================================================
There is a reasonable case to be made for merging two of the patches (patches 4 and 5), given that patch 4 provides tracking that has upper limits on the number of pins that can be done with huge pages. Let me know if anyone wants those merged, but unless there is some weird chance of someone grabbing patch 4 and not patch 5, I don't really see the need. Meanwhile, it's easier to review in this form.
Also, patch 3 has been revived. Earlier reviewers asked for it to be merged into the tracking patch (one cannot please everyone, heh), but now it's back out on it's own.
This activates tracking of FOLL_PIN pages. This is in support of fixing the get_user_pages()+DMA problem described in [1]-[4].
It is based on today's (Jan 28) linux-next (branch: akpm), commit 280e9cb00b41 ("drivers/media/platform/sti/delta/delta-ipc.c: fix read buffer overflow")
There is a git repo and branch, for convenience in reviewing:
git@github.com:johnhubbard/linux.git track_user_pages_v2_linux-next_akpm_28Jan2020
FOLL_PIN support is (so far) in mmotm and linux-next. However, the patch to use FOLL_PIN to track pages was *not* submitted, because Leon saw an RDMA test suite failure that involved (I think) page refcount overflows when huge pages were used.
This patch definitively solves that kind of overflow problem, by adding an exact pincount, for compound pages (of order > 1), in the 3rd struct page of a compound page. If available, that form of pincounting is used, instead of the GUP_PIN_COUNTING_BIAS approach. Thanks again to Jan Kara for that idea.
Here's the last reviewed version of the tracking patch (v11):
https://lore.kernel.org/r/20191216222537.491123-1-jhubbard@nvidia.com
Jan Kara had provided a reviewed-by tag for that, but I've had to remove it (again) here, due to having changed the patch "a little bit", in order to add the feature described above.
Other interesting changes:
* dump_page(): added one, or two new things to report for compound pages: head refcount (for all compound pages), and map_pincount (for compound pages of order > 1).
* Documentation/core-api/pin_user_pages.rst: removed the "TODO" for the huge page refcount upper limit problems, and added notes about how it works now. Also added a note about the dump_page() enhancements.
* Added some comments in gup.c and mm.h, to explain that there are two ways to count pinned pages: exact (for compound pages of order > 1) and fuzzy (GUP_PIN_COUNTING_BIAS: for all other pages).
============================================================ General notes about the tracking patch:
This is a prerequisite to solving the problem of proper interactions between file-backed pages, and [R]DMA activities, as discussed in [1], [2], [3], [4] and in a remarkable number of email threads since about 2017. :)
In contrast to earlier approaches, the page tracking can be incrementally applied to the kernel call sites that, until now, have been simply calling get_user_pages() ("gup"). In other words, opt-in by changing from this:
get_user_pages() (sets FOLL_GET) put_page()
to this: pin_user_pages() (sets FOLL_PIN) unpin_user_page()
============================================================ Next steps:
* Convert more subsystems from get_user_pages() to pin_user_pages(). * Work with Ira and others to connect this all up with file system leases.
[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
[4] LWN kernel index: get_user_pages() https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
John Hubbard (12): mm: dump_page(): better diagnostics for compound pages mm/gup: split get_user_pages_remote() into two routines mm/gup: pass a flags arg to __gup_device_* functions mm: introduce page_ref_sub_return() mm/gup: pass gup flags to two more routines mm/gup: require FOLL_GET for get_user_pages_fast() mm/gup: track FOLL_PIN pages mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages mm: dump_page(): better diagnostics for huge pinned pages mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting mm/gup_benchmark: support pin_user_pages() and related calls selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage
Documentation/core-api/pin_user_pages.rst | 53 +-- include/linux/mm.h | 108 ++++- include/linux/mm_types.h | 7 +- include/linux/mmzone.h | 2 + include/linux/page_ref.h | 10 + mm/debug.c | 60 ++- mm/gup.c | 459 ++++++++++++++++----- mm/gup_benchmark.c | 71 +++- mm/huge_memory.c | 29 +- mm/hugetlb.c | 44 +- mm/page_alloc.c | 2 + mm/rmap.c | 6 + mm/vmstat.c | 2 + tools/testing/selftests/vm/gup_benchmark.c | 15 +- tools/testing/selftests/vm/run_vmtests | 22 + 15 files changed, 715 insertions(+), 175 deletions(-)
A compound page collects the refcount in the head page, while leaving the refcount of each tail page at zero. Therefore, when debugging a problem that involves compound pages, it's best to have diagnostics that reflect that situation. However, dump_page() is oblivious to these points.
Change dump_page() as follows:
1) For tail pages, print relevant head page information: refcount, in particular. But only do this if the page is not corrupted so badly that the pointer to the head page is all wrong.
2) Do a separate check to catch any (rare) cases of the tail page's refcount being non-zero, and issue a separate, clear pr_warn() if that ever happens.
Suggested-by: Matthew Wilcox willy@infradead.org Suggested-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: John Hubbard jhubbard@nvidia.com --- mm/debug.c | 34 ++++++++++++++++++++++++++++------ 1 file changed, 28 insertions(+), 6 deletions(-)
diff --git a/mm/debug.c b/mm/debug.c index ecccd9f17801..beb1c59d784b 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -42,6 +42,32 @@ const struct trace_print_flags vmaflag_names[] = { {0, NULL} };
+static void __dump_tail_page(struct page *page, int mapcount) +{ + struct page *head = compound_head(page); + + if ((page < head) || (page >= head + MAX_ORDER_NR_PAGES)) { + /* + * Page is hopelessly corrupted, so limit any reporting to + * information about the page itself. Do not attempt to look at + * the head page. + */ + pr_warn("page:%px refcount:%d mapcount:%d mapping:%px " + "index:%#lx (corrupted tail page case)\n", + page, page_ref_count(page), mapcount, page->mapping, + page_to_pgoff(page)); + } else { + pr_warn("page:%px compound refcount:%d mapcount:%d mapping:%px " + "index:%#lx compound_mapcount:%d\n", + page, page_ref_count(head), mapcount, head->mapping, + page_to_pgoff(head), compound_mapcount(page)); + } + + if (page_ref_count(page) != 0) + pr_warn("page:%px PROBLEM: non-zero refcount (==%d) on this " + "tail page\n", page, page_ref_count(page)); +} + void __dump_page(struct page *page, const char *reason) { struct address_space *mapping; @@ -75,12 +101,8 @@ void __dump_page(struct page *page, const char *reason) */ mapcount = PageSlab(page) ? 0 : page_mapcount(page);
- if (PageCompound(page)) - pr_warn("page:%px refcount:%d mapcount:%d mapping:%px " - "index:%#lx compound_mapcount: %d\n", - page, page_ref_count(page), mapcount, - page->mapping, page_to_pgoff(page), - compound_mapcount(page)); + if (PageTail(page)) + __dump_tail_page(page, mapcount); else pr_warn("page:%px refcount:%d mapcount:%d mapping:%px index:%#lx\n", page, page_ref_count(page), mapcount,
On Fri, Jan 31, 2020 at 07:40:18PM -0800, John Hubbard wrote:
A compound page collects the refcount in the head page, while leaving the refcount of each tail page at zero. Therefore, when debugging a problem that involves compound pages, it's best to have diagnostics that reflect that situation. However, dump_page() is oblivious to these points.
Change dump_page() as follows:
For tail pages, print relevant head page information: refcount, in particular. But only do this if the page is not corrupted so badly that the pointer to the head page is all wrong.
Do a separate check to catch any (rare) cases of the tail page's refcount being non-zero, and issue a separate, clear pr_warn() if that ever happens.
Suggested-by: Matthew Wilcox willy@infradead.org Suggested-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: John Hubbard jhubbard@nvidia.com
Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com
Few nit-picks below.
mm/debug.c | 34 ++++++++++++++++++++++++++++------ 1 file changed, 28 insertions(+), 6 deletions(-)
diff --git a/mm/debug.c b/mm/debug.c index ecccd9f17801..beb1c59d784b 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -42,6 +42,32 @@ const struct trace_print_flags vmaflag_names[] = { {0, NULL} }; +static void __dump_tail_page(struct page *page, int mapcount) +{
- struct page *head = compound_head(page);
- if ((page < head) || (page >= head + MAX_ORDER_NR_PAGES)) {
I'm not sure if we want to use compound_nr() here instead of MAX_ORDER_NR_PAGES. Do you have any reasonaing about it?
/*
* Page is hopelessly corrupted, so limit any reporting to
* information about the page itself. Do not attempt to look at
* the head page.
*/
pr_warn("page:%px refcount:%d mapcount:%d mapping:%px "
"index:%#lx (corrupted tail page case)\n",
page, page_ref_count(page), mapcount, page->mapping,
page_to_pgoff(page));
- } else {
pr_warn("page:%px compound refcount:%d mapcount:%d mapping:%px "
"index:%#lx compound_mapcount:%d\n",
page, page_ref_count(head), mapcount, head->mapping,
page_to_pgoff(head), compound_mapcount(page));
- }
- if (page_ref_count(page) != 0)
pr_warn("page:%px PROBLEM: non-zero refcount (==%d) on this "
"tail page\n", page, page_ref_count(page));
Wrap into {}, please.
+}
void __dump_page(struct page *page, const char *reason) { struct address_space *mapping; @@ -75,12 +101,8 @@ void __dump_page(struct page *page, const char *reason) */ mapcount = PageSlab(page) ? 0 : page_mapcount(page);
- if (PageCompound(page))
pr_warn("page:%px refcount:%d mapcount:%d mapping:%px "
"index:%#lx compound_mapcount: %d\n",
page, page_ref_count(page), mapcount,
page->mapping, page_to_pgoff(page),
compound_mapcount(page));
- if (PageTail(page))
else pr_warn("page:%px refcount:%d mapcount:%d mapping:%px index:%#lx\n", page, page_ref_count(page), mapcount,__dump_tail_page(page, mapcount);
-- 2.25.0
On 2/3/20 5:16 AM, Kirill A. Shutemov wrote:
On Fri, Jan 31, 2020 at 07:40:18PM -0800, John Hubbard wrote:
A compound page collects the refcount in the head page, while leaving the refcount of each tail page at zero. Therefore, when debugging a problem that involves compound pages, it's best to have diagnostics that reflect that situation. However, dump_page() is oblivious to these points.
Change dump_page() as follows:
For tail pages, print relevant head page information: refcount, in particular. But only do this if the page is not corrupted so badly that the pointer to the head page is all wrong.
Do a separate check to catch any (rare) cases of the tail page's refcount being non-zero, and issue a separate, clear pr_warn() if that ever happens.
Suggested-by: Matthew Wilcox willy@infradead.org Suggested-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: John Hubbard jhubbard@nvidia.com
Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com
Thanks for looking through all of these!
Few nit-picks below.
mm/debug.c | 34 ++++++++++++++++++++++++++++------ 1 file changed, 28 insertions(+), 6 deletions(-)
diff --git a/mm/debug.c b/mm/debug.c index ecccd9f17801..beb1c59d784b 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -42,6 +42,32 @@ const struct trace_print_flags vmaflag_names[] = { {0, NULL} }; +static void __dump_tail_page(struct page *page, int mapcount) +{
- struct page *head = compound_head(page);
- if ((page < head) || (page >= head + MAX_ORDER_NR_PAGES)) {
I'm not sure if we want to use compound_nr() here instead of MAX_ORDER_NR_PAGES. Do you have any reasonaing about it?
Yes: compound_nr(page) reads from the struct page, whereas MAX_ORDER_NR_PAGES is an independent, immutable limit. When checking a struct page for corruption, it's ideal to avoid relying on data within the struct page, as compound_nr() would have to do.
/*
* Page is hopelessly corrupted, so limit any reporting to
* information about the page itself. Do not attempt to look at
* the head page.
*/
pr_warn("page:%px refcount:%d mapcount:%d mapping:%px "
"index:%#lx (corrupted tail page case)\n",
page, page_ref_count(page), mapcount, page->mapping,
page_to_pgoff(page));
- } else {
pr_warn("page:%px compound refcount:%d mapcount:%d mapping:%px "
"index:%#lx compound_mapcount:%d\n",
page, page_ref_count(head), mapcount, head->mapping,
page_to_pgoff(head), compound_mapcount(page));
- }
- if (page_ref_count(page) != 0)
pr_warn("page:%px PROBLEM: non-zero refcount (==%d) on this "
"tail page\n", page, page_ref_count(page));
Wrap into {}, please.
Fixed, thanks.
thanks,
An upcoming patch requires reusing the implementation of get_user_pages_remote(). Split up get_user_pages_remote() into an outer routine that checks flags, and an implementation routine that will be reused. This makes subsequent changes much easier to understand.
There should be no change in behavior due to this patch.
Signed-off-by: John Hubbard jhubbard@nvidia.com --- mm/gup.c | 56 +++++++++++++++++++++++++++++++++----------------------- 1 file changed, 33 insertions(+), 23 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c index e13f4d211475..228aa7059018 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1557,6 +1557,37 @@ static __always_inline long __gup_longterm_locked(struct task_struct *tsk, } #endif /* CONFIG_FS_DAX || CONFIG_CMA */
+#ifdef CONFIG_MMU +static long __get_user_pages_remote(struct task_struct *tsk, + struct mm_struct *mm, + unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas, int *locked) +{ + /* + * Parts of FOLL_LONGTERM behavior are incompatible with + * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on + * vmas. However, this only comes up if locked is set, and there are + * callers that do request FOLL_LONGTERM, but do not set locked. So, + * allow what we can. + */ + if (gup_flags & FOLL_LONGTERM) { + if (WARN_ON_ONCE(locked)) + return -EINVAL; + /* + * This will check the vmas (even if our vmas arg is NULL) + * and return -ENOTSUPP if DAX isn't allowed in this case: + */ + return __gup_longterm_locked(tsk, mm, start, nr_pages, pages, + vmas, gup_flags | FOLL_TOUCH | + FOLL_REMOTE); + } + + return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas, + locked, + gup_flags | FOLL_TOUCH | FOLL_REMOTE); +} + /* * get_user_pages_remote() - pin user pages in memory * @tsk: the task_struct to use for page fault accounting, or @@ -1619,7 +1650,6 @@ static __always_inline long __gup_longterm_locked(struct task_struct *tsk, * should use get_user_pages because it cannot pass * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault. */ -#ifdef CONFIG_MMU long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, @@ -1632,28 +1662,8 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) return -EINVAL;
- /* - * Parts of FOLL_LONGTERM behavior are incompatible with - * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on - * vmas. However, this only comes up if locked is set, and there are - * callers that do request FOLL_LONGTERM, but do not set locked. So, - * allow what we can. - */ - if (gup_flags & FOLL_LONGTERM) { - if (WARN_ON_ONCE(locked)) - return -EINVAL; - /* - * This will check the vmas (even if our vmas arg is NULL) - * and return -ENOTSUPP if DAX isn't allowed in this case: - */ - return __gup_longterm_locked(tsk, mm, start, nr_pages, pages, - vmas, gup_flags | FOLL_TOUCH | - FOLL_REMOTE); - } - - return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas, - locked, - gup_flags | FOLL_TOUCH | FOLL_REMOTE); + return __get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, + pages, vmas, locked); } EXPORT_SYMBOL(get_user_pages_remote);
On Fri, Jan 31, 2020 at 07:40:19PM -0800, John Hubbard wrote:
An upcoming patch requires reusing the implementation of get_user_pages_remote(). Split up get_user_pages_remote() into an outer routine that checks flags, and an implementation routine that will be reused. This makes subsequent changes much easier to understand.
There should be no change in behavior due to this patch.
Signed-off-by: John Hubbard jhubbard@nvidia.com
Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com
On Fri 31-01-20 19:40:19, John Hubbard wrote:
An upcoming patch requires reusing the implementation of get_user_pages_remote(). Split up get_user_pages_remote() into an outer routine that checks flags, and an implementation routine that will be reused. This makes subsequent changes much easier to understand.
There should be no change in behavior due to this patch.
Signed-off-by: John Hubbard jhubbard@nvidia.com
Looks good to me. You can add:
Reviewed-by: Jan Kara jack@suse.cz
Honza
mm/gup.c | 56 +++++++++++++++++++++++++++++++++----------------------- 1 file changed, 33 insertions(+), 23 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c index e13f4d211475..228aa7059018 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1557,6 +1557,37 @@ static __always_inline long __gup_longterm_locked(struct task_struct *tsk, } #endif /* CONFIG_FS_DAX || CONFIG_CMA */ +#ifdef CONFIG_MMU +static long __get_user_pages_remote(struct task_struct *tsk,
struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas, int *locked)
+{
- /*
* Parts of FOLL_LONGTERM behavior are incompatible with
* FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
* vmas. However, this only comes up if locked is set, and there are
* callers that do request FOLL_LONGTERM, but do not set locked. So,
* allow what we can.
*/
- if (gup_flags & FOLL_LONGTERM) {
if (WARN_ON_ONCE(locked))
return -EINVAL;
/*
* This will check the vmas (even if our vmas arg is NULL)
* and return -ENOTSUPP if DAX isn't allowed in this case:
*/
return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
vmas, gup_flags | FOLL_TOUCH |
FOLL_REMOTE);
- }
- return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
locked,
gup_flags | FOLL_TOUCH | FOLL_REMOTE);
+}
/*
- get_user_pages_remote() - pin user pages in memory
- @tsk: the task_struct to use for page fault accounting, or
@@ -1619,7 +1650,6 @@ static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
- should use get_user_pages because it cannot pass
- FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
*/ -#ifdef CONFIG_MMU long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, @@ -1632,28 +1662,8 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) return -EINVAL;
- /*
* Parts of FOLL_LONGTERM behavior are incompatible with
* FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
* vmas. However, this only comes up if locked is set, and there are
* callers that do request FOLL_LONGTERM, but do not set locked. So,
* allow what we can.
*/
- if (gup_flags & FOLL_LONGTERM) {
if (WARN_ON_ONCE(locked))
return -EINVAL;
/*
* This will check the vmas (even if our vmas arg is NULL)
* and return -ENOTSUPP if DAX isn't allowed in this case:
*/
return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
vmas, gup_flags | FOLL_TOUCH |
FOLL_REMOTE);
- }
- return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
locked,
gup_flags | FOLL_TOUCH | FOLL_REMOTE);
- return __get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags,
pages, vmas, locked);
} EXPORT_SYMBOL(get_user_pages_remote); -- 2.25.0
On 2/3/20 6:20 AM, Jan Kara wrote:
On Fri 31-01-20 19:40:19, John Hubbard wrote:
An upcoming patch requires reusing the implementation of get_user_pages_remote(). Split up get_user_pages_remote() into an outer routine that checks flags, and an implementation routine that will be reused. This makes subsequent changes much easier to understand.
There should be no change in behavior due to this patch.
Signed-off-by: John Hubbard jhubbard@nvidia.com
Looks good to me. You can add:
Reviewed-by: Jan Kara jack@suse.cz
Honza
Thanks for reviewing this series (sort of again), Jan!
thanks,
A subsequent patch requires access to gup flags, so pass the flags argument through to the __gup_device_* functions.
Also placate checkpatch.pl by shortening a nearby line.
Reviewed-by: Jan Kara jack@suse.cz Reviewed-by: Jérôme Glisse jglisse@redhat.com Reviewed-by: Ira Weiny ira.weiny@intel.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com
Signed-off-by: John Hubbard jhubbard@nvidia.com --- mm/gup.c | 28 ++++++++++++++++++---------- 1 file changed, 18 insertions(+), 10 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c index 228aa7059018..65a40560c8d0 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1963,7 +1963,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
#if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE) static int __gup_device_huge(unsigned long pfn, unsigned long addr, - unsigned long end, struct page **pages, int *nr) + unsigned long end, unsigned int flags, + struct page **pages, int *nr) { int nr_start = *nr; struct dev_pagemap *pgmap = NULL; @@ -1989,13 +1990,14 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr, }
static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, - unsigned long end, struct page **pages, int *nr) + unsigned long end, unsigned int flags, + struct page **pages, int *nr) { unsigned long fault_pfn; int nr_start = *nr;
fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); - if (!__gup_device_huge(fault_pfn, addr, end, pages, nr)) + if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr)) return 0;
if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) { @@ -2006,13 +2008,14 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, }
static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, - unsigned long end, struct page **pages, int *nr) + unsigned long end, unsigned int flags, + struct page **pages, int *nr) { unsigned long fault_pfn; int nr_start = *nr;
fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); - if (!__gup_device_huge(fault_pfn, addr, end, pages, nr)) + if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr)) return 0;
if (unlikely(pud_val(orig) != pud_val(*pudp))) { @@ -2023,14 +2026,16 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, } #else static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, - unsigned long end, struct page **pages, int *nr) + unsigned long end, unsigned int flags, + struct page **pages, int *nr) { BUILD_BUG(); return 0; }
static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr, - unsigned long end, struct page **pages, int *nr) + unsigned long end, unsigned int flags, + struct page **pages, int *nr) { BUILD_BUG(); return 0; @@ -2146,7 +2151,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, if (pmd_devmap(orig)) { if (unlikely(flags & FOLL_LONGTERM)) return 0; - return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr); + return __gup_device_huge_pmd(orig, pmdp, addr, end, flags, + pages, nr); }
page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); @@ -2167,7 +2173,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, }
static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, - unsigned long end, unsigned int flags, struct page **pages, int *nr) + unsigned long end, unsigned int flags, + struct page **pages, int *nr) { struct page *head, *page; int refs; @@ -2178,7 +2185,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, if (pud_devmap(orig)) { if (unlikely(flags & FOLL_LONGTERM)) return 0; - return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr); + return __gup_device_huge_pud(orig, pudp, addr, end, flags, + pages, nr); }
page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
On Fri, Jan 31, 2020 at 07:40:20PM -0800, John Hubbard wrote:
A subsequent patch requires access to gup flags, so pass the flags argument through to the __gup_device_* functions.
Also placate checkpatch.pl by shortening a nearby line.
Reviewed-by: Jan Kara jack@suse.cz Reviewed-by: Jérôme Glisse jglisse@redhat.com Reviewed-by: Ira Weiny ira.weiny@intel.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com
Empty line here?
Signed-off-by: John Hubbard jhubbard@nvidia.com
Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com
On 2/3/20 5:19 AM, Kirill A. Shutemov wrote:
On Fri, Jan 31, 2020 at 07:40:20PM -0800, John Hubbard wrote:
A subsequent patch requires access to gup flags, so pass the flags argument through to the __gup_device_* functions.
Also placate checkpatch.pl by shortening a nearby line.
Reviewed-by: Jan Kara jack@suse.cz Reviewed-by: Jérôme Glisse jglisse@redhat.com Reviewed-by: Ira Weiny ira.weiny@intel.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com
Empty line here?
Yes. No one knows why. :) Fixed for next posting, thanks!
thanks,
An upcoming patch requires subtracting a large chunk of refcounts from a page, and checking what the resulting refcount is. This is a little different than the usual "check for zero refcount" that many of the page ref functions already do. However, it is similar to a few other routines that (like this one) are generally useful for things such as 1-based refcounting.
Add page_ref_sub_return(), that subtracts a chunk of refcounts atomically, and returns an atomic snapshot of the result.
Signed-off-by: John Hubbard jhubbard@nvidia.com --- include/linux/page_ref.h | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index 14d14beb1f7f..b9cbe553d1e7 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -102,6 +102,16 @@ static inline void page_ref_sub(struct page *page, int nr) __page_ref_mod(page, -nr); }
+static inline int page_ref_sub_return(struct page *page, int nr) +{ + int ret = atomic_sub_return(nr, &page->_refcount); + + if (page_ref_tracepoint_active(__tracepoint_page_ref_mod)) + __page_ref_mod(page, -nr); + + return ret; +} + static inline void page_ref_inc(struct page *page) { atomic_inc(&page->_refcount);
On Fri, Jan 31, 2020 at 07:40:21PM -0800, John Hubbard wrote:
An upcoming patch requires subtracting a large chunk of refcounts from a page, and checking what the resulting refcount is. This is a little different than the usual "check for zero refcount" that many of the page ref functions already do. However, it is similar to a few other routines that (like this one) are generally useful for things such as 1-based refcounting.
Add page_ref_sub_return(), that subtracts a chunk of refcounts atomically, and returns an atomic snapshot of the result.
Signed-off-by: John Hubbard jhubbard@nvidia.com
include/linux/page_ref.h | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index 14d14beb1f7f..b9cbe553d1e7 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -102,6 +102,16 @@ static inline void page_ref_sub(struct page *page, int nr) __page_ref_mod(page, -nr); } +static inline int page_ref_sub_return(struct page *page, int nr) +{
- int ret = atomic_sub_return(nr, &page->_refcount);
- if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
__page_ref_mod(page, -nr);
Shouldn't it be __page_ref_mod_and_return() and relevant tracepoint?
- return ret;
+}
static inline void page_ref_inc(struct page *page) { atomic_inc(&page->_refcount); -- 2.25.0
On 2/3/20 5:23 AM, Kirill A. Shutemov wrote:
On Fri, Jan 31, 2020 at 07:40:21PM -0800, John Hubbard wrote:
An upcoming patch requires subtracting a large chunk of refcounts from a page, and checking what the resulting refcount is. This is a little different than the usual "check for zero refcount" that many of the page ref functions already do. However, it is similar to a few other routines that (like this one) are generally useful for things such as 1-based refcounting.
Add page_ref_sub_return(), that subtracts a chunk of refcounts atomically, and returns an atomic snapshot of the result.
Signed-off-by: John Hubbard jhubbard@nvidia.com
include/linux/page_ref.h | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index 14d14beb1f7f..b9cbe553d1e7 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -102,6 +102,16 @@ static inline void page_ref_sub(struct page *page, int nr) __page_ref_mod(page, -nr); } +static inline int page_ref_sub_return(struct page *page, int nr) +{
- int ret = atomic_sub_return(nr, &page->_refcount);
- if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
__page_ref_mod(page, -nr);
Shouldn't it be __page_ref_mod_and_return() and relevant tracepoint?
Why yes, it should. I didn't even notice that that more precise function existed, thanks for catching that. I've changed it to this for the next version of the patchset:
static inline int page_ref_sub_return(struct page *page, int nr) { int ret = atomic_sub_return(nr, &page->_refcount);
if (page_ref_tracepoint_active(__tracepoint_page_ref_mod)) __page_ref_mod_and_return(page, -nr, ret); return ret; }
thanks,
In preparation for an upcoming patch, send gup flags args to two more routines: put_compound_head(), and undo_dev_pagemap().
Signed-off-by: John Hubbard jhubbard@nvidia.com --- mm/gup.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c index 65a40560c8d0..83473c2165f4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1870,6 +1870,7 @@ static inline pte_t gup_get_pte(pte_t *ptep) #endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, + unsigned int flags, struct page **pages) { while ((*nr) - nr_start) { @@ -1909,7 +1910,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
pgmap = get_dev_pagemap(pte_pfn(pte), pgmap); if (unlikely(!pgmap)) { - undo_dev_pagemap(nr, nr_start, pages); + undo_dev_pagemap(nr, nr_start, flags, pages); goto pte_unmap; } } else if (pte_special(pte)) @@ -1974,7 +1975,7 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
pgmap = get_dev_pagemap(pfn, pgmap); if (unlikely(!pgmap)) { - undo_dev_pagemap(nr, nr_start, pages); + undo_dev_pagemap(nr, nr_start, flags, pages); return 0; } SetPageReferenced(page); @@ -2001,7 +2002,7 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, return 0;
if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) { - undo_dev_pagemap(nr, nr_start, pages); + undo_dev_pagemap(nr, nr_start, flags, pages); return 0; } return 1; @@ -2019,7 +2020,7 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, return 0;
if (unlikely(pud_val(orig) != pud_val(*pudp))) { - undo_dev_pagemap(nr, nr_start, pages); + undo_dev_pagemap(nr, nr_start, flags, pages); return 0; } return 1; @@ -2053,7 +2054,7 @@ static int record_subpages(struct page *page, unsigned long addr, return nr; }
-static void put_compound_head(struct page *page, int refs) +static void put_compound_head(struct page *page, int refs, unsigned int flags) { VM_BUG_ON_PAGE(page_ref_count(page) < refs, page); /* @@ -2103,7 +2104,7 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, return 0;
if (unlikely(pte_val(pte) != pte_val(*ptep))) { - put_compound_head(head, refs); + put_compound_head(head, refs, flags); return 0; }
@@ -2163,7 +2164,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, return 0;
if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) { - put_compound_head(head, refs); + put_compound_head(head, refs, flags); return 0; }
@@ -2197,7 +2198,7 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, return 0;
if (unlikely(pud_val(orig) != pud_val(*pudp))) { - put_compound_head(head, refs); + put_compound_head(head, refs, flags); return 0; }
@@ -2226,7 +2227,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, return 0;
if (unlikely(pgd_val(orig) != pgd_val(*pgdp))) { - put_compound_head(head, refs); + put_compound_head(head, refs, flags); return 0; }
On Fri, Jan 31, 2020 at 07:40:22PM -0800, John Hubbard wrote:
In preparation for an upcoming patch, send gup flags args to two more routines: put_compound_head(), and undo_dev_pagemap().
Signed-off-by: John Hubbard jhubbard@nvidia.com
Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com
On Fri 31-01-20 19:40:22, John Hubbard wrote:
In preparation for an upcoming patch, send gup flags args to two more routines: put_compound_head(), and undo_dev_pagemap().
Signed-off-by: John Hubbard jhubbard@nvidia.com
Looks good to me. You can add:
Reviewed-by: Jan Kara jack@suse.cz
Honza
mm/gup.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c index 65a40560c8d0..83473c2165f4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1870,6 +1870,7 @@ static inline pte_t gup_get_pte(pte_t *ptep) #endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
unsigned int flags, struct page **pages)
{ while ((*nr) - nr_start) { @@ -1909,7 +1910,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, pgmap = get_dev_pagemap(pte_pfn(pte), pgmap); if (unlikely(!pgmap)) {
undo_dev_pagemap(nr, nr_start, pages);
} else if (pte_special(pte))undo_dev_pagemap(nr, nr_start, flags, pages); goto pte_unmap; }
@@ -1974,7 +1975,7 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr, pgmap = get_dev_pagemap(pfn, pgmap); if (unlikely(!pgmap)) {
undo_dev_pagemap(nr, nr_start, pages);
} SetPageReferenced(page);undo_dev_pagemap(nr, nr_start, flags, pages); return 0;
@@ -2001,7 +2002,7 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, return 0; if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
undo_dev_pagemap(nr, nr_start, pages);
return 0; } return 1;undo_dev_pagemap(nr, nr_start, flags, pages);
@@ -2019,7 +2020,7 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, return 0; if (unlikely(pud_val(orig) != pud_val(*pudp))) {
undo_dev_pagemap(nr, nr_start, pages);
return 0; } return 1;undo_dev_pagemap(nr, nr_start, flags, pages);
@@ -2053,7 +2054,7 @@ static int record_subpages(struct page *page, unsigned long addr, return nr; } -static void put_compound_head(struct page *page, int refs) +static void put_compound_head(struct page *page, int refs, unsigned int flags) { VM_BUG_ON_PAGE(page_ref_count(page) < refs, page); /* @@ -2103,7 +2104,7 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, return 0; if (unlikely(pte_val(pte) != pte_val(*ptep))) {
put_compound_head(head, refs);
return 0; }put_compound_head(head, refs, flags);
@@ -2163,7 +2164,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, return 0; if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
put_compound_head(head, refs);
return 0; }put_compound_head(head, refs, flags);
@@ -2197,7 +2198,7 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, return 0; if (unlikely(pud_val(orig) != pud_val(*pudp))) {
put_compound_head(head, refs);
return 0; }put_compound_head(head, refs, flags);
@@ -2226,7 +2227,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, return 0; if (unlikely(pgd_val(orig) != pgd_val(*pgdp))) {
put_compound_head(head, refs);
return 0; }put_compound_head(head, refs, flags);
2.25.0
Internal to mm/gup.c, require that get_user_pages_fast() and __get_user_pages_fast() identify themselves, by setting FOLL_GET. This is required in order to be able to make decisions based on "FOLL_PIN, or FOLL_GET, or both or neither are set", in upcoming patches.
Signed-off-by: John Hubbard jhubbard@nvidia.com --- mm/gup.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c index 83473c2165f4..e899d2e6398c 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2390,6 +2390,14 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, unsigned long len, end; unsigned long flags; int nr = 0; + /* + * Internally (within mm/gup.c), gup fast variants must set FOLL_GET, + * because gup fast is always a "pin with a +1 page refcount" request. + */ + unsigned int gup_flags = FOLL_GET; + + if (write) + gup_flags |= FOLL_WRITE;
start = untagged_addr(start) & PAGE_MASK; len = (unsigned long) nr_pages << PAGE_SHIFT; @@ -2415,7 +2423,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && gup_fast_permitted(start, end)) { local_irq_save(flags); - gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr); + gup_pgd_range(start, end, gup_flags, pages, &nr); local_irq_restore(flags); }
@@ -2454,7 +2462,7 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages, int nr = 0, ret = 0;
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | - FOLL_FORCE | FOLL_PIN))) + FOLL_FORCE | FOLL_PIN | FOLL_GET))) return -EINVAL;
start = untagged_addr(start) & PAGE_MASK; @@ -2521,6 +2529,13 @@ int get_user_pages_fast(unsigned long start, int nr_pages, if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) return -EINVAL;
+ /* + * The caller may or may not have explicitly set FOLL_GET; either way is + * OK. However, internally (within mm/gup.c), gup fast variants must set + * FOLL_GET, because gup fast is always a "pin with a +1 page refcount" + * request. + */ + gup_flags |= FOLL_GET; return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages); } EXPORT_SYMBOL_GPL(get_user_pages_fast);
On Fri, Jan 31, 2020 at 07:40:23PM -0800, John Hubbard wrote:
Internal to mm/gup.c, require that get_user_pages_fast() and __get_user_pages_fast() identify themselves, by setting FOLL_GET. This is required in order to be able to make decisions based on "FOLL_PIN, or FOLL_GET, or both or neither are set", in upcoming patches.
Signed-off-by: John Hubbard jhubbard@nvidia.com
Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com
On Fri 31-01-20 19:40:23, John Hubbard wrote:
Internal to mm/gup.c, require that get_user_pages_fast() and __get_user_pages_fast() identify themselves, by setting FOLL_GET. This is required in order to be able to make decisions based on "FOLL_PIN, or FOLL_GET, or both or neither are set", in upcoming patches.
Signed-off-by: John Hubbard jhubbard@nvidia.com
Looks good. You can add:
Reviewed-by: Jan Kara jack@suse.cz
Honza
mm/gup.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c index 83473c2165f4..e899d2e6398c 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2390,6 +2390,14 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, unsigned long len, end; unsigned long flags; int nr = 0;
- /*
* Internally (within mm/gup.c), gup fast variants must set FOLL_GET,
* because gup fast is always a "pin with a +1 page refcount" request.
*/
- unsigned int gup_flags = FOLL_GET;
- if (write)
gup_flags |= FOLL_WRITE;
start = untagged_addr(start) & PAGE_MASK; len = (unsigned long) nr_pages << PAGE_SHIFT; @@ -2415,7 +2423,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && gup_fast_permitted(start, end)) { local_irq_save(flags);
gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
local_irq_restore(flags); }gup_pgd_range(start, end, gup_flags, pages, &nr);
@@ -2454,7 +2462,7 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages, int nr = 0, ret = 0; if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
FOLL_FORCE | FOLL_PIN)))
return -EINVAL;FOLL_FORCE | FOLL_PIN | FOLL_GET)))
start = untagged_addr(start) & PAGE_MASK; @@ -2521,6 +2529,13 @@ int get_user_pages_fast(unsigned long start, int nr_pages, if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) return -EINVAL;
- /*
* The caller may or may not have explicitly set FOLL_GET; either way is
* OK. However, internally (within mm/gup.c), gup fast variants must set
* FOLL_GET, because gup fast is always a "pin with a +1 page refcount"
* request.
*/
- gup_flags |= FOLL_GET; return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
} EXPORT_SYMBOL_GPL(get_user_pages_fast); -- 2.25.0
Add tracking of pages that were pinned via FOLL_PIN. This tracking is implemented via overloading of page->_refcount: pins are added by adding GUP_PIN_COUNTING_BIAS (1024) to the refcount. This provides a fuzzy indication of pinning, and it can have false positives (and that's OK). Please see the pre-existing Documentation/core-api/pin_user_pages.rst for details.
As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN (typically via pin_user_pages*()) are required to ultimately free such pages via unpin_user_page().
Please also note the limitation, discussed in pin_user_pages.rst under the "TODO: for 1GB and larger huge pages" section. (That limitation will be removed in a following patch.)
The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be thought of as "FOLL_GET for DIO and/or RDMA use".
Pages that have been pinned via FOLL_PIN are identifiable via a new function call:
bool page_maybe_dma_pinned(struct page *page);
What to do in response to encountering such a page, is left to later patchsets. There is discussion about this in [1], [2], [3], and [4].
This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/ [2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/ [3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/ [4] LWN kernel index: get_user_pages(): https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
Suggested-by: Jan Kara jack@suse.cz Suggested-by: Jérôme Glisse jglisse@redhat.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com
Signed-off-by: John Hubbard jhubbard@nvidia.com --- Documentation/core-api/pin_user_pages.rst | 6 +- include/linux/mm.h | 82 +++++-- mm/gup.c | 254 +++++++++++++++++----- mm/huge_memory.c | 29 ++- mm/hugetlb.c | 38 ++-- 5 files changed, 318 insertions(+), 91 deletions(-)
diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index 1d490155ecd7..9829345428f8 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -173,8 +173,8 @@ CASE 4: Pinning for struct page manipulation only ------------------------------------------------- Here, normal GUP calls are sufficient, so neither flag needs to be set.
-page_dma_pinned(): the whole point of pinning -============================================= +page_maybe_dma_pinned(): the whole point of pinning +===================================================
The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able to query, "is this page DMA-pinned?" That allows code such as page_mkclean() @@ -186,7 +186,7 @@ and debates (see the References at the end of this document). It's a TODO item here: fill in the details once that's worked out. Meanwhile, it's safe to say that having this available: ::
- static inline bool page_dma_pinned(struct page *page) + static inline bool page_maybe_dma_pinned(struct page *page)
...is a prerequisite to solving the long-running gup+DMA problem.
diff --git a/include/linux/mm.h b/include/linux/mm.h index 73a044ed6981..ca787c606f0e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1001,6 +1001,8 @@ static inline void get_page(struct page *page) page_ref_inc(page); }
+bool __must_check try_grab_page(struct page *page, unsigned int flags); + static inline __must_check bool try_get_page(struct page *page) { page = compound_head(page); @@ -1029,29 +1031,79 @@ static inline void put_page(struct page *page) __put_page(page); }
-/** - * unpin_user_page() - release a gup-pinned page - * @page: pointer to page to be released +/* + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload + * the page's refcount so that two separate items are tracked: the original page + * reference count, and also a new count of how many pin_user_pages() calls were + * made against the page. ("gup-pinned" is another term for the latter). + * + * With this scheme, pin_user_pages() becomes special: such pages are marked as + * distinct from normal pages. As such, the unpin_user_page() call (and its + * variants) must be used in order to release gup-pinned pages. + * + * Choice of value: + * + * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference + * counts with respect to pin_user_pages() and unpin_user_page() becomes + * simpler, due to the fact that adding an even power of two to the page + * refcount has the effect of using only the upper N bits, for the code that + * counts up using the bias value. This means that the lower bits are left for + * the exclusive use of the original code that increments and decrements by one + * (or at least, by much smaller values than the bias value). * - * Pages that were pinned via pin_user_pages*() must be released via either - * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so - * that eventually such pages can be separately tracked and uniquely handled. In - * particular, interactions with RDMA and filesystems need special handling. + * Of course, once the lower bits overflow into the upper bits (and this is + * OK, because subtraction recovers the original values), then visual inspection + * no longer suffices to directly view the separate counts. However, for normal + * applications that don't have huge page reference counts, this won't be an + * issue. * - * unpin_user_page() and put_page() are not interchangeable, despite this early - * implementation that makes them look the same. unpin_user_page() calls must - * be perfectly matched up with pin*() calls. + * Locking: the lockless algorithm described in page_cache_get_speculative() + * and page_cache_gup_pin_speculative() provides safe operation for + * get_user_pages and page_mkclean and other calls that race to set up page + * table entries. */ -static inline void unpin_user_page(struct page *page) -{ - put_page(page); -} +#define GUP_PIN_COUNTING_BIAS (1U << 10)
+void unpin_user_page(struct page *page); void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, bool make_dirty); - void unpin_user_pages(struct page **pages, unsigned long npages);
+/** + * page_maybe_dma_pinned() - report if a page is pinned for DMA. + * + * This function checks if a page has been pinned via a call to + * pin_user_pages*(). + * + * For non-huge pages, the return value is partially fuzzy: false is not fuzzy, + * because it means "definitely not pinned for DMA", but true means "probably + * pinned for DMA, but possibly a false positive due to having at least + * GUP_PIN_COUNTING_BIAS worth of normal page references". + * + * False positives are OK, because: a) it's unlikely for a page to get that many + * refcounts, and b) all the callers of this routine are expected to be able to + * deal gracefully with a false positive. + * + * For more information, please see Documentation/vm/pin_user_pages.rst. + * + * @page: pointer to page to be queried. + * @Return: True, if it is likely that the page has been "dma-pinned". + * False, if the page is definitely not dma-pinned. + */ +static inline bool page_maybe_dma_pinned(struct page *page) +{ + /* + * page_ref_count() is signed. If that refcount overflows, then + * page_ref_count() returns a negative value, and callers will avoid + * further incrementing the refcount. + * + * Here, for that overflow case, use the signed bit to count a little + * bit higher via unsigned math, and thus still get an accurate result. + */ + return ((unsigned int)page_ref_count(compound_head(page))) >= + GUP_PIN_COUNTING_BIAS; +} + #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) #define SECTION_IN_PAGE_FLAGS #endif diff --git a/mm/gup.c b/mm/gup.c index e899d2e6398c..6e8b773c233a 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -44,6 +44,135 @@ static inline struct page *try_get_compound_head(struct page *page, int refs) return head; }
+/* + * try_grab_compound_head() - attempt to elevate a page's refcount, by a + * flags-dependent amount. + * + * "grab" names in this file mean, "look at flags to decide whether to use + * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount. + * + * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the + * same time. (That's true throughout the get_user_pages*() and + * pin_user_pages*() APIs.) Cases: + * + * FOLL_GET: page's refcount will be incremented by 1. + * FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS. + * + * Return: head page (with refcount appropriately incremented) for success, or + * NULL upon failure. If neither FOLL_GET nor FOLL_PIN was set, that's + * considered failure, and furthermore, a likely bug in the caller, so a warning + * is also emitted. + */ +static __maybe_unused struct page *try_grab_compound_head(struct page *page, + int refs, + unsigned int flags) +{ + if (flags & FOLL_GET) + return try_get_compound_head(page, refs); + else if (flags & FOLL_PIN) { + refs *= GUP_PIN_COUNTING_BIAS; + return try_get_compound_head(page, refs); + } + + WARN_ON_ONCE(1); + return NULL; +} + +/** + * try_grab_page() - elevate a page's refcount by a flag-dependent amount + * + * This might not do anything at all, depending on the flags argument. + * + * "grab" names in this file mean, "look at flags to decide whether to use + * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount. + * + * @page: pointer to page to be grabbed + * @flags: gup flags: these are the FOLL_* flag values. + * + * Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the same + * time. Cases: + * + * FOLL_GET: page's refcount will be incremented by 1. + * FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS. + * + * Return: true for success, or if no action was required (if neither FOLL_PIN + * nor FOLL_GET was set, nothing is done). False for failure: FOLL_GET or + * FOLL_PIN was set, but the page could not be grabbed. + */ +bool __must_check try_grab_page(struct page *page, unsigned int flags) +{ + WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == (FOLL_GET | FOLL_PIN)); + + if (flags & FOLL_GET) + return try_get_page(page); + else if (flags & FOLL_PIN) { + page = compound_head(page); + + if (WARN_ON_ONCE(page_ref_count(page) <= 0)) + return false; + + page_ref_add(page, GUP_PIN_COUNTING_BIAS); + } + + return true; +} + +#ifdef CONFIG_DEV_PAGEMAP_OPS +static bool __unpin_devmap_managed_user_page(struct page *page) +{ + int count; + + if (!page_is_devmap_managed(page)) + return false; + + count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS); + + /* + * devmap page refcounts are 1-based, rather than 0-based: if + * refcount is 1, then the page is free and the refcount is + * stable because nobody holds a reference on the page. + */ + if (count == 1) + free_devmap_managed_page(page); + else if (!count) + __put_page(page); + + return true; +} +#else +static bool __unpin_devmap_managed_user_page(struct page *page) +{ + return false; +} +#endif /* CONFIG_DEV_PAGEMAP_OPS */ + +/** + * unpin_user_page() - release a dma-pinned page + * @page: pointer to page to be released + * + * Pages that were pinned via pin_user_pages*() must be released via either + * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so + * that such pages can be separately tracked and uniquely handled. In + * particular, interactions with RDMA and filesystems need special handling. + */ +void unpin_user_page(struct page *page) +{ + page = compound_head(page); + + /* + * For devmap managed pages we need to catch refcount transition from + * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the + * page is free and we need to inform the device driver through + * callback. See include/linux/memremap.h and HMM for details. + */ + if (__unpin_devmap_managed_user_page(page)) + return; + + if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS)) + __put_page(page); +} +EXPORT_SYMBOL(unpin_user_page); + /** * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages * @pages: array of pages to be maybe marked dirty, and definitely released. @@ -230,10 +359,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, }
page = vm_normal_page(vma, address, pte); - if (!page && pte_devmap(pte) && (flags & FOLL_GET)) { + if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) { /* - * Only return device mapping pages in the FOLL_GET case since - * they are only valid while holding the pgmap reference. + * Only return device mapping pages in the FOLL_GET or FOLL_PIN + * case since they are only valid while holding the pgmap + * reference. */ *pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap); if (*pgmap) @@ -271,11 +401,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, goto retry; }
- if (flags & FOLL_GET) { - if (unlikely(!try_get_page(page))) { - page = ERR_PTR(-ENOMEM); - goto out; - } + /* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */ + if (unlikely(!try_grab_page(page, flags))) { + page = ERR_PTR(-ENOMEM); + goto out; } if (flags & FOLL_TOUCH) { if ((flags & FOLL_WRITE) && @@ -537,7 +666,7 @@ static struct page *follow_page_mask(struct vm_area_struct *vma, /* make this handle hugepd */ page = follow_huge_addr(mm, address, flags & FOLL_WRITE); if (!IS_ERR(page)) { - BUG_ON(flags & FOLL_GET); + WARN_ON_ONCE(flags & (FOLL_GET | FOLL_PIN)); return page; }
@@ -1675,6 +1804,15 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, { return 0; } + +static long __get_user_pages_remote(struct task_struct *tsk, + struct mm_struct *mm, + unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas, int *locked) +{ + return 0; +} #endif /* !CONFIG_MMU */
/* @@ -1877,7 +2015,10 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, struct page *page = pages[--(*nr)];
ClearPageReferenced(page); - put_page(page); + if (flags & FOLL_PIN) + unpin_user_page(page); + else + put_page(page); } }
@@ -1919,7 +2060,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte);
- head = try_get_compound_head(page, 1); + head = try_grab_compound_head(page, 1, flags); if (!head) goto pte_unmap;
@@ -1980,7 +2121,10 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr, } SetPageReferenced(page); pages[*nr] = page; - get_page(page); + if (unlikely(!try_grab_page(page, flags))) { + undo_dev_pagemap(nr, nr_start, flags, pages); + return 0; + } (*nr)++; pfn++; } while (addr += PAGE_SIZE, addr != end); @@ -2056,6 +2200,9 @@ static int record_subpages(struct page *page, unsigned long addr,
static void put_compound_head(struct page *page, int refs, unsigned int flags) { + if (flags & FOLL_PIN) + refs *= GUP_PIN_COUNTING_BIAS; + VM_BUG_ON_PAGE(page_ref_count(page) < refs, page); /* * Calling put_page() for each ref is unnecessarily slow. Only the last @@ -2099,7 +2246,7 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, page = head + ((addr & (sz-1)) >> PAGE_SHIFT); refs = record_subpages(page, addr, end, pages + *nr);
- head = try_get_compound_head(head, refs); + head = try_grab_compound_head(head, refs, flags); if (!head) return 0;
@@ -2159,7 +2306,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); refs = record_subpages(page, addr, end, pages + *nr);
- head = try_get_compound_head(pmd_page(orig), refs); + head = try_grab_compound_head(pmd_page(orig), refs, flags); if (!head) return 0;
@@ -2193,7 +2340,7 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); refs = record_subpages(page, addr, end, pages + *nr);
- head = try_get_compound_head(pud_page(orig), refs); + head = try_grab_compound_head(pud_page(orig), refs, flags); if (!head) return 0;
@@ -2222,7 +2369,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT); refs = record_subpages(page, addr, end, pages + *nr);
- head = try_get_compound_head(pgd_page(orig), refs); + head = try_grab_compound_head(pgd_page(orig), refs, flags); if (!head) return 0;
@@ -2505,11 +2652,11 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
/** * get_user_pages_fast() - pin user pages in memory - * @start: starting user address - * @nr_pages: number of pages from start to pin - * @gup_flags: flags modifying pin behaviour - * @pages: array that receives pointers to the pages pinned. - * Should be at least nr_pages long. + * @start: starting user address + * @nr_pages: number of pages from start to pin + * @gup_flags: flags modifying pin behaviour + * @pages: array that receives pointers to the pages pinned. + * Should be at least nr_pages long. * * Attempt to pin user pages in memory without taking mm->mmap_sem. * If not successful, it will fall back to taking the lock and @@ -2543,9 +2690,12 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast); /** * pin_user_pages_fast() - pin user pages in memory without taking locks * - * For now, this is a placeholder function, until various call sites are - * converted to use the correct get_user_pages*() or pin_user_pages*() API. So, - * this is identical to get_user_pages_fast(). + * Nearly the same as get_user_pages_fast(), except that FOLL_PIN is set. See + * get_user_pages_fast() for documentation on the function arguments, because + * the arguments here are identical. + * + * FOLL_PIN means that the pages must be released via unpin_user_page(). Please + * see Documentation/vm/pin_user_pages.rst for further details. * * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It * is NOT intended for Case 2 (RDMA: long-term pins). @@ -2553,21 +2703,24 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast); int pin_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages) { - /* - * This is a placeholder, until the pin functionality is activated. - * Until then, just behave like the corresponding get_user_pages*() - * routine. - */ - return get_user_pages_fast(start, nr_pages, gup_flags, pages); + /* FOLL_GET and FOLL_PIN are mutually exclusive. */ + if (WARN_ON_ONCE(gup_flags & FOLL_GET)) + return -EINVAL; + + gup_flags |= FOLL_PIN; + return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages); } EXPORT_SYMBOL_GPL(pin_user_pages_fast);
/** * pin_user_pages_remote() - pin pages of a remote process (task != current) * - * For now, this is a placeholder function, until various call sites are - * converted to use the correct get_user_pages*() or pin_user_pages*() API. So, - * this is identical to get_user_pages_remote(). + * Nearly the same as get_user_pages_remote(), except that FOLL_PIN is set. See + * get_user_pages_remote() for documentation on the function arguments, because + * the arguments here are identical. + * + * FOLL_PIN means that the pages must be released via unpin_user_page(). Please + * see Documentation/vm/pin_user_pages.rst for details. * * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It * is NOT intended for Case 2 (RDMA: long-term pins). @@ -2577,22 +2730,24 @@ long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked) { - /* - * This is a placeholder, until the pin functionality is activated. - * Until then, just behave like the corresponding get_user_pages*() - * routine. - */ - return get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, pages, - vmas, locked); + /* FOLL_GET and FOLL_PIN are mutually exclusive. */ + if (WARN_ON_ONCE(gup_flags & FOLL_GET)) + return -EINVAL; + + gup_flags |= FOLL_PIN; + return __get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, + pages, vmas, locked); } EXPORT_SYMBOL(pin_user_pages_remote);
/** * pin_user_pages() - pin user pages in memory for use by other devices * - * For now, this is a placeholder function, until various call sites are - * converted to use the correct get_user_pages*() or pin_user_pages*() API. So, - * this is identical to get_user_pages(). + * Nearly the same as get_user_pages(), except that FOLL_TOUCH is not set, and + * FOLL_PIN is set. + * + * FOLL_PIN means that the pages must be released via unpin_user_page(). Please + * see Documentation/vm/pin_user_pages.rst for details. * * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It * is NOT intended for Case 2 (RDMA: long-term pins). @@ -2601,11 +2756,12 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas) { - /* - * This is a placeholder, until the pin functionality is activated. - * Until then, just behave like the corresponding get_user_pages*() - * routine. - */ - return get_user_pages(start, nr_pages, gup_flags, pages, vmas); + /* FOLL_GET and FOLL_PIN are mutually exclusive. */ + if (WARN_ON_ONCE(gup_flags & FOLL_GET)) + return -EINVAL; + + gup_flags |= FOLL_PIN; + return __gup_longterm_locked(current, current->mm, start, nr_pages, + pages, vmas, gup_flags); } EXPORT_SYMBOL(pin_user_pages); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b08b199f9a11..580098e115bd 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -958,6 +958,11 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, */ WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set");
+ /* FOLL_GET and FOLL_PIN are mutually exclusive. */ + if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == + (FOLL_PIN | FOLL_GET))) + return NULL; + if (flags & FOLL_WRITE && !pmd_write(*pmd)) return NULL;
@@ -973,7 +978,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, * device mapped pages can only be returned if the * caller will manage the page reference count. */ - if (!(flags & FOLL_GET)) + if (!(flags & (FOLL_GET | FOLL_PIN))) return ERR_PTR(-EEXIST);
pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT; @@ -981,7 +986,8 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, if (!*pgmap) return ERR_PTR(-EFAULT); page = pfn_to_page(pfn); - get_page(page); + if (!try_grab_page(page, flags)) + page = ERR_PTR(-ENOMEM);
return page; } @@ -1101,6 +1107,11 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, if (flags & FOLL_WRITE && !pud_write(*pud)) return NULL;
+ /* FOLL_GET and FOLL_PIN are mutually exclusive. */ + if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == + (FOLL_PIN | FOLL_GET))) + return NULL; + if (pud_present(*pud) && pud_devmap(*pud)) /* pass */; else @@ -1112,8 +1123,10 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, /* * device mapped pages can only be returned if the * caller will manage the page reference count. + * + * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here: */ - if (!(flags & FOLL_GET)) + if (!(flags & (FOLL_GET | FOLL_PIN))) return ERR_PTR(-EEXIST);
pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT; @@ -1121,7 +1134,8 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, if (!*pgmap) return ERR_PTR(-EFAULT); page = pfn_to_page(pfn); - get_page(page); + if (!try_grab_page(page, flags)) + page = ERR_PTR(-ENOMEM);
return page; } @@ -1497,8 +1511,13 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
page = pmd_page(*pmd); VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page); + + if (!try_grab_page(page, flags)) + return ERR_PTR(-ENOMEM); + if (flags & FOLL_TOUCH) touch_pmd(vma, addr, pmd, flags); + if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) { /* * We don't mlock() pte-mapped THPs. This way we can avoid @@ -1535,8 +1554,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, skip_mlock: page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT; VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page); - if (flags & FOLL_GET) - get_page(page);
out: return page; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index dd8737a94bec..487e998fd38e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4375,19 +4375,6 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT; page = pte_page(huge_ptep_get(pte));
- /* - * Instead of doing 'try_get_page()' below in the same_page - * loop, just check the count once here. - */ - if (unlikely(page_count(page) <= 0)) { - if (pages) { - spin_unlock(ptl); - remainder = 0; - err = -ENOMEM; - break; - } - } - /* * If subpage information not requested, update counters * and skip the same_page loop below. @@ -4405,7 +4392,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, same_page: if (pages) { pages[i] = mem_map_offset(page, pfn_offset); - get_page(pages[i]); + if (!try_grab_page(pages[i], flags)) { + spin_unlock(ptl); + remainder = 0; + err = -ENOMEM; + WARN_ON_ONCE(1); + break; + } }
if (vmas) @@ -4965,6 +4958,12 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address, struct page *page = NULL; spinlock_t *ptl; pte_t pte; + + /* FOLL_GET and FOLL_PIN are mutually exclusive. */ + if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == + (FOLL_PIN | FOLL_GET))) + return NULL; + retry: ptl = pmd_lockptr(mm, pmd); spin_lock(ptl); @@ -4977,8 +4976,11 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address, pte = huge_ptep_get((pte_t *)pmd); if (pte_present(pte)) { page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT); - if (flags & FOLL_GET) - get_page(page); + if (unlikely(!try_grab_page(page, flags))) { + WARN_ON_ONCE(1); + page = NULL; + goto out; + } } else { if (is_hugetlb_entry_migration(pte)) { spin_unlock(ptl); @@ -4999,7 +5001,7 @@ struct page * __weak follow_huge_pud(struct mm_struct *mm, unsigned long address, pud_t *pud, int flags) { - if (flags & FOLL_GET) + if (flags & (FOLL_GET | FOLL_PIN)) return NULL;
return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT); @@ -5008,7 +5010,7 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address, struct page * __weak follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int flags) { - if (flags & FOLL_GET) + if (flags & (FOLL_GET | FOLL_PIN)) return NULL;
return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT);
On Fri, Jan 31, 2020 at 07:40:24PM -0800, John Hubbard wrote:
@@ -4405,7 +4392,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, same_page: if (pages) { pages[i] = mem_map_offset(page, pfn_offset);
get_page(pages[i]);
if (!try_grab_page(pages[i], flags)) {
spin_unlock(ptl);
remainder = 0;
err = -ENOMEM;
WARN_ON_ONCE(1);
The WARN_ON_ONCE deserve a comment. And I guess you can put it into 'if' condition.
break;
}}
if (vmas) @@ -4965,6 +4958,12 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address, struct page *page = NULL; spinlock_t *ptl; pte_t pte;
- /* FOLL_GET and FOLL_PIN are mutually exclusive. */
- if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
(FOLL_PIN | FOLL_GET)))
return NULL;
retry: ptl = pmd_lockptr(mm, pmd); spin_lock(ptl); @@ -4977,8 +4976,11 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address, pte = huge_ptep_get((pte_t *)pmd); if (pte_present(pte)) { page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
if (flags & FOLL_GET)
get_page(page);
if (unlikely(!try_grab_page(page, flags))) {
WARN_ON_ONCE(1);
Ditto.
page = NULL;
goto out;
} else { if (is_hugetlb_entry_migration(pte)) { spin_unlock(ptl);}
On 2/3/20 5:40 AM, Kirill A. Shutemov wrote:
On Fri, Jan 31, 2020 at 07:40:24PM -0800, John Hubbard wrote:
@@ -4405,7 +4392,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, same_page: if (pages) { pages[i] = mem_map_offset(page, pfn_offset);
get_page(pages[i]);
if (!try_grab_page(pages[i], flags)) {
spin_unlock(ptl);
remainder = 0;
err = -ENOMEM;
WARN_ON_ONCE(1);
The WARN_ON_ONCE deserve a comment. And I guess you can put it into 'if' condition.
OK, I've changed it to the following, which I *think* is an accurate comment, but I'm still a bit new to huge pages:
if (pages) { pages[i] = mem_map_offset(page, pfn_offset); /* * try_grab_page() should always succeed here, because: * a) we hold the ptl lock, and b) we've just checked * that the huge page is present in the page tables. If * the huge page is present, then the tail pages must * also be present. The ptl prevents the head page and * tail pages from being rearranged in any way. So this * page must be available at this point, unless the page * refcount overflowed: */ if (WARN_ON_ONCE(!try_grab_page(pages[i], flags))) { spin_unlock(ptl); remainder = 0; err = -ENOMEM; break; } }
break;
}}
if (vmas) @@ -4965,6 +4958,12 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address, struct page *page = NULL; spinlock_t *ptl; pte_t pte;
- /* FOLL_GET and FOLL_PIN are mutually exclusive. */
- if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
(FOLL_PIN | FOLL_GET)))
return NULL;
retry: ptl = pmd_lockptr(mm, pmd); spin_lock(ptl); @@ -4977,8 +4976,11 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address, pte = huge_ptep_get((pte_t *)pmd); if (pte_present(pte)) { page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
if (flags & FOLL_GET)
get_page(page);
if (unlikely(!try_grab_page(page, flags))) {
WARN_ON_ONCE(1);
Ditto.
OK, I've added a similar comment as the one above. Now it looks like this:
if (pte_present(pte)) { page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT); /* * try_grab_page() should always succeed here, because: a) we * hold the pmd (ptl) lock, and b) we've just checked that the * huge pmd (head) page is present in the page tables. The ptl * prevents the head page and tail pages from being rearranged * in any way. So this page must be available at this point, * unless the page refcount overflowed: */ if (WARN_ON_ONCE(!try_grab_page(page, flags))) { page = NULL; goto out; }
thanks,
On Fri 31-01-20 19:40:24, John Hubbard wrote:
Add tracking of pages that were pinned via FOLL_PIN. This tracking is implemented via overloading of page->_refcount: pins are added by adding GUP_PIN_COUNTING_BIAS (1024) to the refcount. This provides a fuzzy indication of pinning, and it can have false positives (and that's OK). Please see the pre-existing Documentation/core-api/pin_user_pages.rst for details.
As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN (typically via pin_user_pages*()) are required to ultimately free such pages via unpin_user_page().
Please also note the limitation, discussed in pin_user_pages.rst under the "TODO: for 1GB and larger huge pages" section. (That limitation will be removed in a following patch.)
The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be thought of as "FOLL_GET for DIO and/or RDMA use".
Pages that have been pinned via FOLL_PIN are identifiable via a new function call:
bool page_maybe_dma_pinned(struct page *page);
What to do in response to encountering such a page, is left to later patchsets. There is discussion about this in [1], [2], [3], and [4].
This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/ [2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/ [3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/ [4] LWN kernel index: get_user_pages(): https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
Suggested-by: Jan Kara jack@suse.cz Suggested-by: Jérôme Glisse jglisse@redhat.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com
Signed-off-by: John Hubbard jhubbard@nvidia.com
The patch looks good to me. You can add:
Reviewed-by: Jan Kara jack@suse.cz
Honza
Documentation/core-api/pin_user_pages.rst | 6 +- include/linux/mm.h | 82 +++++-- mm/gup.c | 254 +++++++++++++++++----- mm/huge_memory.c | 29 ++- mm/hugetlb.c | 38 ++-- 5 files changed, 318 insertions(+), 91 deletions(-)
diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index 1d490155ecd7..9829345428f8 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -173,8 +173,8 @@ CASE 4: Pinning for struct page manipulation only
Here, normal GUP calls are sufficient, so neither flag needs to be set.
-page_dma_pinned(): the whole point of pinning
+page_maybe_dma_pinned(): the whole point of pinning +=================================================== The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able to query, "is this page DMA-pinned?" That allows code such as page_mkclean() @@ -186,7 +186,7 @@ and debates (see the References at the end of this document). It's a TODO item here: fill in the details once that's worked out. Meanwhile, it's safe to say that having this available: ::
static inline bool page_dma_pinned(struct page *page)
static inline bool page_maybe_dma_pinned(struct page *page)
...is a prerequisite to solving the long-running gup+DMA problem. diff --git a/include/linux/mm.h b/include/linux/mm.h index 73a044ed6981..ca787c606f0e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1001,6 +1001,8 @@ static inline void get_page(struct page *page) page_ref_inc(page); } +bool __must_check try_grab_page(struct page *page, unsigned int flags);
static inline __must_check bool try_get_page(struct page *page) { page = compound_head(page); @@ -1029,29 +1031,79 @@ static inline void put_page(struct page *page) __put_page(page); } -/**
- unpin_user_page() - release a gup-pinned page
- @page: pointer to page to be released
+/*
- GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
- the page's refcount so that two separate items are tracked: the original page
- reference count, and also a new count of how many pin_user_pages() calls were
- made against the page. ("gup-pinned" is another term for the latter).
- With this scheme, pin_user_pages() becomes special: such pages are marked as
- distinct from normal pages. As such, the unpin_user_page() call (and its
- variants) must be used in order to release gup-pinned pages.
- Choice of value:
- By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
- counts with respect to pin_user_pages() and unpin_user_page() becomes
- simpler, due to the fact that adding an even power of two to the page
- refcount has the effect of using only the upper N bits, for the code that
- counts up using the bias value. This means that the lower bits are left for
- the exclusive use of the original code that increments and decrements by one
- (or at least, by much smaller values than the bias value).
- Pages that were pinned via pin_user_pages*() must be released via either
- unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
- that eventually such pages can be separately tracked and uniquely handled. In
- particular, interactions with RDMA and filesystems need special handling.
- Of course, once the lower bits overflow into the upper bits (and this is
- OK, because subtraction recovers the original values), then visual inspection
- no longer suffices to directly view the separate counts. However, for normal
- applications that don't have huge page reference counts, this won't be an
- issue.
- unpin_user_page() and put_page() are not interchangeable, despite this early
- implementation that makes them look the same. unpin_user_page() calls must
- be perfectly matched up with pin*() calls.
- Locking: the lockless algorithm described in page_cache_get_speculative()
- and page_cache_gup_pin_speculative() provides safe operation for
- get_user_pages and page_mkclean and other calls that race to set up page
*/
- table entries.
-static inline void unpin_user_page(struct page *page) -{
- put_page(page);
-} +#define GUP_PIN_COUNTING_BIAS (1U << 10) +void unpin_user_page(struct page *page); void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, bool make_dirty);
void unpin_user_pages(struct page **pages, unsigned long npages); +/**
- page_maybe_dma_pinned() - report if a page is pinned for DMA.
- This function checks if a page has been pinned via a call to
- pin_user_pages*().
- For non-huge pages, the return value is partially fuzzy: false is not fuzzy,
- because it means "definitely not pinned for DMA", but true means "probably
- pinned for DMA, but possibly a false positive due to having at least
- GUP_PIN_COUNTING_BIAS worth of normal page references".
- False positives are OK, because: a) it's unlikely for a page to get that many
- refcounts, and b) all the callers of this routine are expected to be able to
- deal gracefully with a false positive.
- For more information, please see Documentation/vm/pin_user_pages.rst.
- @page: pointer to page to be queried.
- @Return: True, if it is likely that the page has been "dma-pinned".
False, if the page is definitely not dma-pinned.
- */
+static inline bool page_maybe_dma_pinned(struct page *page) +{
- /*
* page_ref_count() is signed. If that refcount overflows, then
* page_ref_count() returns a negative value, and callers will avoid
* further incrementing the refcount.
*
* Here, for that overflow case, use the signed bit to count a little
* bit higher via unsigned math, and thus still get an accurate result.
*/
- return ((unsigned int)page_ref_count(compound_head(page))) >=
GUP_PIN_COUNTING_BIAS;
+}
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) #define SECTION_IN_PAGE_FLAGS #endif diff --git a/mm/gup.c b/mm/gup.c index e899d2e6398c..6e8b773c233a 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -44,6 +44,135 @@ static inline struct page *try_get_compound_head(struct page *page, int refs) return head; } +/*
- try_grab_compound_head() - attempt to elevate a page's refcount, by a
- flags-dependent amount.
- "grab" names in this file mean, "look at flags to decide whether to use
- FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
- Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the
- same time. (That's true throughout the get_user_pages*() and
- pin_user_pages*() APIs.) Cases:
- FOLL_GET: page's refcount will be incremented by 1.
- FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.
- Return: head page (with refcount appropriately incremented) for success, or
- NULL upon failure. If neither FOLL_GET nor FOLL_PIN was set, that's
- considered failure, and furthermore, a likely bug in the caller, so a warning
- is also emitted.
- */
+static __maybe_unused struct page *try_grab_compound_head(struct page *page,
int refs,
unsigned int flags)
+{
- if (flags & FOLL_GET)
return try_get_compound_head(page, refs);
- else if (flags & FOLL_PIN) {
refs *= GUP_PIN_COUNTING_BIAS;
return try_get_compound_head(page, refs);
- }
- WARN_ON_ONCE(1);
- return NULL;
+}
+/**
- try_grab_page() - elevate a page's refcount by a flag-dependent amount
- This might not do anything at all, depending on the flags argument.
- "grab" names in this file mean, "look at flags to decide whether to use
- FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
- @page: pointer to page to be grabbed
- @flags: gup flags: these are the FOLL_* flag values.
- Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the same
- time. Cases:
- FOLL_GET: page's refcount will be incremented by 1.
- FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.
- Return: true for success, or if no action was required (if neither FOLL_PIN
- nor FOLL_GET was set, nothing is done). False for failure: FOLL_GET or
- FOLL_PIN was set, but the page could not be grabbed.
- */
+bool __must_check try_grab_page(struct page *page, unsigned int flags) +{
- WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == (FOLL_GET | FOLL_PIN));
- if (flags & FOLL_GET)
return try_get_page(page);
- else if (flags & FOLL_PIN) {
page = compound_head(page);
if (WARN_ON_ONCE(page_ref_count(page) <= 0))
return false;
page_ref_add(page, GUP_PIN_COUNTING_BIAS);
- }
- return true;
+}
+#ifdef CONFIG_DEV_PAGEMAP_OPS +static bool __unpin_devmap_managed_user_page(struct page *page) +{
- int count;
- if (!page_is_devmap_managed(page))
return false;
- count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS);
- /*
* devmap page refcounts are 1-based, rather than 0-based: if
* refcount is 1, then the page is free and the refcount is
* stable because nobody holds a reference on the page.
*/
- if (count == 1)
free_devmap_managed_page(page);
- else if (!count)
__put_page(page);
- return true;
+} +#else +static bool __unpin_devmap_managed_user_page(struct page *page) +{
- return false;
+} +#endif /* CONFIG_DEV_PAGEMAP_OPS */
+/**
- unpin_user_page() - release a dma-pinned page
- @page: pointer to page to be released
- Pages that were pinned via pin_user_pages*() must be released via either
- unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
- that such pages can be separately tracked and uniquely handled. In
- particular, interactions with RDMA and filesystems need special handling.
- */
+void unpin_user_page(struct page *page) +{
- page = compound_head(page);
- /*
* For devmap managed pages we need to catch refcount transition from
* GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
* page is free and we need to inform the device driver through
* callback. See include/linux/memremap.h and HMM for details.
*/
- if (__unpin_devmap_managed_user_page(page))
return;
- if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS))
__put_page(page);
+} +EXPORT_SYMBOL(unpin_user_page);
/**
- unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
- @pages: array of pages to be maybe marked dirty, and definitely released.
@@ -230,10 +359,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, } page = vm_normal_page(vma, address, pte);
- if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
- if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) { /*
* Only return device mapping pages in the FOLL_GET case since
* they are only valid while holding the pgmap reference.
* Only return device mapping pages in the FOLL_GET or FOLL_PIN
* case since they are only valid while holding the pgmap
*/ *pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap); if (*pgmap)* reference.
@@ -271,11 +401,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, goto retry; }
- if (flags & FOLL_GET) {
if (unlikely(!try_get_page(page))) {
page = ERR_PTR(-ENOMEM);
goto out;
}
- /* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
- if (unlikely(!try_grab_page(page, flags))) {
page = ERR_PTR(-ENOMEM);
} if (flags & FOLL_TOUCH) { if ((flags & FOLL_WRITE) &&goto out;
@@ -537,7 +666,7 @@ static struct page *follow_page_mask(struct vm_area_struct *vma, /* make this handle hugepd */ page = follow_huge_addr(mm, address, flags & FOLL_WRITE); if (!IS_ERR(page)) {
BUG_ON(flags & FOLL_GET);
return page; }WARN_ON_ONCE(flags & (FOLL_GET | FOLL_PIN));
@@ -1675,6 +1804,15 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, { return 0; }
+static long __get_user_pages_remote(struct task_struct *tsk,
struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas, int *locked)
+{
- return 0;
+} #endif /* !CONFIG_MMU */ /* @@ -1877,7 +2015,10 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, struct page *page = pages[--(*nr)]; ClearPageReferenced(page);
put_page(page);
if (flags & FOLL_PIN)
unpin_user_page(page);
else
}put_page(page);
} @@ -1919,7 +2060,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte);
head = try_get_compound_head(page, 1);
if (!head) goto pte_unmap;head = try_grab_compound_head(page, 1, flags);
@@ -1980,7 +2121,10 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr, } SetPageReferenced(page); pages[*nr] = page;
get_page(page);
if (unlikely(!try_grab_page(page, flags))) {
undo_dev_pagemap(nr, nr_start, flags, pages);
return 0;
(*nr)++; pfn++; } while (addr += PAGE_SIZE, addr != end);}
@@ -2056,6 +2200,9 @@ static int record_subpages(struct page *page, unsigned long addr, static void put_compound_head(struct page *page, int refs, unsigned int flags) {
- if (flags & FOLL_PIN)
refs *= GUP_PIN_COUNTING_BIAS;
- VM_BUG_ON_PAGE(page_ref_count(page) < refs, page); /*
- Calling put_page() for each ref is unnecessarily slow. Only the last
@@ -2099,7 +2246,7 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, page = head + ((addr & (sz-1)) >> PAGE_SHIFT); refs = record_subpages(page, addr, end, pages + *nr);
- head = try_get_compound_head(head, refs);
- head = try_grab_compound_head(head, refs, flags); if (!head) return 0;
@@ -2159,7 +2306,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); refs = record_subpages(page, addr, end, pages + *nr);
- head = try_get_compound_head(pmd_page(orig), refs);
- head = try_grab_compound_head(pmd_page(orig), refs, flags); if (!head) return 0;
@@ -2193,7 +2340,7 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); refs = record_subpages(page, addr, end, pages + *nr);
- head = try_get_compound_head(pud_page(orig), refs);
- head = try_grab_compound_head(pud_page(orig), refs, flags); if (!head) return 0;
@@ -2222,7 +2369,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT); refs = record_subpages(page, addr, end, pages + *nr);
- head = try_get_compound_head(pgd_page(orig), refs);
- head = try_grab_compound_head(pgd_page(orig), refs, flags); if (!head) return 0;
@@ -2505,11 +2652,11 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages, /**
- get_user_pages_fast() - pin user pages in memory
- @start: starting user address
- @nr_pages: number of pages from start to pin
- @gup_flags: flags modifying pin behaviour
- @pages: array that receives pointers to the pages pinned.
Should be at least nr_pages long.
- @start: starting user address
- @nr_pages: number of pages from start to pin
- @gup_flags: flags modifying pin behaviour
- @pages: array that receives pointers to the pages pinned.
Should be at least nr_pages long.
- Attempt to pin user pages in memory without taking mm->mmap_sem.
- If not successful, it will fall back to taking the lock and
@@ -2543,9 +2690,12 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast); /**
- pin_user_pages_fast() - pin user pages in memory without taking locks
- For now, this is a placeholder function, until various call sites are
- converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
- this is identical to get_user_pages_fast().
- Nearly the same as get_user_pages_fast(), except that FOLL_PIN is set. See
- get_user_pages_fast() for documentation on the function arguments, because
- the arguments here are identical.
- FOLL_PIN means that the pages must be released via unpin_user_page(). Please
- see Documentation/vm/pin_user_pages.rst for further details.
- This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
- is NOT intended for Case 2 (RDMA: long-term pins).
@@ -2553,21 +2703,24 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast); int pin_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages) {
- /*
* This is a placeholder, until the pin functionality is activated.
* Until then, just behave like the corresponding get_user_pages*()
* routine.
*/
- return get_user_pages_fast(start, nr_pages, gup_flags, pages);
- /* FOLL_GET and FOLL_PIN are mutually exclusive. */
- if (WARN_ON_ONCE(gup_flags & FOLL_GET))
return -EINVAL;
- gup_flags |= FOLL_PIN;
- return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
} EXPORT_SYMBOL_GPL(pin_user_pages_fast); /**
- pin_user_pages_remote() - pin pages of a remote process (task != current)
- For now, this is a placeholder function, until various call sites are
- converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
- this is identical to get_user_pages_remote().
- Nearly the same as get_user_pages_remote(), except that FOLL_PIN is set. See
- get_user_pages_remote() for documentation on the function arguments, because
- the arguments here are identical.
- FOLL_PIN means that the pages must be released via unpin_user_page(). Please
- see Documentation/vm/pin_user_pages.rst for details.
- This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
- is NOT intended for Case 2 (RDMA: long-term pins).
@@ -2577,22 +2730,24 @@ long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked) {
- /*
* This is a placeholder, until the pin functionality is activated.
* Until then, just behave like the corresponding get_user_pages*()
* routine.
*/
- return get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, pages,
vmas, locked);
- /* FOLL_GET and FOLL_PIN are mutually exclusive. */
- if (WARN_ON_ONCE(gup_flags & FOLL_GET))
return -EINVAL;
- gup_flags |= FOLL_PIN;
- return __get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags,
pages, vmas, locked);
} EXPORT_SYMBOL(pin_user_pages_remote); /**
- pin_user_pages() - pin user pages in memory for use by other devices
- For now, this is a placeholder function, until various call sites are
- converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
- this is identical to get_user_pages().
- Nearly the same as get_user_pages(), except that FOLL_TOUCH is not set, and
- FOLL_PIN is set.
- FOLL_PIN means that the pages must be released via unpin_user_page(). Please
- see Documentation/vm/pin_user_pages.rst for details.
- This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
- is NOT intended for Case 2 (RDMA: long-term pins).
@@ -2601,11 +2756,12 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas) {
- /*
* This is a placeholder, until the pin functionality is activated.
* Until then, just behave like the corresponding get_user_pages*()
* routine.
*/
- return get_user_pages(start, nr_pages, gup_flags, pages, vmas);
- /* FOLL_GET and FOLL_PIN are mutually exclusive. */
- if (WARN_ON_ONCE(gup_flags & FOLL_GET))
return -EINVAL;
- gup_flags |= FOLL_PIN;
- return __gup_longterm_locked(current, current->mm, start, nr_pages,
pages, vmas, gup_flags);
} EXPORT_SYMBOL(pin_user_pages); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b08b199f9a11..580098e115bd 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -958,6 +958,11 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, */ WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set");
- /* FOLL_GET and FOLL_PIN are mutually exclusive. */
- if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
(FOLL_PIN | FOLL_GET)))
return NULL;
- if (flags & FOLL_WRITE && !pmd_write(*pmd)) return NULL;
@@ -973,7 +978,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, * device mapped pages can only be returned if the * caller will manage the page reference count. */
- if (!(flags & FOLL_GET))
- if (!(flags & (FOLL_GET | FOLL_PIN))) return ERR_PTR(-EEXIST);
pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT; @@ -981,7 +986,8 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, if (!*pgmap) return ERR_PTR(-EFAULT); page = pfn_to_page(pfn);
- get_page(page);
- if (!try_grab_page(page, flags))
page = ERR_PTR(-ENOMEM);
return page; } @@ -1101,6 +1107,11 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, if (flags & FOLL_WRITE && !pud_write(*pud)) return NULL;
- /* FOLL_GET and FOLL_PIN are mutually exclusive. */
- if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
(FOLL_PIN | FOLL_GET)))
return NULL;
- if (pud_present(*pud) && pud_devmap(*pud)) /* pass */; else
@@ -1112,8 +1123,10 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, /* * device mapped pages can only be returned if the * caller will manage the page reference count.
*
*/* At least one of FOLL_GET | FOLL_PIN must be set, so assert that here:
- if (!(flags & FOLL_GET))
- if (!(flags & (FOLL_GET | FOLL_PIN))) return ERR_PTR(-EEXIST);
pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT; @@ -1121,7 +1134,8 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, if (!*pgmap) return ERR_PTR(-EFAULT); page = pfn_to_page(pfn);
- get_page(page);
- if (!try_grab_page(page, flags))
page = ERR_PTR(-ENOMEM);
return page; } @@ -1497,8 +1511,13 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, page = pmd_page(*pmd); VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
- if (!try_grab_page(page, flags))
return ERR_PTR(-ENOMEM);
- if (flags & FOLL_TOUCH) touch_pmd(vma, addr, pmd, flags);
- if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) { /*
- We don't mlock() pte-mapped THPs. This way we can avoid
@@ -1535,8 +1554,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, skip_mlock: page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT; VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
- if (flags & FOLL_GET)
get_page(page);
out: return page; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index dd8737a94bec..487e998fd38e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4375,19 +4375,6 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT; page = pte_page(huge_ptep_get(pte));
/*
* Instead of doing 'try_get_page()' below in the same_page
* loop, just check the count once here.
*/
if (unlikely(page_count(page) <= 0)) {
if (pages) {
spin_unlock(ptl);
remainder = 0;
err = -ENOMEM;
break;
}
}
- /*
- If subpage information not requested, update counters
- and skip the same_page loop below.
@@ -4405,7 +4392,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, same_page: if (pages) { pages[i] = mem_map_offset(page, pfn_offset);
get_page(pages[i]);
if (!try_grab_page(pages[i], flags)) {
spin_unlock(ptl);
remainder = 0;
err = -ENOMEM;
WARN_ON_ONCE(1);
break;
}}
if (vmas) @@ -4965,6 +4958,12 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address, struct page *page = NULL; spinlock_t *ptl; pte_t pte;
- /* FOLL_GET and FOLL_PIN are mutually exclusive. */
- if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
(FOLL_PIN | FOLL_GET)))
return NULL;
retry: ptl = pmd_lockptr(mm, pmd); spin_lock(ptl); @@ -4977,8 +4976,11 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address, pte = huge_ptep_get((pte_t *)pmd); if (pte_present(pte)) { page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
if (flags & FOLL_GET)
get_page(page);
if (unlikely(!try_grab_page(page, flags))) {
WARN_ON_ONCE(1);
page = NULL;
goto out;
} else { if (is_hugetlb_entry_migration(pte)) { spin_unlock(ptl);}
@@ -4999,7 +5001,7 @@ struct page * __weak follow_huge_pud(struct mm_struct *mm, unsigned long address, pud_t *pud, int flags) {
- if (flags & FOLL_GET)
- if (flags & (FOLL_GET | FOLL_PIN)) return NULL;
return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT); @@ -5008,7 +5010,7 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address, struct page * __weak follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int flags) {
- if (flags & FOLL_GET)
- if (flags & (FOLL_GET | FOLL_PIN)) return NULL;
return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT); -- 2.25.0
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS scheme tends to overflow too easily, each tail page increments the head page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number of huge pages that can be pinned.
This patch removes that limitation, by using an exact form of pin counting for compound pages of order > 1. The "order > 1" is required because this approach uses the 3rd struct page in the compound page, and order 1 compound pages only have two pages, so that won't work there.
A new struct page field, hpage_pinned_refcount, has been added, replacing a padding field in the union (so no new space is used).
This enhancement also has a useful side effect: huge pages and compound pages (of order > 1) do not suffer from the "potential false positives" problem that is discussed in the page_dma_pinned() comment block. That is because these compound pages have extra space for tracking things, so they get exact pin counts instead of overloading page->_refcount.
Documentation/core-api/pin_user_pages.rst is updated accordingly.
Suggested-by: Jan Kara jack@suse.cz Signed-off-by: John Hubbard jhubbard@nvidia.com --- Documentation/core-api/pin_user_pages.rst | 40 +++++------- include/linux/mm.h | 26 ++++++++ include/linux/mm_types.h | 7 +- mm/gup.c | 78 ++++++++++++++++++++--- mm/hugetlb.c | 6 ++ mm/page_alloc.c | 2 + mm/rmap.c | 6 ++ 7 files changed, 133 insertions(+), 32 deletions(-)
diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index 9829345428f8..3f72b1ea1104 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -52,8 +52,22 @@ Which flags are set by each wrapper
For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup flags the caller provides. The caller is required to pass in a non-null struct -pages* array, and the function then pin pages by incrementing each by a special -value. For now, that value is +1, just like get_user_pages*().:: +pages* array, and the function then pins pages by incrementing each by a special +value: GUP_PIN_COUNTING_BIAS. + +For huge pages (and in fact, any compound page of more than 2 pages), the +GUP_PIN_COUNTING_BIAS scheme is not used. Instead, an exact form of pin counting +is achieved, by using the 3rd struct page in the compound page. A new struct +page field, hpage_pinned_refcount, has been added in order to support this. + +This approach for compound pages avoids the counting upper limit problems that +are discussed below. Those limitations would have been aggravated severely by +huge pages, because each tail page adds a refcount to the head page. And in +fact, testing revealed that, without a separate hpage_pinned_refcount field, +page overflows were seen in some huge page stress tests. + +This also means that huge pages and compound pages (of order > 1) do not suffer +from the false positives problem that is mentioned below.::
Function -------- @@ -99,27 +113,6 @@ pages: This also leads to limitations: there are only 31-10==21 bits available for a counter that increments 10 bits at a time.
-TODO: for 1GB and larger huge pages, this is cutting it close. That's because -when pin_user_pages() follows such pages, it increments the head page by "1" -(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for -pin_user_pages()) for each tail page. So if you have a 1GB huge page: - -* There are 256K (18 bits) worth of 4 KB tail pages. -* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is, - 10 bits at a time) -* There are 21 - 18 == 3 bits available to count. Except that there aren't, - because you need to allow for a few normal get_page() calls on the head page, - as well. Fortunately, the approach of using addition, rather than "hard" - bitfields, within page->_refcount, allows for sharing these bits gracefully. - But we're still looking at about 8 references. - -This, however, is a missing feature more than anything else, because it's easily -solved by addressing an obvious inefficiency in the original get_user_pages() -approach of retrieving pages: stop treating all the pages as if they were -PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of -this, so some work is required. Once that's in place, this limitation mostly -disappears from view, because there will be ample refcounting range available. - * Callers must specifically request "dma-pinned tracking of pages". In other words, just calling get_user_pages() will not suffice; a new set of functions, pin_user_page() and related, must be used. @@ -228,5 +221,6 @@ References * `Some slow progress on get_user_pages() (Apr 2, 2019) https://lwn.net/Articles/784574/`_ * `DMA and get_user_pages() (LPC: Dec 12, 2018) https://lwn.net/Articles/774411/`_ * `The trouble with get_user_pages() (Apr 30, 2018) https://lwn.net/Articles/753027/`_ +* `LWN kernel index: get_user_pages() https://lwn.net/Kernel/Index/#Memory_management-get_user_pages`
John Hubbard, October, 2019 diff --git a/include/linux/mm.h b/include/linux/mm.h index ca787c606f0e..fdcd137b9981 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -770,6 +770,24 @@ static inline unsigned int compound_order(struct page *page) return page[1].compound_order; }
+static inline bool hpage_pincount_available(struct page *page) +{ + /* + * Can the page->hpage_pinned_refcount field be used? That field is in + * the 3rd page of the compound page, so the smallest (2-page) compound + * pages cannot support it. + */ + page = compound_head(page); + return PageCompound(page) && compound_order(page) > 1; +} + +static inline int compound_pincount(struct page *page) +{ + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page); + page = compound_head(page); + return atomic_read(compound_pincount_ptr(page)); +} + static inline void set_compound_order(struct page *page, unsigned int order) { page[1].compound_order = order; @@ -1084,6 +1102,11 @@ void unpin_user_pages(struct page **pages, unsigned long npages); * refcounts, and b) all the callers of this routine are expected to be able to * deal gracefully with a false positive. * + * For huge pages, the result will be exactly correct. That's because we have + * more tracking data available: the 3rd struct page in the compound page is + * used to track the pincount (instead using of the GUP_PIN_COUNTING_BIAS + * scheme). + * * For more information, please see Documentation/vm/pin_user_pages.rst. * * @page: pointer to page to be queried. @@ -1092,6 +1115,9 @@ void unpin_user_pages(struct page **pages, unsigned long npages); */ static inline bool page_maybe_dma_pinned(struct page *page) { + if (hpage_pincount_available(page)) + return compound_pincount(page) > 0; + /* * page_ref_count() is signed. If that refcount overflows, then * page_ref_count() returns a negative value, and callers will avoid diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index e87bb864bdb2..01e9717b8529 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -137,7 +137,7 @@ struct page { }; struct { /* Second tail page of compound page */ unsigned long _compound_pad_1; /* compound_head */ - unsigned long _compound_pad_2; + atomic_t hpage_pinned_refcount; /* For both global and memcg */ struct list_head deferred_list; }; @@ -226,6 +226,11 @@ static inline atomic_t *compound_mapcount_ptr(struct page *page) return &page[1].compound_mapcount; }
+static inline atomic_t *compound_pincount_ptr(struct page *page) +{ + return &page[2].hpage_pinned_refcount; +} + /* * Used for sizing the vmemmap region on some architectures */ diff --git a/mm/gup.c b/mm/gup.c index 6e8b773c233a..c10d0d051c5b 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -29,6 +29,22 @@ struct follow_page_context { unsigned int page_mask; };
+static void hpage_pincount_add(struct page *page, int refs) +{ + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page); + VM_BUG_ON_PAGE(page != compound_head(page), page); + + atomic_add(refs, compound_pincount_ptr(page)); +} + +static void hpage_pincount_sub(struct page *page, int refs) +{ + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page); + VM_BUG_ON_PAGE(page != compound_head(page), page); + + atomic_sub(refs, compound_pincount_ptr(page)); +} + /* * Return the compound head page with ref appropriately incremented, * or NULL if that failed. @@ -70,8 +86,25 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page, if (flags & FOLL_GET) return try_get_compound_head(page, refs); else if (flags & FOLL_PIN) { - refs *= GUP_PIN_COUNTING_BIAS; - return try_get_compound_head(page, refs); + /* + * When pinning a compound page of order > 1 (which is what + * hpage_pincount_available() checks for), use an exact count to + * track it, via hpage_pincount_add/_sub(). + * + * However, be sure to *also* increment the normal page refcount + * field at least once, so that the page really is pinned. + */ + if (!hpage_pincount_available(page)) + refs *= GUP_PIN_COUNTING_BIAS; + + page = try_get_compound_head(page, refs); + if (!page) + return NULL; + + if (hpage_pincount_available(page)) + hpage_pincount_add(page, refs); + + return page; }
WARN_ON_ONCE(1); @@ -106,12 +139,25 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags) if (flags & FOLL_GET) return try_get_page(page); else if (flags & FOLL_PIN) { + int refs = 1; + page = compound_head(page);
if (WARN_ON_ONCE(page_ref_count(page) <= 0)) return false;
- page_ref_add(page, GUP_PIN_COUNTING_BIAS); + if (hpage_pincount_available(page)) + hpage_pincount_add(page, 1); + else + refs = GUP_PIN_COUNTING_BIAS; + + /* + * Similar to try_grab_compound_head(): even if using the + * hpage_pincount_add/_sub() routines, be sure to + * *also* increment the normal page refcount field at least + * once, so that the page really is pinned. + */ + page_ref_add(page, refs); }
return true; @@ -120,12 +166,17 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags) #ifdef CONFIG_DEV_PAGEMAP_OPS static bool __unpin_devmap_managed_user_page(struct page *page) { - int count; + int count, refs = 1;
if (!page_is_devmap_managed(page)) return false;
- count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS); + if (hpage_pincount_available(page)) + hpage_pincount_sub(page, 1); + else + refs = GUP_PIN_COUNTING_BIAS; + + count = page_ref_sub_return(page, refs);
/* * devmap page refcounts are 1-based, rather than 0-based: if @@ -157,6 +208,8 @@ static bool __unpin_devmap_managed_user_page(struct page *page) */ void unpin_user_page(struct page *page) { + int refs = 1; + page = compound_head(page);
/* @@ -168,7 +221,12 @@ void unpin_user_page(struct page *page) if (__unpin_devmap_managed_user_page(page)) return;
- if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS)) + if (hpage_pincount_available(page)) + hpage_pincount_sub(page, 1); + else + refs = GUP_PIN_COUNTING_BIAS; + + if (page_ref_sub_and_test(page, refs)) __put_page(page); } EXPORT_SYMBOL(unpin_user_page); @@ -2200,8 +2258,12 @@ static int record_subpages(struct page *page, unsigned long addr,
static void put_compound_head(struct page *page, int refs, unsigned int flags) { - if (flags & FOLL_PIN) - refs *= GUP_PIN_COUNTING_BIAS; + if (flags & FOLL_PIN) { + if (hpage_pincount_available(page)) + hpage_pincount_sub(page, refs); + else + refs *= GUP_PIN_COUNTING_BIAS; + }
VM_BUG_ON_PAGE(page_ref_count(page) < refs, page); /* diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 487e998fd38e..07059d936f7b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1009,6 +1009,9 @@ static void destroy_compound_gigantic_page(struct page *page, struct page *p = page + 1;
atomic_set(compound_mapcount_ptr(page), 0); + if (hpage_pincount_available(page)) + atomic_set(compound_pincount_ptr(page), 0); + for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { clear_compound_head(p); set_page_refcounted(p); @@ -1287,6 +1290,9 @@ static void prep_compound_gigantic_page(struct page *page, unsigned int order) set_compound_head(p, page); } atomic_set(compound_mapcount_ptr(page), -1); + + if (hpage_pincount_available(page)) + atomic_set(compound_pincount_ptr(page), 0); }
/* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 15e908ad933b..c205b912f108 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -689,6 +689,8 @@ void prep_compound_page(struct page *page, unsigned int order) set_compound_head(p, page); } atomic_set(compound_mapcount_ptr(page), -1); + if (hpage_pincount_available(page)) + atomic_set(compound_pincount_ptr(page), 0); }
#ifdef CONFIG_DEBUG_PAGEALLOC diff --git a/mm/rmap.c b/mm/rmap.c index b3e381919835..e45b9b991e2f 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1178,6 +1178,9 @@ void page_add_new_anon_rmap(struct page *page, VM_BUG_ON_PAGE(!PageTransHuge(page), page); /* increment count (starts at -1) */ atomic_set(compound_mapcount_ptr(page), 0); + if (hpage_pincount_available(page)) + atomic_set(compound_pincount_ptr(page), 0); + __inc_node_page_state(page, NR_ANON_THPS); } else { /* Anon THP always mapped first with PMD */ @@ -1974,6 +1977,9 @@ void hugepage_add_new_anon_rmap(struct page *page, { BUG_ON(address < vma->vm_start || address >= vma->vm_end); atomic_set(compound_mapcount_ptr(page), 0); + if (hpage_pincount_available(page)) + atomic_set(compound_pincount_ptr(page), 0); + __page_set_anon_rmap(page, vma, address, 1); } #endif /* CONFIG_HUGETLB_PAGE */
On Fri, Jan 31, 2020 at 07:40:25PM -0800, John Hubbard wrote:
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS scheme tends to overflow too easily, each tail page increments the head page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number of huge pages that can be pinned.
This patch removes that limitation, by using an exact form of pin counting for compound pages of order > 1. The "order > 1" is required because this approach uses the 3rd struct page in the compound page, and order 1 compound pages only have two pages, so that won't work there.
Could you update the comment for HPAGE_PMD_ORDER < 2 check in hugepage_init() to reflect addtional user for the condition.
A new struct page field, hpage_pinned_refcount, has been added, replacing a padding field in the union (so no new space is used).
This enhancement also has a useful side effect: huge pages and compound pages (of order > 1) do not suffer from the "potential false positives" problem that is discussed in the page_dma_pinned() comment block. That is because these compound pages have extra space for tracking things, so they get exact pin counts instead of overloading page->_refcount.
Documentation/core-api/pin_user_pages.rst is updated accordingly.
Suggested-by: Jan Kara jack@suse.cz Signed-off-by: John Hubbard jhubbard@nvidia.com
Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com
On Fri 31-01-20 19:40:25, John Hubbard wrote:
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS scheme tends to overflow too easily, each tail page increments the head page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number of huge pages that can be pinned.
This patch removes that limitation, by using an exact form of pin counting for compound pages of order > 1. The "order > 1" is required because this approach uses the 3rd struct page in the compound page, and order 1 compound pages only have two pages, so that won't work there.
A new struct page field, hpage_pinned_refcount, has been added, replacing a padding field in the union (so no new space is used).
This enhancement also has a useful side effect: huge pages and compound pages (of order > 1) do not suffer from the "potential false positives" problem that is discussed in the page_dma_pinned() comment block. That is because these compound pages have extra space for tracking things, so they get exact pin counts instead of overloading page->_refcount.
Documentation/core-api/pin_user_pages.rst is updated accordingly.
Suggested-by: Jan Kara jack@suse.cz Signed-off-by: John Hubbard jhubbard@nvidia.com
The patch looks good to me. You can add:
Reviewed-by: Jan Kara jack@suse.cz
Honza
Documentation/core-api/pin_user_pages.rst | 40 +++++------- include/linux/mm.h | 26 ++++++++ include/linux/mm_types.h | 7 +- mm/gup.c | 78 ++++++++++++++++++++--- mm/hugetlb.c | 6 ++ mm/page_alloc.c | 2 + mm/rmap.c | 6 ++ 7 files changed, 133 insertions(+), 32 deletions(-)
diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index 9829345428f8..3f72b1ea1104 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -52,8 +52,22 @@ Which flags are set by each wrapper For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup flags the caller provides. The caller is required to pass in a non-null struct -pages* array, and the function then pin pages by incrementing each by a special -value. For now, that value is +1, just like get_user_pages*().:: +pages* array, and the function then pins pages by incrementing each by a special +value: GUP_PIN_COUNTING_BIAS.
+For huge pages (and in fact, any compound page of more than 2 pages), the +GUP_PIN_COUNTING_BIAS scheme is not used. Instead, an exact form of pin counting +is achieved, by using the 3rd struct page in the compound page. A new struct +page field, hpage_pinned_refcount, has been added in order to support this.
+This approach for compound pages avoids the counting upper limit problems that +are discussed below. Those limitations would have been aggravated severely by +huge pages, because each tail page adds a refcount to the head page. And in +fact, testing revealed that, without a separate hpage_pinned_refcount field, +page overflows were seen in some huge page stress tests.
+This also means that huge pages and compound pages (of order > 1) do not suffer +from the false positives problem that is mentioned below.:: Function
@@ -99,27 +113,6 @@ pages: This also leads to limitations: there are only 31-10==21 bits available for a counter that increments 10 bits at a time. -TODO: for 1GB and larger huge pages, this is cutting it close. That's because -when pin_user_pages() follows such pages, it increments the head page by "1" -(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for -pin_user_pages()) for each tail page. So if you have a 1GB huge page:
-* There are 256K (18 bits) worth of 4 KB tail pages. -* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is,
- 10 bits at a time)
-* There are 21 - 18 == 3 bits available to count. Except that there aren't,
- because you need to allow for a few normal get_page() calls on the head page,
- as well. Fortunately, the approach of using addition, rather than "hard"
- bitfields, within page->_refcount, allows for sharing these bits gracefully.
- But we're still looking at about 8 references.
-This, however, is a missing feature more than anything else, because it's easily -solved by addressing an obvious inefficiency in the original get_user_pages() -approach of retrieving pages: stop treating all the pages as if they were -PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of -this, so some work is required. Once that's in place, this limitation mostly -disappears from view, because there will be ample refcounting range available.
- Callers must specifically request "dma-pinned tracking of pages". In other words, just calling get_user_pages() will not suffice; a new set of functions, pin_user_page() and related, must be used.
@@ -228,5 +221,6 @@ References
- `Some slow progress on get_user_pages() (Apr 2, 2019) https://lwn.net/Articles/784574/`_
- `DMA and get_user_pages() (LPC: Dec 12, 2018) https://lwn.net/Articles/774411/`_
- `The trouble with get_user_pages() (Apr 30, 2018) https://lwn.net/Articles/753027/`_
+* `LWN kernel index: get_user_pages() https://lwn.net/Kernel/Index/#Memory_management-get_user_pages` John Hubbard, October, 2019 diff --git a/include/linux/mm.h b/include/linux/mm.h index ca787c606f0e..fdcd137b9981 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -770,6 +770,24 @@ static inline unsigned int compound_order(struct page *page) return page[1].compound_order; } +static inline bool hpage_pincount_available(struct page *page) +{
- /*
* Can the page->hpage_pinned_refcount field be used? That field is in
* the 3rd page of the compound page, so the smallest (2-page) compound
* pages cannot support it.
*/
- page = compound_head(page);
- return PageCompound(page) && compound_order(page) > 1;
+}
+static inline int compound_pincount(struct page *page) +{
- VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
- page = compound_head(page);
- return atomic_read(compound_pincount_ptr(page));
+}
static inline void set_compound_order(struct page *page, unsigned int order) { page[1].compound_order = order; @@ -1084,6 +1102,11 @@ void unpin_user_pages(struct page **pages, unsigned long npages);
- refcounts, and b) all the callers of this routine are expected to be able to
- deal gracefully with a false positive.
- For huge pages, the result will be exactly correct. That's because we have
- more tracking data available: the 3rd struct page in the compound page is
- used to track the pincount (instead using of the GUP_PIN_COUNTING_BIAS
- scheme).
- For more information, please see Documentation/vm/pin_user_pages.rst.
- @page: pointer to page to be queried.
@@ -1092,6 +1115,9 @@ void unpin_user_pages(struct page **pages, unsigned long npages); */ static inline bool page_maybe_dma_pinned(struct page *page) {
- if (hpage_pincount_available(page))
return compound_pincount(page) > 0;
- /*
- page_ref_count() is signed. If that refcount overflows, then
- page_ref_count() returns a negative value, and callers will avoid
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index e87bb864bdb2..01e9717b8529 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -137,7 +137,7 @@ struct page { }; struct { /* Second tail page of compound page */ unsigned long _compound_pad_1; /* compound_head */
unsigned long _compound_pad_2;
};atomic_t hpage_pinned_refcount; /* For both global and memcg */ struct list_head deferred_list;
@@ -226,6 +226,11 @@ static inline atomic_t *compound_mapcount_ptr(struct page *page) return &page[1].compound_mapcount; } +static inline atomic_t *compound_pincount_ptr(struct page *page) +{
- return &page[2].hpage_pinned_refcount;
+}
/*
- Used for sizing the vmemmap region on some architectures
*/ diff --git a/mm/gup.c b/mm/gup.c index 6e8b773c233a..c10d0d051c5b 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -29,6 +29,22 @@ struct follow_page_context { unsigned int page_mask; }; +static void hpage_pincount_add(struct page *page, int refs) +{
- VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
- VM_BUG_ON_PAGE(page != compound_head(page), page);
- atomic_add(refs, compound_pincount_ptr(page));
+}
+static void hpage_pincount_sub(struct page *page, int refs) +{
- VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
- VM_BUG_ON_PAGE(page != compound_head(page), page);
- atomic_sub(refs, compound_pincount_ptr(page));
+}
/*
- Return the compound head page with ref appropriately incremented,
- or NULL if that failed.
@@ -70,8 +86,25 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page, if (flags & FOLL_GET) return try_get_compound_head(page, refs); else if (flags & FOLL_PIN) {
refs *= GUP_PIN_COUNTING_BIAS;
return try_get_compound_head(page, refs);
/*
* When pinning a compound page of order > 1 (which is what
* hpage_pincount_available() checks for), use an exact count to
* track it, via hpage_pincount_add/_sub().
*
* However, be sure to *also* increment the normal page refcount
* field at least once, so that the page really is pinned.
*/
if (!hpage_pincount_available(page))
refs *= GUP_PIN_COUNTING_BIAS;
page = try_get_compound_head(page, refs);
if (!page)
return NULL;
if (hpage_pincount_available(page))
hpage_pincount_add(page, refs);
}return page;
WARN_ON_ONCE(1); @@ -106,12 +139,25 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags) if (flags & FOLL_GET) return try_get_page(page); else if (flags & FOLL_PIN) {
int refs = 1;
- page = compound_head(page);
if (WARN_ON_ONCE(page_ref_count(page) <= 0)) return false;
page_ref_add(page, GUP_PIN_COUNTING_BIAS);
if (hpage_pincount_available(page))
hpage_pincount_add(page, 1);
else
refs = GUP_PIN_COUNTING_BIAS;
/*
* Similar to try_grab_compound_head(): even if using the
* hpage_pincount_add/_sub() routines, be sure to
* *also* increment the normal page refcount field at least
* once, so that the page really is pinned.
*/
}page_ref_add(page, refs);
return true; @@ -120,12 +166,17 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags) #ifdef CONFIG_DEV_PAGEMAP_OPS static bool __unpin_devmap_managed_user_page(struct page *page) {
- int count;
- int count, refs = 1;
if (!page_is_devmap_managed(page)) return false;
- count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS);
- if (hpage_pincount_available(page))
hpage_pincount_sub(page, 1);
- else
refs = GUP_PIN_COUNTING_BIAS;
- count = page_ref_sub_return(page, refs);
/* * devmap page refcounts are 1-based, rather than 0-based: if @@ -157,6 +208,8 @@ static bool __unpin_devmap_managed_user_page(struct page *page) */ void unpin_user_page(struct page *page) {
- int refs = 1;
- page = compound_head(page);
/* @@ -168,7 +221,12 @@ void unpin_user_page(struct page *page) if (__unpin_devmap_managed_user_page(page)) return;
- if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS))
- if (hpage_pincount_available(page))
hpage_pincount_sub(page, 1);
- else
refs = GUP_PIN_COUNTING_BIAS;
- if (page_ref_sub_and_test(page, refs)) __put_page(page);
} EXPORT_SYMBOL(unpin_user_page); @@ -2200,8 +2258,12 @@ static int record_subpages(struct page *page, unsigned long addr, static void put_compound_head(struct page *page, int refs, unsigned int flags) {
- if (flags & FOLL_PIN)
refs *= GUP_PIN_COUNTING_BIAS;
- if (flags & FOLL_PIN) {
if (hpage_pincount_available(page))
hpage_pincount_sub(page, refs);
else
refs *= GUP_PIN_COUNTING_BIAS;
- }
VM_BUG_ON_PAGE(page_ref_count(page) < refs, page); /* diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 487e998fd38e..07059d936f7b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1009,6 +1009,9 @@ static void destroy_compound_gigantic_page(struct page *page, struct page *p = page + 1; atomic_set(compound_mapcount_ptr(page), 0);
- if (hpage_pincount_available(page))
atomic_set(compound_pincount_ptr(page), 0);
- for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { clear_compound_head(p); set_page_refcounted(p);
@@ -1287,6 +1290,9 @@ static void prep_compound_gigantic_page(struct page *page, unsigned int order) set_compound_head(p, page); } atomic_set(compound_mapcount_ptr(page), -1);
- if (hpage_pincount_available(page))
atomic_set(compound_pincount_ptr(page), 0);
} /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 15e908ad933b..c205b912f108 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -689,6 +689,8 @@ void prep_compound_page(struct page *page, unsigned int order) set_compound_head(p, page); } atomic_set(compound_mapcount_ptr(page), -1);
- if (hpage_pincount_available(page))
atomic_set(compound_pincount_ptr(page), 0);
} #ifdef CONFIG_DEBUG_PAGEALLOC diff --git a/mm/rmap.c b/mm/rmap.c index b3e381919835..e45b9b991e2f 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1178,6 +1178,9 @@ void page_add_new_anon_rmap(struct page *page, VM_BUG_ON_PAGE(!PageTransHuge(page), page); /* increment count (starts at -1) */ atomic_set(compound_mapcount_ptr(page), 0);
if (hpage_pincount_available(page))
atomic_set(compound_pincount_ptr(page), 0);
- __inc_node_page_state(page, NR_ANON_THPS); } else { /* Anon THP always mapped first with PMD */
@@ -1974,6 +1977,9 @@ void hugepage_add_new_anon_rmap(struct page *page, { BUG_ON(address < vma->vm_start || address >= vma->vm_end); atomic_set(compound_mapcount_ptr(page), 0);
- if (hpage_pincount_available(page))
atomic_set(compound_pincount_ptr(page), 0);
- __page_set_anon_rmap(page, vma, address, 1);
}
#endif /* CONFIG_HUGETLB_PAGE */
2.25.0
As part of pin_user_pages() and related API calls, pages are "dma-pinned". For the case of compound pages of order > 1, the per-page accounting of dma pins is accomplished via the 3rd struct page in the compound page. In order to support debugging of any pin_user_pages()- related problems, enhance dump_page() so as to report the pin count in that case.
Documentation/core-api/pin_user_pages.rst is also updated accordingly.
Signed-off-by: John Hubbard jhubbard@nvidia.com --- Documentation/core-api/pin_user_pages.rst | 7 +++++ mm/debug.c | 34 +++++++++++++++++------ 2 files changed, 33 insertions(+), 8 deletions(-)
diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index 3f72b1ea1104..dd21ea140ef4 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -215,6 +215,13 @@ Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is because there is a noticeable performance drop in unpin_user_page(), when they are activated.
+Other diagnostics +================= + +dump_page() has been enhanced slightly, to handle these new counting fields, and +to better report on compound pages in general. Specifically, for compound pages +with order > 1, the exact (hpage_pinned_refcount) pincount is reported. + References ==========
diff --git a/mm/debug.c b/mm/debug.c index beb1c59d784b..db81b11345be 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -57,10 +57,20 @@ static void __dump_tail_page(struct page *page, int mapcount) page, page_ref_count(page), mapcount, page->mapping, page_to_pgoff(page)); } else { - pr_warn("page:%px compound refcount:%d mapcount:%d mapping:%px " - "index:%#lx compound_mapcount:%d\n", - page, page_ref_count(head), mapcount, head->mapping, - page_to_pgoff(head), compound_mapcount(page)); + if (hpage_pincount_available(page)) + pr_warn("page:%px compound refcount:%d mapcount:%d " + "mapping:%px index:%#lx compound_mapcount:%d " + "compound_pincount:%d\n", + page, page_ref_count(head), mapcount, + head->mapping, page_to_pgoff(head), + compound_mapcount(page), + compound_pincount(page)); + else + pr_warn("page:%px compound refcount:%d mapcount:%d " + "mapping:%px index:%#lx compound_mapcount:%d\n", + page, page_ref_count(head), mapcount, + head->mapping, page_to_pgoff(head), + compound_mapcount(page)); }
if (page_ref_count(page) != 0) @@ -103,10 +113,18 @@ void __dump_page(struct page *page, const char *reason)
if (PageTail(page)) __dump_tail_page(page, mapcount); - else - pr_warn("page:%px refcount:%d mapcount:%d mapping:%px index:%#lx\n", - page, page_ref_count(page), mapcount, - page->mapping, page_to_pgoff(page)); + else { + if (hpage_pincount_available(page)) + pr_warn("page:%px refcount:%d mapcount:%d mapping:%px " + "index:%#lx compound pincount: %d\n", + page, page_ref_count(page), mapcount, + page->mapping, page_to_pgoff(page), + compound_pincount(page)); + else + pr_warn("page:%px refcount:%d mapcount:%d mapping:%px " + "index:%#lx\n", page, page_ref_count(page), + mapcount, page->mapping, page_to_pgoff(page)); + } if (PageKsm(page)) type = "ksm "; else if (PageAnon(page))
On Fri, Jan 31, 2020 at 07:40:26PM -0800, John Hubbard wrote:
As part of pin_user_pages() and related API calls, pages are "dma-pinned". For the case of compound pages of order > 1, the per-page accounting of dma pins is accomplished via the 3rd struct page in the compound page. In order to support debugging of any pin_user_pages()- related problems, enhance dump_page() so as to report the pin count in that case.
Documentation/core-api/pin_user_pages.rst is also updated accordingly.
Signed-off-by: John Hubbard jhubbard@nvidia.com
Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com
On Fri 31-01-20 19:40:26, John Hubbard wrote:
As part of pin_user_pages() and related API calls, pages are "dma-pinned". For the case of compound pages of order > 1, the per-page accounting of dma pins is accomplished via the 3rd struct page in the compound page. In order to support debugging of any pin_user_pages()- related problems, enhance dump_page() so as to report the pin count in that case.
Documentation/core-api/pin_user_pages.rst is also updated accordingly.
Signed-off-by: John Hubbard jhubbard@nvidia.com
Looks good to me. You can add:
Reviewed-by: Jan Kara jack@suse.cz
Honza
Documentation/core-api/pin_user_pages.rst | 7 +++++ mm/debug.c | 34 +++++++++++++++++------ 2 files changed, 33 insertions(+), 8 deletions(-)
diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index 3f72b1ea1104..dd21ea140ef4 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -215,6 +215,13 @@ Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is because there is a noticeable performance drop in unpin_user_page(), when they are activated. +Other diagnostics +=================
+dump_page() has been enhanced slightly, to handle these new counting fields, and +to better report on compound pages in general. Specifically, for compound pages +with order > 1, the exact (hpage_pinned_refcount) pincount is reported.
References
diff --git a/mm/debug.c b/mm/debug.c index beb1c59d784b..db81b11345be 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -57,10 +57,20 @@ static void __dump_tail_page(struct page *page, int mapcount) page, page_ref_count(page), mapcount, page->mapping, page_to_pgoff(page)); } else {
pr_warn("page:%px compound refcount:%d mapcount:%d mapping:%px "
"index:%#lx compound_mapcount:%d\n",
page, page_ref_count(head), mapcount, head->mapping,
page_to_pgoff(head), compound_mapcount(page));
if (hpage_pincount_available(page))
pr_warn("page:%px compound refcount:%d mapcount:%d "
"mapping:%px index:%#lx compound_mapcount:%d "
"compound_pincount:%d\n",
page, page_ref_count(head), mapcount,
head->mapping, page_to_pgoff(head),
compound_mapcount(page),
compound_pincount(page));
else
pr_warn("page:%px compound refcount:%d mapcount:%d "
"mapping:%px index:%#lx compound_mapcount:%d\n",
page, page_ref_count(head), mapcount,
head->mapping, page_to_pgoff(head),
}compound_mapcount(page));
if (page_ref_count(page) != 0) @@ -103,10 +113,18 @@ void __dump_page(struct page *page, const char *reason) if (PageTail(page)) __dump_tail_page(page, mapcount);
- else
pr_warn("page:%px refcount:%d mapcount:%d mapping:%px index:%#lx\n",
page, page_ref_count(page), mapcount,
page->mapping, page_to_pgoff(page));
- else {
if (hpage_pincount_available(page))
pr_warn("page:%px refcount:%d mapcount:%d mapping:%px "
"index:%#lx compound pincount: %d\n",
page, page_ref_count(page), mapcount,
page->mapping, page_to_pgoff(page),
compound_pincount(page));
else
pr_warn("page:%px refcount:%d mapcount:%d mapping:%px "
"index:%#lx\n", page, page_ref_count(page),
mapcount, page->mapping, page_to_pgoff(page));
- } if (PageKsm(page)) type = "ksm "; else if (PageAnon(page))
-- 2.25.0
Now that pages are "DMA-pinned" via pin_user_page*(), and unpinned via unpin_user_pages*(), we need some visibility into whether all of this is working correctly.
Add two new fields to /proc/vmstat:
nr_foll_pin_requested nr_foll_pin_returned
These are documented in Documentation/core-api/pin_user_pages.rst. They represent the number of pages (since boot time) that have been pinned ("nr_foll_pin_requested") and unpinned ("nr_foll_pin_returned"), via pin_user_pages*() and unpin_user_pages*().
In the absence of long-running DMA or RDMA operations that hold pages pinned, the above two fields will normally be equal to each other.
Signed-off-by: John Hubbard jhubbard@nvidia.com --- include/linux/mmzone.h | 2 ++ mm/gup.c | 21 +++++++++++++++++++++ mm/vmstat.c | 2 ++ 3 files changed, 25 insertions(+)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index c2bc309d1634..01d690586206 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -243,6 +243,8 @@ enum node_stat_item { NR_DIRTIED, /* page dirtyings since bootup */ NR_WRITTEN, /* page writings since bootup */ NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */ + NR_FOLL_PIN_REQUESTED, /* via: pin_user_page(), gup flag: FOLL_PIN */ + NR_FOLL_PIN_RETURNED, /* pages returned via unpin_user_page() */ NR_VM_NODE_STAT_ITEMS };
diff --git a/mm/gup.c b/mm/gup.c index c10d0d051c5b..9fe61d15fc0e 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -29,6 +29,19 @@ struct follow_page_context { unsigned int page_mask; };
+#ifdef CONFIG_DEBUG_VM +static inline void __update_proc_vmstat(struct page *page, + enum node_stat_item item, int count) +{ + mod_node_page_state(page_pgdat(page), item, count); +} +#else +static inline void __update_proc_vmstat(struct page *page, + enum node_stat_item item, int count) +{ +} +#endif + static void hpage_pincount_add(struct page *page, int refs) { VM_BUG_ON_PAGE(!hpage_pincount_available(page), page); @@ -86,6 +99,8 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page, if (flags & FOLL_GET) return try_get_compound_head(page, refs); else if (flags & FOLL_PIN) { + int orig_refs = refs; + /* * When pinning a compound page of order > 1 (which is what * hpage_pincount_available() checks for), use an exact count to @@ -104,6 +119,7 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page, if (hpage_pincount_available(page)) hpage_pincount_add(page, refs);
+ __update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, orig_refs); return page; }
@@ -158,6 +174,8 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags) * once, so that the page really is pinned. */ page_ref_add(page, refs); + + __update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, 1); }
return true; @@ -178,6 +196,7 @@ static bool __unpin_devmap_managed_user_page(struct page *page)
count = page_ref_sub_return(page, refs);
+ __update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1); /* * devmap page refcounts are 1-based, rather than 0-based: if * refcount is 1, then the page is free and the refcount is @@ -228,6 +247,8 @@ void unpin_user_page(struct page *page)
if (page_ref_sub_and_test(page, refs)) __put_page(page); + + __update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1); } EXPORT_SYMBOL(unpin_user_page);
diff --git a/mm/vmstat.c b/mm/vmstat.c index 78d53378db99..b56808bae1b4 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1168,6 +1168,8 @@ const char * const vmstat_text[] = { "nr_dirtied", "nr_written", "nr_kernel_misc_reclaimable", + "nr_foll_pin_requested", + "nr_foll_pin_returned",
/* enum writeback_stat_item counters */ "nr_dirty_threshold",
On Fri, Jan 31, 2020 at 07:40:27PM -0800, John Hubbard wrote:
diff --git a/mm/gup.c b/mm/gup.c index c10d0d051c5b..9fe61d15fc0e 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -29,6 +29,19 @@ struct follow_page_context { unsigned int page_mask; }; +#ifdef CONFIG_DEBUG_VM
Why under CONFIG_DEBUG_VM? There's nothing about this in the cover letter.
+static inline void __update_proc_vmstat(struct page *page,
enum node_stat_item item, int count)
+{
- mod_node_page_state(page_pgdat(page), item, count);
+} +#else +static inline void __update_proc_vmstat(struct page *page,
enum node_stat_item item, int count)
+{ +} +#endif
static void hpage_pincount_add(struct page *page, int refs) { VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
On 2/3/20 5:53 AM, Kirill A. Shutemov wrote:
On Fri, Jan 31, 2020 at 07:40:27PM -0800, John Hubbard wrote:
diff --git a/mm/gup.c b/mm/gup.c index c10d0d051c5b..9fe61d15fc0e 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -29,6 +29,19 @@ struct follow_page_context { unsigned int page_mask; }; +#ifdef CONFIG_DEBUG_VM
Why under CONFIG_DEBUG_VM? There's nothing about this in the cover letter.
Early on, gup_benchmark showed a really significant slowdown from using these counters. And I don't doubt that it's still the case.
I'll re-measure and add a short summary and a few numbers to the patch commit description, and to the v4 cover letter.
thanks,
On Mon, Feb 03, 2020 at 01:04:04PM -0800, John Hubbard wrote:
On 2/3/20 5:53 AM, Kirill A. Shutemov wrote:
On Fri, Jan 31, 2020 at 07:40:27PM -0800, John Hubbard wrote:
diff --git a/mm/gup.c b/mm/gup.c index c10d0d051c5b..9fe61d15fc0e 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -29,6 +29,19 @@ struct follow_page_context { unsigned int page_mask; }; +#ifdef CONFIG_DEBUG_VM
Why under CONFIG_DEBUG_VM? There's nothing about this in the cover letter.
Early on, gup_benchmark showed a really significant slowdown from using these counters. And I don't doubt that it's still the case.
I'll re-measure and add a short summary and a few numbers to the patch commit description, and to the v4 cover letter.
Looks like you'll show zeros for these counters if debug is off. It can be confusing to the user. I think these counters should go away if you don't count them.
On 2/3/20 1:30 PM, Kirill A. Shutemov wrote:
On Mon, Feb 03, 2020 at 01:04:04PM -0800, John Hubbard wrote:
On 2/3/20 5:53 AM, Kirill A. Shutemov wrote:
On Fri, Jan 31, 2020 at 07:40:27PM -0800, John Hubbard wrote:
diff --git a/mm/gup.c b/mm/gup.c index c10d0d051c5b..9fe61d15fc0e 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -29,6 +29,19 @@ struct follow_page_context { unsigned int page_mask; }; +#ifdef CONFIG_DEBUG_VM
Why under CONFIG_DEBUG_VM? There's nothing about this in the cover letter.
Early on, gup_benchmark showed a really significant slowdown from using these counters. And I don't doubt that it's still the case.
I'll re-measure and add a short summary and a few numbers to the patch commit description, and to the v4 cover letter.
Looks like you'll show zeros for these counters if debug is off. It can be confusing to the user. I think these counters should go away if you don't count them.
OK, that's a good point. (And in fact, the counters==0 situation already led me astray briefly while debugging with Leon R, even. heh.) I'll remove them entirely for the !CONFIG_DEBUG_VM case.
thanks,
On 2/3/20 1:34 PM, John Hubbard wrote:
On 2/3/20 1:30 PM, Kirill A. Shutemov wrote:
On Mon, Feb 03, 2020 at 01:04:04PM -0800, John Hubbard wrote:
On 2/3/20 5:53 AM, Kirill A. Shutemov wrote:
On Fri, Jan 31, 2020 at 07:40:27PM -0800, John Hubbard wrote:
diff --git a/mm/gup.c b/mm/gup.c index c10d0d051c5b..9fe61d15fc0e 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -29,6 +29,19 @@ struct follow_page_context { unsigned int page_mask; }; +#ifdef CONFIG_DEBUG_VM
Why under CONFIG_DEBUG_VM? There's nothing about this in the cover letter.
Early on, gup_benchmark showed a really significant slowdown from using these counters. And I don't doubt that it's still the case.
I'll re-measure and add a short summary and a few numbers to the patch commit description, and to the v4 cover letter.
Looks like you'll show zeros for these counters if debug is off. It can be confusing to the user. I think these counters should go away if you don't count them.
OK, that's a good point. (And in fact, the counters==0 situation already led me astray briefly while debugging with Leon R, even. heh.) I'll remove them entirely for the !CONFIG_DEBUG_VM case.
On second thought, let me do some more careful performance testing. I don't recall now if I was just removing every possible perf slowdown item, when I made this decision. It could be that the perf is not affected, and I could just leave this feature enabled at all times, which would be nicer.
And after all, these counters were designed for pretty hot-path items. I'll report back with results...
thanks,
On 2/3/20 3:16 PM, John Hubbard wrote:
On 2/3/20 1:34 PM, John Hubbard wrote:
On 2/3/20 1:30 PM, Kirill A. Shutemov wrote:
On Mon, Feb 03, 2020 at 01:04:04PM -0800, John Hubbard wrote:
On 2/3/20 5:53 AM, Kirill A. Shutemov wrote:
On Fri, Jan 31, 2020 at 07:40:27PM -0800, John Hubbard wrote:
diff --git a/mm/gup.c b/mm/gup.c index c10d0d051c5b..9fe61d15fc0e 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -29,6 +29,19 @@ struct follow_page_context { unsigned int page_mask; }; +#ifdef CONFIG_DEBUG_VM
Why under CONFIG_DEBUG_VM? There's nothing about this in the cover letter.
Early on, gup_benchmark showed a really significant slowdown from using these counters. And I don't doubt that it's still the case.
I'll re-measure and add a short summary and a few numbers to the patch commit description, and to the v4 cover letter.
Looks like you'll show zeros for these counters if debug is off. It can be confusing to the user. I think these counters should go away if you don't count them.
OK, that's a good point. (And in fact, the counters==0 situation already led me astray briefly while debugging with Leon R, even. heh.) I'll remove them entirely for the !CONFIG_DEBUG_VM case.
On second thought, let me do some more careful performance testing. I don't recall now if I was just removing every possible perf slowdown item, when I made this decision. It could be that the perf is not affected, and I could just leave this feature enabled at all times, which would be nicer.
And after all, these counters were designed for pretty hot-path items. I'll report back with results...
Sure enough, any perf effects are hidden in the noise. I'll just remove the CONFIG_DEBUG_VM checks. Glad you asked about this!
thanks,
Up until now, gup_benchmark supported testing of the following kernel functions:
* get_user_pages(): via the '-U' command line option * get_user_pages_longterm(): via the '-L' command line option * get_user_pages_fast(): as the default (no options required)
Add test coverage for the new corresponding pin_*() functions:
* pin_user_pages_fast(): via the '-a' command line option * pin_user_pages(): via the '-b' command line option
Also, add an option for clarity: '-u' for what is now (still) the default choice: get_user_pages_fast().
Also, for the commands that set FOLL_PIN, verify that the pages really are dma-pinned, via the new is_dma_pinned() routine. Those commands are:
PIN_FAST_BENCHMARK : calls pin_user_pages_fast() PIN_BENCHMARK : calls pin_user_pages()
In between the calls to pin_*() and unpin_user_pages(), check each page: if page_maybe_dma_pinned() returns false, then WARN and return.
Do this outside of the benchmark timestamps, so that it doesn't affect reported times.
Reviewed-by: Ira Weiny ira.weiny@intel.com Signed-off-by: John Hubbard jhubbard@nvidia.com --- mm/gup_benchmark.c | 71 ++++++++++++++++++++-- tools/testing/selftests/vm/gup_benchmark.c | 15 ++++- 2 files changed, 80 insertions(+), 6 deletions(-)
diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c index 8dba38e79a9f..447628d0131f 100644 --- a/mm/gup_benchmark.c +++ b/mm/gup_benchmark.c @@ -8,6 +8,8 @@ #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark) #define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark) #define GUP_BENCHMARK _IOWR('g', 3, struct gup_benchmark) +#define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark) +#define PIN_BENCHMARK _IOWR('g', 5, struct gup_benchmark)
struct gup_benchmark { __u64 get_delta_usec; @@ -19,6 +21,48 @@ struct gup_benchmark { __u64 expansion[10]; /* For future use */ };
+static void put_back_pages(unsigned int cmd, struct page **pages, + unsigned long nr_pages) +{ + int i; + + switch (cmd) { + case GUP_FAST_BENCHMARK: + case GUP_LONGTERM_BENCHMARK: + case GUP_BENCHMARK: + for (i = 0; i < nr_pages; i++) + put_page(pages[i]); + break; + + case PIN_FAST_BENCHMARK: + case PIN_BENCHMARK: + unpin_user_pages(pages, nr_pages); + break; + } +} + +static void verify_dma_pinned(unsigned int cmd, struct page **pages, + unsigned long nr_pages) +{ + int i; + struct page *page; + + switch (cmd) { + case PIN_FAST_BENCHMARK: + case PIN_BENCHMARK: + for (i = 0; i < nr_pages; i++) { + page = pages[i]; + if (WARN(!page_maybe_dma_pinned(page), + "pages[%d] is NOT dma-pinned\n", i)) { + + dump_page(page, "gup_benchmark failure"); + break; + } + } + break; + } +} + static int __gup_benchmark_ioctl(unsigned int cmd, struct gup_benchmark *gup) { @@ -66,6 +110,14 @@ static int __gup_benchmark_ioctl(unsigned int cmd, nr = get_user_pages(addr, nr, gup->flags, pages + i, NULL); break; + case PIN_FAST_BENCHMARK: + nr = pin_user_pages_fast(addr, nr, gup->flags, + pages + i); + break; + case PIN_BENCHMARK: + nr = pin_user_pages(addr, nr, gup->flags, pages + i, + NULL); + break; default: kvfree(pages); ret = -EINVAL; @@ -78,15 +130,22 @@ static int __gup_benchmark_ioctl(unsigned int cmd, } end_time = ktime_get();
+ /* Shifting the meaning of nr_pages: now it is actual number pinned: */ + nr_pages = i; + gup->get_delta_usec = ktime_us_delta(end_time, start_time); gup->size = addr - gup->addr;
+ /* + * Take an un-benchmark-timed moment to verify DMA pinned + * state: print a warning if any non-dma-pinned pages are found: + */ + verify_dma_pinned(cmd, pages, nr_pages); + start_time = ktime_get(); - for (i = 0; i < nr_pages; i++) { - if (!pages[i]) - break; - put_page(pages[i]); - } + + put_back_pages(cmd, pages, nr_pages); + end_time = ktime_get(); gup->put_delta_usec = ktime_us_delta(end_time, start_time);
@@ -105,6 +164,8 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd, case GUP_FAST_BENCHMARK: case GUP_LONGTERM_BENCHMARK: case GUP_BENCHMARK: + case PIN_FAST_BENCHMARK: + case PIN_BENCHMARK: break; default: return -EINVAL; diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c index 389327e9b30a..43b4dfe161a2 100644 --- a/tools/testing/selftests/vm/gup_benchmark.c +++ b/tools/testing/selftests/vm/gup_benchmark.c @@ -18,6 +18,10 @@ #define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark) #define GUP_BENCHMARK _IOWR('g', 3, struct gup_benchmark)
+/* Similar to above, but use FOLL_PIN instead of FOLL_GET. */ +#define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark) +#define PIN_BENCHMARK _IOWR('g', 5, struct gup_benchmark) + /* Just the flags we need, copied from mm.h: */ #define FOLL_WRITE 0x01 /* check pte is writable */
@@ -40,8 +44,14 @@ int main(int argc, char **argv) char *file = "/dev/zero"; char *p;
- while ((opt = getopt(argc, argv, "m:r:n:f:tTLUwSH")) != -1) { + while ((opt = getopt(argc, argv, "m:r:n:f:abtTLUuwSH")) != -1) { switch (opt) { + case 'a': + cmd = PIN_FAST_BENCHMARK; + break; + case 'b': + cmd = PIN_BENCHMARK; + break; case 'm': size = atoi(optarg) * MB; break; @@ -63,6 +73,9 @@ int main(int argc, char **argv) case 'U': cmd = GUP_BENCHMARK; break; + case 'u': + cmd = GUP_FAST_BENCHMARK; + break; case 'w': write = 1; break;
On Fri, Jan 31, 2020 at 07:40:28PM -0800, John Hubbard wrote:
Up until now, gup_benchmark supported testing of the following kernel functions:
- get_user_pages(): via the '-U' command line option
- get_user_pages_longterm(): via the '-L' command line option
- get_user_pages_fast(): as the default (no options required)
Add test coverage for the new corresponding pin_*() functions:
- pin_user_pages_fast(): via the '-a' command line option
- pin_user_pages(): via the '-b' command line option
Also, add an option for clarity: '-u' for what is now (still) the default choice: get_user_pages_fast().
Also, for the commands that set FOLL_PIN, verify that the pages really are dma-pinned, via the new is_dma_pinned() routine. Those commands are:
PIN_FAST_BENCHMARK : calls pin_user_pages_fast() PIN_BENCHMARK : calls pin_user_pages()
In between the calls to pin_*() and unpin_user_pages(), check each page: if page_maybe_dma_pinned() returns false, then WARN and return.
Do this outside of the benchmark timestamps, so that it doesn't affect reported times.
Reviewed-by: Ira Weiny ira.weiny@intel.com Signed-off-by: John Hubbard jhubbard@nvidia.com
mm/gup_benchmark.c | 71 ++++++++++++++++++++-- tools/testing/selftests/vm/gup_benchmark.c | 15 ++++- 2 files changed, 80 insertions(+), 6 deletions(-)
diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c index 8dba38e79a9f..447628d0131f 100644 --- a/mm/gup_benchmark.c +++ b/mm/gup_benchmark.c @@ -8,6 +8,8 @@ #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark) #define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark) #define GUP_BENCHMARK _IOWR('g', 3, struct gup_benchmark) +#define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark) +#define PIN_BENCHMARK _IOWR('g', 5, struct gup_benchmark) struct gup_benchmark { __u64 get_delta_usec; @@ -19,6 +21,48 @@ struct gup_benchmark { __u64 expansion[10]; /* For future use */ }; +static void put_back_pages(unsigned int cmd, struct page **pages,
unsigned long nr_pages)
+{
- int i;
- switch (cmd) {
- case GUP_FAST_BENCHMARK:
- case GUP_LONGTERM_BENCHMARK:
- case GUP_BENCHMARK:
for (i = 0; i < nr_pages; i++)
'i' is 'int' and 'nr_pages' is 'unsigned long'. There's space for trouble :P
put_page(pages[i]);
break;
- case PIN_FAST_BENCHMARK:
- case PIN_BENCHMARK:
unpin_user_pages(pages, nr_pages);
break;
- }
+}
+static void verify_dma_pinned(unsigned int cmd, struct page **pages,
unsigned long nr_pages)
+{
- int i;
- struct page *page;
- switch (cmd) {
- case PIN_FAST_BENCHMARK:
- case PIN_BENCHMARK:
for (i = 0; i < nr_pages; i++) {
Ditto.
page = pages[i];
if (WARN(!page_maybe_dma_pinned(page),
"pages[%d] is NOT dma-pinned\n", i)) {
dump_page(page, "gup_benchmark failure");
break;
}
}
break;
- }
+}
static int __gup_benchmark_ioctl(unsigned int cmd, struct gup_benchmark *gup) {
On 2/3/20 5:58 AM, Kirill A. Shutemov wrote: ...
@@ -19,6 +21,48 @@ struct gup_benchmark { __u64 expansion[10]; /* For future use */ }; +static void put_back_pages(unsigned int cmd, struct page **pages,
unsigned long nr_pages)
+{
- int i;
- switch (cmd) {
- case GUP_FAST_BENCHMARK:
- case GUP_LONGTERM_BENCHMARK:
- case GUP_BENCHMARK:
for (i = 0; i < nr_pages; i++)
'i' is 'int' and 'nr_pages' is 'unsigned long'. There's space for trouble :P
Yes, I've changed it to "unsigned int", thanks.
put_page(pages[i]);
break;
- case PIN_FAST_BENCHMARK:
- case PIN_BENCHMARK:
unpin_user_pages(pages, nr_pages);
break;
- }
+}
+static void verify_dma_pinned(unsigned int cmd, struct page **pages,
unsigned long nr_pages)
+{
- int i;
- struct page *page;
- switch (cmd) {
- case PIN_FAST_BENCHMARK:
- case PIN_BENCHMARK:
for (i = 0; i < nr_pages; i++) {
Ditto.
Fixed here also.
page = pages[i];
if (WARN(!page_maybe_dma_pinned(page),
"pages[%d] is NOT dma-pinned\n", i)) {
...and changed to "pages[%u]", to match.
thanks,
On Mon, Feb 03, 2020 at 01:17:40PM -0800, John Hubbard wrote:
On 2/3/20 5:58 AM, Kirill A. Shutemov wrote: ...
@@ -19,6 +21,48 @@ struct gup_benchmark { __u64 expansion[10]; /* For future use */ }; +static void put_back_pages(unsigned int cmd, struct page **pages,
unsigned long nr_pages)
+{
- int i;
- switch (cmd) {
- case GUP_FAST_BENCHMARK:
- case GUP_LONGTERM_BENCHMARK:
- case GUP_BENCHMARK:
for (i = 0; i < nr_pages; i++)
'i' is 'int' and 'nr_pages' is 'unsigned long'. There's space for trouble :P
Yes, I've changed it to "unsigned int", thanks.
I'm confused. If nr_pages is more than UINT_MAX, this is endless loop. Hm?
On 2/3/20 1:55 PM, Kirill A. Shutemov wrote:
On Mon, Feb 03, 2020 at 01:17:40PM -0800, John Hubbard wrote:
On 2/3/20 5:58 AM, Kirill A. Shutemov wrote: ...
@@ -19,6 +21,48 @@ struct gup_benchmark { __u64 expansion[10]; /* For future use */ }; +static void put_back_pages(unsigned int cmd, struct page **pages,
unsigned long nr_pages)
+{
- int i;
- switch (cmd) {
- case GUP_FAST_BENCHMARK:
- case GUP_LONGTERM_BENCHMARK:
- case GUP_BENCHMARK:
for (i = 0; i < nr_pages; i++)
'i' is 'int' and 'nr_pages' is 'unsigned long'. There's space for trouble :P
Yes, I've changed it to "unsigned int", thanks.
I'm confused. If nr_pages is more than UINT_MAX, this is endless loop. Hm?
Oh, I've been afflicted with 64-bit tunnel vision. OK, make that "unsigned long" and "%ul". yikes. :)
thanks,
It's good to have basic unit test coverage of the new FOLL_PIN behavior. Fortunately, the gup_benchmark unit test is extremely fast (a few milliseconds), so adding it the the run_vmtests suite is going to cause no noticeable change in running time.
So, add two new invocations to run_vmtests:
1) Run gup_benchmark with normal get_user_pages().
2) Run gup_benchmark with pin_user_pages(). This is much like the first call, except that it sets FOLL_PIN.
Running these two in quick succession also provide a visual comparison of the running times, which is convenient.
The new invocations are fairly early in the run_vmtests script, because with test suites, it's usually preferable to put the shorter, faster tests first, all other things being equal.
Reviewed-by: Ira Weiny ira.weiny@intel.com Signed-off-by: John Hubbard jhubbard@nvidia.com --- tools/testing/selftests/vm/run_vmtests | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+)
diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests index a692ea828317..df6a6bf3f238 100755 --- a/tools/testing/selftests/vm/run_vmtests +++ b/tools/testing/selftests/vm/run_vmtests @@ -112,6 +112,28 @@ echo "NOTE: The above hugetlb tests provide minimal coverage. Use" echo " https://github.com/libhugetlbfs/libhugetlbfs.git for" echo " hugetlb regression testing."
+echo "--------------------------------------------" +echo "running 'gup_benchmark -U' (normal/slow gup)" +echo "--------------------------------------------" +./gup_benchmark -U +if [ $? -ne 0 ]; then + echo "[FAIL]" + exitcode=1 +else + echo "[PASS]" +fi + +echo "------------------------------------------" +echo "running gup_benchmark -b (pin_user_pages)" +echo "------------------------------------------" +./gup_benchmark -b +if [ $? -ne 0 ]; then + echo "[FAIL]" + exitcode=1 +else + echo "[PASS]" +fi + echo "-------------------" echo "running userfaultfd" echo "-------------------"
linux-kselftest-mirror@lists.linaro.org