Re: [PATCH v2 12/18] mm/gup: track FOLL_PIN pages

4 Nov 2019

On 11/4/19 10:52 AM, Jerome Glisse wrote:
...
On Sun, Nov 03, 2019 at 01:18:07PM -0800, John Hubbard wrote:
...
Add tracking of pages that were pinned via FOLL_PIN.
As mentioned in the FOLL_PIN documentation, callers who effectively set
FOLL_PIN are required to ultimately free such pages via put_user_page().
The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
for DIO and/or RDMA use".
Pages that have been pinned via FOLL_PIN are identifiable via a
new function call:
bool page_dma_pinned(struct page *page);
What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1].
This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
This also has a couple of trivial, non-functional change fixes to
try_get_compound_head(). That function got moved to the top of the
file.
Maybe split that as a separate trivial patch.
Will do.
...
...
This includes the following fix from Ira Weiny:
DAX requires detection of a page crossing to a ref count of 1.  Fix this
for GUP pages by introducing put_devmap_managed_user_page() which
accounts for GUP_PIN_COUNTING_BIAS now used by GUP.
Please do the put_devmap_managed_page() changes in a separate
patch, it would be a lot easier to follow, also on that front
see comments below.
Oh! OK. It makes sense when you say it out loud. :)
...
...
...
+static inline bool put_devmap_managed_page(struct page *page)
+{

bool is_devmap = page_is_devmap_managed(page);

if (is_devmap) {
int count = page_ref_dec_return(page);



__put_devmap_managed_page(page, count);


}

return is_devmap;

+}
I think the __put_devmap_managed_page() should be rename
to free_devmap_managed_page() and that the count != 1
case move to this inline function ie:
static inline bool put_devmap_managed_page(struct page *page)
{
   bool is_devmap = page_is_devmap_managed(page);
if (is_devmap) {
   	int count = page_ref_dec_return(page);
/*
 * If refcount is 1 then page is freed and refcount is stable as nobody
 * holds a reference on the page.
 */
if (count == 1)
	free_devmap_managed_page(page, count);
else if (!count)
	__put_page(page);

}
return is_devmap;
}
Thanks, that does look cleaner and easier to read.
...
...



#else /* CONFIG_DEV_PAGEMAP_OPS */
 static inline bool put_devmap_managed_page(struct page *page)
 {
@@ -1038,6 +1051,8 @@ static inline __must_check bool try_get_page(struct page *page)
   return true;
 }
 
+__must_check bool user_page_ref_inc(struct page *page);



What about having it as an inline here as it is pretty small.
You mean move it to a static inline function in mm.h? It's worse than it 
looks, though: *everything* that it calls is also a static function, local
to gup.c. So I'd have to expose both try_get_compound_head() and
__update_proc_vmstat(). And that also means calling mod_node_page_state() from
mm.h, and it goes south right about there. :)
...
...
...
+/**


page_dma_pinned() - report if a page is pinned by a call to pin_user_pages*()



or pin_longterm_pages*()



@page:	pointer to page to be queried.



@Return:	True, if it is likely that the page has been "dma-pinned".



False, if the page is definitely not dma-pinned.




*/

Maybe add a small comment about wrap around :)
I don't *think* the count can wrap around, due to the checks in user_page_ref_inc().
But it's true that the documentation is a little light here...What did you have 
in mind?
...
[...]
...
@@ -1930,12 +2028,20 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 
   	pgmap = get_dev_pagemap(pfn, pgmap);
   	if (unlikely(!pgmap)) {

	undo_dev_pagemap(nr, nr_start, pages);




	undo_dev_pagemap(nr, nr_start, flags, pages);
return 0;

}
SetPageReferenced(page);
pages[*nr] = page;


get_page(page);





if (flags & FOLL_PIN) {


	if (unlikely(!user_page_ref_inc(page))) {


		undo_dev_pagemap(nr, nr_start, flags, pages);


		return 0;


	}



Maybe add a comment about a case that should never happens ie
user_page_ref_inc() fails after the second iteration of the
loop as it would be broken and a bug to call undo_dev_pagemap()
after the first iteration of that loop.
Also i believe that this should never happens as if first
iteration succeed than __page_cache_add_speculative() will
succeed for all the iterations.
Note that the pgmap case above follows that too ie the call to
get_dev_pagemap() can only fail on first iteration of the loop,
well i assume you can never have a huge device page that span
different pgmap ie different devices (which is a reasonable
assumption). So maybe this code needs fixing ie :
pgmap = get_dev_pagemap(pfn, pgmap);
if (unlikely(!pgmap))
	return 0;



OK, yes that does make sense. And I think a comment is adequate,
no need to check for bugs during every tail page iteration. So how 
about this, as a preliminary patch:

diff --git a/mm/gup.c b/mm/gup.c
index 8f236a335ae9..a4a81e125832 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1892,17 +1892,18 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
                unsigned long end, struct page **pages, int *nr)
 {
-       int nr_start = *nr;
-       struct dev_pagemap *pgmap = NULL;
+       /*
+        * Huge pages should never cross dev_pagemap boundaries. Therefore, use
+        * this same pgmap for the entire huge page.
+        */
+       struct dev_pagemap *pgmap = get_dev_pagemap(pfn, NULL);
+
+       if (unlikely(!pgmap))
+               return 0;
do {
                struct page *page = pfn_to_page(pfn);
-               pgmap = get_dev_pagemap(pfn, pgmap);
-               if (unlikely(!pgmap)) {
-                       undo_dev_pagemap(nr, nr_start, pages);
-                       return 0;
-               }
                SetPageReferenced(page);
                pages[*nr] = page;
                get_page(page);
...
...

} else


	get_page(page);


(*nr)++;
  pfn++;
 } while (addr += PAGE_SIZE, addr != end);

[...]
...
@@ -2409,7 +2540,7 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
   unsigned long addr, len, end;
   int nr = 0, ret = 0;

if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))


if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_PIN)))

Maybe add a comments to explain, something like:
/*

The only flags allowed here are: FOLL_WRITE, FOLL_LONGTERM, FOLL_PIN

Note that get_user_pages_fast() imply FOLL_GET flag by default but
callers can over-ride this default to pin case by setting FOLL_PIN.

*/
Good idea. Here's the draft now:
/*
 * The only flags allowed here are: FOLL_WRITE, FOLL_LONGTERM, FOLL_PIN.
 *
 * Note that get_user_pages_fast() implies FOLL_GET flag by default, but
 * callers can override this default by setting FOLL_PIN instead of
 * FOLL_GET.
 */
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_PIN)))
        return -EINVAL;
...
...
return -EINVAL;

start = untagged_addr(start) & PAGE_MASK;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 13cc93785006..66bf4c8b88f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
[...]
...
@@ -968,7 +973,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
   if (!*pgmap)
   	return ERR_PTR(-EFAULT);
   page = pfn_to_page(pfn);

get_page(page);



if (flags & FOLL_GET)
get_page(page);


else if (flags & FOLL_PIN)
if (unlikely(!user_page_ref_inc(page)))


	page = ERR_PTR(-ENOMEM);



While i agree that user_page_ref_inc() (ie page_cache_add_speculative())
should never fails here as we are holding the pmd lock and thus no one
can unmap the pmd and free the page it points to. I believe you should
return -EFAULT like for the pgmap and not -ENOMEM as the pgmap should
not fail either for the same reason. Thus it would be better to have
consistent error. Maybe also add a comments explaining that it should
not fail here.
OK. I'll take a pass through and fix up the remaining points about these
sorts of cases below, as well, in v3. Those all make sense.
...
...
return page;
 }
[...]
...
@@ -1100,7 +1115,7 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
    * device mapped pages can only be returned if the
    * caller will manage the page reference count.
    */

if (!(flags & FOLL_GET))


if (!(flags & (FOLL_GET | FOLL_PIN)))
return ERR_PTR(-EEXIST);

Maybe add a comment that FOLL_GET or FOLL_PIN must be set.
...
pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
@@ -1108,7 +1123,12 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
   if (!*pgmap)
   	return ERR_PTR(-EFAULT);
   page = pfn_to_page(pfn);

get_page(page);



if (flags & FOLL_GET)
get_page(page);


else if (flags & FOLL_PIN)
if (unlikely(!user_page_ref_inc(page)))


	page = ERR_PTR(-ENOMEM);



Same as for follow_devmap_pmd() see above.
...
return page;
 }
@@ -1522,8 +1542,12 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 skip_mlock:
   page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
   VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);

if (flags & FOLL_GET)
  get_page(page);
else if (flags & FOLL_PIN)
if (unlikely(!user_page_ref_inc(page)))


	page = NULL;



This should not fail either as we are holding the pmd lock maybe add
a comment. Dunno if we want a WARN() or something to catch this
degenerate case, or dump the page.
...
out:
   return page;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b45a95363a84..da335b1cd798 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4462,7 +4462,17 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 same_page:
   	if (pages) {
   		pages[i] = mem_map_offset(page, pfn_offset);

	get_page(pages[i]);





	if (flags & FOLL_GET)


		get_page(pages[i]);


	else if (flags & FOLL_PIN)


		if (unlikely(!user_page_ref_inc(pages[i]))) {


			spin_unlock(ptl);


			remainder = 0;


			err = -ENOMEM;


			WARN_ON_ONCE(1);


			break;


		}

}

user_page_ref_inc() should not fail here either because we hold the
ptl, so the WAR_ON_ONCE() is right but maybe add a comment.
...
if (vmas)
[...]
...
@@ -5034,8 +5050,14 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
   pte = huge_ptep_get((pte_t *)pmd);
   if (pte_present(pte)) {
   	page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);

if (flags & FOLL_GET)
  	get_page(page);
else if (flags & FOLL_PIN)


	if (unlikely(!user_page_ref_inc(page))) {


		page = NULL;


		goto out;


	}



This should not fail either (again holding pmd lock), dunno if we want
a warn or something to catch this degenerate case.
...
} else {
   	if (is_hugetlb_entry_migration(pte)) {
   		spin_unlock(ptl);
[...]
Those are all good points, working on them now.
thanks,
-- 
John Hubbard
NVIDIA

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v2 12/18] mm/gup: track FOLL_PIN pages