[PATCH STABLE 4.4 0/8] page refcount overflow backports

List overview All Threads
Download

newer

older

stable request: 5.4.y: SUNRPC:...

[PATCH 1/2] can: m_can: tcan4x5x:...

Vlastimil Babka

8 Nov 2019 8 Nov '19

9:38 a.m.

Hi,

this series backports the CVE-2019-11487 fixes (page refcount overflow) to 4.4 stable. It differs from Ajay's series [1] in the following:

- gup.c variants of fast gup for x86 and s390 are fixed too. I've not fixed sparc, mips, sh. It's unlikely the known overflow scenario based on FUSE, which needs 140GB of RAM, is a problem for those architectures, and I don't feel confident enough to patch them. I've sent the same fixup for 4.9 [3] - there are some differences in backport adaptations, hopefully not important. My version is taken from our 4.4 based kernel, which was just simpler for me than adding the missing parts to Ajay's version - The last patch fixes another problem in the fast gup implementation on x86, that I've previously posted and got merged to 4.9 stable [2].

[1] https://lore.kernel.org/linux-mm/1570581863-12090-1-git-send-email-akaher@vm... [2] https://lore.kernel.org/linux-mm/20190802160614.8089-1-vbabka@suse.cz/ [3] https://lore.kernel.org/linux-mm/9c130fa4-e52d-f8bd-c450-42341c7ab441@suse.c...

Linus Torvalds (3): mm: make page ref count overflow check tighter and more explicit mm: add 'try_get_page()' helper function mm: prevent get_user_pages() from overflowing page refcount

Matthew Wilcox (1): fs: prevent page refcount overflow in pipe_buf_get

Miklos Szeredi (1): pipe: add pipe_buf_get() helper

Punit Agrawal (1): mm, gup: ensure real head page is ref-counted when using hugepages

Vlastimil Babka (1): x86, mm, gup: prevent get_page() race with munmap in paravirt guest

Will Deacon (1): mm, gup: remove broken VM_BUG_ON_PAGE compound check for hugepages

-- 2.23.0

Show replies by date

Vlastimil Babka

8 Nov 8 Nov

9:38 a.m.

New subject: [PATCH STABLE 4.4 1/8] mm, gup: remove broken VM_BUG_ON_PAGE compound check for hugepages

From: Will Deacon will.deacon@arm.com

commit a3e328556d41bb61c55f9dfcc62d6a826ea97b85 upstream.

When operating on hugepages with DEBUG_VM enabled, the GUP code checks the compound head for each tail page prior to calling page_cache_add_speculative. This is broken, because on the fast-GUP path (where we don't hold any page table locks) we can be racing with a concurrent invocation of split_huge_page_to_list.

split_huge_page_to_list deals with this race by using page_ref_freeze to freeze the page and force concurrent GUPs to fail whilst the component pages are modified. This modification includes clearing the compound_head field for the tail pages, so checking this prior to a successful call to page_cache_add_speculative can lead to false positives: In fact, page_cache_add_speculative *already* has this check once the page refcount has been successfully updated, so we can simply remove the broken calls to VM_BUG_ON_PAGE.

Link: http://lkml.kernel.org/r/20170522133604.11392-2-punit.agrawal@arm.com Signed-off-by: Will Deacon will.deacon@arm.com Signed-off-by: Punit Agrawal punit.agrawal@arm.com Acked-by: Steve Capper steve.capper@arm.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Aneesh Kumar K.V aneesh.kumar@linux.vnet.ibm.com Cc: Catalin Marinas catalin.marinas@arm.com Cc: Naoya Horiguchi n-horiguchi@ah.jp.nec.com Cc: Mark Rutland mark.rutland@arm.com Cc: Hillf Danton hillf.zj@alibaba-inc.com Cc: Michal Hocko mhocko@suse.com Cc: Mike Kravetz mike.kravetz@oracle.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Vlastimil Babka vbabka@suse.cz --- mm/gup.c | 3 --- 1 file changed, 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c index 2cd3b31e3666..6f9088cb8ebe 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1134,7 +1134,6 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT); tail = page; do { - VM_BUG_ON_PAGE(compound_head(page) != head, page); pages[*nr] = page; (*nr)++; page++; @@ -1181,7 +1180,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT); tail = page; do { - VM_BUG_ON_PAGE(compound_head(page) != head, page); pages[*nr] = page; (*nr)++; page++; @@ -1224,7 +1222,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, page = head + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT); tail = page; do { - VM_BUG_ON_PAGE(compound_head(page) != head, page); pages[*nr] = page; (*nr)++; page++;

-- 2.23.0

Vlastimil Babka

9:38 a.m.

New subject: [PATCH STABLE 4.4 2/8] mm, gup: ensure real head page is ref-counted when using hugepages

From: Punit Agrawal punit.agrawal@arm.com

commit d63206ee32b6e64b0e12d46e5d6004afd9913713 upstream.

When speculatively taking references to a hugepage using page_cache_add_speculative() in gup_huge_pmd(), it is assumed that the page returned by pmd_page() is the head page. Although normally true, this assumption doesn't hold when the hugepage comprises of successive page table entries such as when using contiguous bit on arm64 at PTE or PMD levels.

This can be addressed by ensuring that the page passed to page_cache_add_speculative() is the real head or by de-referencing the head page within the function.

We take the first approach to keep the usage pattern aligned with page_cache_get_speculative() where users already pass the appropriate page, i.e., the de-referenced head.

Apply the same logic to fix gup_huge_[pud|pgd]() as well.

[punit.agrawal@arm.com: fix arm64 ltp failure] Link: http://lkml.kernel.org/r/20170619170145.25577-5-punit.agrawal@arm.com Link: http://lkml.kernel.org/r/20170522133604.11392-3-punit.agrawal@arm.com Signed-off-by: Punit Agrawal punit.agrawal@arm.com Acked-by: Steve Capper steve.capper@arm.com Cc: Michal Hocko mhocko@suse.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Cc: Aneesh Kumar K.V aneesh.kumar@linux.vnet.ibm.com Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will.deacon@arm.com Cc: Naoya Horiguchi n-horiguchi@ah.jp.nec.com Cc: Mark Rutland mark.rutland@arm.com Cc: Hillf Danton hillf.zj@alibaba-inc.com Cc: Mike Kravetz mike.kravetz@oracle.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Vlastimil Babka vbabka@suse.cz --- mm/gup.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c index 6f9088cb8ebe..71e9d0093a35 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1130,8 +1130,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, return 0;

refs = 0; - head = pmd_page(orig); - page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT); + page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); tail = page; do { pages[*nr] = page; @@ -1140,6 +1139,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, refs++; } while (addr += PAGE_SIZE, addr != end);

+ head = compound_head(pmd_page(orig)); if (!page_cache_add_speculative(head, refs)) { *nr -= refs; return 0; @@ -1176,8 +1176,7 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, return 0;

refs = 0; - head = pud_page(orig); - page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT); + page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); tail = page; do { pages[*nr] = page; @@ -1186,6 +1185,7 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, refs++; } while (addr += PAGE_SIZE, addr != end);

+ head = compound_head(pud_page(orig)); if (!page_cache_add_speculative(head, refs)) { *nr -= refs; return 0; @@ -1218,8 +1218,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, return 0;

refs = 0; - head = pgd_page(orig); - page = head + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT); + page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT); tail = page; do { pages[*nr] = page; @@ -1228,6 +1227,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, refs++; } while (addr += PAGE_SIZE, addr != end);

+ head = compound_head(pgd_page(orig)); if (!page_cache_add_speculative(head, refs)) { *nr -= refs; return 0;

-- 2.23.0

Vlastimil Babka

9:38 a.m.

New subject: [PATCH STABLE 4.4 3/8] mm: make page ref count overflow check tighter and more explicit

From: Linus Torvalds torvalds@linux-foundation.org

commit f958d7b528b1b40c44cfda5eabe2d82760d868c3 upstream.

[ 4.4 backport: page_ref_count() doesn't exist, introduce it to reduce churn. Change also two similar checks in mm/internal.h ]

We have a VM_BUG_ON() to check that the page reference count doesn't underflow (or get close to overflow) by checking the sign of the count.

That's all fine, but we actually want to allow people to use a "get page ref unless it's already very high" helper function, and we want that one to use the sign of the page ref (without triggering this VM_BUG_ON).

Change the VM_BUG_ON to only check for small underflows (or _very_ close to overflowing), and ignore overflows which have strayed into negative territory.

Acked-by: Matthew Wilcox willy@infradead.org Cc: Jann Horn jannh@google.com Cc: stable@kernel.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Vlastimil Babka vbabka@suse.cz --- include/linux/mm.h | 11 ++++++++++- mm/internal.h | 5 +++-- 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h index ed653ba47c46..997edfcb0a30 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -488,6 +488,15 @@ static inline void get_huge_page_tail(struct page *page)

extern bool __get_page_tail(struct page *page);

+static inline int page_ref_count(struct page *page) +{ + return atomic_read(&page->_count); +} + +/* 127: arbitrary random number, small enough to assemble well */ +#define page_ref_zero_or_close_to_overflow(page) \ + ((unsigned int) page_ref_count(page) + 127u <= 127u) + static inline void get_page(struct page *page) { if (unlikely(PageTail(page))) @@ -497,7 +506,7 @@ static inline void get_page(struct page *page) * Getting a normal page or the head of a compound page * requires to already have an elevated page->_count. */ - VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page); + VM_BUG_ON_PAGE(page_ref_zero_or_close_to_overflow(page), page); atomic_inc(&page->_count); }

diff --git a/mm/internal.h b/mm/internal.h index f63f4393d633..a6639c72780a 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -81,7 +81,8 @@ static inline void __get_page_tail_foll(struct page *page, * speculative page access (like in * page_cache_get_speculative()) on tail pages. */ - VM_BUG_ON_PAGE(atomic_read(&compound_head(page)->_count) <= 0, page); + VM_BUG_ON_PAGE(page_ref_zero_or_close_to_overflow(compound_head(page)), + page); if (get_page_head) atomic_inc(&compound_head(page)->_count); get_huge_page_tail(page); @@ -106,7 +107,7 @@ static inline void get_page_foll(struct page *page) * Getting a normal page or the head of a compound page * requires to already have an elevated page->_count. */ - VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page); + VM_BUG_ON_PAGE(page_ref_zero_or_close_to_overflow(page), page); atomic_inc(&page->_count); } }

-- 2.23.0

Vlastimil Babka

9:38 a.m.

New subject: [PATCH STABLE 4.4 4/8] mm: add 'try_get_page()' helper function

From: Linus Torvalds torvalds@linux-foundation.org

commit 88b1a17dfc3ed7728316478fae0f5ad508f50397 upstream.

[ 4.4 backport: get_page() is more complicated due to special handling of tail pages via __get_page_tail(). But in all cases, eventually the compound head page's refcount is incremented. So try_get_page() just checks compound head's refcount for overflow and then simply calls get_page(). ]

This is the same as the traditional 'get_page()' function, but instead of unconditionally incrementing the reference count of the page, it only does so if the count was "safe". It returns whether the reference count was incremented (and is marked __must_check, since the caller obviously has to be aware of it).

Also like 'get_page()', you can't use this function unless you already had a reference to the page. The intent is that you can use this exactly like get_page(), but in situations where you want to limit the maximum reference count.

The code currently does an unconditional WARN_ON_ONCE() if we ever hit the reference count issues (either zero or negative), as a notification that the conditional non-increment actually happened.

NOTE! The count access for the "safety" check is inherently racy, but that doesn't matter since the buffer we use is basically half the range of the reference count (ie we look at the sign of the count).

diff --git a/include/linux/mm.h b/include/linux/mm.h index 997edfcb0a30..78358aeb7732 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -510,6 +510,21 @@ static inline void get_page(struct page *page) atomic_inc(&page->_count); }

+static inline __must_check bool try_get_page(struct page *page) +{ + struct page *head = compound_head(page); + + /* + * get_page() increases always head page's refcount, either directly or + * via __get_page_tail() for tail page, so we check that + */ + if (WARN_ON_ONCE(page_ref_count(head) <= 0)) + return false; + + get_page(page); + return true; +} + static inline struct page *virt_to_head_page(const void *x) { struct page *page = virt_to_page(x);

-- 2.23.0

Vlastimil Babka

9:38 a.m.

New subject: [PATCH STABLE 4.4 5/8] mm: prevent get_user_pages() from overflowing page refcount

From: Linus Torvalds torvalds@linux-foundation.org

commit 8fde12ca79aff9b5ba951fce1a2641901b8d8e64 upstream.

[ 4.4 backport: there's get_page_foll(), so add try_get_page()-like checks in there, enabled by a new parameter, which is false where upstream patch doesn't replace get_page() with try_get_page() (the THP and hugetlb callers). In gup_pte_range(), we don't expect tail pages, so just check page ref count instead of try_get_compound_head() Also patch arch-specific variants of gup.c for x86 and s390, leaving mips, sh, sparc alone ]

If the page refcount wraps around past zero, it will be freed while there are still four billion references to it. One of the possible avenues for an attacker to try to make this happen is by doing direct IO on a page multiple times. This patch makes get_user_pages() refuse to take a new page reference if there are already more than two billion references to the page.

Reported-by: Jann Horn jannh@google.com Acked-by: Matthew Wilcox willy@infradead.org Cc: stable@kernel.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Vlastimil Babka vbabka@suse.cz --- arch/s390/mm/gup.c | 6 ++++-- arch/x86/mm/gup.c | 9 ++++++++- mm/gup.c | 39 +++++++++++++++++++++++++++++++-------- mm/huge_memory.c | 2 +- mm/hugetlb.c | 18 ++++++++++++++++-- mm/internal.h | 12 +++++++++--- 6 files changed, 69 insertions(+), 17 deletions(-)

diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c index 7ad41be8b373..bdaa5f7b652c 100644 --- a/arch/s390/mm/gup.c +++ b/arch/s390/mm/gup.c @@ -37,7 +37,8 @@ static inline int gup_pte_range(pmd_t *pmdp, pmd_t pmd, unsigned long addr, return 0; VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); - if (!page_cache_get_speculative(page)) + if (unlikely(WARN_ON_ONCE(page_ref_count(page) < 0) + || !page_cache_get_speculative(page))) return 0; if (unlikely(pte_val(pte) != pte_val(*ptep))) { put_page(page); @@ -76,7 +77,8 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr, refs++; } while (addr += PAGE_SIZE, addr != end);

- if (!page_cache_add_speculative(head, refs)) { + if (unlikely(WARN_ON_ONCE(page_ref_count(head) < 0) + || !page_cache_add_speculative(head, refs))) { *nr -= refs; return 0; } diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index 7d2542ad346a..6612d532e42e 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -95,7 +95,10 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, } VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); - get_page(page); + if (unlikely(!try_get_page(page))) { + pte_unmap(ptep); + return 0; + } SetPageReferenced(page); pages[*nr] = page; (*nr)++; @@ -132,6 +135,8 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,

refs = 0; head = pmd_page(pmd); + if (WARN_ON_ONCE(page_ref_count(head) <= 0)) + return 0; page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT); do { VM_BUG_ON_PAGE(compound_head(page) != head, page); @@ -208,6 +213,8 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr,

refs = 0; head = pud_page(pud); + if (WARN_ON_ONCE(page_ref_count(head) <= 0)) + return 0; page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT); do { VM_BUG_ON_PAGE(compound_head(page) != head, page); diff --git a/mm/gup.c b/mm/gup.c index 71e9d0093a35..fc8e2dca99fc 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -127,7 +127,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, }

if (flags & FOLL_GET) - get_page_foll(page); + if (!get_page_foll(page, true)) { + page = ERR_PTR(-ENOMEM); + goto out; + } if (flags & FOLL_TOUCH) { if ((flags & FOLL_WRITE) && !pte_dirty(pte) && !PageDirty(page)) @@ -289,7 +292,10 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address, goto unmap; *page = pte_page(*pte); } - get_page(*page); + if (unlikely(!try_get_page(*page))) { + ret = -ENOMEM; + goto unmap; + } out: ret = 0; unmap: @@ -1053,6 +1059,20 @@ struct page *get_dump_page(unsigned long addr) */ #ifdef CONFIG_HAVE_GENERIC_RCU_GUP

+/* + * Return the compund head page with ref appropriately incremented, + * or NULL if that failed. + */ +static inline struct page *try_get_compound_head(struct page *page, int refs) +{ + struct page *head = compound_head(page); + if (WARN_ON_ONCE(page_ref_count(head) < 0)) + return NULL; + if (unlikely(!page_cache_add_speculative(head, refs))) + return NULL; + return head; +} + #ifdef __HAVE_ARCH_PTE_SPECIAL static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) @@ -1083,6 +1103,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte);

+ if (WARN_ON_ONCE(page_ref_count(page) < 0)) + goto pte_unmap; + if (!page_cache_get_speculative(page)) goto pte_unmap;

@@ -1139,8 +1162,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, refs++; } while (addr += PAGE_SIZE, addr != end);

- head = compound_head(pmd_page(orig)); - if (!page_cache_add_speculative(head, refs)) { + head = try_get_compound_head(pmd_page(orig), refs); + if (!head) { *nr -= refs; return 0; } @@ -1185,8 +1208,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, refs++; } while (addr += PAGE_SIZE, addr != end);

- head = compound_head(pud_page(orig)); - if (!page_cache_add_speculative(head, refs)) { + head = try_get_compound_head(pud_page(orig), refs); + if (!head) { *nr -= refs; return 0; } @@ -1227,8 +1250,8 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, refs++; } while (addr += PAGE_SIZE, addr != end);

- head = compound_head(pgd_page(orig)); - if (!page_cache_add_speculative(head, refs)) { + head = try_get_compound_head(pgd_page(orig), refs); + if (!head) { *nr -= refs; return 0; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 465786cd6490..6087277981a6 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1322,7 +1322,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT; VM_BUG_ON_PAGE(!PageCompound(page), page); if (flags & FOLL_GET) - get_page_foll(page); + get_page_foll(page, false);

out: return page; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index fd932e7a25dd..b4a8a18fa3a5 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3886,6 +3886,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long vaddr = *position; unsigned long remainder = *nr_pages; struct hstate *h = hstate_vma(vma); + int err = -EFAULT;

while (vaddr < vma->vm_end && remainder) { pte_t *pte; @@ -3957,10 +3958,23 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,

pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT; page = pte_page(huge_ptep_get(pte)); + + /* + * Instead of doing 'try_get_page()' below in the same_page + * loop, just check the count once here. + */ + if (unlikely(page_count(page) <= 0)) { + if (pages) { + spin_unlock(ptl); + remainder = 0; + err = -ENOMEM; + break; + } + } same_page: if (pages) { pages[i] = mem_map_offset(page, pfn_offset); - get_page_foll(pages[i]); + get_page_foll(pages[i], false); }

if (vmas) @@ -3983,7 +3997,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, *nr_pages = remainder; *position = vaddr;

- return i ? i : -EFAULT; + return i ? i : err; }

unsigned long hugetlb_change_protection(struct vm_area_struct *vma, diff --git a/mm/internal.h b/mm/internal.h index a6639c72780a..b52041969d06 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -93,23 +93,29 @@ static inline void __get_page_tail_foll(struct page *page, * follow_page() and it must be called while holding the proper PT * lock while the pte (or pmd_trans_huge) is still mapping the page. */ -static inline void get_page_foll(struct page *page) +static inline bool get_page_foll(struct page *page, bool check) { - if (unlikely(PageTail(page))) + if (unlikely(PageTail(page))) { /* * This is safe only because * __split_huge_page_refcount() can't run under * get_page_foll() because we hold the proper PT lock. */ + if (check && WARN_ON_ONCE( + page_ref_count(compound_head(page)) <= 0)) + return false; __get_page_tail_foll(page, true); - else { + } else { /* * Getting a normal page or the head of a compound page * requires to already have an elevated page->_count. */ VM_BUG_ON_PAGE(page_ref_zero_or_close_to_overflow(page), page); + if (check && WARN_ON_ONCE(page_ref_count(page) <= 0)) + return false; atomic_inc(&page->_count); } + return true; }

extern unsigned long highest_memmap_pfn;

-- 2.23.0

Ajay Kaher

3 Dec 3 Dec

12:25 p.m.

New subject: [PATCH STABLE 4.4 5/8] mm: prevent get_user_pages() from overflowing page refcount

On 08/11/19, 3:08 PM, "Vlastimil Babka" vbabka@suse.cz wrote:

...

From: Linus Torvalds torvalds@linux-foundation.org

commit 8fde12ca79aff9b5ba951fce1a2641901b8d8e64 upstream. [ 4.4 backport: there's get_page_foll(), so add try_get_page()-like checks in there, enabled by a new parameter, which is false where upstream patch doesn't replace get_page() with try_get_page() (the THP and hugetlb callers).

Could we have try_get_page_foll(), as in: https://lore.kernel.org/stable/1570581863-12090-3-git-send-email-akaher@vmwa...

+ Code will be in sync as we have try_get_page() + No need to add extra argument to try_get_page() + No need to modify the callers of try_get_page()

...

In gup_pte_range(), we don't expect tail pages, so just check
            page ref count instead of try_get_compound_head()

Technically it's fine. If you want to keep the code of stable versions in sync with latest versions then this could be done in following ways (without any modification in upstream patch for gup_pte_range()):

Apply 7aef4172c7957d7e65fc172be4c99becaef855d4 before applying 8fde12ca79aff9b5ba951fce1a2641901b8d8e64, as done here: https://lore.kernel.org/stable/1570581863-12090-4-git-send-email-akaher@vmwa...

...

Also patch arch-specific variants of gup.c for x86 and s390,
leaving mips, sh, sparc alone				      ]

...

arch/s390/mm/gup.c | 6 ++++-- arch/x86/mm/gup.c | 9 ++++++++- mm/gup.c | 39 +++++++++++++++++++++++++++++++-------- mm/huge_memory.c | 2 +- mm/hugetlb.c | 18 ++++++++++++++++-- mm/internal.h | 12 +++++++++--- 6 files changed, 69 insertions(+), 17 deletions(-) #ifdef __HAVE_ARCH_PTE_SPECIAL static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) @@ -1083,6 +1103,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte);
if (WARN_ON_ONCE(page_ref_count(page) < 0))
	goto pte_unmap;
if (!page_cache_get_speculative(page)) goto pte_unmap;

...

diff --git a/mm/internal.h b/mm/internal.h index a6639c72780a..b52041969d06 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -93,23 +93,29 @@ static inline void __get_page_tail_foll(struct page *page,

follow_page() and it must be called while holding the proper PT

lock while the pte (or pmd_trans_huge) is still mapping the page.

*/ -static inline void get_page_foll(struct page *page) +static inline bool get_page_foll(struct page *page, bool check) {

if (unlikely(PageTail(page)))
if (unlikely(PageTail(page))) { /*

This is safe only because

__split_huge_page_refcount() can't run under

get_page_foll() because we hold the proper PT lock.

*/
if (check && WARN_ON_ONCE(
		page_ref_count(compound_head(page)) <= 0))
	return false;
__get_page_tail_foll(page, true);
else {
} else { /*

Getting a normal page or the head of a compound page

requires to already have an elevated page->_count.

*/ VM_BUG_ON_PAGE(page_ref_zero_or_close_to_overflow(page), page);
if (check && WARN_ON_ONCE(page_ref_count(page) <= 0))
	return false;
atomic_inc(&page->_count); }
return true;
}

Vlastimil Babka

12:57 p.m.

New subject: [PATCH STABLE 4.4 5/8] mm: prevent get_user_pages() from overflowing page refcount

On 12/3/19 1:25 PM, Ajay Kaher wrote:

...

On 08/11/19, 3:08 PM, "Vlastimil Babka" vbabka@suse.cz wrote:

...
From: Linus Torvalds torvalds@linux-foundation.org

commit 8fde12ca79aff9b5ba951fce1a2641901b8d8e64 upstream. [ 4.4 backport: there's get_page_foll(), so add try_get_page()-like checks in there, enabled by a new parameter, which is false where upstream patch doesn't replace get_page() with try_get_page() (the THP and hugetlb callers).

Could we have try_get_page_foll(), as in: https://lore.kernel.org/stable/1570581863-12090-3-git-send-email-akaher@vmwa...

Code will be in sync as we have try_get_page()

No need to add extra argument to try_get_page()

No need to modify the callers of try_get_page()

...
In gup_pte_range(), we don't expect tail pages, so just check
            page ref count instead of try_get_compound_head()
Technically it's fine. If you want to keep the code of stable versions in sync with latest versions then this could be done in following ways (without any modification in upstream patch for gup_pte_range()):

Apply 7aef4172c7957d7e65fc172be4c99becaef855d4 before applying 8fde12ca79aff9b5ba951fce1a2641901b8d8e64, as done here: https://lore.kernel.org/stable/1570581863-12090-4-git-send-email-akaher@vmwa...

Yup, I have considered that, and deliberately didn't add that commit 7aef4172c795 ("mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton") as it's part of a large THP refcount rework. In 4.4 we don't expect to GUP tail pages so I wanted to keep it that way - minimally, the compound_head() operation is a unnecessary added cost, although it would also work.

Ajay Kaher

6 Dec 6 Dec

4:15 a.m.

New subject: [PATCH STABLE 4.4 5/8] mm: prevent get_user_pages() from overflowing page refcount

On 03/12/19, 6:28 PM, "Vlastimil Babka" vbabka@suse.cz wrote:

...

...
...
[ 4.4 backport: there's get_page_foll(), so add try_get_page()-like checks in there, enabled by a new parameter, which is false where upstream patch doesn't replace get_page() with try_get_page() (the THP and hugetlb callers).

Could we have try_get_page_foll(), as in: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kerne...

Code will be in sync as we have try_get_page()

No need to add extra argument to try_get_page()

No need to modify the callers of try_get_page()

Any reason for not using try_get_page_foll().

...

...
...
In gup_pte_range(), we don't expect tail pages, so just check
            page ref count instead of try_get_compound_head()
Technically it's fine. If you want to keep the code of stable versions in sync with latest versions then this could be done in following ways (without any modification in upstream patch for gup_pte_range()):

Apply 7aef4172c7957d7e65fc172be4c99becaef855d4 before applying 8fde12ca79aff9b5ba951fce1a2641901b8d8e64, as done here: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kerne...

...

Yup, I have considered that, and deliberately didn't add that commit 7aef4172c795 ("mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton") as it's part of a large THP refcount rework. In 4.4 we don't expect to GUP tail pages so I wanted to keep it that way - minimally, the compound_head() operation is a unnecessary added cost, although it would also work.

Vlastimil Babka

2:32 p.m.

New subject: [PATCH STABLE 4.4 5/8] mm: prevent get_user_pages() from overflowing page refcount

On 12/6/19 5:15 AM, Ajay Kaher wrote:

...

On 03/12/19, 6:28 PM, "Vlastimil Babka" vbabka@suse.cz wrote:

...
...
...
[ 4.4 backport: there's get_page_foll(), so add try_get_page()-like checks in there, enabled by a new parameter, which is false where upstream patch doesn't replace get_page() with try_get_page() (the THP and hugetlb callers).

Could we have try_get_page_foll(), as in: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kerne...

Code will be in sync as we have try_get_page()

No need to add extra argument to try_get_page()

No need to modify the callers of try_get_page()

Any reason for not using try_get_page_foll().

Ah, sorry, I missed that previously. It's certainly possible to do it that way, I just didn't care so strongly to rewrite the existing SLES patch. It's a stable backport for a rather old LTS, not a codebase for further development.

...

...
...
...
In gup_pte_range(), we don't expect tail pages, so just check
            page ref count instead of try_get_compound_head()
Technically it's fine. If you want to keep the code of stable versions in sync with latest versions then this could be done in following ways (without any modification in upstream patch for gup_pte_range()):

Apply 7aef4172c7957d7e65fc172be4c99becaef855d4 before applying 8fde12ca79aff9b5ba951fce1a2641901b8d8e64, as done here: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kerne...
...
Yup, I have considered that, and deliberately didn't add that commit 7aef4172c795 ("mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton") as it's part of a large THP refcount rework. In 4.4 we don't expect to GUP tail pages so I wanted to keep it that way - minimally, the compound_head() operation is a unnecessary added cost, although it would also work.

Ajay Kaher

9 Dec 9 Dec

8:54 a.m.

New subject: [PATCH STABLE 4.4 5/8] mm: prevent get_user_pages() from overflowing page refcount

On 06/12/19, 8:02 PM, "Vlastimil Babka" vbabka@suse.cz wrote:

...

On 12/6/19 5:15 AM, Ajay Kaher wrote:

...
On 03/12/19, 6:28 PM, "Vlastimil Babka" vbabka@suse.cz wrote:

...
...
...
[ 4.4 backport: there's get_page_foll(), so add try_get_page()-like checks in there, enabled by a new parameter, which is false where upstream patch doesn't replace get_page() with try_get_page() (the THP and hugetlb callers).

Could we have try_get_page_foll(), as in: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kerne...

Code will be in sync as we have try_get_page()

No need to add extra argument to try_get_page()

No need to modify the callers of try_get_page()

Any reason for not using try_get_page_foll().

Ah, sorry, I missed that previously. It's certainly possible to do it that way, I just didn't care so strongly to rewrite the existing SLES patch. It's a stable backport for a rather old LTS, not a codebase for further development.

Thanks for your response.

I would appreciate if you would like to include try_get_page_foll(), and resend this patch series again.

Greg may require Acked-by from my side also, so if it's fine with you, you can add or I will add once you will post this patch series again.

Let me know if anything else I can do here.

...

...
...
...
...
In gup_pte_range(), we don't expect tail pages, so just check
            page ref count instead of try_get_compound_head()
Technically it's fine. If you want to keep the code of stable versions in sync with latest versions then this could be done in following ways (without any modification in upstream patch for gup_pte_range()):

Apply 7aef4172c7957d7e65fc172be4c99becaef855d4 before applying 8fde12ca79aff9b5ba951fce1a2641901b8d8e64, as done here: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kerne...
...
Yup, I have considered that, and deliberately didn't add that commit 7aef4172c795 ("mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton") as it's part of a large THP refcount rework. In 4.4 we don't expect to GUP tail pages so I wanted to keep it that way - minimally, the compound_head() operation is a unnecessary added cost, although it would also work.

Thanks for above explanation.

Vlastimil Babka

9:10 a.m.

New subject: [PATCH STABLE 4.4 5/8] mm: prevent get_user_pages() from overflowing page refcount

On 12/9/19 9:54 AM, Ajay Kaher wrote:

...

On 06/12/19, 8:02 PM, "Vlastimil Babka" vbabka@suse.cz wrote:

...
On 12/6/19 5:15 AM, Ajay Kaher wrote:

...
On 03/12/19, 6:28 PM, "Vlastimil Babka" vbabka@suse.cz wrote:

...
...
...
[ 4.4 backport: there's get_page_foll(), so add try_get_page()-like checks in there, enabled by a new parameter, which is false where upstream patch doesn't replace get_page() with try_get_page() (the THP and hugetlb callers).

Could we have try_get_page_foll(), as in: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kerne...

Code will be in sync as we have try_get_page()

No need to add extra argument to try_get_page()

No need to modify the callers of try_get_page()

Any reason for not using try_get_page_foll().

Ah, sorry, I missed that previously. It's certainly possible to do it that way, I just didn't care so strongly to rewrite the existing SLES patch. It's a stable backport for a rather old LTS, not a codebase for further development.

Thanks for your response.

I would appreciate if you would like to include try_get_page_foll(), and resend this patch series again.

I won't have time for that now, but I don't mind if you do that, or resend your version with the missing x86 and s390 gup.c parts and preferably without 7aef4172c795.

Vlastimil Babka

8 Nov 8 Nov

9:38 a.m.

New subject: [PATCH STABLE 4.4 6/8] pipe: add pipe_buf_get() helper

From: Miklos Szeredi mszeredi@redhat.com

commit 7bf2d1df80822ec056363627e2014990f068f7aa upstream.

Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Al Viro viro@zeniv.linux.org.uk Signed-off-by: Vlastimil Babka vbabka@suse.cz --- fs/fuse/dev.c | 2 +- fs/splice.c | 4 ++-- include/linux/pipe_fs_i.h | 11 +++++++++++ 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index f5d2d2340b44..36a5df92eb9c 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -2052,7 +2052,7 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe, pipe->curbuf = (pipe->curbuf + 1) & (pipe->buffers - 1); pipe->nrbufs--; } else { - ibuf->ops->get(pipe, ibuf); + pipe_buf_get(pipe, ibuf); *obuf = *ibuf; obuf->flags &= ~PIPE_BUF_FLAG_GIFT; obuf->len = rem; diff --git a/fs/splice.c b/fs/splice.c index 8398974e1538..fde126369966 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1876,7 +1876,7 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe, * Get a reference to this pipe buffer, * so we can copy the contents over. */ - ibuf->ops->get(ipipe, ibuf); + pipe_buf_get(ipipe, ibuf); *obuf = *ibuf;

/* @@ -1948,7 +1948,7 @@ static int link_pipe(struct pipe_inode_info *ipipe, * Get a reference to this pipe buffer, * so we can copy the contents over. */ - ibuf->ops->get(ipipe, ibuf); + pipe_buf_get(ipipe, ibuf);

obuf = opipe->bufs + nbuf; *obuf = *ibuf; diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 24f5470d3944..10876f3cb3da 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -115,6 +115,17 @@ struct pipe_buf_operations { void (*get)(struct pipe_inode_info *, struct pipe_buffer *); };

+/** + * pipe_buf_get - get a reference to a pipe_buffer + * @pipe: the pipe that the buffer belongs to + * @buf: the buffer to get a reference to + */ +static inline void pipe_buf_get(struct pipe_inode_info *pipe, + struct pipe_buffer *buf) +{ + buf->ops->get(pipe, buf); +} + /* Differs from PIPE_BUF in that PIPE_SIZE is the length of the actual memory allocation, whereas PIPE_BUF makes atomicity guarantees. */ #define PIPE_SIZE PAGE_SIZE

-- 2.23.0

Vlastimil Babka

9:38 a.m.

New subject: [PATCH STABLE 4.4 7/8] fs: prevent page refcount overflow in pipe_buf_get

From: Matthew Wilcox willy@infradead.org

commit 15fab63e1e57be9fdb5eec1bbc5916e9825e9acb upstream.

Change pipe_buf_get() to return a bool indicating whether it succeeded in raising the refcount of the page (if the thing in the pipe is a page). This removes another mechanism for overflowing the page refcount. All callers converted to handle a failure.

Reported-by: Jann Horn jannh@google.com Signed-off-by: Matthew Wilcox willy@infradead.org Cc: stable@kernel.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Vlastimil Babka vbabka@suse.cz --- fs/fuse/dev.c | 12 ++++++------ fs/pipe.c | 4 ++-- fs/splice.c | 12 ++++++++++-- include/linux/pipe_fs_i.h | 10 ++++++---- kernel/trace/trace.c | 6 +++++- 5 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 36a5df92eb9c..16891f5364af 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -2031,10 +2031,8 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe, rem += pipe->bufs[(pipe->curbuf + idx) & (pipe->buffers - 1)].len;

ret = -EINVAL; - if (rem < len) { - pipe_unlock(pipe); - goto out; - } + if (rem < len) + goto out_free;

rem = len; while (rem) { @@ -2052,7 +2050,9 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe, pipe->curbuf = (pipe->curbuf + 1) & (pipe->buffers - 1); pipe->nrbufs--; } else { - pipe_buf_get(pipe, ibuf); + if (!pipe_buf_get(pipe, ibuf)) + goto out_free; + *obuf = *ibuf; obuf->flags &= ~PIPE_BUF_FLAG_GIFT; obuf->len = rem; @@ -2075,13 +2075,13 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe, ret = fuse_dev_do_write(fud, &cs, len);

pipe_lock(pipe); +out_free: for (idx = 0; idx < nbuf; idx++) { struct pipe_buffer *buf = &bufs[idx]; buf->ops->release(pipe, buf); } pipe_unlock(pipe);

-out: kfree(bufs); return ret; } diff --git a/fs/pipe.c b/fs/pipe.c index 1e7263bb837a..6534470a6c19 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -178,9 +178,9 @@ EXPORT_SYMBOL(generic_pipe_buf_steal); * in the tee() system call, when we duplicate the buffers in one * pipe into another. */ -void generic_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf) +bool generic_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { - page_cache_get(buf->page); + return try_get_page(buf->page); } EXPORT_SYMBOL(generic_pipe_buf_get);

diff --git a/fs/splice.c b/fs/splice.c index fde126369966..57ccc583a172 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1876,7 +1876,11 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe, * Get a reference to this pipe buffer, * so we can copy the contents over. */ - pipe_buf_get(ipipe, ibuf); + if (!pipe_buf_get(ipipe, ibuf)) { + if (ret == 0) + ret = -EFAULT; + break; + } *obuf = *ibuf;

/* @@ -1948,7 +1952,11 @@ static int link_pipe(struct pipe_inode_info *ipipe, * Get a reference to this pipe buffer, * so we can copy the contents over. */ - pipe_buf_get(ipipe, ibuf); + if (!pipe_buf_get(ipipe, ibuf)) { + if (ret == 0) + ret = -EFAULT; + break; + }

obuf = opipe->bufs + nbuf; *obuf = *ibuf; diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 10876f3cb3da..0b28b65c12fb 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -112,18 +112,20 @@ struct pipe_buf_operations { /* * Get a reference to the pipe buffer. */ - void (*get)(struct pipe_inode_info *, struct pipe_buffer *); + bool (*get)(struct pipe_inode_info *, struct pipe_buffer *); };

/** * pipe_buf_get - get a reference to a pipe_buffer * @pipe: the pipe that the buffer belongs to * @buf: the buffer to get a reference to + * + * Return: %true if the reference was successfully obtained. */ -static inline void pipe_buf_get(struct pipe_inode_info *pipe, +static inline __must_check bool pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { - buf->ops->get(pipe, buf); + return buf->ops->get(pipe, buf); }

/* Differs from PIPE_BUF in that PIPE_SIZE is the length of the actual @@ -148,7 +150,7 @@ struct pipe_inode_info *alloc_pipe_info(void); void free_pipe_info(struct pipe_inode_info *);

/* Generic pipe buffer ops functions */ -void generic_pipe_buf_get(struct pipe_inode_info *, struct pipe_buffer *); +bool generic_pipe_buf_get(struct pipe_inode_info *, struct pipe_buffer *); int generic_pipe_buf_confirm(struct pipe_inode_info *, struct pipe_buffer *); int generic_pipe_buf_steal(struct pipe_inode_info *, struct pipe_buffer *); void generic_pipe_buf_release(struct pipe_inode_info *, struct pipe_buffer *); diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index c6e4e3e7f685..32cc4ea93ad6 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -5748,12 +5748,16 @@ static void buffer_pipe_buf_release(struct pipe_inode_info *pipe, buf->private = 0; }

-static void buffer_pipe_buf_get(struct pipe_inode_info *pipe, +static bool buffer_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { struct buffer_ref *ref = (struct buffer_ref *)buf->private;

+ if (ref->ref > INT_MAX/2) + return false; + ref->ref++; + return true; }

/* Pipe buffer operations for a buffer. */

-- 2.23.0

Vlastimil Babka

9:38 a.m.

New subject: [PATCH STABLE 4.4 8/8] x86, mm, gup: prevent get_page() race with munmap in paravirt guest

The x86 version of get_user_pages_fast() relies on disabled interrupts to synchronize gup_pte_range() between gup_get_pte(ptep); and get_page() against a parallel munmap. The munmap side nulls the pte, then flushes TLBs, then releases the page. As TLB flush is done synchronously via IPI disabling interrupts blocks the page release, and get_page(), which assumes existing reference on page, is thus safe. However when TLB flush is done by a hypercall, e.g. in a Xen PV guest, there is no blocking thanks to disabled interrupts, and get_page() can succeed on a page that was already freed or even reused.

We have recently seen this happen with our 4.4 and 4.12 based kernels, with userspace (java) that exits a thread, where mm_release() performs a futex_wake() on tsk->clear_child_tid, and another thread in parallel unmaps the page where tsk->clear_child_tid points to. The spurious get_page() succeeds, but futex code immediately releases the page again, while it's already on a freelist. Symptoms include a bad page state warning, general protection faults acessing a poisoned list prev/next pointer in the freelist, or free page pcplists of two cpus joined together in a single list. Oscar has also reproduced this scenario, with a patch inserting delays before the get_page() to make the race window larger.

Fix this by removing the dependency on TLB flush interrupts the same way as the generic get_user_pages_fast() code by using page_cache_add_speculative() and revalidating the PTE contents after pinning the page. Mainline is safe since 4.13 where the x86 gup code was removed in favor of the common code. Accessing the page table itself safely also relies on disabled interrupts and TLB flush IPIs that don't happen with hypercalls, which was acknowledged in commit 9e52fc2b50de ("x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)"). That commit with follups should also be backported for full safety, although our reproducer didn't hit a problem without that backport.

Reproduced-by: Oscar Salvador osalvador@suse.de Signed-off-by: Vlastimil Babka vbabka@suse.cz Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnar mingo@redhat.com Cc: Peter Zijlstra peterz@infradead.org Cc: Juergen Gross jgross@suse.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Vitaly Kuznetsov vkuznets@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Borislav Petkov bp@alien8.de Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Andy Lutomirski luto@kernel.org

Signed-off-by: Vlastimil Babka vbabka@suse.cz --- arch/x86/mm/gup.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index 6612d532e42e..6379a4883c0a 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -9,6 +9,7 @@ #include <linux/vmstat.h> #include <linux/highmem.h> #include <linux/swap.h> +#include <linux/pagemap.h>

#include <asm/pgtable.h>

@@ -95,10 +96,23 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, } VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); - if (unlikely(!try_get_page(page))) { + + if (WARN_ON_ONCE(page_ref_count(page) < 0)) { + pte_unmap(ptep); + return 0; + } + + if (!page_cache_get_speculative(page)) { pte_unmap(ptep); return 0; } + + if (unlikely(pte_val(pte) != pte_val(*ptep))) { + put_page(page); + pte_unmap(ptep); + return 0; + } + SetPageReferenced(page); pages[*nr] = page; (*nr)++;

-- 2.23.0

2204

days inactive

2235

days old

linux-stable-mirror@lists.linaro.org

14 comments

participants

tags (0)

participants (2)

Ajay Kaher
Vlastimil Babka