On 2019/4/20 4:44, Mike Kravetz wrote:
> Continuing discussion about commit 58b6e5e8f1ad ("hugetlbfs: fix memory
> leak for resv_map") brought up the issue that inode->i_mapping may not
> point to the address space embedded within the inode at inode eviction
> time. The hugetlbfs truncate routine handles this by explicitly using
> inode->i_data. However, code cleaning up the resv_map will still use
> the address space pointed to by inode->i_mapping. Luckily, private_data
> is NULL for address spaces in all such cases today but, there is no
> guarantee this will continue.
>
> Change all hugetlbfs code getting a resv_map pointer to explicitly get
> it from the address space embedded within the inode. In addition, add
> more comments in the code to indicate why this is being done.
>
> Reported-by: Yufen Yu <yuyufen(a)huawei.com>
> Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
> ---
> fs/hugetlbfs/inode.c | 11 +++++++++--
> mm/hugetlb.c | 19 ++++++++++++++++++-
> 2 files changed, 27 insertions(+), 3 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 9285dd4f4b1c..cbc649cd1722 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -499,8 +499,15 @@ static void hugetlbfs_evict_inode(struct inode *inode)
> struct resv_map *resv_map;
>
> remove_inode_hugepages(inode, 0, LLONG_MAX);
> - resv_map = (struct resv_map *)inode->i_mapping->private_data;
> - /* root inode doesn't have the resv_map, so we should check it */
> +
> + /*
> + * Get the resv_map from the address space embedded in the inode.
> + * This is the address space which points to any resv_map allocated
> + * at inode creation time. If this is a device special inode,
> + * i_mapping may not point to the original address space.
> + */
> + resv_map = (struct resv_map *)(&inode->i_data)->private_data;
> + /* Only regular and link inodes have associated reserve maps */
> if (resv_map)
> resv_map_release(&resv_map->refs);
> clear_inode(inode);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6cdc7b2d9100..b30e97b0ef37 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -740,7 +740,15 @@ void resv_map_release(struct kref *ref)
>
> static inline struct resv_map *inode_resv_map(struct inode *inode)
> {
> - return inode->i_mapping->private_data;
> + /*
> + * At inode evict time, i_mapping may not point to the original
> + * address space within the inode. This original address space
> + * contains the pointer to the resv_map. So, always use the
> + * address space embedded within the inode.
> + * The VERY common case is inode->mapping == &inode->i_data but,
> + * this may not be true for device special inodes.
> + */
> + return (struct resv_map *)(&inode->i_data)->private_data;
> }
>
> static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
> @@ -4477,6 +4485,11 @@ int hugetlb_reserve_pages(struct inode *inode,
> * called to make the mapping read-write. Assume !vma is a shm mapping
> */
> if (!vma || vma->vm_flags & VM_MAYSHARE) {
> + /*
> + * resv_map can not be NULL as hugetlb_reserve_pages is only
> + * called for inodes for which resv_maps were created (see
> + * hugetlbfs_get_inode).
> + */
> resv_map = inode_resv_map(inode);
>
> chg = region_chg(resv_map, from, to);
> @@ -4568,6 +4581,10 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
> struct hugepage_subpool *spool = subpool_inode(inode);
> long gbl_reserve;
>
> + /*
> + * Since this routine can be called in the evict inode path for all
> + * hugetlbfs inodes, resv_map could be NULL.
> + */
> if (resv_map) {
> chg = region_del(resv_map, start, end);
> /*
Dose this patch have been applied?
I think it is better to add fixes label, like:
Fixes: 58b6e5e8f1ad ("hugetlbfs: fix memory leak for resv_map")
Since the commit 58b6e5e8f1a has been merged to stable, this patch also
be needed.
https://www.spinics.net/lists/stable/msg298740.html
The patch titled
Subject: mm/hugetlb.c: don't put_page in lock of hugetlb_lock
has been added to the -mm tree. Its filename is
mm-hugetlb-dont-put_page-in-lock-of-hugetlb_lock.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-hugetlb-dont-put_page-in-lock-o…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-hugetlb-dont-put_page-in-lock-o…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Kai Shen <shenkai8(a)huawei.com>
Subject: mm/hugetlb.c: don't put_page in lock of hugetlb_lock
spinlock recursion happened when do LTP test:
#!/bin/bash
./runltp -p -f hugetlb &
./runltp -p -f hugetlb &
./runltp -p -f hugetlb &
./runltp -p -f hugetlb &
./runltp -p -f hugetlb &
The dtor returned by get_compound_page_dtor in __put_compound_page may be
the function of free_huge_page which will lock the hugetlb_lock, so don't
put_page in lock of hugetlb_lock.
BUG: spinlock recursion on CPU#0, hugemmap05/1079
lock: hugetlb_lock+0x0/0x18, .magic: dead4ead, .owner: hugemmap05/1079, .owner_cpu: 0
Call trace:
dump_backtrace+0x0/0x198
show_stack+0x24/0x30
dump_stack+0xa4/0xcc
spin_dump+0x84/0xa8
do_raw_spin_lock+0xd0/0x108
_raw_spin_lock+0x20/0x30
free_huge_page+0x9c/0x260
__put_compound_page+0x44/0x50
__put_page+0x2c/0x60
alloc_surplus_huge_page.constprop.19+0xf0/0x140
hugetlb_acct_memory+0x104/0x378
hugetlb_reserve_pages+0xe0/0x250
hugetlbfs_file_mmap+0xc0/0x140
mmap_region+0x3e8/0x5b0
do_mmap+0x280/0x460
vm_mmap_pgoff+0xf4/0x128
ksys_mmap_pgoff+0xb4/0x258
__arm64_sys_mmap+0x34/0x48
el0_svc_common+0x78/0x130
el0_svc_handler+0x38/0x78
el0_svc+0x8/0xc
Link: http://lkml.kernel.org/r/b8ade452-2d6b-0372-32c2-703644032b47@huawei.com
Fixes: 9980d744a0 ("mm, hugetlb: get rid of surplus page accounting tricks")
Signed-off-by: Kai Shen <shenkai8(a)huawei.com>
Signed-off-by: Feilong Lin <linfeilong(a)huawei.com>
Reported-by: Wang Wang <wangwang2(a)huawei.com>
Reviewed-by: Oscar Salvador <osalvador(a)suse.de>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reviewed-by: Andrew Morton <akpm(a)linux-foundation.org>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/mm/hugetlb.c~mm-hugetlb-dont-put_page-in-lock-of-hugetlb_lock
+++ a/mm/hugetlb.c
@@ -1574,8 +1574,9 @@ static struct page *alloc_surplus_huge_p
*/
if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
SetPageHugeTemporary(page);
+ spin_unlock(&hugetlb_lock);
put_page(page);
- page = NULL;
+ return NULL;
} else {
h->surplus_huge_pages++;
h->surplus_huge_pages_node[page_to_nid(page)]++;
_
Patches currently in -mm which might be from shenkai8(a)huawei.com are
mm-hugetlb-dont-put_page-in-lock-of-hugetlb_lock.patch
A few new fields were added to mmu_gather to make TLB flush smarter for
huge page by telling what level of page table is changed.
__tlb_reset_range() is used to reset all these page table state to
unchanged, which is called by TLB flush for parallel mapping changes for
the same range under non-exclusive lock (i.e. read mmap_sem). Before
commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
munmap"), MADV_DONTNEED is the only one who may do page zapping in
parallel and it doesn't remove page tables. But, the forementioned commit
may do munmap() under read mmap_sem and free page tables. This causes a
bug [1] reported by Jan Stancek since __tlb_reset_range() may pass the
wrong page table state to architecture specific TLB flush operations.
So, removing __tlb_reset_range() sounds sane. This may cause more TLB
flush for MADV_DONTNEED, but it should be not called very often, hence
the impact should be negligible.
The original proposed fix came from Jan Stancek who mainly debugged this
issue, I just wrapped up everything together.
[1] https://lore.kernel.org/linux-mm/342bf1fd-f1bf-ed62-1127-e911b5032274@linux…
Reported-by: Jan Stancek <jstancek(a)redhat.com>
Tested-by: Jan Stancek <jstancek(a)redhat.com>
Cc: Will Deacon <will.deacon(a)arm.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Yang Shi <yang.shi(a)linux.alibaba.com>
Signed-off-by: Jan Stancek <jstancek(a)redhat.com>
---
mm/mmu_gather.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 99740e1..9fd5272 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -249,11 +249,12 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
+ *
+ * munmap() may change mapping under non-excluse lock and also free
+ * page tables. Do not call __tlb_reset_range() for it.
*/
- if (mm_tlb_flush_nested(tlb->mm)) {
- __tlb_reset_range(tlb);
+ if (mm_tlb_flush_nested(tlb->mm))
__tlb_adjust_range(tlb, start, end - start);
- }
tlb_flush_mmu(tlb);
--
1.8.3.1