The patch titled
Subject: mm: page_mapped: don't assume compound page is huge or THP
has been removed from the -mm tree. Its filename was
mm-page_mapped-dont-assume-compound-page-is-huge-or-thp.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Jan Stancek <jstancek(a)redhat.com>
Subject: mm: page_mapped: don't assume compound page is huge or THP
LTP proc01 testcase has been observed to rarely trigger crashes
on arm64:
page_mapped+0x78/0xb4
stable_page_flags+0x27c/0x338
kpageflags_read+0xfc/0x164
proc_reg_read+0x7c/0xb8
__vfs_read+0x58/0x178
vfs_read+0x90/0x14c
SyS_read+0x60/0xc0
Issue is that page_mapped() assumes that if compound page is not huge,
then it must be THP. But if this is 'normal' compound page
(COMPOUND_PAGE_DTOR), then following loop can keep running (for
HPAGE_PMD_NR iterations) until it tries to read from memory that isn't
mapped and triggers a panic:
for (i = 0; i < hpage_nr_pages(page); i++) {
if (atomic_read(&page[i]._mapcount) >= 0)
return true;
}
I could replicate this on x86 (v4.20-rc4-98-g60b548237fed) only
with a custom kernel module [1] which:
- allocates compound page (PAGEC) of order 1
- allocates 2 normal pages (COPY), which are initialized to 0xff
(to satisfy _mapcount >= 0)
- 2 PAGEC page structs are copied to address of first COPY page
- second page of COPY is marked as not present
- call to page_mapped(COPY) now triggers fault on access to 2nd
COPY page at offset 0x30 (_mapcount)
[1] https://github.com/jstancek/reproducers/blob/master/kernel/page_mapped_cras…
Fix the loop to iterate for "1 << compound_order" pages.
Kirrill said "IIRC, sound subsystem can producuce custom mapped compound
pages".
Link: http://lkml.kernel.org/r/c440d69879e34209feba21e12d236d06bc0a25db.154357715…
Fixes: e1534ae95004 ("mm: differentiate page_mapped() from page_mapcount() for compound pages")
Signed-off-by: Jan Stancek <jstancek(a)redhat.com>
Debugged-by: Laszlo Ersek <lersek(a)redhat.com>
Suggested-by: "Kirill A. Shutemov" <kirill(a)shutemov.name>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reviewed-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/util.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/util.c~mm-page_mapped-dont-assume-compound-page-is-huge-or-thp
+++ a/mm/util.c
@@ -478,7 +478,7 @@ bool page_mapped(struct page *page)
return true;
if (PageHuge(page))
return false;
- for (i = 0; i < hpage_nr_pages(page); i++) {
+ for (i = 0; i < (1 << compound_order(page)); i++) {
if (atomic_read(&page[i]._mapcount) >= 0)
return true;
}
_
Patches currently in -mm which might be from jstancek(a)redhat.com are
The patch titled
Subject: mm, memcg: fix reclaim deadlock with writeback
has been removed from the -mm tree. Its filename was
mm-memcg-fix-reclaim-deadlock-with-writeback.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Michal Hocko <mhocko(a)suse.com>
Subject: mm, memcg: fix reclaim deadlock with writeback
Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
ext4 writeback
task1:
[<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
[<ffffffff811c5777>] shrink_page_list+0x907/0x960
[<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
[<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
[<ffffffff811c70a8>] shrink_node+0xd8/0x300
[<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
[<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
[<ffffffff8122df2d>] try_charge+0x14d/0x720
[<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
[<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
[<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
[<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
[<ffffffff81074247>] pte_alloc_one+0x17/0x40
[<ffffffff811e34de>] __pte_alloc+0x1e/0x110
[<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
[<ffffffff811e5d93>] do_fault+0x103/0x970
[<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
[<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
[<ffffffff8106ecb0>] do_page_fault+0x30/0x80
[<ffffffff8171bce8>] page_fault+0x28/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
task2:
[<ffffffff811aadc6>] __lock_page+0x86/0xa0
[<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
[<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
[<ffffffff811bbede>] do_writepages+0x1e/0x30
[<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
[<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
[<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
[<ffffffff81273568>] wb_writeback+0x268/0x300
[<ffffffff81273d24>] wb_workfn+0xb4/0x390
[<ffffffff810a2f19>] process_one_work+0x189/0x420
[<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
[<ffffffff810a9786>] kthread+0xe6/0x100
[<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
[<ffffffffffffffff>] 0xffffffffffffffff
He adds
: task1 is waiting for the PageWriteback bit of the page that task2 has
: collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
: bit the page which tasks1 has locked.
More precisely task1 is handling a page fault and it has a page locked
while it charges a new page table to a memcg. That in turn hits a memory
limit reclaim and the memcg reclaim for legacy controller is waiting on
the writeback but that is never going to finish because the writeback
itself is waiting for the page locked in the #PF path. So this is
essentially ABBA deadlock:
lock_page(A)
SetPageWriteback(A)
unlock_page(A)
lock_page(B)
lock_page(B)
pte_alloc_pne
shrink_page_list
wait_on_page_writeback(A)
SetPageWriteback(B)
unlock_page(B)
# flush A, B to clear the writeback
This accumulating of more pages to flush is used by several filesystems to
generate a more optimal IO patterns.
Waiting for the writeback in legacy memcg controller is a workaround for
pre-mature OOM killer invocations because there is no dirty IO throttling
available for the controller. There is no easy way around that
unfortunately. Therefore fix this specific issue by pre-allocating the
page table outside of the page lock. We have that handy infrastructure
for that already so simply reuse the fault-around pattern which already
does this.
There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
from under a fs page locked but they should be really rare. I am not
aware of a better solution unfortunately.
[akpm(a)linux-foundation.org: fix mm/memory.c:__do_fault()]
[akpm(a)linux-foundation.org: coding-style fixes]
[mhocko(a)kernel.org: enhance comment, per Johannes]
Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Reported-by: Liu Bo <bo.liu(a)linux.alibaba.com>
Debugged-by: Liu Bo <bo.liu(a)linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
Reviewed-by: Liu Bo <bo.liu(a)linux.alibaba.com>
Cc: Jan Kara <jack(a)suse.cz>
Cc: Dave Chinner <david(a)fromorbit.com>
Cc: Theodore Ts'o <tytso(a)mit.edu>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
--- a/mm/memory.c~mm-memcg-fix-reclaim-deadlock-with-writeback
+++ a/mm/memory.c
@@ -2994,6 +2994,28 @@ static vm_fault_t __do_fault(struct vm_f
struct vm_area_struct *vma = vmf->vma;
vm_fault_t ret;
+ /*
+ * Preallocate pte before we take page_lock because this might lead to
+ * deadlocks for memcg reclaim which waits for pages under writeback:
+ * lock_page(A)
+ * SetPageWriteback(A)
+ * unlock_page(A)
+ * lock_page(B)
+ * lock_page(B)
+ * pte_alloc_pne
+ * shrink_page_list
+ * wait_on_page_writeback(A)
+ * SetPageWriteback(B)
+ * unlock_page(B)
+ * # flush A, B to clear the writeback
+ */
+ if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
+ vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);
+ if (!vmf->prealloc_pte)
+ return VM_FAULT_OOM;
+ smp_wmb(); /* See comment in __pte_alloc() */
+ }
+
ret = vma->vm_ops->fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
VM_FAULT_DONE_COW)))
_
Patches currently in -mm which might be from mhocko(a)suse.com are
The patch titled
Subject: mm/usercopy.c: no check page span for stack objects
has been removed from the -mm tree. Its filename was
usercopy-no-check-page-span-for-stack-objects.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Qian Cai <cai(a)lca.pw>
Subject: mm/usercopy.c: no check page span for stack objects
It is easy to trigger this with CONFIG_HARDENED_USERCOPY_PAGESPAN=y,
usercopy: Kernel memory overwrite attempt detected to spans multiple
pages (offset 0, size 23)!
kernel BUG at mm/usercopy.c:102!
For example,
print_worker_info
char name[WQ_NAME_LEN] = { };
char desc[WORKER_DESC_LEN] = { };
probe_kernel_read(name, wq->name, sizeof(name) - 1);
probe_kernel_read(desc, worker->desc, sizeof(desc) - 1);
__copy_from_user_inatomic
check_object_size
check_heap_object
check_page_span
This is because on-stack variables could cross PAGE_SIZE boundary, and
failed this check,
if (likely(((unsigned long)ptr & (unsigned long)PAGE_MASK) ==
((unsigned long)end & (unsigned long)PAGE_MASK)))
ptr = FFFF889007D7EFF8
end = FFFF889007D7F00E
Hence, fix it by checking if it is a stack object first.
[keescook(a)chromium.org: improve comments after reorder]
Link: http://lkml.kernel.org/r/20190103165151.GA32845@beast
Link: http://lkml.kernel.org/r/20181231030254.99441-1-cai@lca.pw
Signed-off-by: Qian Cai <cai(a)lca.pw>
Signed-off-by: Kees Cook <keescook(a)chromium.org>
Acked-by: Kees Cook <keescook(a)chromium.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/usercopy.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
--- a/mm/usercopy.c~usercopy-no-check-page-span-for-stack-objects
+++ a/mm/usercopy.c
@@ -247,7 +247,8 @@ static DEFINE_STATIC_KEY_FALSE_RO(bypass
/*
* Validates that the given object is:
* - not bogus address
- * - known-safe heap or stack object
+ * - fully contained by stack (or stack frame, when available)
+ * - fully within SLAB object (or object whitelist area, when available)
* - not in kernel text
*/
void __check_object_size(const void *ptr, unsigned long n, bool to_user)
@@ -262,9 +263,6 @@ void __check_object_size(const void *ptr
/* Check for invalid addresses. */
check_bogus_address((const unsigned long)ptr, n, to_user);
- /* Check for bad heap object. */
- check_heap_object(ptr, n, to_user);
-
/* Check for bad stack object. */
switch (check_stack_object(ptr, n)) {
case NOT_STACK:
@@ -282,6 +280,9 @@ void __check_object_size(const void *ptr
usercopy_abort("process stack", NULL, to_user, 0, n);
}
+ /* Check for bad heap object. */
+ check_heap_object(ptr, n, to_user);
+
/* Check for object in kernel to avoid text exposure. */
check_kernel_text_object((const unsigned long)ptr, n, to_user);
}
_
Patches currently in -mm which might be from cai(a)lca.pw are
mm-page_owner-fix-for-deferred-struct-page-init.patch
page_poison-plays-nicely-with-kasan.patch
signal-allow-the-null-signal-in-rt_sigqueueinfo.patch
The patch titled
Subject: slab: alien caches must not be initialized if the allocation of the alien cache failed
has been removed from the -mm tree. Its filename was
slab-alien-caches-must-not-be-initialized-if-the-allocation-of-the-alien-cache-failed.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Christoph Lameter <cl(a)linux.com>
Subject: slab: alien caches must not be initialized if the allocation of the alien cache failed
Callers of __alloc_alien() check for NULL. We must do the same check in
__alloc_alien_cache to avoid NULL pointer dereferences on allocation
failures.
Link: http://lkml.kernel.org/r/010001680f42f192-82b4e12e-1565-4ee0-ae1f-1e9897490…
Fixes: 49dfc304ba241 ("slab: use the lock on alien_cache, instead of the lock on array_cache")
Fixes: c8522a3a5832b ("Slab: introduce alloc_alien")
Signed-off-by: Christoph Lameter <cl(a)linux.com>
Reported-by: syzbot+d6ed4ec679652b4fd4e4(a)syzkaller.appspotmail.com
Reviewed-by: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Pekka Enberg <penberg(a)kernel.org>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/slab.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
--- a/mm/slab.c~slab-alien-caches-must-not-be-initialized-if-the-allocation-of-the-alien-cache-failed
+++ a/mm/slab.c
@@ -666,8 +666,10 @@ static struct alien_cache *__alloc_alien
struct alien_cache *alc = NULL;
alc = kmalloc_node(memsize, gfp, node);
- init_arraycache(&alc->ac, entries, batch);
- spin_lock_init(&alc->lock);
+ if (alc) {
+ init_arraycache(&alc->ac, entries, batch);
+ spin_lock_init(&alc->lock);
+ }
return alc;
}
_
Patches currently in -mm which might be from cl(a)linux.com are
The patch titled
Subject: fork, memcg: fix cached_stacks case
has been removed from the -mm tree. Its filename was
fork-memcg-fix-cached_stacks-case.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Shakeel Butt <shakeelb(a)google.com>
Subject: fork, memcg: fix cached_stacks case
5eed6f1dff87 ("fork,memcg: fix crash in free_thread_stack on memcg charge
fail") fixes a crash caused due to failed memcg charge of the kernel
stack. However the fix misses the cached_stacks case which this patch
fixes. So, the same crash can happen if the memcg charge of a cached
stack is failed.
Link: http://lkml.kernel.org/r/20190102180145.57406-1-shakeelb@google.com
Fixes: 5eed6f1dff87 ("fork,memcg: fix crash in free_thread_stack on memcg charge fail")
Signed-off-by: Shakeel Butt <shakeelb(a)google.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Acked-by: Rik van Riel <riel(a)surriel.com>
Cc: Rik van Riel <riel(a)surriel.com>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/fork.c | 1 +
1 file changed, 1 insertion(+)
--- a/kernel/fork.c~fork-memcg-fix-cached_stacks-case
+++ a/kernel/fork.c
@@ -217,6 +217,7 @@ static unsigned long *alloc_thread_stack
memset(s->addr, 0, THREAD_SIZE);
tsk->stack_vm_area = s;
+ tsk->stack = s->addr;
return s->addr;
}
_
Patches currently in -mm which might be from shakeelb(a)google.com are
memcg-localize-memcg_kmem_enabled-check.patch
memcg-schedule-high-reclaim-for-remote-memcgs-on-high_work.patch