__io_compat_recvmsg_copy_hdr() with REQ_F_BUFFER_SELECT reads out iov
len but never assigns it to iov/fast_iov, leaving sr->len with garbage.
Hopefully, following io_buffer_select() truncates it to the selected
buffer size, but the value is still may be under what was specified.
Cc: <stable(a)vger.kernel.org> # 5.7
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
---
fs/io_uring.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1023f7b44cea..a2a7c65a77aa 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4499,7 +4499,8 @@ static int __io_compat_recvmsg_copy_hdr(struct io_kiocb *req,
return -EFAULT;
if (clen < 0)
return -EINVAL;
- sr->len = iomsg->iov[0].iov_len;
+ sr->len = clen;
+ iomsg->iov[0].iov_len = clen;
iomsg->iov = NULL;
} else {
ret = __import_iovec(READ, (struct iovec __user *)uiov, len,
--
2.24.0
The patch titled
Subject: proc: use untagged_addr() for pagemap_read addresses
has been added to the -mm tree. Its filename is
proc-use-untagged_addr-for-pagemap_read-addresses.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/proc-use-untagged_addr-for-pagema…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/proc-use-untagged_addr-for-pagema…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Miles Chen <miles.chen(a)mediatek.com>
Subject: proc: use untagged_addr() for pagemap_read addresses
When we try to visit the pagemap of a tagged userspace pointer, we find
that the start_vaddr is not correct because of the tag. To fix it, we
should untag the usespace pointers in pagemap_read().
I tested with 5.10-rc4 and the issue remains.
Explanation from Catalin in [1]:
: Arguably, that's a user-space bug since tagged file offsets were never
: supported. In this case it's not even a tag at bit 56 as per the arm64
: tagged address ABI but rather down to bit 47. You could say that the
: problem is caused by the C library (malloc()) or whoever created the
: tagged vaddr and passed it to this function. It's not a kernel regression
: as we've never supported it.
:
: Now, pagemap is a special case where the offset is usually not generated
: as a classic file offset but rather derived by shifting a user virtual
: address. I guess we can make a concession for pagemap (only) and allow
: such offset with the tag at bit (56 - PAGE_SHIFT + 3).
My test code is baed on [2]:
A userspace pointer which has been tagged by 0xb4: 0xb400007662f541c8
=== userspace program ===
uint64 OsLayer::VirtualToPhysical(void *vaddr) {
uint64 frame, paddr, pfnmask, pagemask;
int pagesize = sysconf(_SC_PAGESIZE);
off64_t off = ((uintptr_t)vaddr) / pagesize * 8; // off = 0xb400007662f541c8 / pagesize * 8 = 0x5a00003b317aa0
int fd = open(kPagemapPath, O_RDONLY);
...
if (lseek64(fd, off, SEEK_SET) != off || read(fd, &frame, 8) != 8) {
int err = errno;
string errtxt = ErrorString(err);
if (fd >= 0)
close(fd);
return 0;
}
...
}
=== kernel fs/proc/task_mmu.c ===
static ssize_t pagemap_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
...
src = *ppos;
svpfn = src / PM_ENTRY_BYTES; // svpfn == 0xb400007662f54
start_vaddr = svpfn << PAGE_SHIFT; // start_vaddr == 0xb400007662f54000
end_vaddr = mm->task_size;
/* watch out for wraparound */
// svpfn == 0xb400007662f54
// (mm->task_size >> PAGE) == 0x8000000
if (svpfn > mm->task_size >> PAGE_SHIFT) // the condition is true because of the tag 0xb4
start_vaddr = end_vaddr;
ret = 0;
while (count && (start_vaddr < end_vaddr)) { // we cannot visit correct entry because start_vaddr is set to end_vaddr
int len;
unsigned long end;
...
}
...
}
[1] https://lore.kernel.org/patchwork/patch/1343258/
[2] https://github.com/stressapptest/stressapptest/blob/master/src/os.cc#L158
Link: https://lkml.kernel.org/r/20201127050738.14440-1-miles.chen@mediatek.com
Signed-off-by: Miles Chen <miles.chen(a)mediatek.com>
Cc: Alexey Dobriyan <adobriyan(a)gmail.com>
Cc: Andrey Konovalov <andreyknvl(a)google.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Cc: Andrey Ryabinin <aryabinin(a)virtuozzo.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Marco Elver <elver(a)google.com>
Cc: Will Deacon <will.deacon(a)arm.com>
Cc: Eric W. Biederman <ebiederm(a)xmission.com>
Cc: Song Bao Hua (Barry Song) <song.bao.hua(a)hisilicon.com>
Cc: <stable(a)vger.kernel.org> [5.4+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/proc/task_mmu.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
--- a/fs/proc/task_mmu.c~proc-use-untagged_addr-for-pagemap_read-addresses
+++ a/fs/proc/task_mmu.c
@@ -1599,11 +1599,15 @@ static ssize_t pagemap_read(struct file
src = *ppos;
svpfn = src / PM_ENTRY_BYTES;
- start_vaddr = svpfn << PAGE_SHIFT;
end_vaddr = mm->task_size;
/* watch out for wraparound */
- if (svpfn > mm->task_size >> PAGE_SHIFT)
+ start_vaddr = end_vaddr;
+ if (svpfn < (ULONG_MAX >> PAGE_SHIFT))
+ start_vaddr = untagged_addr(svpfn << PAGE_SHIFT);
+
+ /* Ensure the address is inside the task */
+ if (start_vaddr > mm->task_size)
start_vaddr = end_vaddr;
/*
_
Patches currently in -mm which might be from miles.chen(a)mediatek.com are
proc-use-untagged_addr-for-pagemap_read-addresses.patch
The patch titled
Subject: mm: list_lru: set shrinker map bit when child nr_items is not zero
has been added to the -mm tree. Its filename is
mm-list_lru-set-shrinker-map-bit-when-child-nr_items-is-not-zero.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/mm-list_lru-set-shrinker-map-bit-…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/mm-list_lru-set-shrinker-map-bit-…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Yang Shi <shy828301(a)gmail.com>
Subject: mm: list_lru: set shrinker map bit when child nr_items is not zero
When investigating a slab cache bloat problem, significant amount of
negative dentry cache was seen, but confusingly they neither got shrunk
by reclaimer (the host has very tight memory) nor be shrunk by dropping
cache. The vmcore shows there are over 14M negative dentry objects on lru,
but tracing result shows they were even not scanned at all. The further
investigation shows the memcg's vfs shrinker_map bit is not set. So the
reclaimer or dropping cache just skip calling vfs shrinker. So we have
to reboot the hosts to get the memory back.
I didn't manage to come up with a reproducer in test environment, and the
problem can't be reproduced after rebooting. But it seems there is race
between shrinker map bit clear and reparenting by code inspection. The
hypothesis is elaborated as below.
The memcg hierarchy on our production environment looks like:
root
/ \
system user
The main workloads are running under user slice's children, and it creates
and removes memcg frequently. So reparenting happens very often under user
slice, but no task is under user slice directly.
So with the frequent reparenting and tight memory pressure, the below
hypothetical race condition may happen:
CPU A CPU B
reparent
dst->nr_items == 0
shrinker:
total_objects == 0
add src->nr_items to dst
set_bit
return SHRINK_EMPTY
clear_bit
child memcg offline
replace child's kmemcg_id with
parent's (in memcg_offline_kmem())
list_lru_del() between shrinker runs
see parent's kmemcg_id
dec dst->nr_items
reparent again
dst->nr_items may go negative
due to concurrent list_lru_del()
The second run of shrinker:
read nr_items without any
synchronization, so it may
see intermediate negative
nr_items then total_objects
may return 0 coincidently
keep the bit cleared
dst->nr_items != 0
skip set_bit
add scr->nr_item to dst
After this point dst->nr_item may never go zero, so reparenting will not
set shrinker_map bit anymore. And since there is no task under user
slice directly, so no new object will be added to its lru to set the
shrinker map bit either. That bit is kept cleared forever.
How does list_lru_del() race with reparenting? It is because
reparenting replaces children's kmemcg_id to parent's without protecting
from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
actually deleting items from child's lru, but dec'ing parent's nr_items,
so the parent's nr_items may go negative as commit
2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
free kmemcg_id on css offline") says.
Since it is impossible that dst->nr_items goes negative and
src->nr_items goes zero at the same time, so it seems we could set the
shrinker map bit iff src->nr_items != 0. We could synchronize
list_lru_count_one() and reparenting with nlru->lock, but it seems
checking src->nr_items in reparenting is the simplest and avoids lock
contention.
Link: https://lkml.kernel.org/r/20201202171749.264354-1-shy828301@gmail.com
Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
Signed-off-by: Yang Shi <shy828301(a)gmail.com>
Suggested-by: Roman Gushchin <guro(a)fb.com>
Reviewed-by: Roman Gushchin <guro(a)fb.com>
Acked-by: Kirill Tkhai <ktkhai(a)virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb(a)google.com>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: <stable(a)vger.kernel.org> [4.19]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/list_lru.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
--- a/mm/list_lru.c~mm-list_lru-set-shrinker-map-bit-when-child-nr_items-is-not-zero
+++ a/mm/list_lru.c
@@ -534,7 +534,6 @@ static void memcg_drain_list_lru_node(st
struct list_lru_node *nlru = &lru->node[nid];
int dst_idx = dst_memcg->kmemcg_id;
struct list_lru_one *src, *dst;
- bool set;
/*
* Since list_lru_{add,del} may be called under an IRQ-safe lock,
@@ -546,11 +545,12 @@ static void memcg_drain_list_lru_node(st
dst = list_lru_from_memcg_idx(nlru, dst_idx);
list_splice_init(&src->list, &dst->list);
- set = (!dst->nr_items && src->nr_items);
- dst->nr_items += src->nr_items;
- if (set)
+
+ if (src->nr_items) {
+ dst->nr_items += src->nr_items;
memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
- src->nr_items = 0;
+ src->nr_items = 0;
+ }
spin_unlock_irq(&nlru->lock);
}
_
Patches currently in -mm which might be from shy828301(a)gmail.com are
mm-list_lru-set-shrinker-map-bit-when-child-nr_items-is-not-zero.patch
mm-truncate_complete_page-is-not-existed-anymore.patch
mm-migrate-simplify-the-logic-for-handling-permanent-failure.patch
mm-migrate-skip-shared-exec-thp-for-numa-balancing.patch
mm-migrate-clean-up-migrate_prep_local.patch
mm-migrate-return-enosys-if-thp-migration-is-unsupported.patch