The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e914d8f00391520ecc4495dd0ca0124538ab7119 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan(a)kernel.org>
Date: Thu, 14 Apr 2022 19:13:46 -0700
Subject: [PATCH] mm: fix unexpected zeroed page mapping with zram swap
Two processes under CLONE_VM cloning, user process can be corrupted by
seeing zeroed page unexpectedly.
CPU A CPU B
do_swap_page do_swap_page
SWP_SYNCHRONOUS_IO path SWP_SYNCHRONOUS_IO path
swap_readpage valid data
swap_slot_free_notify
delete zram entry
swap_readpage zeroed(invalid) data
pte_lock
map the *zero data* to userspace
pte_unlock
pte_lock
if (!pte_same)
goto out_nomap;
pte_unlock
return and next refault will
read zeroed data
The swap_slot_free_notify is bogus for CLONE_VM case since it doesn't
increase the refcount of swap slot at copy_mm so it couldn't catch up
whether it's safe or not to discard data from backing device. In the
case, only the lock it could rely on to synchronize swap slot freeing is
page table lock. Thus, this patch gets rid of the swap_slot_free_notify
function. With this patch, CPU A will see correct data.
CPU A CPU B
do_swap_page do_swap_page
SWP_SYNCHRONOUS_IO path SWP_SYNCHRONOUS_IO path
swap_readpage original data
pte_lock
map the original data
swap_free
swap_range_free
bd_disk->fops->swap_slot_free_notify
swap_readpage read zeroed data
pte_unlock
pte_lock
if (!pte_same)
goto out_nomap;
pte_unlock
return
on next refault will see mapped data by CPU B
The concern of the patch would increase memory consumption since it
could keep wasted memory with compressed form in zram as well as
uncompressed form in address space. However, most of cases of zram uses
no readahead and do_swap_page is followed by swap_free so it will free
the compressed form from in zram quickly.
Link: https://lkml.kernel.org/r/YjTVVxIAsnKAXjTd@google.com
Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
Reported-by: Ivan Babrou <ivan(a)cloudflare.com>
Tested-by: Ivan Babrou <ivan(a)cloudflare.com>
Signed-off-by: Minchan Kim <minchan(a)kernel.org>
Cc: Nitin Gupta <ngupta(a)vflare.org>
Cc: Sergey Senozhatsky <senozhatsky(a)chromium.org>
Cc: Jens Axboe <axboe(a)kernel.dk>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/page_io.c b/mm/page_io.c
index b417f000b49e..89fbf3cae30f 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -51,54 +51,6 @@ void end_swap_bio_write(struct bio *bio)
bio_put(bio);
}
-static void swap_slot_free_notify(struct page *page)
-{
- struct swap_info_struct *sis;
- struct gendisk *disk;
- swp_entry_t entry;
-
- /*
- * There is no guarantee that the page is in swap cache - the software
- * suspend code (at least) uses end_swap_bio_read() against a non-
- * swapcache page. So we must check PG_swapcache before proceeding with
- * this optimization.
- */
- if (unlikely(!PageSwapCache(page)))
- return;
-
- sis = page_swap_info(page);
- if (data_race(!(sis->flags & SWP_BLKDEV)))
- return;
-
- /*
- * The swap subsystem performs lazy swap slot freeing,
- * expecting that the page will be swapped out again.
- * So we can avoid an unnecessary write if the page
- * isn't redirtied.
- * This is good for real swap storage because we can
- * reduce unnecessary I/O and enhance wear-leveling
- * if an SSD is used as the as swap device.
- * But if in-memory swap device (eg zram) is used,
- * this causes a duplicated copy between uncompressed
- * data in VM-owned memory and compressed data in
- * zram-owned memory. So let's free zram-owned memory
- * and make the VM-owned decompressed page *dirty*,
- * so the page should be swapped out somewhere again if
- * we again wish to reclaim it.
- */
- disk = sis->bdev->bd_disk;
- entry.val = page_private(page);
- if (disk->fops->swap_slot_free_notify && __swap_count(entry) == 1) {
- unsigned long offset;
-
- offset = swp_offset(entry);
-
- SetPageDirty(page);
- disk->fops->swap_slot_free_notify(sis->bdev,
- offset);
- }
-}
-
static void end_swap_bio_read(struct bio *bio)
{
struct page *page = bio_first_page_all(bio);
@@ -114,7 +66,6 @@ static void end_swap_bio_read(struct bio *bio)
}
SetPageUptodate(page);
- swap_slot_free_notify(page);
out:
unlock_page(page);
WRITE_ONCE(bio->bi_private, NULL);
@@ -394,11 +345,6 @@ int swap_readpage(struct page *page, bool synchronous)
if (sis->flags & SWP_SYNCHRONOUS_IO) {
ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
if (!ret) {
- if (trylock_page(page)) {
- swap_slot_free_notify(page);
- unlock_page(page);
- }
-
count_vm_event(PSWPIN);
goto out;
}
This is the start of the stable review cycle for the 4.19.241 release.
There are 12 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sun, 01 May 2022 10:40:41 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.19.241-r…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.19.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.19.241-rc1
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
lightnvm: disable the subsystem
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Revert "net: ethernet: stmmac: fix altr_tse_pcs function when using a fixed-link"
Masami Hiramatsu <mhiramat(a)kernel.org>
ia64: kprobes: Fix to pass correct trampoline address to the handler
Masami Hiramatsu <mhiramat(a)kernel.org>
Revert "ia64: kprobes: Use generic kretprobe trampoline handler"
Masami Hiramatsu <mhiramat(a)kernel.org>
Revert "ia64: kprobes: Fix to pass correct trampoline address to the handler"
Michael Ellerman <mpe(a)ellerman.id.au>
powerpc/64s: Unmerge EX_LR and EX_DAR
Nicholas Piggin <npiggin(a)gmail.com>
powerpc/64/interrupt: Temporarily save PPR on stack to fix register corruption due to SLB miss
Eric Dumazet <edumazet(a)google.com>
net/sched: cls_u32: fix netns refcount changes in u32_change()
Lin Ma <linma(a)zju.edu.cn>
hamradio: remove needs_free_netdev to avoid UAF
Lin Ma <linma(a)zju.edu.cn>
hamradio: defer 6pack kfree after unregister_netdev
Willy Tarreau <w(a)1wt.eu>
floppy: disable FDRAWCMD by default
Dafna Hirschfeld <dafna3(a)gmail.com>
media: vicodec: upon release, call m2m release before freeing ctrl handler
-------------
Diffstat:
Makefile | 4 +-
arch/ia64/kernel/kprobes.c | 78 +++++++++++++++++++++-
arch/powerpc/include/asm/exception-64s.h | 37 +++++-----
drivers/block/Kconfig | 16 +++++
drivers/block/floppy.c | 43 +++++++++---
drivers/lightnvm/Kconfig | 2 +-
drivers/media/platform/vicodec/vicodec-core.c | 6 +-
drivers/net/ethernet/stmicro/stmmac/altr_tse_pcs.c | 8 +++
drivers/net/ethernet/stmicro/stmmac/altr_tse_pcs.h | 4 --
.../net/ethernet/stmicro/stmmac/dwmac-socfpga.c | 13 ++--
drivers/net/hamradio/6pack.c | 5 +-
net/sched/cls_u32.c | 18 +++--
12 files changed, 181 insertions(+), 53 deletions(-)
There are 3 places where the cpu and node masks of the top cpuset can
be initialized in the order they are executed:
1) start_kernel -> cpuset_init()
2) start_kernel -> cgroup_init() -> cpuset_bind()
3) kernel_init_freeable() -> do_basic_setup() -> cpuset_init_smp()
The first cpuset_init() call just sets all the bits in the masks.
The second cpuset_bind() call sets cpus_allowed and mems_allowed to the
default v2 values. The third cpuset_init_smp() call sets them back to
v1 values.
For systems with cgroup v2 setup, cpuset_bind() is called once. As a
result, cpu and memory node hot add may fail to update the cpu and node
masks of the top cpuset to include the newly added cpu or node in a
cgroup v2 environment.
For systems with cgroup v1 setup, cpuset_bind() is called again by
rebind_subsystem() when the v1 cpuset filesystem is mounted as shown
in the dmesg log below with an instrumented kernel.
[ 2.609781] cpuset_bind() called - v2 = 1
[ 3.079473] cpuset_init_smp() called
[ 7.103710] cpuset_bind() called - v2 = 0
smp_init() is called after the first two init functions. So we don't
have a complete list of active cpus and memory nodes until later in
cpuset_init_smp() which is the right time to set up effective_cpus
and effective_mems.
To fix this cgroup v2 mask setup problem, the potentially incorrect
cpus_allowed & mems_allowed setting in cpuset_init_smp() are removed.
For cgroup v2 systems, the initial cpuset_bind() call will set the masks
correctly. For cgroup v1 systems, the second call to cpuset_bind()
will do the right setup.
cc: stable(a)vger.kernel.org
Signed-off-by: Waiman Long <longman(a)redhat.com>
Tested-by: Feng Tang <feng.tang(a)intel.com>
Reviewed-by: Michal Koutný <mkoutny(a)suse.com>
---
kernel/cgroup/cpuset.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 9390bfd9f1cd..71a418858a5e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3390,8 +3390,11 @@ static struct notifier_block cpuset_track_online_nodes_nb = {
*/
void __init cpuset_init_smp(void)
{
- cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
- top_cpuset.mems_allowed = node_states[N_MEMORY];
+ /*
+ * cpus_allowd/mems_allowed set to v2 values in the initial
+ * cpuset_bind() call will be reset to v1 values in another
+ * cpuset_bind() call when v1 cpuset is mounted.
+ */
top_cpuset.old_mems_allowed = top_cpuset.mems_allowed;
cpumask_copy(top_cpuset.effective_cpus, cpu_active_mask);
--
2.27.0
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e914d8f00391520ecc4495dd0ca0124538ab7119 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan(a)kernel.org>
Date: Thu, 14 Apr 2022 19:13:46 -0700
Subject: [PATCH] mm: fix unexpected zeroed page mapping with zram swap
Two processes under CLONE_VM cloning, user process can be corrupted by
seeing zeroed page unexpectedly.
CPU A CPU B
do_swap_page do_swap_page
SWP_SYNCHRONOUS_IO path SWP_SYNCHRONOUS_IO path
swap_readpage valid data
swap_slot_free_notify
delete zram entry
swap_readpage zeroed(invalid) data
pte_lock
map the *zero data* to userspace
pte_unlock
pte_lock
if (!pte_same)
goto out_nomap;
pte_unlock
return and next refault will
read zeroed data
The swap_slot_free_notify is bogus for CLONE_VM case since it doesn't
increase the refcount of swap slot at copy_mm so it couldn't catch up
whether it's safe or not to discard data from backing device. In the
case, only the lock it could rely on to synchronize swap slot freeing is
page table lock. Thus, this patch gets rid of the swap_slot_free_notify
function. With this patch, CPU A will see correct data.
CPU A CPU B
do_swap_page do_swap_page
SWP_SYNCHRONOUS_IO path SWP_SYNCHRONOUS_IO path
swap_readpage original data
pte_lock
map the original data
swap_free
swap_range_free
bd_disk->fops->swap_slot_free_notify
swap_readpage read zeroed data
pte_unlock
pte_lock
if (!pte_same)
goto out_nomap;
pte_unlock
return
on next refault will see mapped data by CPU B
The concern of the patch would increase memory consumption since it
could keep wasted memory with compressed form in zram as well as
uncompressed form in address space. However, most of cases of zram uses
no readahead and do_swap_page is followed by swap_free so it will free
the compressed form from in zram quickly.
Link: https://lkml.kernel.org/r/YjTVVxIAsnKAXjTd@google.com
Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
Reported-by: Ivan Babrou <ivan(a)cloudflare.com>
Tested-by: Ivan Babrou <ivan(a)cloudflare.com>
Signed-off-by: Minchan Kim <minchan(a)kernel.org>
Cc: Nitin Gupta <ngupta(a)vflare.org>
Cc: Sergey Senozhatsky <senozhatsky(a)chromium.org>
Cc: Jens Axboe <axboe(a)kernel.dk>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/page_io.c b/mm/page_io.c
index b417f000b49e..89fbf3cae30f 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -51,54 +51,6 @@ void end_swap_bio_write(struct bio *bio)
bio_put(bio);
}
-static void swap_slot_free_notify(struct page *page)
-{
- struct swap_info_struct *sis;
- struct gendisk *disk;
- swp_entry_t entry;
-
- /*
- * There is no guarantee that the page is in swap cache - the software
- * suspend code (at least) uses end_swap_bio_read() against a non-
- * swapcache page. So we must check PG_swapcache before proceeding with
- * this optimization.
- */
- if (unlikely(!PageSwapCache(page)))
- return;
-
- sis = page_swap_info(page);
- if (data_race(!(sis->flags & SWP_BLKDEV)))
- return;
-
- /*
- * The swap subsystem performs lazy swap slot freeing,
- * expecting that the page will be swapped out again.
- * So we can avoid an unnecessary write if the page
- * isn't redirtied.
- * This is good for real swap storage because we can
- * reduce unnecessary I/O and enhance wear-leveling
- * if an SSD is used as the as swap device.
- * But if in-memory swap device (eg zram) is used,
- * this causes a duplicated copy between uncompressed
- * data in VM-owned memory and compressed data in
- * zram-owned memory. So let's free zram-owned memory
- * and make the VM-owned decompressed page *dirty*,
- * so the page should be swapped out somewhere again if
- * we again wish to reclaim it.
- */
- disk = sis->bdev->bd_disk;
- entry.val = page_private(page);
- if (disk->fops->swap_slot_free_notify && __swap_count(entry) == 1) {
- unsigned long offset;
-
- offset = swp_offset(entry);
-
- SetPageDirty(page);
- disk->fops->swap_slot_free_notify(sis->bdev,
- offset);
- }
-}
-
static void end_swap_bio_read(struct bio *bio)
{
struct page *page = bio_first_page_all(bio);
@@ -114,7 +66,6 @@ static void end_swap_bio_read(struct bio *bio)
}
SetPageUptodate(page);
- swap_slot_free_notify(page);
out:
unlock_page(page);
WRITE_ONCE(bio->bi_private, NULL);
@@ -394,11 +345,6 @@ int swap_readpage(struct page *page, bool synchronous)
if (sis->flags & SWP_SYNCHRONOUS_IO) {
ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
if (!ret) {
- if (trylock_page(page)) {
- swap_slot_free_notify(page);
- unlock_page(page);
- }
-
count_vm_event(PSWPIN);
goto out;
}
We observed an issue with NXP 5.15 LTS kernel that dma_alloc_coherent()
may fail sometimes when there're multiple processes trying to allocate
CMA memory.
This issue can be very easily reproduced on MX6Q SDB board with latest
linux-next kernel by writing a test module creating 16 or 32 threads
allocating random size of CMA memory in parallel at the background.
Or simply enabling CONFIG_CMA_DEBUG, you can see endless of CMA alloc
retries during booting:
[ 1.452124] cma: cma_alloc(): memory range at (ptrval) is busy,retrying
....
(thousands of reties)
The root cause of this issue is that since commit a4efc174b382
("mm/cma.c: remove redundant cma_mutex lock"), CMA supports concurrent
memory allocation.
It's possible that the memory range process A try to alloc has already
been isolated by the allocation of process B during memory migration.
The problem here is that the memory range isolated during one allocation
by start_isolate_page_range() could be much bigger than the real size we
want to alloc due to the range is aligned to MAX_ORDER_NR_PAGES.
Taking an ARMv7 platform with 1G memory as an example, when MAX_ORDER_NR_PAGES
is big (e.g. 32M with max_order 14) and CMA memory is relatively small
(e.g. 128M), there're only 4 MAX_ORDER slot, then it's very easy that
all CMA memory may have already been isolated by other processes when
one trying to allocate memory using dma_alloc_coherent().
Since current CMA code will only scan one time of whole available CMA
memory, then dma_alloc_coherent() may easy fail due to contention with
other processes.
This patchset introduces a retry mechanism to rescan CMA bitmap for -EBUSY
error in case the target pageblock may has been temporarily isolated
by others and released later.
It also improves the CMA allocation performance by trying the next
MAX_ORDER_NR_PAGES range during reties rather than looping within the
same isolated range in small steps which wasting CPU mips.
The following test is based on linux-next: next-20211213.
Without the fix, it's easily fail.
# insmod cma_alloc.ko pnum=16
[ 274.322369] CMA alloc test enter: thread number: 16
[ 274.329948] cpu: 0, pid: 692, index 4 pages 144
[ 274.330143] cpu: 1, pid: 694, index 2 pages 44
[ 274.330359] cpu: 2, pid: 695, index 7 pages 757
[ 274.330760] cpu: 2, pid: 696, index 4 pages 144
[ 274.330974] cpu: 2, pid: 697, index 6 pages 512
[ 274.331223] cpu: 2, pid: 698, index 6 pages 512
[ 274.331499] cpu: 2, pid: 699, index 2 pages 44
[ 274.332228] cpu: 2, pid: 700, index 0 pages 7
[ 274.337421] cpu: 0, pid: 701, index 1 pages 38
[ 274.337618] cpu: 2, pid: 702, index 0 pages 7
[ 274.344669] cpu: 1, pid: 703, index 0 pages 7
[ 274.344807] cpu: 3, pid: 704, index 6 pages 512
[ 274.348269] cpu: 2, pid: 705, index 5 pages 148
[ 274.349490] cma: cma_alloc: reserved: alloc failed, req-size: 38 pages, ret: -16
[ 274.366292] cpu: 1, pid: 706, index 4 pages 144
[ 274.366562] cpu: 0, pid: 707, index 3 pages 128
[ 274.367356] cma: cma_alloc: reserved: alloc failed, req-size: 128 pages, ret: -16
[ 274.367370] cpu: 0, pid: 707, index 3 pages 128 failed
[ 274.371148] cma: cma_alloc: reserved: alloc failed, req-size: 148 pages, ret: -16
[ 274.375348] cma: cma_alloc: reserved: alloc failed, req-size: 144 pages, ret: -16
[ 274.384256] cpu: 2, pid: 708, index 0 pages 7
....
With the fix, 32 threads allocating in parallel can pass overnight
stress test.
root@imx6qpdlsolox:~# insmod cma_alloc.ko pnum=32
[ 112.976809] cma_alloc: loading out-of-tree module taints kernel.
[ 112.984128] CMA alloc test enter: thread number: 32
[ 112.989748] cpu: 2, pid: 707, index 6 pages 512
[ 112.994342] cpu: 1, pid: 708, index 6 pages 512
[ 112.995162] cpu: 0, pid: 709, index 3 pages 128
[ 112.995867] cpu: 2, pid: 710, index 0 pages 7
[ 112.995910] cpu: 3, pid: 711, index 2 pages 44
[ 112.996005] cpu: 3, pid: 712, index 7 pages 757
[ 112.996098] cpu: 3, pid: 713, index 7 pages 757
...
[41877.368163] cpu: 1, pid: 737, index 2 pages 44
[41877.369388] cpu: 1, pid: 736, index 3 pages 128
[41878.486516] cpu: 0, pid: 737, index 2 pages 44
[41878.486515] cpu: 2, pid: 739, index 4 pages 144
[41878.486622] cpu: 1, pid: 736, index 3 pages 128
[41878.486948] cpu: 2, pid: 735, index 7 pages 757
[41878.487279] cpu: 2, pid: 738, index 4 pages 144
[41879.526603] cpu: 1, pid: 739, index 3 pages 128
[41879.606491] cpu: 2, pid: 737, index 3 pages 128
[41879.606550] cpu: 0, pid: 736, index 0 pages 7
[41879.612271] cpu: 2, pid: 738, index 4 pages 144
...
v1:
https://patchwork.kernel.org/project/linux-mm/cover/20211215080242.3034856-…
v2:
https://patchwork.kernel.org/project/linux-mm/cover/20220112131552.3329380-…
Dong Aisheng (2):
mm: cma: fix allocation may fail sometimes
mm: cma: try next MAX_ORDER_NR_PAGES during retry
mm/cma.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)
--
2.25.1