The patch titled
Subject: mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-swap-fix-swap_info_struct-race-between-swapoff-and-get_swap_pages.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Rongwei Wang <rongwei.wang(a)linux.alibaba.com>
Subject: mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()
Date: Tue, 4 Apr 2023 23:47:16 +0800
The si->lock must be held when deleting the si from the available list.
Otherwise, another thread can re-add the si to the available list, which
can lead to memory corruption. The only place we have found where this
happens is in the swapoff path. This case can be described as below:
core 0 core 1
swapoff
del_from_avail_list(si) waiting
try lock si->lock acquire swap_avail_lock
and re-add si into
swap_avail_head
acquire si->lock but missing si already being added again, and continuing
to clear SWP_WRITEOK, etc.
It can be easily found that a massive warning messages can be triggered
inside get_swap_pages() by some special cases, for example, we call
madvise(MADV_PAGEOUT) on blocks of touched memory concurrently, meanwhile,
run much swapon-swapoff operations (e.g. stress-ng-swap).
However, in the worst case, panic can be caused by the above scene. In
swapoff(), the memory used by si could be kept in swap_info[] after
turning off a swap. This means memory corruption will not be caused
immediately until allocated and reset for a new swap in the swapon path.
A panic message caused: (with CONFIG_PLIST_DEBUG enabled)
------------[ cut here ]------------
top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a
prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d
next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a
WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70
Modules linked in: rfkill(E) crct10dif_ce(E)...
CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+
Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
pc : plist_check_prev_next_node+0x50/0x70
lr : plist_check_prev_next_node+0x50/0x70
sp : ffff0018009d3c30
x29: ffff0018009d3c40 x28: ffff800011b32a98
x27: 0000000000000000 x26: ffff001803908000
x25: ffff8000128ea088 x24: ffff800011b32a48
x23: 0000000000000028 x22: ffff001800875c00
x21: ffff800010f9e520 x20: ffff001800875c00
x19: ffff001800fdc6e0 x18: 0000000000000030
x17: 0000000000000000 x16: 0000000000000000
x15: 0736076307640766 x14: 0730073007380731
x13: 0736076307640766 x12: 0730073007380731
x11: 000000000004058d x10: 0000000085a85b76
x9 : ffff8000101436e4 x8 : ffff800011c8ce08
x7 : 0000000000000000 x6 : 0000000000000001
x5 : ffff0017df9ed338 x4 : 0000000000000001
x3 : ffff8017ce62a000 x2 : ffff0017df9ed340
x1 : 0000000000000000 x0 : 0000000000000000
Call trace:
plist_check_prev_next_node+0x50/0x70
plist_check_head+0x80/0xf0
plist_add+0x28/0x140
add_to_avail_list+0x9c/0xf0
_enable_swap_info+0x78/0xb4
__do_sys_swapon+0x918/0xa10
__arm64_sys_swapon+0x20/0x30
el0_svc_common+0x8c/0x220
do_el0_svc+0x2c/0x90
el0_svc+0x1c/0x30
el0_sync_handler+0xa8/0xb0
el0_sync+0x148/0x180
irq event stamp: 2082270
Now, si->lock locked before calling 'del_from_avail_list()' to make sure
other thread see the si had been deleted and SWP_WRITEOK cleared together,
will not reinsert again.
This problem exists in versions after stable 5.10.y.
Link: https://lkml.kernel.org/r/20230404154716.23058-1-rongwei.wang@linux.alibaba…
Fixes: a2468cc9bfdff ("swap: choose swap device according to numa node")
Tested-by: Yongchen Yin <wb-yyc939293(a)alibaba-inc.com>
Signed-off-by: Rongwei Wang <rongwei.wang(a)linux.alibaba.com>
Cc: Bagas Sanjaya <bagasdotme(a)gmail.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Aaron Lu <aaron.lu(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/swapfile.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/mm/swapfile.c~mm-swap-fix-swap_info_struct-race-between-swapoff-and-get_swap_pages
+++ a/mm/swapfile.c
@@ -679,6 +679,7 @@ static void __del_from_avail_list(struct
{
int nid;
+ assert_spin_locked(&p->lock);
for_each_node(nid)
plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
}
@@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __us
spin_unlock(&swap_lock);
goto out_dput;
}
- del_from_avail_list(p);
spin_lock(&p->lock);
+ del_from_avail_list(p);
if (p->prio < 0) {
struct swap_info_struct *si = p;
int nid;
_
Patches currently in -mm which might be from rongwei.wang(a)linux.alibaba.com are
mm-swap-fix-swap_info_struct-race-between-swapoff-and-get_swap_pages.patch
The patch below does not apply to the 6.2-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.2.y
git checkout FETCH_HEAD
git cherry-pick -x f7b58a69fad9d2c4c90cab0247811155dd0d48e7
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023040332-dodgy-chamomile-54f5@gregkh' --subject-prefix 'PATCH 6.2.y' HEAD^..
Possible dependencies:
f7b58a69fad9 ("dm: fix improper splitting for abnormal bios")
86a3238c7b9b ("dm: change "unsigned" to "unsigned int"")
7533afa1d27b ("dm: send just one event on resize, not two")
5cd6d1d53a1f ("dm integrity: Remove bi_sector that's only used by commented debug code")
22c40e134c4c ("dm cache: Add some documentation to dm-cache-background-tracker.h")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f7b58a69fad9d2c4c90cab0247811155dd0d48e7 Mon Sep 17 00:00:00 2001
From: Mike Snitzer <snitzer(a)kernel.org>
Date: Thu, 30 Mar 2023 14:56:38 -0400
Subject: [PATCH] dm: fix improper splitting for abnormal bios
"Abnormal" bios include discards, write zeroes and secure erase. By no
longer passing the calculated 'len' pointer, commit 7dd06a2548b2 ("dm:
allow dm_accept_partial_bio() for dm_io without duplicate bios") took a
senseless approach to disallowing dm_accept_partial_bio() from working
for duplicate bios processed using __send_duplicate_bios().
It inadvertently and incorrectly stopped the use of 'len' when
initializing a target's io (in alloc_tio). As such the resulting tio
could address more area of a device than it should.
For example, when discarding an entire DM striped device with the
following DM table:
vg-lvol0: 0 159744 striped 2 128 7:0 2048 7:1 2048
vg-lvol0: 159744 45056 striped 2 128 7:2 2048 7:3 2048
Before this fix:
device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=102400
blkdiscard: attempt to access beyond end of device
loop0: rw=2051, sector=2048, nr_sectors = 102400 limit=81920
device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=102400
blkdiscard: attempt to access beyond end of device
loop1: rw=2051, sector=2048, nr_sectors = 102400 limit=81920
After this fix;
device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=79872
device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=79872
Fixes: 7dd06a2548b2 ("dm: allow dm_accept_partial_bio() for dm_io without duplicate bios")
Cc: stable(a)vger.kernel.org
Reported-by: Orange Kao <orange(a)aiven.io>
Signed-off-by: Mike Snitzer <snitzer(a)kernel.org>
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 2d0f934ba6e6..e67a2757c53e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1467,7 +1467,8 @@ static void setup_split_accounting(struct clone_info *ci, unsigned int len)
}
static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
- struct dm_target *ti, unsigned int num_bios)
+ struct dm_target *ti, unsigned int num_bios,
+ unsigned *len)
{
struct bio *bio;
int try;
@@ -1478,7 +1479,7 @@ static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
if (try)
mutex_lock(&ci->io->md->table_devices_lock);
for (bio_nr = 0; bio_nr < num_bios; bio_nr++) {
- bio = alloc_tio(ci, ti, bio_nr, NULL,
+ bio = alloc_tio(ci, ti, bio_nr, len,
try ? GFP_NOIO : GFP_NOWAIT);
if (!bio)
break;
@@ -1514,7 +1515,7 @@ static int __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
break;
default:
/* dm_accept_partial_bio() is not supported with shared tio->len_ptr */
- alloc_multiple_bios(&blist, ci, ti, num_bios);
+ alloc_multiple_bios(&blist, ci, ti, num_bios, len);
while ((clone = bio_list_pop(&blist))) {
dm_tio_set_flag(clone_to_tio(clone), DM_TIO_IS_DUPLICATE_BIO);
__map_bio(clone);
From: Roberto Sassu <roberto.sassu(a)huawei.com>
Reiserfs sets a security xattr at inode creation time in two stages: first,
it calls reiserfs_security_init() to obtain the xattr from active LSMs;
then, it calls reiserfs_security_write() to actually write that xattr.
Unfortunately, it seems there is a wrong expectation that LSMs provide the
full xattr name in the form 'security.<suffix>'. However, LSMs always
provided just the suffix, causing reiserfs to not write the xattr at all
(if the suffix is shorter than the prefix), or to write an xattr with the
wrong name.
Add a temporary buffer in reiserfs_security_write(), and write to it the
full xattr name, before passing it to reiserfs_xattr_set_handle().
Also replace the name length check with a check that the full xattr name is
not larger than XATTR_NAME_MAX.
Cc: stable(a)vger.kernel.org # v2.6.x
Fixes: 57fe60df6241 ("reiserfs: add atomic addition of selinux attributes during inode creation")
Signed-off-by: Roberto Sassu <roberto.sassu(a)huawei.com>
---
fs/reiserfs/xattr_security.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/reiserfs/xattr_security.c b/fs/reiserfs/xattr_security.c
index 6bffdf9a4fd..6e0a099dd78 100644
--- a/fs/reiserfs/xattr_security.c
+++ b/fs/reiserfs/xattr_security.c
@@ -95,11 +95,15 @@ int reiserfs_security_write(struct reiserfs_transaction_handle *th,
struct inode *inode,
struct reiserfs_security_handle *sec)
{
+ char xattr_name[XATTR_NAME_MAX + 1] = XATTR_SECURITY_PREFIX;
int error;
- if (strlen(sec->name) < sizeof(XATTR_SECURITY_PREFIX))
+
+ if (XATTR_SECURITY_PREFIX_LEN + strlen(sec->name) > XATTR_NAME_MAX)
return -EINVAL;
- error = reiserfs_xattr_set_handle(th, inode, sec->name, sec->value,
+ strlcat(xattr_name, sec->name, sizeof(xattr_name));
+
+ error = reiserfs_xattr_set_handle(th, inode, xattr_name, sec->value,
sec->length, XATTR_CREATE);
if (error == -ENODATA || error == -EOPNOTSUPP)
error = 0;
--
2.25.1
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x f7b58a69fad9d2c4c90cab0247811155dd0d48e7
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023040334-attire-drone-2c8b@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
f7b58a69fad9 ("dm: fix improper splitting for abnormal bios")
86a3238c7b9b ("dm: change "unsigned" to "unsigned int"")
7533afa1d27b ("dm: send just one event on resize, not two")
5cd6d1d53a1f ("dm integrity: Remove bi_sector that's only used by commented debug code")
22c40e134c4c ("dm cache: Add some documentation to dm-cache-background-tracker.h")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f7b58a69fad9d2c4c90cab0247811155dd0d48e7 Mon Sep 17 00:00:00 2001
From: Mike Snitzer <snitzer(a)kernel.org>
Date: Thu, 30 Mar 2023 14:56:38 -0400
Subject: [PATCH] dm: fix improper splitting for abnormal bios
"Abnormal" bios include discards, write zeroes and secure erase. By no
longer passing the calculated 'len' pointer, commit 7dd06a2548b2 ("dm:
allow dm_accept_partial_bio() for dm_io without duplicate bios") took a
senseless approach to disallowing dm_accept_partial_bio() from working
for duplicate bios processed using __send_duplicate_bios().
It inadvertently and incorrectly stopped the use of 'len' when
initializing a target's io (in alloc_tio). As such the resulting tio
could address more area of a device than it should.
For example, when discarding an entire DM striped device with the
following DM table:
vg-lvol0: 0 159744 striped 2 128 7:0 2048 7:1 2048
vg-lvol0: 159744 45056 striped 2 128 7:2 2048 7:3 2048
Before this fix:
device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=102400
blkdiscard: attempt to access beyond end of device
loop0: rw=2051, sector=2048, nr_sectors = 102400 limit=81920
device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=102400
blkdiscard: attempt to access beyond end of device
loop1: rw=2051, sector=2048, nr_sectors = 102400 limit=81920
After this fix;
device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=79872
device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=79872
Fixes: 7dd06a2548b2 ("dm: allow dm_accept_partial_bio() for dm_io without duplicate bios")
Cc: stable(a)vger.kernel.org
Reported-by: Orange Kao <orange(a)aiven.io>
Signed-off-by: Mike Snitzer <snitzer(a)kernel.org>
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 2d0f934ba6e6..e67a2757c53e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1467,7 +1467,8 @@ static void setup_split_accounting(struct clone_info *ci, unsigned int len)
}
static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
- struct dm_target *ti, unsigned int num_bios)
+ struct dm_target *ti, unsigned int num_bios,
+ unsigned *len)
{
struct bio *bio;
int try;
@@ -1478,7 +1479,7 @@ static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
if (try)
mutex_lock(&ci->io->md->table_devices_lock);
for (bio_nr = 0; bio_nr < num_bios; bio_nr++) {
- bio = alloc_tio(ci, ti, bio_nr, NULL,
+ bio = alloc_tio(ci, ti, bio_nr, len,
try ? GFP_NOIO : GFP_NOWAIT);
if (!bio)
break;
@@ -1514,7 +1515,7 @@ static int __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
break;
default:
/* dm_accept_partial_bio() is not supported with shared tio->len_ptr */
- alloc_multiple_bios(&blist, ci, ti, num_bios);
+ alloc_multiple_bios(&blist, ci, ti, num_bios, len);
while ((clone = bio_list_pop(&blist))) {
dm_tio_set_flag(clone_to_tio(clone), DM_TIO_IS_DUPLICATE_BIO);
__map_bio(clone);
Currently, with VHE, KVM enables the EL0 event counting for the
guest on vcpu_load() or KVM enables it as a part of the PMU
register emulation process, when needed. However, in the migration
case (with VHE), the same handling is lacking, as vPMU register
values that were restored by userspace haven't been propagated yet
(the PMU events haven't been created) at the vcpu load-time on the
first KVM_RUN (kvm_vcpu_pmu_restore_guest() called from vcpu_load()
on the first KVM_RUN won't do anything as events_{guest,host} of
kvm_pmu_events are still zero).
So, with VHE, enable the guest's EL0 event counting on the first
KVM_RUN (after the migration) when needed. More specifically,
have kvm_pmu_handle_pmcr() call kvm_vcpu_pmu_restore_guest()
so that kvm_pmu_handle_pmcr() on the first KVM_RUN can take
care of it.
Fixes: d0c94c49792c ("KVM: arm64: Restore PMU configuration on first run")
Cc: stable(a)vger.kernel.org
Reviewed-by: Marc Zyngier <maz(a)kernel.org>
Signed-off-by: Reiji Watanabe <reijiw(a)google.com>
---
v2:
- Added more explanation to the commit message [Marc]
- Added Marc's r-b tag (Thank you!)
v1: https://lore.kernel.org/all/20230328034725.2051499-1-reijiw@google.com/
---
arch/arm64/kvm/pmu-emul.c | 1 +
arch/arm64/kvm/sys_regs.c | 1 -
2 files changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
index 24908400e190..74e0d2b153b5 100644
--- a/arch/arm64/kvm/pmu-emul.c
+++ b/arch/arm64/kvm/pmu-emul.c
@@ -557,6 +557,7 @@ void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u64 val)
for_each_set_bit(i, &mask, 32)
kvm_pmu_set_pmc_value(kvm_vcpu_idx_to_pmc(vcpu, i), 0, true);
}
+ kvm_vcpu_pmu_restore_guest(vcpu);
}
static bool kvm_pmu_counter_is_enabled(struct kvm_pmc *pmc)
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 53749d3a0996..425e1e9adae7 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -794,7 +794,6 @@ static bool access_pmcr(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
if (!kvm_supports_32bit_el0())
val |= ARMV8_PMU_PMCR_LC;
kvm_pmu_handle_pmcr(vcpu, val);
- kvm_vcpu_pmu_restore_guest(vcpu);
} else {
/* PMCR.P & PMCR.C are RAZ */
val = __vcpu_sys_reg(vcpu, PMCR_EL0)
base-commit: 197b6b60ae7bc51dd0814953c562833143b292aa
--
2.40.0.348.gf938b09366-goog