From: Xiubo Li <xiubli(a)redhat.com>
The osd code has remove cursor initilizing code and this will make
the sparse read state into a infinite loop. We should initialize
the cursor just before each sparse-read in messnger v2.
Cc: stable(a)vger.kernel.org
URL: https://tracker.ceph.com/issues/64607
Fixes: 8e46a2d068c9 ("libceph: just wait for more data to be available on the socket")
Reported-by: Luis Henriques <lhenriques(a)suse.de>
Signed-off-by: Xiubo Li <xiubli(a)redhat.com>
---
V2:
- Just removed the unnecessary 'sparse_read_total' check.
net/ceph/messenger_v2.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/net/ceph/messenger_v2.c b/net/ceph/messenger_v2.c
index a0ca5414b333..ab3ab130a911 100644
--- a/net/ceph/messenger_v2.c
+++ b/net/ceph/messenger_v2.c
@@ -2034,6 +2034,9 @@ static int prepare_sparse_read_data(struct ceph_connection *con)
if (!con_secure(con))
con->in_data_crc = -1;
+ ceph_msg_data_cursor_init(&con->v2.in_cursor, con->in_msg,
+ con->in_msg->sparse_read_total);
+
reset_in_kvecs(con);
con->v2.in_state = IN_S_PREPARE_SPARSE_DATA_CONT;
con->v2.data_len_remain = data_len(msg);
--
2.43.0
This is the start of the stable review cycle for the 5.15.151 release.
There are 83 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Thu, 07 Mar 2024 11:31:11 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.151-r…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.15.151-rc2
Davide Caratti <dcaratti(a)redhat.com>
mptcp: fix double-free on socket dismantle
Gal Pressman <gal(a)nvidia.com>
Revert "tls: rx: move counting TlsDecryptErrors for sync"
Jakub Kicinski <kuba(a)kernel.org>
net: tls: fix async vs NIC crypto offload
Martynas Pumputis <m(a)lambda.lt>
bpf: Derive source IP addr via bpf_*_fib_lookup()
Louis DeLosSantos <louis.delos.devel(a)gmail.com>
bpf: Add table ID to bpf_fib_lookup BPF helper
Martin KaFai Lau <martin.lau(a)kernel.org>
bpf: Add BPF_FIB_LOOKUP_SKIP_NEIGH for bpf_fib_lookup
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Revert "interconnect: Teach lockdep about icc_bw_lock order"
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Revert "interconnect: Fix locking for runpm vs reclaim"
Bartosz Golaszewski <bartosz.golaszewski(a)linaro.org>
gpio: fix resource unwinding order in error path
Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
gpiolib: Fix the error path order in gpiochip_add_data_with_key()
Arturas Moskvinas <arturas.moskvinas(a)gmail.com>
gpio: 74x164: Enable output pins after registers are reset
Kuniyuki Iwashima <kuniyu(a)amazon.com>
af_unix: Drop oob_skb ref before purging queue in GC.
Max Krummenacher <max.krummenacher(a)toradex.com>
Revert "drm/bridge: lt8912b: Register and attach our DSI device at probe"
Oscar Salvador <osalvador(a)suse.de>
fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super
Baokun Li <libaokun1(a)huawei.com>
cachefiles: fix memory leak in cachefiles_add_cache()
Paolo Abeni <pabeni(a)redhat.com>
mptcp: fix possible deadlock in subflow diag
Paolo Abeni <pabeni(a)redhat.com>
mptcp: push at DSS boundaries
Geliang Tang <tanggeliang(a)kylinos.cn>
mptcp: add needs_id for netlink appending addr
Jean Sacren <sakiwit(a)gmail.com>
mptcp: clean up harmless false expressions
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
selftests: mptcp: add missing kconfig for NF Filter in v6
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
selftests: mptcp: add missing kconfig for NF Filter
Paolo Abeni <pabeni(a)redhat.com>
mptcp: rename timer related helper to less confusing names
Paolo Abeni <pabeni(a)redhat.com>
mptcp: process pending subflow error on close
Paolo Abeni <pabeni(a)redhat.com>
mptcp: move __mptcp_error_report in protocol.c
Paolo Bonzini <pbonzini(a)redhat.com>
x86/cpu/intel: Detect TME keyid bits before setting MTRR mask registers
Bjorn Andersson <quic_bjorande(a)quicinc.com>
pmdomain: qcom: rpmhpd: Fix enabled_corner aggregation
Elad Nachman <enachman(a)marvell.com>
mmc: sdhci-xenon: fix PHY init clock stability
Elad Nachman <enachman(a)marvell.com>
mmc: sdhci-xenon: add timeout for PHY init complete
Ivan Semenov <ivan(a)semenov.dev>
mmc: core: Fix eMMC initialization with 1-bit bus connection
Curtis Klein <curtis.klein(a)hpe.com>
dmaengine: fsl-qdma: init irq after reg initialization
Tadeusz Struk <tstruk(a)gigaio.com>
dmaengine: ptdma: use consistent DMA masks
Peng Ma <peng.ma(a)nxp.com>
dmaengine: fsl-qdma: fix SoC may hang on 16 byte unaligned read
David Sterba <dsterba(a)suse.com>
btrfs: dev-replace: properly validate device names
Johannes Berg <johannes.berg(a)intel.com>
wifi: nl80211: reject iftype change with mesh ID change
Alexander Ofitserov <oficerovas(a)altlinux.org>
gtp: fix use-after-free and null-ptr-deref in gtp_newlink()
Takashi Sakamoto <o-takashi(a)sakamocchi.jp>
ALSA: firewire-lib: fix to check cycle continuity
Tetsuo Handa <penguin-kernel(a)I-love.SAKURA.ne.jp>
tomoyo: fix UAF write bug in tomoyo_write_control()
Dimitris Vlachos <dvlachos(a)ics.forth.gr>
riscv: Sparse-Memory/vmemmap out-of-bounds fix
David Howells <dhowells(a)redhat.com>
afs: Fix endless loop in directory parsing
Jiri Slaby (SUSE) <jirislaby(a)kernel.org>
fbcon: always restore the old font data in fbcon_do_set_font()
Takashi Iwai <tiwai(a)suse.de>
ALSA: Drop leftover snd-rtctimer stuff from Makefile
Hans de Goede <hdegoede(a)redhat.com>
power: supply: bq27xxx-i2c: Do not free non existing IRQ
Arnd Bergmann <arnd(a)arndb.de>
efi/capsule-loader: fix incorrect allocation size
Sabrina Dubroca <sd(a)queasysnail.net>
tls: decrement decrypt_pending if no async completion will be called
Jakub Kicinski <kuba(a)kernel.org>
tls: rx: use async as an in-out argument
Jakub Kicinski <kuba(a)kernel.org>
tls: rx: assume crypto always calls our callback
Jakub Kicinski <kuba(a)kernel.org>
tls: rx: move counting TlsDecryptErrors for sync
Jakub Kicinski <kuba(a)kernel.org>
tls: rx: don't track the async count
Jakub Kicinski <kuba(a)kernel.org>
tls: rx: factor out writing ContentType to cmsg
Jakub Kicinski <kuba(a)kernel.org>
tls: rx: wrap decryption arguments in a structure
Jakub Kicinski <kuba(a)kernel.org>
tls: rx: don't report text length from the bowels of decrypt
Jakub Kicinski <kuba(a)kernel.org>
tls: rx: drop unnecessary arguments from tls_setup_from_iter()
Jakub Kicinski <kuba(a)kernel.org>
tls: hw: rx: use return value of tls_device_decrypted() to carry status
Jakub Kicinski <kuba(a)kernel.org>
tls: rx: refactor decrypt_skb_update()
Jakub Kicinski <kuba(a)kernel.org>
tls: rx: don't issue wake ups when data is decrypted
Jakub Kicinski <kuba(a)kernel.org>
tls: rx: don't store the decryption status in socket context
Jakub Kicinski <kuba(a)kernel.org>
tls: rx: don't store the record type in socket context
Oleksij Rempel <linux(a)rempel-privat.de>
igb: extend PTP timestamp adjustments to i211
Lin Ma <linma(a)zju.edu.cn>
rtnetlink: fix error logic of IFLA_BRIDGE_FLAGS writing back
Florian Westphal <fw(a)strlen.de>
netfilter: bridge: confirm multicast packets before passing them up the stack
Florian Westphal <fw(a)strlen.de>
netfilter: let reset rules clean out conntrack entries
Florian Westphal <fw(a)strlen.de>
netfilter: make function op structures const
Florian Westphal <fw(a)strlen.de>
netfilter: core: move ip_ct_attach indirection to struct nf_ct_hook
Florian Westphal <fw(a)strlen.de>
netfilter: nfnetlink_queue: silence bogus compiler warning
Ignat Korchagin <ignat(a)cloudflare.com>
netfilter: nf_tables: allow NFPROTO_INET in nft_(match/target)_validate()
Kai-Heng Feng <kai.heng.feng(a)canonical.com>
Bluetooth: Enforce validation on max value of connection interval
Luiz Augusto von Dentz <luiz.von.dentz(a)intel.com>
Bluetooth: hci_event: Fix handling of HCI_EV_IO_CAPA_REQUEST
Zijun Hu <quic_zijuhu(a)quicinc.com>
Bluetooth: hci_event: Fix wrongly recorded wakeup BD_ADDR
Ying Hsu <yinghsu(a)chromium.org>
Bluetooth: Avoid potential use-after-free in hci_error_reset
Jakub Raczynski <j.raczynski(a)samsung.com>
stmmac: Clear variable when destroying workqueue
Justin Iurman <justin.iurman(a)uliege.be>
uapi: in6: replace temporary label with rfc9486
Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
net: usb: dm9601: fix wrong return value in dm9601_mdio_read
Jakub Kicinski <kuba(a)kernel.org>
veth: try harder when allocating queue memory
Vasily Averin <vvs(a)openvz.org>
net: enable memcg accounting for veth queues
Oleksij Rempel <linux(a)rempel-privat.de>
lan78xx: enable auto speed configuration for LAN7850 if no EEPROM is detected
Eric Dumazet <edumazet(a)google.com>
ipv6: fix potential "struct net" leak in inet6_rtm_getaddr()
Jakub Kicinski <kuba(a)kernel.org>
net: veth: clear GRO when clearing XDP even when down
Doug Smythies <dsmythies(a)telus.net>
cpufreq: intel_pstate: fix pstate limits enforcement for adjust_perf call back
Yunjian Wang <wangyunjian(a)huawei.com>
tun: Fix xdp_rxq_info's queue_index when detaching
Florian Westphal <fw(a)strlen.de>
net: ip_tunnel: prevent perpetual headroom growth
Ryosuke Yasuoka <ryasuoka(a)redhat.com>
netlink: Fix kernel-infoleak-after-free in __skb_datagram_iter
Han Xu <han.xu(a)nxp.com>
mtd: spinand: gigadevice: Fix the get ecc status issue
Pablo Neira Ayuso <pablo(a)netfilter.org>
netfilter: nf_tables: disallow timeout for anonymous sets
-------------
Diffstat:
Makefile | 4 +-
arch/riscv/include/asm/pgtable.h | 2 +-
arch/x86/kernel/cpu/intel.c | 178 ++++++------
drivers/cpufreq/intel_pstate.c | 3 +
drivers/dma/fsl-qdma.c | 25 +-
drivers/dma/ptdma/ptdma-dmaengine.c | 2 -
drivers/firmware/efi/capsule-loader.c | 2 +-
drivers/gpio/gpio-74x164.c | 4 +-
drivers/gpio/gpiolib.c | 12 +-
drivers/gpu/drm/bridge/lontium-lt8912b.c | 11 +-
drivers/interconnect/core.c | 18 +-
drivers/mmc/core/mmc.c | 2 +
drivers/mmc/host/sdhci-xenon-phy.c | 48 +++-
drivers/mtd/nand/spi/gigadevice.c | 6 +-
drivers/net/ethernet/intel/igb/igb_ptp.c | 5 +-
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 4 +-
drivers/net/gtp.c | 12 +-
drivers/net/tun.c | 1 +
drivers/net/usb/dm9601.c | 2 +-
drivers/net/usb/lan78xx.c | 3 +-
drivers/net/veth.c | 40 +--
drivers/power/supply/bq27xxx_battery_i2c.c | 4 +-
drivers/soc/qcom/rpmhpd.c | 7 +-
drivers/video/fbdev/core/fbcon.c | 8 +-
fs/afs/dir.c | 4 +-
fs/btrfs/dev-replace.c | 24 +-
fs/cachefiles/bind.c | 3 +
fs/hugetlbfs/inode.c | 6 +-
include/linux/netfilter.h | 14 +-
include/net/ipv6_stubs.h | 5 +
include/net/netfilter/nf_conntrack.h | 8 +
include/net/strparser.h | 4 +
include/net/tls.h | 11 +-
include/uapi/linux/bpf.h | 37 ++-
include/uapi/linux/in6.h | 2 +-
net/bluetooth/hci_core.c | 7 +-
net/bluetooth/hci_event.c | 13 +-
net/bluetooth/l2cap_core.c | 8 +-
net/bridge/br_netfilter_hooks.c | 96 +++++++
net/bridge/netfilter/nf_conntrack_bridge.c | 30 ++
net/core/filter.c | 67 ++++-
net/core/rtnetlink.c | 11 +-
net/ipv4/ip_tunnel.c | 28 +-
net/ipv4/netfilter/nf_reject_ipv4.c | 1 +
net/ipv6/addrconf.c | 7 +-
net/ipv6/af_inet6.c | 1 +
net/ipv6/netfilter/nf_reject_ipv6.c | 1 +
net/mptcp/diag.c | 3 +
net/mptcp/pm_netlink.c | 30 +-
net/mptcp/protocol.c | 123 +++++++--
net/mptcp/subflow.c | 36 ---
net/netfilter/core.c | 45 +--
net/netfilter/nf_conntrack_core.c | 21 +-
net/netfilter/nf_conntrack_netlink.c | 4 +-
net/netfilter/nf_conntrack_proto_tcp.c | 35 +++
net/netfilter/nf_nat_core.c | 2 +-
net/netfilter/nf_tables_api.c | 7 +
net/netfilter/nfnetlink_queue.c | 10 +-
net/netfilter/nft_compat.c | 20 ++
net/netlink/af_netlink.c | 2 +-
net/tls/tls_device.c | 6 +-
net/tls/tls_sw.c | 316 ++++++++++------------
net/unix/garbage.c | 22 +-
net/wireless/nl80211.c | 2 +
security/tomoyo/common.c | 3 +-
sound/core/Makefile | 1 -
sound/firewire/amdtp-stream.c | 2 +-
tools/include/uapi/linux/bpf.h | 37 ++-
tools/testing/selftests/net/mptcp/config | 2 +
69 files changed, 991 insertions(+), 529 deletions(-)
From: Dominique Martinet <dominique.martinet(a)atmark-techno.com>
Commit e7794c14fd73 ("mmc: rpmb: fixes pause retune on all RPMB
partitions.") added a mask check for 'part_type', but the mask used was
wrong leading to the code intended for rpmb also being executed for GP3.
On some MMCs (but not all) this would make gp3 partition inaccessible:
armadillo:~# head -c 1 < /dev/mmcblk2gp3
head: standard input: I/O error
armadillo:~# dmesg -c
[ 422.976583] mmc2: running CQE recovery
[ 423.058182] mmc2: running CQE recovery
[ 423.137607] mmc2: running CQE recovery
[ 423.137802] blk_update_request: I/O error, dev mmcblk2gp3, sector 0 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 0
[ 423.237125] mmc2: running CQE recovery
[ 423.318206] mmc2: running CQE recovery
[ 423.397680] mmc2: running CQE recovery
[ 423.397837] blk_update_request: I/O error, dev mmcblk2gp3, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 423.408287] Buffer I/O error on dev mmcblk2gp3, logical block 0, async page read
the part_type values of interest here are defined as follow:
main 0
boot0 1
boot1 2
rpmb 3
gp0 4
gp1 5
gp2 6
gp3 7
so mask with EXT_CSD_PART_CONFIG_ACC_MASK (7) to correctly identify rpmb
Fixes: e7794c14fd73 ("mmc: rpmb: fixes pause retune on all RPMB partitions.")
Cc: stable(a)vger.kernel.org
Cc: Jorge Ramirez-Ortiz <jorge(a)foundries.io>
Signed-off-by: Dominique Martinet <dominique.martinet(a)atmark-techno.com>
---
A couple of notes:
- this doesn't fail on all eMMCs, I can still access gp3 on some models
but it seems to fail reliably with micron's "G1M15L"
- I've encountered this on the 5.10 backport (in 5.10.208), so that'll
need to be backported everywhere the fix was taken...
Thanks!
---
drivers/mmc/core/block.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 32d49100dff5..86efa6084696 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -874,10 +874,11 @@ static const struct block_device_operations mmc_bdops = {
static int mmc_blk_part_switch_pre(struct mmc_card *card,
unsigned int part_type)
{
- const unsigned int mask = EXT_CSD_PART_CONFIG_ACC_RPMB;
+ const unsigned int mask = EXT_CSD_PART_CONFIG_ACC_MASK;
+ const unsigned int rpmb = EXT_CSD_PART_CONFIG_ACC_RPMB;
int ret = 0;
- if ((part_type & mask) == mask) {
+ if ((part_type & mask) == rpmb) {
if (card->ext_csd.cmdq_en) {
ret = mmc_cmdq_disable(card);
if (ret)
@@ -892,10 +893,11 @@ static int mmc_blk_part_switch_pre(struct mmc_card *card,
static int mmc_blk_part_switch_post(struct mmc_card *card,
unsigned int part_type)
{
- const unsigned int mask = EXT_CSD_PART_CONFIG_ACC_RPMB;
+ const unsigned int mask = EXT_CSD_PART_CONFIG_ACC_MASK;
+ const unsigned int rpmb = EXT_CSD_PART_CONFIG_ACC_RPMB;
int ret = 0;
- if ((part_type & mask) == mask) {
+ if ((part_type & mask) == rpmb) {
mmc_retune_unpause(card->host);
if (card->reenable_cmdq && !card->ext_csd.cmdq_en)
ret = mmc_cmdq_enable(card);
---
base-commit: 5847c9777c303a792202c609bd761dceb60f4eed
change-id: 20240306-mmc-partswitch-c3a50b5084ae
Best regards,
--
Dominique Martinet | Asmadeus
In the following sequence:
1) of_platform_depopulate()
2) of_overlay_remove()
During the step 1, devices are destroyed and devlinks are removed.
During the step 2, OF nodes are destroyed but
__of_changeset_entry_destroy() can raise warnings related to missing
of_node_put():
ERROR: memory leak, expected refcount 1 instead of 2 ...
Indeed, during the devlink removals performed at step 1, the removal
itself releasing the device (and the attached of_node) is done by a job
queued in a workqueue and so, it is done asynchronously with respect to
function calls.
When the warning is present, of_node_put() will be called but wrongly
too late from the workqueue job.
In order to be sure that any ongoing devlink removals are done before
the of_node destruction, synchronize the of_changeset_destroy() with the
devlink removals.
Fixes: 80dd33cf72d1 ("drivers: base: Fix device link removal")
Cc: stable(a)vger.kernel.org
Signed-off-by: Herve Codina <herve.codina(a)bootlin.com>
---
drivers/of/dynamic.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
index 3bf27052832f..169e2a9ae22f 100644
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -9,6 +9,7 @@
#define pr_fmt(fmt) "OF: " fmt
+#include <linux/device.h>
#include <linux/of.h>
#include <linux/spinlock.h>
#include <linux/slab.h>
@@ -667,6 +668,12 @@ void of_changeset_destroy(struct of_changeset *ocs)
{
struct of_changeset_entry *ce, *cen;
+ /*
+ * Wait for any ongoing device link removals before destroying some of
+ * nodes.
+ */
+ device_link_wait_removal();
+
list_for_each_entry_safe_reverse(ce, cen, &ocs->entries, node)
__of_changeset_entry_destroy(ce);
}
--
2.43.0
The commit 80dd33cf72d1 ("drivers: base: Fix device link removal")
introduces a workqueue to release the consumer and supplier devices used
in the devlink.
In the job queued, devices are release and in turn, when all the
references to these devices are dropped, the release function of the
device itself is called.
Nothing is present to provide some synchronisation with this workqueue
in order to ensure that all ongoing releasing operations are done and
so, some other operations can be started safely.
For instance, in the following sequence:
1) of_platform_depopulate()
2) of_overlay_remove()
During the step 1, devices are released and related devlinks are removed
(jobs pushed in the workqueue).
During the step 2, OF nodes are destroyed but, without any
synchronisation with devlink removal jobs, of_overlay_remove() can raise
warnings related to missing of_node_put():
ERROR: memory leak, expected refcount 1 instead of 2
Indeed, the missing of_node_put() call is going to be done, too late,
from the workqueue job execution.
Introduce device_link_wait_removal() to offer a way to synchronize
operations waiting for the end of devlink removals (i.e. end of
workqueue jobs).
Also, as a flushing operation is done on the workqueue, the workqueue
used is moved from a system-wide workqueue to a local one.
Fixes: 80dd33cf72d1 ("drivers: base: Fix device link removal")
Cc: stable(a)vger.kernel.org
Signed-off-by: Herve Codina <herve.codina(a)bootlin.com>
---
drivers/base/core.c | 26 +++++++++++++++++++++++---
include/linux/device.h | 1 +
2 files changed, 24 insertions(+), 3 deletions(-)
diff --git a/drivers/base/core.c b/drivers/base/core.c
index d5f4e4aac09b..48b28c59c592 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -44,6 +44,7 @@ static bool fw_devlink_is_permissive(void);
static void __fw_devlink_link_to_consumers(struct device *dev);
static bool fw_devlink_drv_reg_done;
static bool fw_devlink_best_effort;
+static struct workqueue_struct *device_link_wq;
/**
* __fwnode_link_add - Create a link between two fwnode_handles.
@@ -532,12 +533,26 @@ static void devlink_dev_release(struct device *dev)
/*
* It may take a while to complete this work because of the SRCU
* synchronization in device_link_release_fn() and if the consumer or
- * supplier devices get deleted when it runs, so put it into the "long"
- * workqueue.
+ * supplier devices get deleted when it runs, so put it into the
+ * dedicated workqueue.
*/
- queue_work(system_long_wq, &link->rm_work);
+ queue_work(device_link_wq, &link->rm_work);
}
+/**
+ * device_link_wait_removal - Wait for ongoing devlink removal jobs to terminate
+ */
+void device_link_wait_removal(void)
+{
+ /*
+ * devlink removal jobs are queued in the dedicated work queue.
+ * To be sure that all removal jobs are terminated, ensure that any
+ * scheduled work has run to completion.
+ */
+ flush_workqueue(device_link_wq);
+}
+EXPORT_SYMBOL_GPL(device_link_wait_removal);
+
static struct class devlink_class = {
.name = "devlink",
.dev_groups = devlink_groups,
@@ -4099,9 +4114,14 @@ int __init devices_init(void)
sysfs_dev_char_kobj = kobject_create_and_add("char", dev_kobj);
if (!sysfs_dev_char_kobj)
goto char_kobj_err;
+ device_link_wq = alloc_workqueue("device_link_wq", 0, 0);
+ if (!device_link_wq)
+ goto wq_err;
return 0;
+ wq_err:
+ kobject_put(sysfs_dev_char_kobj);
char_kobj_err:
kobject_put(sysfs_dev_block_kobj);
block_kobj_err:
diff --git a/include/linux/device.h b/include/linux/device.h
index 1795121dee9a..d7d8305a72e8 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1249,6 +1249,7 @@ void device_link_del(struct device_link *link);
void device_link_remove(void *consumer, struct device *supplier);
void device_links_supplier_sync_state_pause(void);
void device_links_supplier_sync_state_resume(void);
+void device_link_wait_removal(void);
/* Create alias, so I can be autoloaded. */
#define MODULE_ALIAS_CHARDEV(major,minor) \
--
2.43.0
The quilt patch titled
Subject: mm: swap: fix race between free_swap_and_cache() and swapoff()
has been removed from the -mm tree. Its filename was
mm-swap-fix-race-between-free_swap_and_cache-and-swapoff.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Ryan Roberts <ryan.roberts(a)arm.com>
Subject: mm: swap: fix race between free_swap_and_cache() and swapoff()
Date: Wed, 6 Mar 2024 14:03:56 +0000
There was previously a theoretical window where swapoff() could run and
teardown a swap_info_struct while a call to free_swap_and_cache() was
running in another thread. This could cause, amongst other bad
possibilities, swap_page_trans_huge_swapped() (called by
free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a
test case. But there has been agreement based on code review that this is
possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall
swapoff(). There was an extra check in _swap_info_get() to confirm that
the swap entry was not free. This isn't present in get_swap_device()
because it doesn't make sense in general due to the race between getting
the reference and swapoff. So I've added an equivalent check directly in
free_swap_and_cache().
Details of how to provoke one possible issue (thanks to David Hildenbrand
for deriving this):
--8<-----
__swap_entry_free() might be the last user and result in
"count == SWAP_HAS_CACHE".
swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
So the question is: could someone reclaim the folio and turn
si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
still references by swap entries.
Process 1 still references subpage 0 via swap entry.
Process 2 still references subpage 1 via swap entry.
Process 1 quits. Calls free_swap_and_cache().
-> count == SWAP_HAS_CACHE
[then, preempted in the hypervisor etc.]
Process 2 quits. Calls free_swap_and_cache().
-> count == SWAP_HAS_CACHE
Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
__try_to_reclaim_swap().
__try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
swap_entry_free()->swap_range_free()->
...
WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
What stops swapoff to succeed after process 2 reclaimed the swap cache
but before process1 finished its call to swap_page_trans_huge_swapped()?
--8<-----
Link: https://lkml.kernel.org/r/20240306140356.3974886-1-ryan.roberts@arm.com
Fixes: 7c00bafee87c ("mm/swap: free swap slots in batch")
Closes: https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@redha…
Signed-off-by: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: "Huang, Ying" <ying.huang(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/swapfile.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
--- a/mm/swapfile.c~mm-swap-fix-race-between-free_swap_and_cache-and-swapoff
+++ a/mm/swapfile.c
@@ -1232,6 +1232,11 @@ static unsigned char __swap_entry_free_l
* with get_swap_device() and put_swap_device(), unless the swap
* functions call get/put_swap_device() by themselves.
*
+ * Note that when only holding the PTL, swapoff might succeed immediately
+ * after freeing a swap entry. Therefore, immediately after
+ * __swap_entry_free(), the swap info might become stale and should not
+ * be touched without a prior get_swap_device().
+ *
* Check whether swap entry is valid in the swap device. If so,
* return pointer to swap_info_struct, and keep the swap entry valid
* via preventing the swap device from being swapoff, until
@@ -1609,13 +1614,19 @@ int free_swap_and_cache(swp_entry_t entr
if (non_swap_entry(entry))
return 1;
- p = _swap_info_get(entry);
+ p = get_swap_device(entry);
if (p) {
+ if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) {
+ put_swap_device(p);
+ return 0;
+ }
+
count = __swap_entry_free(p, entry);
if (count == SWAP_HAS_CACHE &&
!swap_page_trans_huge_swapped(p, entry))
__try_to_reclaim_swap(p, swp_offset(entry),
TTRS_UNMAPPED | TTRS_FULL);
+ put_swap_device(p);
}
return p != NULL;
}
_
Patches currently in -mm which might be from ryan.roberts(a)arm.com are
The patch titled
Subject: mm: swap: fix race between free_swap_and_cache() and swapoff()
has been added to the -mm mm-unstable branch. Its filename is
mm-swap-fix-race-between-free_swap_and_cache-and-swapoff.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Ryan Roberts <ryan.roberts(a)arm.com>
Subject: mm: swap: fix race between free_swap_and_cache() and swapoff()
Date: Wed, 6 Mar 2024 14:03:56 +0000
There was previously a theoretical window where swapoff() could run and
teardown a swap_info_struct while a call to free_swap_and_cache() was
running in another thread. This could cause, amongst other bad
possibilities, swap_page_trans_huge_swapped() (called by
free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a
test case. But there has been agreement based on code review that this is
possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall
swapoff(). There was an extra check in _swap_info_get() to confirm that
the swap entry was not free. This isn't present in get_swap_device()
because it doesn't make sense in general due to the race between getting
the reference and swapoff. So I've added an equivalent check directly in
free_swap_and_cache().
Details of how to provoke one possible issue (thanks to David Hildenbrand
for deriving this):
--8<-----
__swap_entry_free() might be the last user and result in
"count == SWAP_HAS_CACHE".
swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
So the question is: could someone reclaim the folio and turn
si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
still references by swap entries.
Process 1 still references subpage 0 via swap entry.
Process 2 still references subpage 1 via swap entry.
Process 1 quits. Calls free_swap_and_cache().
-> count == SWAP_HAS_CACHE
[then, preempted in the hypervisor etc.]
Process 2 quits. Calls free_swap_and_cache().
-> count == SWAP_HAS_CACHE
Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
__try_to_reclaim_swap().
__try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
swap_entry_free()->swap_range_free()->
...
WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
What stops swapoff to succeed after process 2 reclaimed the swap cache
but before process1 finished its call to swap_page_trans_huge_swapped()?
--8<-----
Link: https://lkml.kernel.org/r/20240306140356.3974886-1-ryan.roberts@arm.com
Fixes: 7c00bafee87c ("mm/swap: free swap slots in batch")
Closes: https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@redha…
Signed-off-by: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: "Huang, Ying" <ying.huang(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/swapfile.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
--- a/mm/swapfile.c~mm-swap-fix-race-between-free_swap_and_cache-and-swapoff
+++ a/mm/swapfile.c
@@ -1232,6 +1232,11 @@ static unsigned char __swap_entry_free_l
* with get_swap_device() and put_swap_device(), unless the swap
* functions call get/put_swap_device() by themselves.
*
+ * Note that when only holding the PTL, swapoff might succeed immediately
+ * after freeing a swap entry. Therefore, immediately after
+ * __swap_entry_free(), the swap info might become stale and should not
+ * be touched without a prior get_swap_device().
+ *
* Check whether swap entry is valid in the swap device. If so,
* return pointer to swap_info_struct, and keep the swap entry valid
* via preventing the swap device from being swapoff, until
@@ -1609,13 +1614,19 @@ int free_swap_and_cache(swp_entry_t entr
if (non_swap_entry(entry))
return 1;
- p = _swap_info_get(entry);
+ p = get_swap_device(entry);
if (p) {
+ if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) {
+ put_swap_device(p);
+ return 0;
+ }
+
count = __swap_entry_free(p, entry);
if (count == SWAP_HAS_CACHE &&
!swap_page_trans_huge_swapped(p, entry))
__try_to_reclaim_swap(p, swp_offset(entry),
TTRS_UNMAPPED | TTRS_FULL);
+ put_swap_device(p);
}
return p != NULL;
}
_
Patches currently in -mm which might be from ryan.roberts(a)arm.com are
mm-swap-fix-race-between-free_swap_and_cache-and-swapoff.patch