From: Chuck Lever <chuck.lever(a)oracle.com>
Following up on
https://lore.kernel.org/linux-nfs/d4b235df-4ee5-4824-9d48-e3b3c1f1f4d1@orac…
Here is a backport series targeting origin/linux-5.10.y that closes
the information leak described in the above thread.
Review comments welcome.
Chuck Lever (6):
NFSD: Refactor nfsd_reply_cache_free_locked()
NFSD: Rename nfsd_reply_cache_alloc()
NFSD: Replace nfsd_prune_bucket()
NFSD: Refactor the duplicate reply cache shrinker
NFSD: Rewrite synopsis of nfsd_percpu_counters_init()
NFSD: Fix frame size warning in svc_export_parse()
Jeff Layton (2):
nfsd: move reply cache initialization into nfsd startup
nfsd: move init of percpu reply_cache_stats counters back to
nfsd_init_net
Josef Bacik (10):
sunrpc: don't change ->sv_stats if it doesn't exist
nfsd: stop setting ->pg_stats for unused stats
sunrpc: pass in the sv_stats struct through svc_create_pooled
sunrpc: remove ->pg_stats from svc_program
sunrpc: use the struct net as the svc proc private
nfsd: rename NFSD_NET_* to NFSD_STATS_*
nfsd: expose /proc/net/sunrpc/nfsd in net namespaces
nfsd: make all of the nfsd stats per-network namespace
nfsd: remove nfsd_stats, make th_cnt a global counter
nfsd: make svc_stat per-network namespace instead of global
NeilBrown (1):
NFSD: simplify error paths in nfsd_svc()
fs/lockd/svc.c | 3 -
fs/nfs/callback.c | 3 -
fs/nfsd/export.c | 32 ++++--
fs/nfsd/export.h | 4 +-
fs/nfsd/netns.h | 25 ++++-
fs/nfsd/nfs4proc.c | 6 +-
fs/nfsd/nfscache.c | 202 ++++++++++++++++++++++---------------
fs/nfsd/nfsctl.c | 24 ++---
fs/nfsd/nfsd.h | 1 +
fs/nfsd/nfsfh.c | 3 +-
fs/nfsd/nfssvc.c | 38 ++++---
fs/nfsd/stats.c | 52 ++++------
fs/nfsd/stats.h | 83 ++++++---------
fs/nfsd/trace.h | 22 ++++
fs/nfsd/vfs.c | 6 +-
include/linux/sunrpc/svc.h | 5 +-
net/sunrpc/stats.c | 2 +-
net/sunrpc/svc.c | 36 ++++---
18 files changed, 306 insertions(+), 241 deletions(-)
--
2.45.1
The patch below does not apply to the 6.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.10.y
git checkout FETCH_HEAD
git cherry-pick -x 562755501d44cfbbe82703a62cb41502bd067bd1
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090812-ample-stowaway-5c06@gregkh' --subject-prefix 'PATCH 6.10.y' HEAD^..
Possible dependencies:
562755501d44 ("ALSA: hda/realtek: extend quirks for Clevo V5[46]0")
03c5c350e38d ("ALSA: hda/realtek: Add support for new HP G12 laptops")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 562755501d44cfbbe82703a62cb41502bd067bd1 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Marek=20Marczykowski-G=C3=B3recki?=
<marmarek(a)invisiblethingslab.com>
Date: Tue, 3 Sep 2024 14:49:31 +0200
Subject: [PATCH] ALSA: hda/realtek: extend quirks for Clevo V5[46]0
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The mic in those laptops suffers too high gain resulting in mostly (fan
or else) noise being recorded. In addition to the existing fixup about
mic detection, apply also limiting its boost. While at it, extend the
quirk to also V5[46]0TNE models, which have the same issue.
Signed-off-by: Marek Marczykowski-Górecki <marmarek(a)invisiblethingslab.com>
Cc: <stable(a)vger.kernel.org>
Link: https://patch.msgid.link/20240903124939.6213-1-marmarek@invisiblethingslab.…
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
diff --git a/sound/pci/hda/patch_realtek.c b/sound/pci/hda/patch_realtek.c
index ff62702a8226..fd7711d69823 100644
--- a/sound/pci/hda/patch_realtek.c
+++ b/sound/pci/hda/patch_realtek.c
@@ -7638,6 +7638,7 @@ enum {
ALC287_FIXUP_LENOVO_14ARP8_LEGION_IAH7,
ALC287_FIXUP_LENOVO_SSID_17AA3820,
ALCXXX_FIXUP_CS35LXX,
+ ALC245_FIXUP_CLEVO_NOISY_MIC,
};
/* A special fixup for Lenovo C940 and Yoga Duet 7;
@@ -9977,6 +9978,12 @@ static const struct hda_fixup alc269_fixups[] = {
.type = HDA_FIXUP_FUNC,
.v.func = cs35lxx_autodet_fixup,
},
+ [ALC245_FIXUP_CLEVO_NOISY_MIC] = {
+ .type = HDA_FIXUP_FUNC,
+ .v.func = alc269_fixup_limit_int_mic_boost,
+ .chained = true,
+ .chain_id = ALC256_FIXUP_SYSTEM76_MIC_NO_PRESENCE,
+ },
};
static const struct snd_pci_quirk alc269_fixup_tbl[] = {
@@ -10626,7 +10633,8 @@ static const struct snd_pci_quirk alc269_fixup_tbl[] = {
SND_PCI_QUIRK(0x1558, 0xa600, "Clevo NL50NU", ALC293_FIXUP_SYSTEM76_MIC_NO_PRESENCE),
SND_PCI_QUIRK(0x1558, 0xa650, "Clevo NP[567]0SN[CD]", ALC256_FIXUP_SYSTEM76_MIC_NO_PRESENCE),
SND_PCI_QUIRK(0x1558, 0xa671, "Clevo NP70SN[CDE]", ALC256_FIXUP_SYSTEM76_MIC_NO_PRESENCE),
- SND_PCI_QUIRK(0x1558, 0xa763, "Clevo V54x_6x_TU", ALC256_FIXUP_SYSTEM76_MIC_NO_PRESENCE),
+ SND_PCI_QUIRK(0x1558, 0xa741, "Clevo V54x_6x_TNE", ALC245_FIXUP_CLEVO_NOISY_MIC),
+ SND_PCI_QUIRK(0x1558, 0xa763, "Clevo V54x_6x_TU", ALC245_FIXUP_CLEVO_NOISY_MIC),
SND_PCI_QUIRK(0x1558, 0xb018, "Clevo NP50D[BE]", ALC293_FIXUP_SYSTEM76_MIC_NO_PRESENCE),
SND_PCI_QUIRK(0x1558, 0xb019, "Clevo NH77D[BE]Q", ALC293_FIXUP_SYSTEM76_MIC_NO_PRESENCE),
SND_PCI_QUIRK(0x1558, 0xb022, "Clevo NH77D[DC][QW]", ALC293_FIXUP_SYSTEM76_MIC_NO_PRESENCE),
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 65444581a4aecf0e96b4691bb20fc75c602f5863
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090809-wrongly-repulsive-5a71@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
65444581a4ae ("drm/amd/display: Determine IPS mode by ASIC and PMFW versions")
234e94555800 ("drm/amd/display: Enable copying of bounding box data from VBIOS DMUB")
afca033f10d3 ("drm/amd/display: Add periodic detection for IPS")
05c5ffaac770 ("drm/amd/display: gpuvm handling in DML21")
9ba971b25316 ("drm/amd/display: Re-enable IPS2 for static screen")
70839da63605 ("drm/amd/display: Add new DCN401 sources")
14813934b629 ("drm/amd/display: Allow RCG for Static Screen + LVP for DCN35")
e779f4587f61 ("drm/amd/display: Add handling for DC power mode")
cc263c3a0c9f ("drm/amd/display: remove context->dml2 dependency from DML21 wrapper")
d62d5551dd61 ("drm/amd/display: Backup and restore only on full updates")
2d5bb791e24f ("drm/amd/display: Implement update_planes_and_stream_v3 sequence")
4f5b8d78ca43 ("drm/amd/display: Init DPPCLK from SMU on dcn32")
2728e9c7c842 ("drm/amd/display: add DC changes for DCN351")
d2dea1f14038 ("drm/amd/display: Generalize new minimal transition path")
0701117efd1e ("Revert "drm/amd/display: For FPO and SubVP/DRR configs program vmin/max sel"")
a9b1a4f684b3 ("drm/amd/display: Add more checks for exiting idle in DC")
13b3d6bdbeb4 ("drm/amd/display: add debugfs disallow edp psr")
dcbf438d4834 ("drm/amd/display: Unify optimize_required flags and VRR adjustments")
1630c6ded587 ("drm/amd/display: "Enable IPS by default"")
8457bddc266c ("drm/amd/display: Revert "Rework DC Z10 restore"")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 65444581a4aecf0e96b4691bb20fc75c602f5863 Mon Sep 17 00:00:00 2001
From: Leo Li <sunpeng.li(a)amd.com>
Date: Tue, 27 Aug 2024 11:29:53 -0400
Subject: [PATCH] drm/amd/display: Determine IPS mode by ASIC and PMFW versions
[Why]
DCN IPS interoperates with other system idle power features, such as
Zstates.
On DCN35, there is a known issue where system Z8 + DCN IPS2 causes a
hard hang. We observe this on systems where the SBIOS allows Z8.
Though there is a SBIOS fix, there's no guarantee that users will get it
any time soon, or even install it. A workaround is needed to prevent
this from rearing its head in the wild.
[How]
For DCN35, check the pmfw version to determine whether the SBIOS has the
fix. If not, set IPS1+RCG as the deepest possible state in all cases
except for s0ix and display off (DPMS). Otherwise, enable all IPS
Signed-off-by: Leo Li <sunpeng.li(a)amd.com>
Reviewed-by: Harry Wentland <harry.wentland(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
(cherry picked from commit 28d43d0895896f84c038d906d244e0a95eb243ec)
Cc: stable(a)vger.kernel.org
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 983a977632ff..e6cea5b9bdb3 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1752,6 +1752,30 @@ static struct dml2_soc_bb *dm_dmub_get_vbios_bounding_box(struct amdgpu_device *
return bb;
}
+static enum dmub_ips_disable_type dm_get_default_ips_mode(
+ struct amdgpu_device *adev)
+{
+ /*
+ * On DCN35 systems with Z8 enabled, it's possible for IPS2 + Z8 to
+ * cause a hard hang. A fix exists for newer PMFW.
+ *
+ * As a workaround, for non-fixed PMFW, force IPS1+RCG as the deepest
+ * IPS state in all cases, except for s0ix and all displays off (DPMS),
+ * where IPS2 is allowed.
+ *
+ * When checking pmfw version, use the major and minor only.
+ */
+ if (amdgpu_ip_version(adev, DCE_HWIP, 0) == IP_VERSION(3, 5, 0) &&
+ (adev->pm.fw_version & 0x00FFFF00) < 0x005D6300)
+ return DMUB_IPS_RCG_IN_ACTIVE_IPS2_IN_OFF;
+
+ if (amdgpu_ip_version(adev, DCE_HWIP, 0) >= IP_VERSION(3, 5, 0))
+ return DMUB_IPS_ENABLE;
+
+ /* ASICs older than DCN35 do not have IPSs */
+ return DMUB_IPS_DISABLE_ALL;
+}
+
static int amdgpu_dm_init(struct amdgpu_device *adev)
{
struct dc_init_data init_data;
@@ -1863,7 +1887,7 @@ static int amdgpu_dm_init(struct amdgpu_device *adev)
if (amdgpu_dc_debug_mask & DC_DISABLE_IPS)
init_data.flags.disable_ips = DMUB_IPS_DISABLE_ALL;
else
- init_data.flags.disable_ips = DMUB_IPS_ENABLE;
+ init_data.flags.disable_ips = dm_get_default_ips_mode(adev);
init_data.flags.disable_ips_in_vpb = 0;
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 65444581a4aecf0e96b4691bb20fc75c602f5863
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090805-visa-hankering-f46e@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
65444581a4ae ("drm/amd/display: Determine IPS mode by ASIC and PMFW versions")
234e94555800 ("drm/amd/display: Enable copying of bounding box data from VBIOS DMUB")
afca033f10d3 ("drm/amd/display: Add periodic detection for IPS")
05c5ffaac770 ("drm/amd/display: gpuvm handling in DML21")
9ba971b25316 ("drm/amd/display: Re-enable IPS2 for static screen")
70839da63605 ("drm/amd/display: Add new DCN401 sources")
14813934b629 ("drm/amd/display: Allow RCG for Static Screen + LVP for DCN35")
e779f4587f61 ("drm/amd/display: Add handling for DC power mode")
cc263c3a0c9f ("drm/amd/display: remove context->dml2 dependency from DML21 wrapper")
d62d5551dd61 ("drm/amd/display: Backup and restore only on full updates")
2d5bb791e24f ("drm/amd/display: Implement update_planes_and_stream_v3 sequence")
4f5b8d78ca43 ("drm/amd/display: Init DPPCLK from SMU on dcn32")
2728e9c7c842 ("drm/amd/display: add DC changes for DCN351")
d2dea1f14038 ("drm/amd/display: Generalize new minimal transition path")
0701117efd1e ("Revert "drm/amd/display: For FPO and SubVP/DRR configs program vmin/max sel"")
a9b1a4f684b3 ("drm/amd/display: Add more checks for exiting idle in DC")
13b3d6bdbeb4 ("drm/amd/display: add debugfs disallow edp psr")
dcbf438d4834 ("drm/amd/display: Unify optimize_required flags and VRR adjustments")
1630c6ded587 ("drm/amd/display: "Enable IPS by default"")
8457bddc266c ("drm/amd/display: Revert "Rework DC Z10 restore"")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 65444581a4aecf0e96b4691bb20fc75c602f5863 Mon Sep 17 00:00:00 2001
From: Leo Li <sunpeng.li(a)amd.com>
Date: Tue, 27 Aug 2024 11:29:53 -0400
Subject: [PATCH] drm/amd/display: Determine IPS mode by ASIC and PMFW versions
[Why]
DCN IPS interoperates with other system idle power features, such as
Zstates.
On DCN35, there is a known issue where system Z8 + DCN IPS2 causes a
hard hang. We observe this on systems where the SBIOS allows Z8.
Though there is a SBIOS fix, there's no guarantee that users will get it
any time soon, or even install it. A workaround is needed to prevent
this from rearing its head in the wild.
[How]
For DCN35, check the pmfw version to determine whether the SBIOS has the
fix. If not, set IPS1+RCG as the deepest possible state in all cases
except for s0ix and display off (DPMS). Otherwise, enable all IPS
Signed-off-by: Leo Li <sunpeng.li(a)amd.com>
Reviewed-by: Harry Wentland <harry.wentland(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
(cherry picked from commit 28d43d0895896f84c038d906d244e0a95eb243ec)
Cc: stable(a)vger.kernel.org
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 983a977632ff..e6cea5b9bdb3 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1752,6 +1752,30 @@ static struct dml2_soc_bb *dm_dmub_get_vbios_bounding_box(struct amdgpu_device *
return bb;
}
+static enum dmub_ips_disable_type dm_get_default_ips_mode(
+ struct amdgpu_device *adev)
+{
+ /*
+ * On DCN35 systems with Z8 enabled, it's possible for IPS2 + Z8 to
+ * cause a hard hang. A fix exists for newer PMFW.
+ *
+ * As a workaround, for non-fixed PMFW, force IPS1+RCG as the deepest
+ * IPS state in all cases, except for s0ix and all displays off (DPMS),
+ * where IPS2 is allowed.
+ *
+ * When checking pmfw version, use the major and minor only.
+ */
+ if (amdgpu_ip_version(adev, DCE_HWIP, 0) == IP_VERSION(3, 5, 0) &&
+ (adev->pm.fw_version & 0x00FFFF00) < 0x005D6300)
+ return DMUB_IPS_RCG_IN_ACTIVE_IPS2_IN_OFF;
+
+ if (amdgpu_ip_version(adev, DCE_HWIP, 0) >= IP_VERSION(3, 5, 0))
+ return DMUB_IPS_ENABLE;
+
+ /* ASICs older than DCN35 do not have IPSs */
+ return DMUB_IPS_DISABLE_ALL;
+}
+
static int amdgpu_dm_init(struct amdgpu_device *adev)
{
struct dc_init_data init_data;
@@ -1863,7 +1887,7 @@ static int amdgpu_dm_init(struct amdgpu_device *adev)
if (amdgpu_dc_debug_mask & DC_DISABLE_IPS)
init_data.flags.disable_ips = DMUB_IPS_DISABLE_ALL;
else
- init_data.flags.disable_ips = DMUB_IPS_ENABLE;
+ init_data.flags.disable_ips = dm_get_default_ips_mode(adev);
init_data.flags.disable_ips_in_vpb = 0;
The patch below does not apply to the 6.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.10.y
git checkout FETCH_HEAD
git cherry-pick -x 65444581a4aecf0e96b4691bb20fc75c602f5863
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090802-spill-spooky-94c0@gregkh' --subject-prefix 'PATCH 6.10.y' HEAD^..
Possible dependencies:
65444581a4ae ("drm/amd/display: Determine IPS mode by ASIC and PMFW versions")
234e94555800 ("drm/amd/display: Enable copying of bounding box data from VBIOS DMUB")
afca033f10d3 ("drm/amd/display: Add periodic detection for IPS")
05c5ffaac770 ("drm/amd/display: gpuvm handling in DML21")
9ba971b25316 ("drm/amd/display: Re-enable IPS2 for static screen")
70839da63605 ("drm/amd/display: Add new DCN401 sources")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 65444581a4aecf0e96b4691bb20fc75c602f5863 Mon Sep 17 00:00:00 2001
From: Leo Li <sunpeng.li(a)amd.com>
Date: Tue, 27 Aug 2024 11:29:53 -0400
Subject: [PATCH] drm/amd/display: Determine IPS mode by ASIC and PMFW versions
[Why]
DCN IPS interoperates with other system idle power features, such as
Zstates.
On DCN35, there is a known issue where system Z8 + DCN IPS2 causes a
hard hang. We observe this on systems where the SBIOS allows Z8.
Though there is a SBIOS fix, there's no guarantee that users will get it
any time soon, or even install it. A workaround is needed to prevent
this from rearing its head in the wild.
[How]
For DCN35, check the pmfw version to determine whether the SBIOS has the
fix. If not, set IPS1+RCG as the deepest possible state in all cases
except for s0ix and display off (DPMS). Otherwise, enable all IPS
Signed-off-by: Leo Li <sunpeng.li(a)amd.com>
Reviewed-by: Harry Wentland <harry.wentland(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
(cherry picked from commit 28d43d0895896f84c038d906d244e0a95eb243ec)
Cc: stable(a)vger.kernel.org
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 983a977632ff..e6cea5b9bdb3 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1752,6 +1752,30 @@ static struct dml2_soc_bb *dm_dmub_get_vbios_bounding_box(struct amdgpu_device *
return bb;
}
+static enum dmub_ips_disable_type dm_get_default_ips_mode(
+ struct amdgpu_device *adev)
+{
+ /*
+ * On DCN35 systems with Z8 enabled, it's possible for IPS2 + Z8 to
+ * cause a hard hang. A fix exists for newer PMFW.
+ *
+ * As a workaround, for non-fixed PMFW, force IPS1+RCG as the deepest
+ * IPS state in all cases, except for s0ix and all displays off (DPMS),
+ * where IPS2 is allowed.
+ *
+ * When checking pmfw version, use the major and minor only.
+ */
+ if (amdgpu_ip_version(adev, DCE_HWIP, 0) == IP_VERSION(3, 5, 0) &&
+ (adev->pm.fw_version & 0x00FFFF00) < 0x005D6300)
+ return DMUB_IPS_RCG_IN_ACTIVE_IPS2_IN_OFF;
+
+ if (amdgpu_ip_version(adev, DCE_HWIP, 0) >= IP_VERSION(3, 5, 0))
+ return DMUB_IPS_ENABLE;
+
+ /* ASICs older than DCN35 do not have IPSs */
+ return DMUB_IPS_DISABLE_ALL;
+}
+
static int amdgpu_dm_init(struct amdgpu_device *adev)
{
struct dc_init_data init_data;
@@ -1863,7 +1887,7 @@ static int amdgpu_dm_init(struct amdgpu_device *adev)
if (amdgpu_dc_debug_mask & DC_DISABLE_IPS)
init_data.flags.disable_ips = DMUB_IPS_DISABLE_ALL;
else
- init_data.flags.disable_ips = DMUB_IPS_ENABLE;
+ init_data.flags.disable_ips = dm_get_default_ips_mode(adev);
init_data.flags.disable_ips_in_vpb = 0;
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 38e3285dbd07db44487bbaca8c383a5d7f3c11f3
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090844-speech-subzero-1de7@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
38e3285dbd07 ("drm/amd/display: Block timing sync for different signals in PMO")
70839da63605 ("drm/amd/display: Add new DCN401 sources")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 38e3285dbd07db44487bbaca8c383a5d7f3c11f3 Mon Sep 17 00:00:00 2001
From: Dillon Varone <dillon.varone(a)amd.com>
Date: Thu, 22 Aug 2024 17:52:57 -0400
Subject: [PATCH] drm/amd/display: Block timing sync for different signals in
PMO
PMO assumes that like timings can be synchronized, but DC only allows
this if the signal types match.
Reviewed-by: Austin Zheng <austin.zheng(a)amd.com>
Signed-off-by: Dillon Varone <dillon.varone(a)amd.com>
Signed-off-by: Hamza Mahfooz <hamza.mahfooz(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
(cherry picked from commit 29d3d6af43135de7bec677f334292ca8dab53d67)
Cc: stable(a)vger.kernel.org
diff --git a/drivers/gpu/drm/amd/display/dc/dml2/dml21/src/dml2_pmo/dml2_pmo_dcn4_fams2.c b/drivers/gpu/drm/amd/display/dc/dml2/dml21/src/dml2_pmo/dml2_pmo_dcn4_fams2.c
index 603036df68ba..6547cc2c2a77 100644
--- a/drivers/gpu/drm/amd/display/dc/dml2/dml21/src/dml2_pmo/dml2_pmo_dcn4_fams2.c
+++ b/drivers/gpu/drm/amd/display/dc/dml2/dml21/src/dml2_pmo/dml2_pmo_dcn4_fams2.c
@@ -811,7 +811,8 @@ static void build_synchronized_timing_groups(
for (j = i + 1; j < display_config->display_config.num_streams; j++) {
if (memcmp(master_timing,
&display_config->display_config.stream_descriptors[j].timing,
- sizeof(struct dml2_timing_cfg)) == 0) {
+ sizeof(struct dml2_timing_cfg)) == 0 &&
+ display_config->display_config.stream_descriptors[i].output.output_encoder == display_config->display_config.stream_descriptors[j].output.output_encoder) {
set_bit_in_bitfield(&pmo->scratch.pmo_dcn4.synchronized_timing_group_masks[timing_group_idx], j);
set_bit_in_bitfield(&stream_mapped_mask, j);
}
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 38e3285dbd07db44487bbaca8c383a5d7f3c11f3
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090842-uncounted-lustrous-8af4@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
38e3285dbd07 ("drm/amd/display: Block timing sync for different signals in PMO")
70839da63605 ("drm/amd/display: Add new DCN401 sources")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 38e3285dbd07db44487bbaca8c383a5d7f3c11f3 Mon Sep 17 00:00:00 2001
From: Dillon Varone <dillon.varone(a)amd.com>
Date: Thu, 22 Aug 2024 17:52:57 -0400
Subject: [PATCH] drm/amd/display: Block timing sync for different signals in
PMO
PMO assumes that like timings can be synchronized, but DC only allows
this if the signal types match.
Reviewed-by: Austin Zheng <austin.zheng(a)amd.com>
Signed-off-by: Dillon Varone <dillon.varone(a)amd.com>
Signed-off-by: Hamza Mahfooz <hamza.mahfooz(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
(cherry picked from commit 29d3d6af43135de7bec677f334292ca8dab53d67)
Cc: stable(a)vger.kernel.org
diff --git a/drivers/gpu/drm/amd/display/dc/dml2/dml21/src/dml2_pmo/dml2_pmo_dcn4_fams2.c b/drivers/gpu/drm/amd/display/dc/dml2/dml21/src/dml2_pmo/dml2_pmo_dcn4_fams2.c
index 603036df68ba..6547cc2c2a77 100644
--- a/drivers/gpu/drm/amd/display/dc/dml2/dml21/src/dml2_pmo/dml2_pmo_dcn4_fams2.c
+++ b/drivers/gpu/drm/amd/display/dc/dml2/dml21/src/dml2_pmo/dml2_pmo_dcn4_fams2.c
@@ -811,7 +811,8 @@ static void build_synchronized_timing_groups(
for (j = i + 1; j < display_config->display_config.num_streams; j++) {
if (memcmp(master_timing,
&display_config->display_config.stream_descriptors[j].timing,
- sizeof(struct dml2_timing_cfg)) == 0) {
+ sizeof(struct dml2_timing_cfg)) == 0 &&
+ display_config->display_config.stream_descriptors[i].output.output_encoder == display_config->display_config.stream_descriptors[j].output.output_encoder) {
set_bit_in_bitfield(&pmo->scratch.pmo_dcn4.synchronized_timing_group_masks[timing_group_idx], j);
set_bit_in_bitfield(&stream_mapped_mask, j);
}
The patch below does not apply to the 6.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.10.y
git checkout FETCH_HEAD
git cherry-pick -x 38e3285dbd07db44487bbaca8c383a5d7f3c11f3
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090839-curry-shallot-c638@gregkh' --subject-prefix 'PATCH 6.10.y' HEAD^..
Possible dependencies:
38e3285dbd07 ("drm/amd/display: Block timing sync for different signals in PMO")
70839da63605 ("drm/amd/display: Add new DCN401 sources")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 38e3285dbd07db44487bbaca8c383a5d7f3c11f3 Mon Sep 17 00:00:00 2001
From: Dillon Varone <dillon.varone(a)amd.com>
Date: Thu, 22 Aug 2024 17:52:57 -0400
Subject: [PATCH] drm/amd/display: Block timing sync for different signals in
PMO
PMO assumes that like timings can be synchronized, but DC only allows
this if the signal types match.
Reviewed-by: Austin Zheng <austin.zheng(a)amd.com>
Signed-off-by: Dillon Varone <dillon.varone(a)amd.com>
Signed-off-by: Hamza Mahfooz <hamza.mahfooz(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
(cherry picked from commit 29d3d6af43135de7bec677f334292ca8dab53d67)
Cc: stable(a)vger.kernel.org
diff --git a/drivers/gpu/drm/amd/display/dc/dml2/dml21/src/dml2_pmo/dml2_pmo_dcn4_fams2.c b/drivers/gpu/drm/amd/display/dc/dml2/dml21/src/dml2_pmo/dml2_pmo_dcn4_fams2.c
index 603036df68ba..6547cc2c2a77 100644
--- a/drivers/gpu/drm/amd/display/dc/dml2/dml21/src/dml2_pmo/dml2_pmo_dcn4_fams2.c
+++ b/drivers/gpu/drm/amd/display/dc/dml2/dml21/src/dml2_pmo/dml2_pmo_dcn4_fams2.c
@@ -811,7 +811,8 @@ static void build_synchronized_timing_groups(
for (j = i + 1; j < display_config->display_config.num_streams; j++) {
if (memcmp(master_timing,
&display_config->display_config.stream_descriptors[j].timing,
- sizeof(struct dml2_timing_cfg)) == 0) {
+ sizeof(struct dml2_timing_cfg)) == 0 &&
+ display_config->display_config.stream_descriptors[i].output.output_encoder == display_config->display_config.stream_descriptors[j].output.output_encoder) {
set_bit_in_bitfield(&pmo->scratch.pmo_dcn4.synchronized_timing_group_masks[timing_group_idx], j);
set_bit_in_bitfield(&stream_mapped_mask, j);
}
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x e8705632435ae2f2253b65d3786da389982e8813
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090848-sharpness-hunk-c88f@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
e8705632435a ("drm/i915: Fix readout degamma_lut mismatch on ilk/snb")
da8c3cdb016c ("drm/i915: Rename bigjoiner master/slave to bigjoiner primary/secondary")
fb4943574f92 ("drm/i915: Rename all bigjoiner to joiner")
578ff98403ce ("drm/i915: Allow bigjoiner for MST")
3607b30836ae ("drm/i915: Handle joined pipes inside hsw_crtc_enable()")
e16bcbb01186 ("drm/i915: Handle joined pipes inside hsw_crtc_disable()")
2b8ad19d3ed6 ("drm/i915: Introduce intel_crtc_joined_pipe_mask()")
e43b4f7980f8 ("drm/i915: Pass connector to intel_dp_need_bigjoiner()")
5a1527ed8b43 ("drm/i915/mst: Check intel_dp_joiner_needs_dsc()")
aa099402f98b ("drm/i915: Extract intel_dp_joiner_needs_dsc()")
c0b8afc3a777 ("drm/i915: s/intel_dp_can_bigjoiner()/intel_dp_has_bigjoiner()/")
e02ef5553d9b ("drm/i915: Update pipes in reverse order for bigjoiner")
3a5e09d82f97 ("drm/i915: Fix intel_modeset_pipe_config_late() for bigjoiner")
f9d5e51db656 ("drm/i915/vrr: Disable VRR when using bigjoiner")
ef79820db723 ("drm/i915: Disable live M/N updates when using bigjoiner")
b37e1347b991 ("drm/i915: Disable port sync when bigjoiner is used")
372fa0c79d3f ("drm/i915/psr: Disable PSR when bigjoiner is used")
7a3f171c8f6a ("drm/i915: Extract glk_need_scaler_clock_gating_wa()")
c922a47913f9 ("drm/i915: Clean up glk_pipe_scaler_clock_gating_wa()")
e9fa99dd47a4 ("drm/i915: Shuffle DP .mode_valid() checks")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e8705632435ae2f2253b65d3786da389982e8813 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ville=20Syrj=C3=A4l=C3=A4?= <ville.syrjala(a)linux.intel.com>
Date: Wed, 10 Jul 2024 15:41:37 +0300
Subject: [PATCH] drm/i915: Fix readout degamma_lut mismatch on ilk/snb
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
On ilk/snb the pipe may be configured to place the LUT before or
after the CSC depending on various factors, but as there is only
one LUT (no split mode like on IVB+) we only advertise a gamma_lut
and no degamma_lut in the uapi to avoid confusing userspace.
This can cause a problem during readout if the VBIOS/GOP enabled
the LUT in the pre CSC configuration. The current code blindly
assigns the results of the readout to the degamma_lut, which will
cause a failure during the next atomic_check() as we aren't expecting
anything to be in degamma_lut since it's not visible to userspace.
Fix the problem by assigning whatever LUT we read out from the
hardware into gamma_lut.
Cc: stable(a)vger.kernel.org
Fixes: d2559299d339 ("drm/i915: Make ilk_read_luts() capable of degamma readout")
Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/11608
Signed-off-by: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240710124137.16773-1-ville.…
Reviewed-by: Uma Shankar <uma.shankar(a)intel.com>
(cherry picked from commit 33eca84db6e31091cef63584158ab64704f78462)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen(a)linux.intel.com>
diff --git a/drivers/gpu/drm/i915/display/intel_modeset_setup.c b/drivers/gpu/drm/i915/display/intel_modeset_setup.c
index 7602cb30ebf1..e1213f3d93cc 100644
--- a/drivers/gpu/drm/i915/display/intel_modeset_setup.c
+++ b/drivers/gpu/drm/i915/display/intel_modeset_setup.c
@@ -326,6 +326,8 @@ static void intel_modeset_update_connector_atomic_state(struct drm_i915_private
static void intel_crtc_copy_hw_to_uapi_state(struct intel_crtc_state *crtc_state)
{
+ struct drm_i915_private *i915 = to_i915(crtc_state->uapi.crtc->dev);
+
if (intel_crtc_is_joiner_secondary(crtc_state))
return;
@@ -337,11 +339,30 @@ static void intel_crtc_copy_hw_to_uapi_state(struct intel_crtc_state *crtc_state
crtc_state->uapi.adjusted_mode = crtc_state->hw.adjusted_mode;
crtc_state->uapi.scaling_filter = crtc_state->hw.scaling_filter;
- /* assume 1:1 mapping */
- drm_property_replace_blob(&crtc_state->hw.degamma_lut,
- crtc_state->pre_csc_lut);
- drm_property_replace_blob(&crtc_state->hw.gamma_lut,
- crtc_state->post_csc_lut);
+ if (DISPLAY_INFO(i915)->color.degamma_lut_size) {
+ /* assume 1:1 mapping */
+ drm_property_replace_blob(&crtc_state->hw.degamma_lut,
+ crtc_state->pre_csc_lut);
+ drm_property_replace_blob(&crtc_state->hw.gamma_lut,
+ crtc_state->post_csc_lut);
+ } else {
+ /*
+ * ilk/snb hw may be configured for either pre_csc_lut
+ * or post_csc_lut, but we don't advertise degamma_lut as
+ * being available in the uapi since there is only one
+ * hardware LUT. Always assign the result of the readout
+ * to gamma_lut as that is the only valid source of LUTs
+ * in the uapi.
+ */
+ drm_WARN_ON(&i915->drm, crtc_state->post_csc_lut &&
+ crtc_state->pre_csc_lut);
+
+ drm_property_replace_blob(&crtc_state->hw.degamma_lut,
+ NULL);
+ drm_property_replace_blob(&crtc_state->hw.gamma_lut,
+ crtc_state->post_csc_lut ?:
+ crtc_state->pre_csc_lut);
+ }
drm_property_replace_blob(&crtc_state->uapi.degamma_lut,
crtc_state->hw.degamma_lut);
The patch below does not apply to the 6.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.10.y
git checkout FETCH_HEAD
git cherry-pick -x e8705632435ae2f2253b65d3786da389982e8813
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090844-result-caucasian-a9e5@gregkh' --subject-prefix 'PATCH 6.10.y' HEAD^..
Possible dependencies:
e8705632435a ("drm/i915: Fix readout degamma_lut mismatch on ilk/snb")
da8c3cdb016c ("drm/i915: Rename bigjoiner master/slave to bigjoiner primary/secondary")
fb4943574f92 ("drm/i915: Rename all bigjoiner to joiner")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e8705632435ae2f2253b65d3786da389982e8813 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ville=20Syrj=C3=A4l=C3=A4?= <ville.syrjala(a)linux.intel.com>
Date: Wed, 10 Jul 2024 15:41:37 +0300
Subject: [PATCH] drm/i915: Fix readout degamma_lut mismatch on ilk/snb
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
On ilk/snb the pipe may be configured to place the LUT before or
after the CSC depending on various factors, but as there is only
one LUT (no split mode like on IVB+) we only advertise a gamma_lut
and no degamma_lut in the uapi to avoid confusing userspace.
This can cause a problem during readout if the VBIOS/GOP enabled
the LUT in the pre CSC configuration. The current code blindly
assigns the results of the readout to the degamma_lut, which will
cause a failure during the next atomic_check() as we aren't expecting
anything to be in degamma_lut since it's not visible to userspace.
Fix the problem by assigning whatever LUT we read out from the
hardware into gamma_lut.
Cc: stable(a)vger.kernel.org
Fixes: d2559299d339 ("drm/i915: Make ilk_read_luts() capable of degamma readout")
Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/11608
Signed-off-by: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240710124137.16773-1-ville.…
Reviewed-by: Uma Shankar <uma.shankar(a)intel.com>
(cherry picked from commit 33eca84db6e31091cef63584158ab64704f78462)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen(a)linux.intel.com>
diff --git a/drivers/gpu/drm/i915/display/intel_modeset_setup.c b/drivers/gpu/drm/i915/display/intel_modeset_setup.c
index 7602cb30ebf1..e1213f3d93cc 100644
--- a/drivers/gpu/drm/i915/display/intel_modeset_setup.c
+++ b/drivers/gpu/drm/i915/display/intel_modeset_setup.c
@@ -326,6 +326,8 @@ static void intel_modeset_update_connector_atomic_state(struct drm_i915_private
static void intel_crtc_copy_hw_to_uapi_state(struct intel_crtc_state *crtc_state)
{
+ struct drm_i915_private *i915 = to_i915(crtc_state->uapi.crtc->dev);
+
if (intel_crtc_is_joiner_secondary(crtc_state))
return;
@@ -337,11 +339,30 @@ static void intel_crtc_copy_hw_to_uapi_state(struct intel_crtc_state *crtc_state
crtc_state->uapi.adjusted_mode = crtc_state->hw.adjusted_mode;
crtc_state->uapi.scaling_filter = crtc_state->hw.scaling_filter;
- /* assume 1:1 mapping */
- drm_property_replace_blob(&crtc_state->hw.degamma_lut,
- crtc_state->pre_csc_lut);
- drm_property_replace_blob(&crtc_state->hw.gamma_lut,
- crtc_state->post_csc_lut);
+ if (DISPLAY_INFO(i915)->color.degamma_lut_size) {
+ /* assume 1:1 mapping */
+ drm_property_replace_blob(&crtc_state->hw.degamma_lut,
+ crtc_state->pre_csc_lut);
+ drm_property_replace_blob(&crtc_state->hw.gamma_lut,
+ crtc_state->post_csc_lut);
+ } else {
+ /*
+ * ilk/snb hw may be configured for either pre_csc_lut
+ * or post_csc_lut, but we don't advertise degamma_lut as
+ * being available in the uapi since there is only one
+ * hardware LUT. Always assign the result of the readout
+ * to gamma_lut as that is the only valid source of LUTs
+ * in the uapi.
+ */
+ drm_WARN_ON(&i915->drm, crtc_state->post_csc_lut &&
+ crtc_state->pre_csc_lut);
+
+ drm_property_replace_blob(&crtc_state->hw.degamma_lut,
+ NULL);
+ drm_property_replace_blob(&crtc_state->hw.gamma_lut,
+ crtc_state->post_csc_lut ?:
+ crtc_state->pre_csc_lut);
+ }
drm_property_replace_blob(&crtc_state->uapi.degamma_lut,
crtc_state->hw.degamma_lut);
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 72a6e22c604c95ddb3b10b5d3bb85b6ff4dbc34f
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090838-thus-fiftieth-f4f7@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
72a6e22c604c ("fscache: delete fscache_cookie_lru_timer when fscache exits to avoid UAF")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 72a6e22c604c95ddb3b10b5d3bb85b6ff4dbc34f Mon Sep 17 00:00:00 2001
From: Baokun Li <libaokun1(a)huawei.com>
Date: Mon, 26 Aug 2024 19:20:56 +0800
Subject: [PATCH] fscache: delete fscache_cookie_lru_timer when fscache exits
to avoid UAF
The fscache_cookie_lru_timer is initialized when the fscache module
is inserted, but is not deleted when the fscache module is removed.
If timer_reduce() is called before removing the fscache module,
the fscache_cookie_lru_timer will be added to the timer list of
the current cpu. Afterwards, a use-after-free will be triggered
in the softIRQ after removing the fscache module, as follows:
==================================================================
BUG: unable to handle page fault for address: fffffbfff803c9e9
PF: supervisor read access in kernel mode
PF: error_code(0x0000) - not-present page
PGD 21ffea067 P4D 21ffea067 PUD 21ffe6067 PMD 110a7c067 PTE 0
Oops: Oops: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Tainted: G W 6.11.0-rc3 #855
Tainted: [W]=WARN
RIP: 0010:__run_timer_base.part.0+0x254/0x8a0
Call Trace:
<IRQ>
tmigr_handle_remote_up+0x627/0x810
__walk_groups.isra.0+0x47/0x140
tmigr_handle_remote+0x1fa/0x2f0
handle_softirqs+0x180/0x590
irq_exit_rcu+0x84/0xb0
sysvec_apic_timer_interrupt+0x6e/0x90
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x1a/0x20
RIP: 0010:default_idle+0xf/0x20
default_idle_call+0x38/0x60
do_idle+0x2b5/0x300
cpu_startup_entry+0x54/0x60
start_secondary+0x20d/0x280
common_startup_64+0x13e/0x148
</TASK>
Modules linked in: [last unloaded: netfs]
==================================================================
Therefore delete fscache_cookie_lru_timer when removing the fscahe module.
Fixes: 12bb21a29c19 ("fscache: Implement cookie user counting and resource pinning")
Cc: stable(a)kernel.org
Signed-off-by: Baokun Li <libaokun1(a)huawei.com>
Link: https://lore.kernel.org/r/20240826112056.2458299-1-libaokun@huaweicloud.com
Acked-by: David Howells <dhowells(a)redhat.com>
Signed-off-by: Christian Brauner <brauner(a)kernel.org>
diff --git a/fs/netfs/fscache_main.c b/fs/netfs/fscache_main.c
index 42e98bb523e3..49849005eb7c 100644
--- a/fs/netfs/fscache_main.c
+++ b/fs/netfs/fscache_main.c
@@ -103,6 +103,7 @@ void __exit fscache_exit(void)
kmem_cache_destroy(fscache_cookie_jar);
fscache_proc_cleanup();
+ timer_shutdown_sync(&fscache_cookie_lru_timer);
destroy_workqueue(fscache_wq);
pr_notice("FS-Cache unloaded\n");
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x cd9253c23aedd61eb5ff11f37a36247cd46faf86
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090853-untagged-gravy-ccd3@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
cd9253c23aed ("btrfs: fix race between direct IO write and fsync when using same fd")
939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write")
9aa29a20b700 ("btrfs: move the direct IO code into its own file")
04ef7631bfa5 ("btrfs: cleanup duplicated parameters related to btrfs_create_dio_extent()")
9fec848b3a33 ("btrfs: cleanup duplicated parameters related to create_io_em()")
e9ea31fb5c1f ("btrfs: cleanup duplicated parameters related to btrfs_alloc_ordered_extent")
cdc627e65c7e ("btrfs: cleanup duplicated parameters related to can_nocow_file_extent_args")
c77a8c61002e ("btrfs: remove extent_map::block_start member")
e28b851ed9b2 ("btrfs: remove extent_map::block_len member")
4aa7b5d1784f ("btrfs: remove extent_map::orig_start member")
3f255ece2f1e ("btrfs: introduce extra sanity checks for extent maps")
3d2ac9922465 ("btrfs: introduce new members for extent_map")
87a6962f73b1 ("btrfs: export the expected file extent through can_nocow_extent()")
e8fe524da027 ("btrfs: rename extent_map::orig_block_len to disk_num_bytes")
8996f61ab9ff ("btrfs: move fiemap code into its own file")
56b7169f691c ("btrfs: use a btrfs_inode local variable at btrfs_sync_file()")
e641e323abb3 ("btrfs: pass a btrfs_inode to btrfs_wait_ordered_range()")
cef2daba4268 ("btrfs: pass a btrfs_inode to btrfs_fdatawrite_range()")
4e660ca3a98d ("btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree")
7f5830bc964d ("btrfs: rename rb_root member of extent_map_tree from map to root")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From cd9253c23aedd61eb5ff11f37a36247cd46faf86 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Thu, 29 Aug 2024 18:25:49 +0100
Subject: [PATCH] btrfs: fix race between direct IO write and fsync when using
same fd
If we have 2 threads that are using the same file descriptor and one of
them is doing direct IO writes while the other is doing fsync, we have a
race where we can end up either:
1) Attempt a fsync without holding the inode's lock, triggering an
assertion failures when assertions are enabled;
2) Do an invalid memory access from the fsync task because the file private
points to memory allocated on stack by the direct IO task and it may be
used by the fsync task after the stack was destroyed.
The race happens like this:
1) A user space program opens a file descriptor with O_DIRECT;
2) The program spawns 2 threads using libpthread for example;
3) One of the threads uses the file descriptor to do direct IO writes,
while the other calls fsync using the same file descriptor.
4) Call task A the thread doing direct IO writes and task B the thread
doing fsyncs;
5) Task A does a direct IO write, and at btrfs_direct_write() sets the
file's private to an on stack allocated private with the member
'fsync_skip_inode_lock' set to true;
6) Task B enters btrfs_sync_file() and sees that there's a private
structure associated to the file which has 'fsync_skip_inode_lock' set
to true, so it skips locking the inode's VFS lock;
7) Task A completes the direct IO write, and resets the file's private to
NULL since it had no prior private and our private was stack allocated.
Then it unlocks the inode's VFS lock;
8) Task B enters btrfs_get_ordered_extents_for_logging(), then the
assertion that checks the inode's VFS lock is held fails, since task B
never locked it and task A has already unlocked it.
The stack trace produced is the following:
assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983
------------[ cut here ]------------
kernel BUG at fs/btrfs/ordered-data.c:983!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 9 PID: 5072 Comm: worker Tainted: G U OE 6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8
Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020
RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs]
Code: 50 d6 86 c0 e8 (...)
RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246
RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800
RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38
R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800
R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000
FS: 00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0
Call Trace:
<TASK>
? __die_body.cold+0x14/0x24
? die+0x2e/0x50
? do_trap+0xca/0x110
? do_error_trap+0x6a/0x90
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? exc_invalid_op+0x50/0x70
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? asm_exc_invalid_op+0x1a/0x20
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? __seccomp_filter+0x31d/0x4f0
__x64_sys_fdatasync+0x4f/0x90
do_syscall_64+0x82/0x160
? do_futex+0xcb/0x190
? __x64_sys_futex+0x10e/0x1d0
? switch_fpu_return+0x4f/0xd0
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Another problem here is if task B grabs the private pointer and then uses
it after task A has finished, since the private was allocated in the stack
of task A, it results in some invalid memory access with a hard to predict
result.
This issue, triggering the assertion, was observed with QEMU workloads by
two users in the Link tags below.
Fix this by not relying on a file's private to pass information to fsync
that it should skip locking the inode and instead pass this information
through a special value stored in current->journal_info. This is safe
because in the relevant section of the direct IO write path we are not
holding a transaction handle, so current->journal_info is NULL.
The following C program triggers the issue:
$ cat repro.c
/* Get the O_DIRECT definition. */
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <pthread.h>
static int fd;
static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset)
{
while (count > 0) {
ssize_t ret;
ret = pwrite(fd, buf, count, offset);
if (ret < 0) {
if (errno == EINTR)
continue;
return ret;
}
count -= ret;
buf += ret;
}
return 0;
}
static void *fsync_loop(void *arg)
{
while (1) {
int ret;
ret = fsync(fd);
if (ret != 0) {
perror("Fsync failed");
exit(6);
}
}
}
int main(int argc, char *argv[])
{
long pagesize;
void *write_buf;
pthread_t fsyncer;
int ret;
if (argc != 2) {
fprintf(stderr, "Use: %s <file path>\n", argv[0]);
return 1;
}
fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666);
if (fd == -1) {
perror("Failed to open/create file");
return 1;
}
pagesize = sysconf(_SC_PAGE_SIZE);
if (pagesize == -1) {
perror("Failed to get page size");
return 2;
}
ret = posix_memalign(&write_buf, pagesize, pagesize);
if (ret) {
perror("Failed to allocate buffer");
return 3;
}
ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL);
if (ret != 0) {
fprintf(stderr, "Failed to create writer thread: %d\n", ret);
return 4;
}
while (1) {
ret = do_write(fd, write_buf, pagesize, 0);
if (ret != 0) {
perror("Write failed");
exit(5);
}
}
return 0;
}
$ mkfs.btrfs -f /dev/sdi
$ mount /dev/sdi /mnt/sdi
$ timeout 10 ./repro /mnt/sdi/foo
Usually the race is triggered within less than 1 second. A test case for
fstests will follow soon.
Reported-by: Paulo Dias <paulo.miguel.dias(a)gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187
Reported-by: Andreas Jahn <jahn-andi(a)web.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199
Reported-by: syzbot+4704b3cc972bd76024f1(a)syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000044ff540620d7dee2@google.com/
Fixes: 939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write")
CC: stable(a)vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 75fa563e4cac..c8568b1a61c4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -459,7 +459,6 @@ struct btrfs_file_private {
void *filldir_buf;
u64 last_index;
struct extent_state *llseek_cached_state;
- bool fsync_skip_inode_lock;
};
static inline u32 BTRFS_LEAF_DATA_SIZE(const struct btrfs_fs_info *info)
diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 67adbe9d294a..364bce34f034 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -864,13 +864,6 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
if (IS_ERR_OR_NULL(dio)) {
ret = PTR_ERR_OR_ZERO(dio);
} else {
- struct btrfs_file_private stack_private = { 0 };
- struct btrfs_file_private *private;
- const bool have_private = (file->private_data != NULL);
-
- if (!have_private)
- file->private_data = &stack_private;
-
/*
* If we have a synchronous write, we must make sure the fsync
* triggered by the iomap_dio_complete() call below doesn't
@@ -879,13 +872,10 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
* partial writes due to the input buffer (or parts of it) not
* being already faulted in.
*/
- private = file->private_data;
- private->fsync_skip_inode_lock = true;
+ ASSERT(current->journal_info == NULL);
+ current->journal_info = BTRFS_TRANS_DIO_WRITE_STUB;
ret = iomap_dio_complete(dio);
- private->fsync_skip_inode_lock = false;
-
- if (!have_private)
- file->private_data = NULL;
+ current->journal_info = NULL;
}
/* No increment (+=) because iomap returns a cumulative value. */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 9914419f3b7d..2aeb8116549c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1603,7 +1603,6 @@ static inline bool skip_inode_logging(const struct btrfs_log_ctx *ctx)
*/
int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
{
- struct btrfs_file_private *private = file->private_data;
struct dentry *dentry = file_dentry(file);
struct btrfs_inode *inode = BTRFS_I(d_inode(dentry));
struct btrfs_root *root = inode->root;
@@ -1613,7 +1612,13 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
int ret = 0, err;
u64 len;
bool full_sync;
- const bool skip_ilock = (private ? private->fsync_skip_inode_lock : false);
+ bool skip_ilock = false;
+
+ if (current->journal_info == BTRFS_TRANS_DIO_WRITE_STUB) {
+ skip_ilock = true;
+ current->journal_info = NULL;
+ lockdep_assert_held(&inode->vfs_inode.i_rwsem);
+ }
trace_btrfs_sync_file(file, datasync);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 98c03ddc760b..dd9ce9b9f69e 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -27,6 +27,12 @@ struct btrfs_root_item;
struct btrfs_root;
struct btrfs_path;
+/*
+ * Signal that a direct IO write is in progress, to avoid deadlock for sync
+ * direct IO writes when fsync is called during the direct IO write path.
+ */
+#define BTRFS_TRANS_DIO_WRITE_STUB ((void *) 1)
+
/* Radix-tree tag for roots that are part of the trasaction. */
#define BTRFS_ROOT_TRANS_TAG 0
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x cd9253c23aedd61eb5ff11f37a36247cd46faf86
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090850-naturist-deafness-b924@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
cd9253c23aed ("btrfs: fix race between direct IO write and fsync when using same fd")
939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write")
9aa29a20b700 ("btrfs: move the direct IO code into its own file")
04ef7631bfa5 ("btrfs: cleanup duplicated parameters related to btrfs_create_dio_extent()")
9fec848b3a33 ("btrfs: cleanup duplicated parameters related to create_io_em()")
e9ea31fb5c1f ("btrfs: cleanup duplicated parameters related to btrfs_alloc_ordered_extent")
cdc627e65c7e ("btrfs: cleanup duplicated parameters related to can_nocow_file_extent_args")
c77a8c61002e ("btrfs: remove extent_map::block_start member")
e28b851ed9b2 ("btrfs: remove extent_map::block_len member")
4aa7b5d1784f ("btrfs: remove extent_map::orig_start member")
3f255ece2f1e ("btrfs: introduce extra sanity checks for extent maps")
3d2ac9922465 ("btrfs: introduce new members for extent_map")
87a6962f73b1 ("btrfs: export the expected file extent through can_nocow_extent()")
e8fe524da027 ("btrfs: rename extent_map::orig_block_len to disk_num_bytes")
8996f61ab9ff ("btrfs: move fiemap code into its own file")
56b7169f691c ("btrfs: use a btrfs_inode local variable at btrfs_sync_file()")
e641e323abb3 ("btrfs: pass a btrfs_inode to btrfs_wait_ordered_range()")
cef2daba4268 ("btrfs: pass a btrfs_inode to btrfs_fdatawrite_range()")
4e660ca3a98d ("btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree")
7f5830bc964d ("btrfs: rename rb_root member of extent_map_tree from map to root")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From cd9253c23aedd61eb5ff11f37a36247cd46faf86 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Thu, 29 Aug 2024 18:25:49 +0100
Subject: [PATCH] btrfs: fix race between direct IO write and fsync when using
same fd
If we have 2 threads that are using the same file descriptor and one of
them is doing direct IO writes while the other is doing fsync, we have a
race where we can end up either:
1) Attempt a fsync without holding the inode's lock, triggering an
assertion failures when assertions are enabled;
2) Do an invalid memory access from the fsync task because the file private
points to memory allocated on stack by the direct IO task and it may be
used by the fsync task after the stack was destroyed.
The race happens like this:
1) A user space program opens a file descriptor with O_DIRECT;
2) The program spawns 2 threads using libpthread for example;
3) One of the threads uses the file descriptor to do direct IO writes,
while the other calls fsync using the same file descriptor.
4) Call task A the thread doing direct IO writes and task B the thread
doing fsyncs;
5) Task A does a direct IO write, and at btrfs_direct_write() sets the
file's private to an on stack allocated private with the member
'fsync_skip_inode_lock' set to true;
6) Task B enters btrfs_sync_file() and sees that there's a private
structure associated to the file which has 'fsync_skip_inode_lock' set
to true, so it skips locking the inode's VFS lock;
7) Task A completes the direct IO write, and resets the file's private to
NULL since it had no prior private and our private was stack allocated.
Then it unlocks the inode's VFS lock;
8) Task B enters btrfs_get_ordered_extents_for_logging(), then the
assertion that checks the inode's VFS lock is held fails, since task B
never locked it and task A has already unlocked it.
The stack trace produced is the following:
assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983
------------[ cut here ]------------
kernel BUG at fs/btrfs/ordered-data.c:983!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 9 PID: 5072 Comm: worker Tainted: G U OE 6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8
Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020
RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs]
Code: 50 d6 86 c0 e8 (...)
RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246
RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800
RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38
R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800
R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000
FS: 00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0
Call Trace:
<TASK>
? __die_body.cold+0x14/0x24
? die+0x2e/0x50
? do_trap+0xca/0x110
? do_error_trap+0x6a/0x90
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? exc_invalid_op+0x50/0x70
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? asm_exc_invalid_op+0x1a/0x20
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? __seccomp_filter+0x31d/0x4f0
__x64_sys_fdatasync+0x4f/0x90
do_syscall_64+0x82/0x160
? do_futex+0xcb/0x190
? __x64_sys_futex+0x10e/0x1d0
? switch_fpu_return+0x4f/0xd0
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Another problem here is if task B grabs the private pointer and then uses
it after task A has finished, since the private was allocated in the stack
of task A, it results in some invalid memory access with a hard to predict
result.
This issue, triggering the assertion, was observed with QEMU workloads by
two users in the Link tags below.
Fix this by not relying on a file's private to pass information to fsync
that it should skip locking the inode and instead pass this information
through a special value stored in current->journal_info. This is safe
because in the relevant section of the direct IO write path we are not
holding a transaction handle, so current->journal_info is NULL.
The following C program triggers the issue:
$ cat repro.c
/* Get the O_DIRECT definition. */
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <pthread.h>
static int fd;
static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset)
{
while (count > 0) {
ssize_t ret;
ret = pwrite(fd, buf, count, offset);
if (ret < 0) {
if (errno == EINTR)
continue;
return ret;
}
count -= ret;
buf += ret;
}
return 0;
}
static void *fsync_loop(void *arg)
{
while (1) {
int ret;
ret = fsync(fd);
if (ret != 0) {
perror("Fsync failed");
exit(6);
}
}
}
int main(int argc, char *argv[])
{
long pagesize;
void *write_buf;
pthread_t fsyncer;
int ret;
if (argc != 2) {
fprintf(stderr, "Use: %s <file path>\n", argv[0]);
return 1;
}
fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666);
if (fd == -1) {
perror("Failed to open/create file");
return 1;
}
pagesize = sysconf(_SC_PAGE_SIZE);
if (pagesize == -1) {
perror("Failed to get page size");
return 2;
}
ret = posix_memalign(&write_buf, pagesize, pagesize);
if (ret) {
perror("Failed to allocate buffer");
return 3;
}
ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL);
if (ret != 0) {
fprintf(stderr, "Failed to create writer thread: %d\n", ret);
return 4;
}
while (1) {
ret = do_write(fd, write_buf, pagesize, 0);
if (ret != 0) {
perror("Write failed");
exit(5);
}
}
return 0;
}
$ mkfs.btrfs -f /dev/sdi
$ mount /dev/sdi /mnt/sdi
$ timeout 10 ./repro /mnt/sdi/foo
Usually the race is triggered within less than 1 second. A test case for
fstests will follow soon.
Reported-by: Paulo Dias <paulo.miguel.dias(a)gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187
Reported-by: Andreas Jahn <jahn-andi(a)web.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199
Reported-by: syzbot+4704b3cc972bd76024f1(a)syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000044ff540620d7dee2@google.com/
Fixes: 939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write")
CC: stable(a)vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 75fa563e4cac..c8568b1a61c4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -459,7 +459,6 @@ struct btrfs_file_private {
void *filldir_buf;
u64 last_index;
struct extent_state *llseek_cached_state;
- bool fsync_skip_inode_lock;
};
static inline u32 BTRFS_LEAF_DATA_SIZE(const struct btrfs_fs_info *info)
diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 67adbe9d294a..364bce34f034 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -864,13 +864,6 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
if (IS_ERR_OR_NULL(dio)) {
ret = PTR_ERR_OR_ZERO(dio);
} else {
- struct btrfs_file_private stack_private = { 0 };
- struct btrfs_file_private *private;
- const bool have_private = (file->private_data != NULL);
-
- if (!have_private)
- file->private_data = &stack_private;
-
/*
* If we have a synchronous write, we must make sure the fsync
* triggered by the iomap_dio_complete() call below doesn't
@@ -879,13 +872,10 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
* partial writes due to the input buffer (or parts of it) not
* being already faulted in.
*/
- private = file->private_data;
- private->fsync_skip_inode_lock = true;
+ ASSERT(current->journal_info == NULL);
+ current->journal_info = BTRFS_TRANS_DIO_WRITE_STUB;
ret = iomap_dio_complete(dio);
- private->fsync_skip_inode_lock = false;
-
- if (!have_private)
- file->private_data = NULL;
+ current->journal_info = NULL;
}
/* No increment (+=) because iomap returns a cumulative value. */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 9914419f3b7d..2aeb8116549c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1603,7 +1603,6 @@ static inline bool skip_inode_logging(const struct btrfs_log_ctx *ctx)
*/
int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
{
- struct btrfs_file_private *private = file->private_data;
struct dentry *dentry = file_dentry(file);
struct btrfs_inode *inode = BTRFS_I(d_inode(dentry));
struct btrfs_root *root = inode->root;
@@ -1613,7 +1612,13 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
int ret = 0, err;
u64 len;
bool full_sync;
- const bool skip_ilock = (private ? private->fsync_skip_inode_lock : false);
+ bool skip_ilock = false;
+
+ if (current->journal_info == BTRFS_TRANS_DIO_WRITE_STUB) {
+ skip_ilock = true;
+ current->journal_info = NULL;
+ lockdep_assert_held(&inode->vfs_inode.i_rwsem);
+ }
trace_btrfs_sync_file(file, datasync);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 98c03ddc760b..dd9ce9b9f69e 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -27,6 +27,12 @@ struct btrfs_root_item;
struct btrfs_root;
struct btrfs_path;
+/*
+ * Signal that a direct IO write is in progress, to avoid deadlock for sync
+ * direct IO writes when fsync is called during the direct IO write path.
+ */
+#define BTRFS_TRANS_DIO_WRITE_STUB ((void *) 1)
+
/* Radix-tree tag for roots that are part of the trasaction. */
#define BTRFS_ROOT_TRANS_TAG 0
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x cd9253c23aedd61eb5ff11f37a36247cd46faf86
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090847-flashing-dimmed-2c7f@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
cd9253c23aed ("btrfs: fix race between direct IO write and fsync when using same fd")
939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write")
9aa29a20b700 ("btrfs: move the direct IO code into its own file")
04ef7631bfa5 ("btrfs: cleanup duplicated parameters related to btrfs_create_dio_extent()")
9fec848b3a33 ("btrfs: cleanup duplicated parameters related to create_io_em()")
e9ea31fb5c1f ("btrfs: cleanup duplicated parameters related to btrfs_alloc_ordered_extent")
cdc627e65c7e ("btrfs: cleanup duplicated parameters related to can_nocow_file_extent_args")
c77a8c61002e ("btrfs: remove extent_map::block_start member")
e28b851ed9b2 ("btrfs: remove extent_map::block_len member")
4aa7b5d1784f ("btrfs: remove extent_map::orig_start member")
3f255ece2f1e ("btrfs: introduce extra sanity checks for extent maps")
3d2ac9922465 ("btrfs: introduce new members for extent_map")
87a6962f73b1 ("btrfs: export the expected file extent through can_nocow_extent()")
e8fe524da027 ("btrfs: rename extent_map::orig_block_len to disk_num_bytes")
8996f61ab9ff ("btrfs: move fiemap code into its own file")
56b7169f691c ("btrfs: use a btrfs_inode local variable at btrfs_sync_file()")
e641e323abb3 ("btrfs: pass a btrfs_inode to btrfs_wait_ordered_range()")
cef2daba4268 ("btrfs: pass a btrfs_inode to btrfs_fdatawrite_range()")
4e660ca3a98d ("btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree")
7f5830bc964d ("btrfs: rename rb_root member of extent_map_tree from map to root")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From cd9253c23aedd61eb5ff11f37a36247cd46faf86 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Thu, 29 Aug 2024 18:25:49 +0100
Subject: [PATCH] btrfs: fix race between direct IO write and fsync when using
same fd
If we have 2 threads that are using the same file descriptor and one of
them is doing direct IO writes while the other is doing fsync, we have a
race where we can end up either:
1) Attempt a fsync without holding the inode's lock, triggering an
assertion failures when assertions are enabled;
2) Do an invalid memory access from the fsync task because the file private
points to memory allocated on stack by the direct IO task and it may be
used by the fsync task after the stack was destroyed.
The race happens like this:
1) A user space program opens a file descriptor with O_DIRECT;
2) The program spawns 2 threads using libpthread for example;
3) One of the threads uses the file descriptor to do direct IO writes,
while the other calls fsync using the same file descriptor.
4) Call task A the thread doing direct IO writes and task B the thread
doing fsyncs;
5) Task A does a direct IO write, and at btrfs_direct_write() sets the
file's private to an on stack allocated private with the member
'fsync_skip_inode_lock' set to true;
6) Task B enters btrfs_sync_file() and sees that there's a private
structure associated to the file which has 'fsync_skip_inode_lock' set
to true, so it skips locking the inode's VFS lock;
7) Task A completes the direct IO write, and resets the file's private to
NULL since it had no prior private and our private was stack allocated.
Then it unlocks the inode's VFS lock;
8) Task B enters btrfs_get_ordered_extents_for_logging(), then the
assertion that checks the inode's VFS lock is held fails, since task B
never locked it and task A has already unlocked it.
The stack trace produced is the following:
assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983
------------[ cut here ]------------
kernel BUG at fs/btrfs/ordered-data.c:983!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 9 PID: 5072 Comm: worker Tainted: G U OE 6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8
Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020
RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs]
Code: 50 d6 86 c0 e8 (...)
RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246
RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800
RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38
R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800
R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000
FS: 00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0
Call Trace:
<TASK>
? __die_body.cold+0x14/0x24
? die+0x2e/0x50
? do_trap+0xca/0x110
? do_error_trap+0x6a/0x90
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? exc_invalid_op+0x50/0x70
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? asm_exc_invalid_op+0x1a/0x20
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? __seccomp_filter+0x31d/0x4f0
__x64_sys_fdatasync+0x4f/0x90
do_syscall_64+0x82/0x160
? do_futex+0xcb/0x190
? __x64_sys_futex+0x10e/0x1d0
? switch_fpu_return+0x4f/0xd0
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Another problem here is if task B grabs the private pointer and then uses
it after task A has finished, since the private was allocated in the stack
of task A, it results in some invalid memory access with a hard to predict
result.
This issue, triggering the assertion, was observed with QEMU workloads by
two users in the Link tags below.
Fix this by not relying on a file's private to pass information to fsync
that it should skip locking the inode and instead pass this information
through a special value stored in current->journal_info. This is safe
because in the relevant section of the direct IO write path we are not
holding a transaction handle, so current->journal_info is NULL.
The following C program triggers the issue:
$ cat repro.c
/* Get the O_DIRECT definition. */
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <pthread.h>
static int fd;
static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset)
{
while (count > 0) {
ssize_t ret;
ret = pwrite(fd, buf, count, offset);
if (ret < 0) {
if (errno == EINTR)
continue;
return ret;
}
count -= ret;
buf += ret;
}
return 0;
}
static void *fsync_loop(void *arg)
{
while (1) {
int ret;
ret = fsync(fd);
if (ret != 0) {
perror("Fsync failed");
exit(6);
}
}
}
int main(int argc, char *argv[])
{
long pagesize;
void *write_buf;
pthread_t fsyncer;
int ret;
if (argc != 2) {
fprintf(stderr, "Use: %s <file path>\n", argv[0]);
return 1;
}
fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666);
if (fd == -1) {
perror("Failed to open/create file");
return 1;
}
pagesize = sysconf(_SC_PAGE_SIZE);
if (pagesize == -1) {
perror("Failed to get page size");
return 2;
}
ret = posix_memalign(&write_buf, pagesize, pagesize);
if (ret) {
perror("Failed to allocate buffer");
return 3;
}
ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL);
if (ret != 0) {
fprintf(stderr, "Failed to create writer thread: %d\n", ret);
return 4;
}
while (1) {
ret = do_write(fd, write_buf, pagesize, 0);
if (ret != 0) {
perror("Write failed");
exit(5);
}
}
return 0;
}
$ mkfs.btrfs -f /dev/sdi
$ mount /dev/sdi /mnt/sdi
$ timeout 10 ./repro /mnt/sdi/foo
Usually the race is triggered within less than 1 second. A test case for
fstests will follow soon.
Reported-by: Paulo Dias <paulo.miguel.dias(a)gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187
Reported-by: Andreas Jahn <jahn-andi(a)web.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199
Reported-by: syzbot+4704b3cc972bd76024f1(a)syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000044ff540620d7dee2@google.com/
Fixes: 939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write")
CC: stable(a)vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 75fa563e4cac..c8568b1a61c4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -459,7 +459,6 @@ struct btrfs_file_private {
void *filldir_buf;
u64 last_index;
struct extent_state *llseek_cached_state;
- bool fsync_skip_inode_lock;
};
static inline u32 BTRFS_LEAF_DATA_SIZE(const struct btrfs_fs_info *info)
diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 67adbe9d294a..364bce34f034 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -864,13 +864,6 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
if (IS_ERR_OR_NULL(dio)) {
ret = PTR_ERR_OR_ZERO(dio);
} else {
- struct btrfs_file_private stack_private = { 0 };
- struct btrfs_file_private *private;
- const bool have_private = (file->private_data != NULL);
-
- if (!have_private)
- file->private_data = &stack_private;
-
/*
* If we have a synchronous write, we must make sure the fsync
* triggered by the iomap_dio_complete() call below doesn't
@@ -879,13 +872,10 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
* partial writes due to the input buffer (or parts of it) not
* being already faulted in.
*/
- private = file->private_data;
- private->fsync_skip_inode_lock = true;
+ ASSERT(current->journal_info == NULL);
+ current->journal_info = BTRFS_TRANS_DIO_WRITE_STUB;
ret = iomap_dio_complete(dio);
- private->fsync_skip_inode_lock = false;
-
- if (!have_private)
- file->private_data = NULL;
+ current->journal_info = NULL;
}
/* No increment (+=) because iomap returns a cumulative value. */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 9914419f3b7d..2aeb8116549c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1603,7 +1603,6 @@ static inline bool skip_inode_logging(const struct btrfs_log_ctx *ctx)
*/
int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
{
- struct btrfs_file_private *private = file->private_data;
struct dentry *dentry = file_dentry(file);
struct btrfs_inode *inode = BTRFS_I(d_inode(dentry));
struct btrfs_root *root = inode->root;
@@ -1613,7 +1612,13 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
int ret = 0, err;
u64 len;
bool full_sync;
- const bool skip_ilock = (private ? private->fsync_skip_inode_lock : false);
+ bool skip_ilock = false;
+
+ if (current->journal_info == BTRFS_TRANS_DIO_WRITE_STUB) {
+ skip_ilock = true;
+ current->journal_info = NULL;
+ lockdep_assert_held(&inode->vfs_inode.i_rwsem);
+ }
trace_btrfs_sync_file(file, datasync);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 98c03ddc760b..dd9ce9b9f69e 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -27,6 +27,12 @@ struct btrfs_root_item;
struct btrfs_root;
struct btrfs_path;
+/*
+ * Signal that a direct IO write is in progress, to avoid deadlock for sync
+ * direct IO writes when fsync is called during the direct IO write path.
+ */
+#define BTRFS_TRANS_DIO_WRITE_STUB ((void *) 1)
+
/* Radix-tree tag for roots that are part of the trasaction. */
#define BTRFS_ROOT_TRANS_TAG 0
The patch below does not apply to the 6.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.10.y
git checkout FETCH_HEAD
git cherry-pick -x cd9253c23aedd61eb5ff11f37a36247cd46faf86
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090844-flattered-badass-d13c@gregkh' --subject-prefix 'PATCH 6.10.y' HEAD^..
Possible dependencies:
cd9253c23aed ("btrfs: fix race between direct IO write and fsync when using same fd")
939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write")
9aa29a20b700 ("btrfs: move the direct IO code into its own file")
04ef7631bfa5 ("btrfs: cleanup duplicated parameters related to btrfs_create_dio_extent()")
9fec848b3a33 ("btrfs: cleanup duplicated parameters related to create_io_em()")
e9ea31fb5c1f ("btrfs: cleanup duplicated parameters related to btrfs_alloc_ordered_extent")
cdc627e65c7e ("btrfs: cleanup duplicated parameters related to can_nocow_file_extent_args")
c77a8c61002e ("btrfs: remove extent_map::block_start member")
e28b851ed9b2 ("btrfs: remove extent_map::block_len member")
4aa7b5d1784f ("btrfs: remove extent_map::orig_start member")
3f255ece2f1e ("btrfs: introduce extra sanity checks for extent maps")
3d2ac9922465 ("btrfs: introduce new members for extent_map")
87a6962f73b1 ("btrfs: export the expected file extent through can_nocow_extent()")
e8fe524da027 ("btrfs: rename extent_map::orig_block_len to disk_num_bytes")
8996f61ab9ff ("btrfs: move fiemap code into its own file")
56b7169f691c ("btrfs: use a btrfs_inode local variable at btrfs_sync_file()")
e641e323abb3 ("btrfs: pass a btrfs_inode to btrfs_wait_ordered_range()")
cef2daba4268 ("btrfs: pass a btrfs_inode to btrfs_fdatawrite_range()")
4e660ca3a98d ("btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree")
7f5830bc964d ("btrfs: rename rb_root member of extent_map_tree from map to root")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From cd9253c23aedd61eb5ff11f37a36247cd46faf86 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Thu, 29 Aug 2024 18:25:49 +0100
Subject: [PATCH] btrfs: fix race between direct IO write and fsync when using
same fd
If we have 2 threads that are using the same file descriptor and one of
them is doing direct IO writes while the other is doing fsync, we have a
race where we can end up either:
1) Attempt a fsync without holding the inode's lock, triggering an
assertion failures when assertions are enabled;
2) Do an invalid memory access from the fsync task because the file private
points to memory allocated on stack by the direct IO task and it may be
used by the fsync task after the stack was destroyed.
The race happens like this:
1) A user space program opens a file descriptor with O_DIRECT;
2) The program spawns 2 threads using libpthread for example;
3) One of the threads uses the file descriptor to do direct IO writes,
while the other calls fsync using the same file descriptor.
4) Call task A the thread doing direct IO writes and task B the thread
doing fsyncs;
5) Task A does a direct IO write, and at btrfs_direct_write() sets the
file's private to an on stack allocated private with the member
'fsync_skip_inode_lock' set to true;
6) Task B enters btrfs_sync_file() and sees that there's a private
structure associated to the file which has 'fsync_skip_inode_lock' set
to true, so it skips locking the inode's VFS lock;
7) Task A completes the direct IO write, and resets the file's private to
NULL since it had no prior private and our private was stack allocated.
Then it unlocks the inode's VFS lock;
8) Task B enters btrfs_get_ordered_extents_for_logging(), then the
assertion that checks the inode's VFS lock is held fails, since task B
never locked it and task A has already unlocked it.
The stack trace produced is the following:
assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983
------------[ cut here ]------------
kernel BUG at fs/btrfs/ordered-data.c:983!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 9 PID: 5072 Comm: worker Tainted: G U OE 6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8
Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020
RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs]
Code: 50 d6 86 c0 e8 (...)
RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246
RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800
RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38
R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800
R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000
FS: 00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0
Call Trace:
<TASK>
? __die_body.cold+0x14/0x24
? die+0x2e/0x50
? do_trap+0xca/0x110
? do_error_trap+0x6a/0x90
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? exc_invalid_op+0x50/0x70
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? asm_exc_invalid_op+0x1a/0x20
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? __seccomp_filter+0x31d/0x4f0
__x64_sys_fdatasync+0x4f/0x90
do_syscall_64+0x82/0x160
? do_futex+0xcb/0x190
? __x64_sys_futex+0x10e/0x1d0
? switch_fpu_return+0x4f/0xd0
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Another problem here is if task B grabs the private pointer and then uses
it after task A has finished, since the private was allocated in the stack
of task A, it results in some invalid memory access with a hard to predict
result.
This issue, triggering the assertion, was observed with QEMU workloads by
two users in the Link tags below.
Fix this by not relying on a file's private to pass information to fsync
that it should skip locking the inode and instead pass this information
through a special value stored in current->journal_info. This is safe
because in the relevant section of the direct IO write path we are not
holding a transaction handle, so current->journal_info is NULL.
The following C program triggers the issue:
$ cat repro.c
/* Get the O_DIRECT definition. */
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <pthread.h>
static int fd;
static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset)
{
while (count > 0) {
ssize_t ret;
ret = pwrite(fd, buf, count, offset);
if (ret < 0) {
if (errno == EINTR)
continue;
return ret;
}
count -= ret;
buf += ret;
}
return 0;
}
static void *fsync_loop(void *arg)
{
while (1) {
int ret;
ret = fsync(fd);
if (ret != 0) {
perror("Fsync failed");
exit(6);
}
}
}
int main(int argc, char *argv[])
{
long pagesize;
void *write_buf;
pthread_t fsyncer;
int ret;
if (argc != 2) {
fprintf(stderr, "Use: %s <file path>\n", argv[0]);
return 1;
}
fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666);
if (fd == -1) {
perror("Failed to open/create file");
return 1;
}
pagesize = sysconf(_SC_PAGE_SIZE);
if (pagesize == -1) {
perror("Failed to get page size");
return 2;
}
ret = posix_memalign(&write_buf, pagesize, pagesize);
if (ret) {
perror("Failed to allocate buffer");
return 3;
}
ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL);
if (ret != 0) {
fprintf(stderr, "Failed to create writer thread: %d\n", ret);
return 4;
}
while (1) {
ret = do_write(fd, write_buf, pagesize, 0);
if (ret != 0) {
perror("Write failed");
exit(5);
}
}
return 0;
}
$ mkfs.btrfs -f /dev/sdi
$ mount /dev/sdi /mnt/sdi
$ timeout 10 ./repro /mnt/sdi/foo
Usually the race is triggered within less than 1 second. A test case for
fstests will follow soon.
Reported-by: Paulo Dias <paulo.miguel.dias(a)gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187
Reported-by: Andreas Jahn <jahn-andi(a)web.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199
Reported-by: syzbot+4704b3cc972bd76024f1(a)syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000044ff540620d7dee2@google.com/
Fixes: 939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write")
CC: stable(a)vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 75fa563e4cac..c8568b1a61c4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -459,7 +459,6 @@ struct btrfs_file_private {
void *filldir_buf;
u64 last_index;
struct extent_state *llseek_cached_state;
- bool fsync_skip_inode_lock;
};
static inline u32 BTRFS_LEAF_DATA_SIZE(const struct btrfs_fs_info *info)
diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 67adbe9d294a..364bce34f034 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -864,13 +864,6 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
if (IS_ERR_OR_NULL(dio)) {
ret = PTR_ERR_OR_ZERO(dio);
} else {
- struct btrfs_file_private stack_private = { 0 };
- struct btrfs_file_private *private;
- const bool have_private = (file->private_data != NULL);
-
- if (!have_private)
- file->private_data = &stack_private;
-
/*
* If we have a synchronous write, we must make sure the fsync
* triggered by the iomap_dio_complete() call below doesn't
@@ -879,13 +872,10 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
* partial writes due to the input buffer (or parts of it) not
* being already faulted in.
*/
- private = file->private_data;
- private->fsync_skip_inode_lock = true;
+ ASSERT(current->journal_info == NULL);
+ current->journal_info = BTRFS_TRANS_DIO_WRITE_STUB;
ret = iomap_dio_complete(dio);
- private->fsync_skip_inode_lock = false;
-
- if (!have_private)
- file->private_data = NULL;
+ current->journal_info = NULL;
}
/* No increment (+=) because iomap returns a cumulative value. */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 9914419f3b7d..2aeb8116549c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1603,7 +1603,6 @@ static inline bool skip_inode_logging(const struct btrfs_log_ctx *ctx)
*/
int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
{
- struct btrfs_file_private *private = file->private_data;
struct dentry *dentry = file_dentry(file);
struct btrfs_inode *inode = BTRFS_I(d_inode(dentry));
struct btrfs_root *root = inode->root;
@@ -1613,7 +1612,13 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
int ret = 0, err;
u64 len;
bool full_sync;
- const bool skip_ilock = (private ? private->fsync_skip_inode_lock : false);
+ bool skip_ilock = false;
+
+ if (current->journal_info == BTRFS_TRANS_DIO_WRITE_STUB) {
+ skip_ilock = true;
+ current->journal_info = NULL;
+ lockdep_assert_held(&inode->vfs_inode.i_rwsem);
+ }
trace_btrfs_sync_file(file, datasync);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 98c03ddc760b..dd9ce9b9f69e 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -27,6 +27,12 @@ struct btrfs_root_item;
struct btrfs_root;
struct btrfs_path;
+/*
+ * Signal that a direct IO write is in progress, to avoid deadlock for sync
+ * direct IO writes when fsync is called during the direct IO write path.
+ */
+#define BTRFS_TRANS_DIO_WRITE_STUB ((void *) 1)
+
/* Radix-tree tag for roots that are part of the trasaction. */
#define BTRFS_ROOT_TRANS_TAG 0
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x 0ecc5be200c84e67114f3640064ba2bae3ba2f5a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090830-reactive-jokester-6061@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
0ecc5be200c8 ("x86/apic: Make x2apic_disable() work correctly")
720a22fd6c1c ("x86/apic: Don't access the APIC when disabling x2APIC")
5a88f354dcd8 ("x86/apic: Split register_apic_address()")
d10a904435fa ("x86/apic: Consolidate boot_cpu_physical_apicid initialization sites")
49062454a3eb ("x86/apic: Rename disable_apic")
bea629d57d00 ("x86/apic: Save the APIC virtual base address")
3adee777ad0d ("x86/smpboot: Remove initial_stack on 64-bit")
94a855111ed9 ("Merge tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0ecc5be200c84e67114f3640064ba2bae3ba2f5a Mon Sep 17 00:00:00 2001
From: Yuntao Wang <yuntao.wang(a)linux.dev>
Date: Tue, 13 Aug 2024 09:48:27 +0800
Subject: [PATCH] x86/apic: Make x2apic_disable() work correctly
x2apic_disable() clears x2apic_state and x2apic_mode unconditionally, even
when the state is X2APIC_ON_LOCKED, which prevents the kernel to disable
it thereby creating inconsistent state.
Due to the early state check for X2APIC_ON, the code path which warns about
a locked X2APIC cannot be reached.
Test for state < X2APIC_ON instead and move the clearing of the state and
mode variables to the place which actually disables X2APIC.
[ tglx: Massaged change log. Added Fixes tag. Moved clearing so it's at the
right place for back ports ]
Fixes: a57e456a7b28 ("x86/apic: Fix fallout from x2apic cleanup")
Signed-off-by: Yuntao Wang <yuntao.wang(a)linux.dev>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/20240813014827.895381-1-yuntao.wang@linux.dev
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 66fd4b2a37a3..373638691cd4 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1775,12 +1775,9 @@ static __init void apic_set_fixmap(bool read_apic);
static __init void x2apic_disable(void)
{
- u32 x2apic_id, state = x2apic_state;
+ u32 x2apic_id;
- x2apic_mode = 0;
- x2apic_state = X2APIC_DISABLED;
-
- if (state != X2APIC_ON)
+ if (x2apic_state < X2APIC_ON)
return;
x2apic_id = read_apic_id();
@@ -1793,6 +1790,10 @@ static __init void x2apic_disable(void)
}
__x2apic_disable();
+
+ x2apic_mode = 0;
+ x2apic_state = X2APIC_DISABLED;
+
/*
* Don't reread the APIC ID as it was already done from
* check_x2apic() and the APIC driver still is a x2APIC variant,
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 0ecc5be200c84e67114f3640064ba2bae3ba2f5a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090830-kangaroo-hassle-e959@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
0ecc5be200c8 ("x86/apic: Make x2apic_disable() work correctly")
720a22fd6c1c ("x86/apic: Don't access the APIC when disabling x2APIC")
5a88f354dcd8 ("x86/apic: Split register_apic_address()")
d10a904435fa ("x86/apic: Consolidate boot_cpu_physical_apicid initialization sites")
49062454a3eb ("x86/apic: Rename disable_apic")
bea629d57d00 ("x86/apic: Save the APIC virtual base address")
3adee777ad0d ("x86/smpboot: Remove initial_stack on 64-bit")
94a855111ed9 ("Merge tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0ecc5be200c84e67114f3640064ba2bae3ba2f5a Mon Sep 17 00:00:00 2001
From: Yuntao Wang <yuntao.wang(a)linux.dev>
Date: Tue, 13 Aug 2024 09:48:27 +0800
Subject: [PATCH] x86/apic: Make x2apic_disable() work correctly
x2apic_disable() clears x2apic_state and x2apic_mode unconditionally, even
when the state is X2APIC_ON_LOCKED, which prevents the kernel to disable
it thereby creating inconsistent state.
Due to the early state check for X2APIC_ON, the code path which warns about
a locked X2APIC cannot be reached.
Test for state < X2APIC_ON instead and move the clearing of the state and
mode variables to the place which actually disables X2APIC.
[ tglx: Massaged change log. Added Fixes tag. Moved clearing so it's at the
right place for back ports ]
Fixes: a57e456a7b28 ("x86/apic: Fix fallout from x2apic cleanup")
Signed-off-by: Yuntao Wang <yuntao.wang(a)linux.dev>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/20240813014827.895381-1-yuntao.wang@linux.dev
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 66fd4b2a37a3..373638691cd4 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1775,12 +1775,9 @@ static __init void apic_set_fixmap(bool read_apic);
static __init void x2apic_disable(void)
{
- u32 x2apic_id, state = x2apic_state;
+ u32 x2apic_id;
- x2apic_mode = 0;
- x2apic_state = X2APIC_DISABLED;
-
- if (state != X2APIC_ON)
+ if (x2apic_state < X2APIC_ON)
return;
x2apic_id = read_apic_id();
@@ -1793,6 +1790,10 @@ static __init void x2apic_disable(void)
}
__x2apic_disable();
+
+ x2apic_mode = 0;
+ x2apic_state = X2APIC_DISABLED;
+
/*
* Don't reread the APIC ID as it was already done from
* check_x2apic() and the APIC driver still is a x2APIC variant,
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 0ecc5be200c84e67114f3640064ba2bae3ba2f5a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090829-clench-kinfolk-03d9@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
0ecc5be200c8 ("x86/apic: Make x2apic_disable() work correctly")
720a22fd6c1c ("x86/apic: Don't access the APIC when disabling x2APIC")
5a88f354dcd8 ("x86/apic: Split register_apic_address()")
d10a904435fa ("x86/apic: Consolidate boot_cpu_physical_apicid initialization sites")
49062454a3eb ("x86/apic: Rename disable_apic")
bea629d57d00 ("x86/apic: Save the APIC virtual base address")
3adee777ad0d ("x86/smpboot: Remove initial_stack on 64-bit")
94a855111ed9 ("Merge tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0ecc5be200c84e67114f3640064ba2bae3ba2f5a Mon Sep 17 00:00:00 2001
From: Yuntao Wang <yuntao.wang(a)linux.dev>
Date: Tue, 13 Aug 2024 09:48:27 +0800
Subject: [PATCH] x86/apic: Make x2apic_disable() work correctly
x2apic_disable() clears x2apic_state and x2apic_mode unconditionally, even
when the state is X2APIC_ON_LOCKED, which prevents the kernel to disable
it thereby creating inconsistent state.
Due to the early state check for X2APIC_ON, the code path which warns about
a locked X2APIC cannot be reached.
Test for state < X2APIC_ON instead and move the clearing of the state and
mode variables to the place which actually disables X2APIC.
[ tglx: Massaged change log. Added Fixes tag. Moved clearing so it's at the
right place for back ports ]
Fixes: a57e456a7b28 ("x86/apic: Fix fallout from x2apic cleanup")
Signed-off-by: Yuntao Wang <yuntao.wang(a)linux.dev>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/20240813014827.895381-1-yuntao.wang@linux.dev
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 66fd4b2a37a3..373638691cd4 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1775,12 +1775,9 @@ static __init void apic_set_fixmap(bool read_apic);
static __init void x2apic_disable(void)
{
- u32 x2apic_id, state = x2apic_state;
+ u32 x2apic_id;
- x2apic_mode = 0;
- x2apic_state = X2APIC_DISABLED;
-
- if (state != X2APIC_ON)
+ if (x2apic_state < X2APIC_ON)
return;
x2apic_id = read_apic_id();
@@ -1793,6 +1790,10 @@ static __init void x2apic_disable(void)
}
__x2apic_disable();
+
+ x2apic_mode = 0;
+ x2apic_state = X2APIC_DISABLED;
+
/*
* Don't reread the APIC ID as it was already done from
* check_x2apic() and the APIC driver still is a x2APIC variant,
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 0ecc5be200c84e67114f3640064ba2bae3ba2f5a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090828-tiny-boggle-b7b4@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
0ecc5be200c8 ("x86/apic: Make x2apic_disable() work correctly")
720a22fd6c1c ("x86/apic: Don't access the APIC when disabling x2APIC")
5a88f354dcd8 ("x86/apic: Split register_apic_address()")
d10a904435fa ("x86/apic: Consolidate boot_cpu_physical_apicid initialization sites")
49062454a3eb ("x86/apic: Rename disable_apic")
bea629d57d00 ("x86/apic: Save the APIC virtual base address")
3adee777ad0d ("x86/smpboot: Remove initial_stack on 64-bit")
94a855111ed9 ("Merge tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0ecc5be200c84e67114f3640064ba2bae3ba2f5a Mon Sep 17 00:00:00 2001
From: Yuntao Wang <yuntao.wang(a)linux.dev>
Date: Tue, 13 Aug 2024 09:48:27 +0800
Subject: [PATCH] x86/apic: Make x2apic_disable() work correctly
x2apic_disable() clears x2apic_state and x2apic_mode unconditionally, even
when the state is X2APIC_ON_LOCKED, which prevents the kernel to disable
it thereby creating inconsistent state.
Due to the early state check for X2APIC_ON, the code path which warns about
a locked X2APIC cannot be reached.
Test for state < X2APIC_ON instead and move the clearing of the state and
mode variables to the place which actually disables X2APIC.
[ tglx: Massaged change log. Added Fixes tag. Moved clearing so it's at the
right place for back ports ]
Fixes: a57e456a7b28 ("x86/apic: Fix fallout from x2apic cleanup")
Signed-off-by: Yuntao Wang <yuntao.wang(a)linux.dev>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/20240813014827.895381-1-yuntao.wang@linux.dev
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 66fd4b2a37a3..373638691cd4 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1775,12 +1775,9 @@ static __init void apic_set_fixmap(bool read_apic);
static __init void x2apic_disable(void)
{
- u32 x2apic_id, state = x2apic_state;
+ u32 x2apic_id;
- x2apic_mode = 0;
- x2apic_state = X2APIC_DISABLED;
-
- if (state != X2APIC_ON)
+ if (x2apic_state < X2APIC_ON)
return;
x2apic_id = read_apic_id();
@@ -1793,6 +1790,10 @@ static __init void x2apic_disable(void)
}
__x2apic_disable();
+
+ x2apic_mode = 0;
+ x2apic_state = X2APIC_DISABLED;
+
/*
* Don't reread the APIC ID as it was already done from
* check_x2apic() and the APIC driver still is a x2APIC variant,
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 0ecc5be200c84e67114f3640064ba2bae3ba2f5a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090827-recount-humbly-e075@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
0ecc5be200c8 ("x86/apic: Make x2apic_disable() work correctly")
720a22fd6c1c ("x86/apic: Don't access the APIC when disabling x2APIC")
5a88f354dcd8 ("x86/apic: Split register_apic_address()")
d10a904435fa ("x86/apic: Consolidate boot_cpu_physical_apicid initialization sites")
49062454a3eb ("x86/apic: Rename disable_apic")
bea629d57d00 ("x86/apic: Save the APIC virtual base address")
3adee777ad0d ("x86/smpboot: Remove initial_stack on 64-bit")
94a855111ed9 ("Merge tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0ecc5be200c84e67114f3640064ba2bae3ba2f5a Mon Sep 17 00:00:00 2001
From: Yuntao Wang <yuntao.wang(a)linux.dev>
Date: Tue, 13 Aug 2024 09:48:27 +0800
Subject: [PATCH] x86/apic: Make x2apic_disable() work correctly
x2apic_disable() clears x2apic_state and x2apic_mode unconditionally, even
when the state is X2APIC_ON_LOCKED, which prevents the kernel to disable
it thereby creating inconsistent state.
Due to the early state check for X2APIC_ON, the code path which warns about
a locked X2APIC cannot be reached.
Test for state < X2APIC_ON instead and move the clearing of the state and
mode variables to the place which actually disables X2APIC.
[ tglx: Massaged change log. Added Fixes tag. Moved clearing so it's at the
right place for back ports ]
Fixes: a57e456a7b28 ("x86/apic: Fix fallout from x2apic cleanup")
Signed-off-by: Yuntao Wang <yuntao.wang(a)linux.dev>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/20240813014827.895381-1-yuntao.wang@linux.dev
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 66fd4b2a37a3..373638691cd4 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1775,12 +1775,9 @@ static __init void apic_set_fixmap(bool read_apic);
static __init void x2apic_disable(void)
{
- u32 x2apic_id, state = x2apic_state;
+ u32 x2apic_id;
- x2apic_mode = 0;
- x2apic_state = X2APIC_DISABLED;
-
- if (state != X2APIC_ON)
+ if (x2apic_state < X2APIC_ON)
return;
x2apic_id = read_apic_id();
@@ -1793,6 +1790,10 @@ static __init void x2apic_disable(void)
}
__x2apic_disable();
+
+ x2apic_mode = 0;
+ x2apic_state = X2APIC_DISABLED;
+
/*
* Don't reread the APIC ID as it was already done from
* check_x2apic() and the APIC driver still is a x2APIC variant,
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 2848ff28d180bd63a95da8e5dcbcdd76c1beeb7b
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090810-amply-schedule-a7d0@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
2848ff28d180 ("x86/fpu: Avoid writing LBR bit to IA32_XSS unless supported")
c33f0a81a2cf ("x86/fpu: Add fpu_state_config::legacy_features")
d72c87018d00 ("x86/fpu/xstate: Move remaining xfeature helpers to core")
eda32f4f93b4 ("x86/fpu: Rework restore_regs_from_fpstate()")
daddee247319 ("x86/fpu: Mop up xfeatures_mask_uabi()")
1c253ff2287f ("x86/fpu: Move xstate feature masks to fpu_*_cfg")
2bd264bce238 ("x86/fpu: Move xstate size to fpu_*_cfg")
cd9ae7617449 ("x86/fpu/xstate: Cleanup size calculations")
617473acdfe4 ("x86/fpu: Cleanup fpu__init_system_xstate_size_legacy()")
578971f4e228 ("x86/fpu: Provide struct fpu_config")
5509cc78080d ("x86/fpu/signal: Use fpstate for size and features")
ad6ede407aae ("x86/fpu: Use fpstate in fpu_copy_kvm_uabi_to_fpstate()")
be31dfdfd75b ("x86/fpu: Use fpstate::size")
248452ce21ae ("x86/fpu: Add size and mask information to fpstate")
2dd8eedc80b1 ("x86/process: Move arch_thread_struct_whitelist() out of line")
c20942ce5128 ("x86/fpu/core: Convert to fpstate")
7e049e8b7459 ("x86/fpu/signal: Convert to fpstate")
087df48c298c ("x86/fpu: Replace KVMs xstate component clearing")
18b3fa1ad15f ("x86/fpu: Convert restore_fpregs_from_fpstate() to struct fpstate")
f83ac56acdad ("x86/fpu: Convert fpstate_init() to struct fpstate")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 2848ff28d180bd63a95da8e5dcbcdd76c1beeb7b Mon Sep 17 00:00:00 2001
From: Mitchell Levy <levymitchell0(a)gmail.com>
Date: Mon, 12 Aug 2024 13:44:12 -0700
Subject: [PATCH] x86/fpu: Avoid writing LBR bit to IA32_XSS unless supported
There are two distinct CPU features related to the use of XSAVES and LBR:
whether LBR is itself supported and whether XSAVES supports LBR. The LBR
subsystem correctly checks both in intel_pmu_arch_lbr_init(), but the
XSTATE subsystem does not.
The LBR bit is only removed from xfeatures_mask_independent when LBR is not
supported by the CPU, but there is no validation of XSTATE support.
If XSAVES does not support LBR the write to IA32_XSS causes a #GP fault,
leaving the state of IA32_XSS unchanged, i.e. zero. The fault is handled
with a warning and the boot continues.
Consequently the next XRSTORS which tries to restore supervisor state fails
with #GP because the RFBM has zero for all supervisor features, which does
not match the XCOMP_BV field.
As XFEATURE_MASK_FPSTATE includes supervisor features setting up the FPU
causes a #GP, which ends up in fpu_reset_from_exception_fixup(). That fails
due to the same problem resulting in recursive #GPs until the kernel runs
out of stack space and double faults.
Prevent this by storing the supported independent features in
fpu_kernel_cfg during XSTATE initialization and use that cached value for
retrieving the independent feature bits to be written into IA32_XSS.
[ tglx: Massaged change log ]
Fixes: f0dccc9da4c0 ("x86/fpu/xstate: Support dynamic supervisor feature for LBR")
Suggested-by: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: Mitchell Levy <levymitchell0(a)gmail.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/20240812-xsave-lbr-fix-v3-1-95bac1bf62f4@gmail.…
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index eb17f31b06d2..de16862bf230 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -591,6 +591,13 @@ struct fpu_state_config {
* even without XSAVE support, i.e. legacy features FP + SSE
*/
u64 legacy_features;
+ /*
+ * @independent_features:
+ *
+ * Features that are supported by XSAVES, but not managed as part of
+ * the FPU core, such as LBR
+ */
+ u64 independent_features;
};
/* FPU state configuration information */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c5a026fee5e0..1339f8328db5 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -788,6 +788,9 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
goto out_disable;
}
+ fpu_kernel_cfg.independent_features = fpu_kernel_cfg.max_features &
+ XFEATURE_MASK_INDEPENDENT;
+
/*
* Clear XSAVE features that are disabled in the normal CPUID.
*/
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index 2ee0b9c53dcc..afb404cd2059 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -62,9 +62,9 @@ static inline u64 xfeatures_mask_supervisor(void)
static inline u64 xfeatures_mask_independent(void)
{
if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR))
- return XFEATURE_MASK_INDEPENDENT & ~XFEATURE_MASK_LBR;
+ return fpu_kernel_cfg.independent_features & ~XFEATURE_MASK_LBR;
- return XFEATURE_MASK_INDEPENDENT;
+ return fpu_kernel_cfg.independent_features;
}
/* XSAVE/XRSTOR wrapper functions */
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x 71c186efc1b2cf1aeabfeff3b9bd5ac4c5ac14d8
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090859-delicious-serpent-3635@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
71c186efc1b2 ("userfaultfd: fix checks for huge PMDs")
dab6e717429e ("mm: Rename pmd_read_atomic()")
024d232ae4fc ("mm: Fix pmd_read_atomic()")
fbfdec9989e6 ("x86/mm/pae: Make pmd_t similar to pte_t")
bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
018ee47f1489 ("mm: multi-gen LRU: exploit locality in rmap")
ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
ec1c86b25f4b ("mm: multi-gen LRU: groundwork")
f1e1a7be4718 ("mm/vmscan.c: refactor shrink_node()")
d3629af59f41 ("mm/vmscan: make the annotations of refaults code at the right place")
e9c2dbc8bf71 ("mm/vmscan: define macros for refaults in struct lruvec")
507228044236 ("mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage")
6614a3c3164a ("Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 71c186efc1b2cf1aeabfeff3b9bd5ac4c5ac14d8 Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 13 Aug 2024 22:25:21 +0200
Subject: [PATCH] userfaultfd: fix checks for huge PMDs
Patch series "userfaultfd: fix races around pmd_trans_huge() check", v2.
The pmd_trans_huge() code in mfill_atomic() is wrong in three different
ways depending on kernel version:
1. The pmd_trans_huge() check is racy and can lead to a BUG_ON() (if you hit
the right two race windows) - I've tested this in a kernel build with
some extra mdelay() calls. See the commit message for a description
of the race scenario.
On older kernels (before 6.5), I think the same bug can even
theoretically lead to accessing transhuge page contents as a page table
if you hit the right 5 narrow race windows (I haven't tested this case).
2. As pointed out by Qi Zheng, pmd_trans_huge() is not sufficient for
detecting PMDs that don't point to page tables.
On older kernels (before 6.5), you'd just have to win a single fairly
wide race to hit this.
I've tested this on 6.1 stable by racing migration (with a mdelay()
patched into try_to_migrate()) against UFFDIO_ZEROPAGE - on my x86
VM, that causes a kernel oops in ptlock_ptr().
3. On newer kernels (>=6.5), for shmem mappings, khugepaged is allowed
to yank page tables out from under us (though I haven't tested that),
so I think the BUG_ON() checks in mfill_atomic() are just wrong.
I decided to write two separate fixes for these (one fix for bugs 1+2, one
fix for bug 3), so that the first fix can be backported to kernels
affected by bugs 1+2.
This patch (of 2):
This fixes two issues.
I discovered that the following race can occur:
mfill_atomic other thread
============ ============
<zap PMD>
pmdp_get_lockless() [reads none pmd]
<bail if trans_huge>
<if none:>
<pagefault creates transhuge zeropage>
__pte_alloc [no-op]
<zap PMD>
<bail if pmd_trans_huge(*dst_pmd)>
BUG_ON(pmd_none(*dst_pmd))
I have experimentally verified this in a kernel with extra mdelay() calls;
the BUG_ON(pmd_none(*dst_pmd)) triggers.
On kernels newer than commit 0d940a9b270b ("mm/pgtable: allow
pte_offset_map[_lock]() to fail"), this can't lead to anything worse than
a BUG_ON(), since the page table access helpers are actually designed to
deal with page tables concurrently disappearing; but on older kernels
(<=6.4), I think we could probably theoretically race past the two
BUG_ON() checks and end up treating a hugepage as a page table.
The second issue is that, as Qi Zheng pointed out, there are other types
of huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs
(in particular, migration PMDs).
On <=6.4, this is worse than the first issue: If mfill_atomic() runs on a
PMD that contains a migration entry (which just requires winning a single,
fairly wide race), it will pass the PMD to pte_offset_map_lock(), which
assumes that the PMD points to a page table.
Breakage follows: First, the kernel tries to take the PTE lock (which will
crash or maybe worse if there is no "struct page" for the address bits in
the migration entry PMD - I think at least on X86 there usually is no
corresponding "struct page" thanks to the PTE inversion mitigation, amd64
looks different).
If that didn't crash, the kernel would next try to write a PTE into what
it wrongly thinks is a page table.
As part of fixing these issues, get rid of the check for pmd_trans_huge()
before __pte_alloc() - that's redundant, we're going to have to check for
that after the __pte_alloc() anyway.
Backport note: pmdp_get_lockless() is pmd_read_atomic() in older kernels.
Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-0-5efa61078a41@goog…
Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-1-5efa61078a41@goog…
Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
Signed-off-by: Jann Horn <jannh(a)google.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Pavel Emelyanov <xemul(a)virtuozzo.com>
Cc: Qi Zheng <zhengqi.arch(a)bytedance.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e54e5c8907fa..290b2a0d84ac 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -787,21 +787,23 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
}
dst_pmdval = pmdp_get_lockless(dst_pmd);
- /*
- * If the dst_pmd is mapped as THP don't
- * override it and just be strict.
- */
- if (unlikely(pmd_trans_huge(dst_pmdval))) {
- err = -EEXIST;
- break;
- }
if (unlikely(pmd_none(dst_pmdval)) &&
unlikely(__pte_alloc(dst_mm, dst_pmd))) {
err = -ENOMEM;
break;
}
- /* If an huge pmd materialized from under us fail */
- if (unlikely(pmd_trans_huge(*dst_pmd))) {
+ dst_pmdval = pmdp_get_lockless(dst_pmd);
+ /*
+ * If the dst_pmd is THP don't override it and just be strict.
+ * (This includes the case where the PMD used to be THP and
+ * changed back to none after __pte_alloc().)
+ */
+ if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval) ||
+ pmd_devmap(dst_pmdval))) {
+ err = -EEXIST;
+ break;
+ }
+ if (unlikely(pmd_bad(dst_pmdval))) {
err = -EFAULT;
break;
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 71c186efc1b2cf1aeabfeff3b9bd5ac4c5ac14d8
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090857-wieldable-pedicure-19b9@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
71c186efc1b2 ("userfaultfd: fix checks for huge PMDs")
dab6e717429e ("mm: Rename pmd_read_atomic()")
024d232ae4fc ("mm: Fix pmd_read_atomic()")
fbfdec9989e6 ("x86/mm/pae: Make pmd_t similar to pte_t")
bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
018ee47f1489 ("mm: multi-gen LRU: exploit locality in rmap")
ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
ec1c86b25f4b ("mm: multi-gen LRU: groundwork")
f1e1a7be4718 ("mm/vmscan.c: refactor shrink_node()")
d3629af59f41 ("mm/vmscan: make the annotations of refaults code at the right place")
e9c2dbc8bf71 ("mm/vmscan: define macros for refaults in struct lruvec")
507228044236 ("mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage")
6614a3c3164a ("Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 71c186efc1b2cf1aeabfeff3b9bd5ac4c5ac14d8 Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 13 Aug 2024 22:25:21 +0200
Subject: [PATCH] userfaultfd: fix checks for huge PMDs
Patch series "userfaultfd: fix races around pmd_trans_huge() check", v2.
The pmd_trans_huge() code in mfill_atomic() is wrong in three different
ways depending on kernel version:
1. The pmd_trans_huge() check is racy and can lead to a BUG_ON() (if you hit
the right two race windows) - I've tested this in a kernel build with
some extra mdelay() calls. See the commit message for a description
of the race scenario.
On older kernels (before 6.5), I think the same bug can even
theoretically lead to accessing transhuge page contents as a page table
if you hit the right 5 narrow race windows (I haven't tested this case).
2. As pointed out by Qi Zheng, pmd_trans_huge() is not sufficient for
detecting PMDs that don't point to page tables.
On older kernels (before 6.5), you'd just have to win a single fairly
wide race to hit this.
I've tested this on 6.1 stable by racing migration (with a mdelay()
patched into try_to_migrate()) against UFFDIO_ZEROPAGE - on my x86
VM, that causes a kernel oops in ptlock_ptr().
3. On newer kernels (>=6.5), for shmem mappings, khugepaged is allowed
to yank page tables out from under us (though I haven't tested that),
so I think the BUG_ON() checks in mfill_atomic() are just wrong.
I decided to write two separate fixes for these (one fix for bugs 1+2, one
fix for bug 3), so that the first fix can be backported to kernels
affected by bugs 1+2.
This patch (of 2):
This fixes two issues.
I discovered that the following race can occur:
mfill_atomic other thread
============ ============
<zap PMD>
pmdp_get_lockless() [reads none pmd]
<bail if trans_huge>
<if none:>
<pagefault creates transhuge zeropage>
__pte_alloc [no-op]
<zap PMD>
<bail if pmd_trans_huge(*dst_pmd)>
BUG_ON(pmd_none(*dst_pmd))
I have experimentally verified this in a kernel with extra mdelay() calls;
the BUG_ON(pmd_none(*dst_pmd)) triggers.
On kernels newer than commit 0d940a9b270b ("mm/pgtable: allow
pte_offset_map[_lock]() to fail"), this can't lead to anything worse than
a BUG_ON(), since the page table access helpers are actually designed to
deal with page tables concurrently disappearing; but on older kernels
(<=6.4), I think we could probably theoretically race past the two
BUG_ON() checks and end up treating a hugepage as a page table.
The second issue is that, as Qi Zheng pointed out, there are other types
of huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs
(in particular, migration PMDs).
On <=6.4, this is worse than the first issue: If mfill_atomic() runs on a
PMD that contains a migration entry (which just requires winning a single,
fairly wide race), it will pass the PMD to pte_offset_map_lock(), which
assumes that the PMD points to a page table.
Breakage follows: First, the kernel tries to take the PTE lock (which will
crash or maybe worse if there is no "struct page" for the address bits in
the migration entry PMD - I think at least on X86 there usually is no
corresponding "struct page" thanks to the PTE inversion mitigation, amd64
looks different).
If that didn't crash, the kernel would next try to write a PTE into what
it wrongly thinks is a page table.
As part of fixing these issues, get rid of the check for pmd_trans_huge()
before __pte_alloc() - that's redundant, we're going to have to check for
that after the __pte_alloc() anyway.
Backport note: pmdp_get_lockless() is pmd_read_atomic() in older kernels.
Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-0-5efa61078a41@goog…
Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-1-5efa61078a41@goog…
Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
Signed-off-by: Jann Horn <jannh(a)google.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Pavel Emelyanov <xemul(a)virtuozzo.com>
Cc: Qi Zheng <zhengqi.arch(a)bytedance.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e54e5c8907fa..290b2a0d84ac 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -787,21 +787,23 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
}
dst_pmdval = pmdp_get_lockless(dst_pmd);
- /*
- * If the dst_pmd is mapped as THP don't
- * override it and just be strict.
- */
- if (unlikely(pmd_trans_huge(dst_pmdval))) {
- err = -EEXIST;
- break;
- }
if (unlikely(pmd_none(dst_pmdval)) &&
unlikely(__pte_alloc(dst_mm, dst_pmd))) {
err = -ENOMEM;
break;
}
- /* If an huge pmd materialized from under us fail */
- if (unlikely(pmd_trans_huge(*dst_pmd))) {
+ dst_pmdval = pmdp_get_lockless(dst_pmd);
+ /*
+ * If the dst_pmd is THP don't override it and just be strict.
+ * (This includes the case where the PMD used to be THP and
+ * changed back to none after __pte_alloc().)
+ */
+ if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval) ||
+ pmd_devmap(dst_pmdval))) {
+ err = -EEXIST;
+ break;
+ }
+ if (unlikely(pmd_bad(dst_pmdval))) {
err = -EFAULT;
break;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 71c186efc1b2cf1aeabfeff3b9bd5ac4c5ac14d8
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090855-kiln-shrank-fea8@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
71c186efc1b2 ("userfaultfd: fix checks for huge PMDs")
dab6e717429e ("mm: Rename pmd_read_atomic()")
024d232ae4fc ("mm: Fix pmd_read_atomic()")
fbfdec9989e6 ("x86/mm/pae: Make pmd_t similar to pte_t")
bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
018ee47f1489 ("mm: multi-gen LRU: exploit locality in rmap")
ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
ec1c86b25f4b ("mm: multi-gen LRU: groundwork")
f1e1a7be4718 ("mm/vmscan.c: refactor shrink_node()")
d3629af59f41 ("mm/vmscan: make the annotations of refaults code at the right place")
e9c2dbc8bf71 ("mm/vmscan: define macros for refaults in struct lruvec")
507228044236 ("mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage")
6614a3c3164a ("Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 71c186efc1b2cf1aeabfeff3b9bd5ac4c5ac14d8 Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 13 Aug 2024 22:25:21 +0200
Subject: [PATCH] userfaultfd: fix checks for huge PMDs
Patch series "userfaultfd: fix races around pmd_trans_huge() check", v2.
The pmd_trans_huge() code in mfill_atomic() is wrong in three different
ways depending on kernel version:
1. The pmd_trans_huge() check is racy and can lead to a BUG_ON() (if you hit
the right two race windows) - I've tested this in a kernel build with
some extra mdelay() calls. See the commit message for a description
of the race scenario.
On older kernels (before 6.5), I think the same bug can even
theoretically lead to accessing transhuge page contents as a page table
if you hit the right 5 narrow race windows (I haven't tested this case).
2. As pointed out by Qi Zheng, pmd_trans_huge() is not sufficient for
detecting PMDs that don't point to page tables.
On older kernels (before 6.5), you'd just have to win a single fairly
wide race to hit this.
I've tested this on 6.1 stable by racing migration (with a mdelay()
patched into try_to_migrate()) against UFFDIO_ZEROPAGE - on my x86
VM, that causes a kernel oops in ptlock_ptr().
3. On newer kernels (>=6.5), for shmem mappings, khugepaged is allowed
to yank page tables out from under us (though I haven't tested that),
so I think the BUG_ON() checks in mfill_atomic() are just wrong.
I decided to write two separate fixes for these (one fix for bugs 1+2, one
fix for bug 3), so that the first fix can be backported to kernels
affected by bugs 1+2.
This patch (of 2):
This fixes two issues.
I discovered that the following race can occur:
mfill_atomic other thread
============ ============
<zap PMD>
pmdp_get_lockless() [reads none pmd]
<bail if trans_huge>
<if none:>
<pagefault creates transhuge zeropage>
__pte_alloc [no-op]
<zap PMD>
<bail if pmd_trans_huge(*dst_pmd)>
BUG_ON(pmd_none(*dst_pmd))
I have experimentally verified this in a kernel with extra mdelay() calls;
the BUG_ON(pmd_none(*dst_pmd)) triggers.
On kernels newer than commit 0d940a9b270b ("mm/pgtable: allow
pte_offset_map[_lock]() to fail"), this can't lead to anything worse than
a BUG_ON(), since the page table access helpers are actually designed to
deal with page tables concurrently disappearing; but on older kernels
(<=6.4), I think we could probably theoretically race past the two
BUG_ON() checks and end up treating a hugepage as a page table.
The second issue is that, as Qi Zheng pointed out, there are other types
of huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs
(in particular, migration PMDs).
On <=6.4, this is worse than the first issue: If mfill_atomic() runs on a
PMD that contains a migration entry (which just requires winning a single,
fairly wide race), it will pass the PMD to pte_offset_map_lock(), which
assumes that the PMD points to a page table.
Breakage follows: First, the kernel tries to take the PTE lock (which will
crash or maybe worse if there is no "struct page" for the address bits in
the migration entry PMD - I think at least on X86 there usually is no
corresponding "struct page" thanks to the PTE inversion mitigation, amd64
looks different).
If that didn't crash, the kernel would next try to write a PTE into what
it wrongly thinks is a page table.
As part of fixing these issues, get rid of the check for pmd_trans_huge()
before __pte_alloc() - that's redundant, we're going to have to check for
that after the __pte_alloc() anyway.
Backport note: pmdp_get_lockless() is pmd_read_atomic() in older kernels.
Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-0-5efa61078a41@goog…
Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-1-5efa61078a41@goog…
Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
Signed-off-by: Jann Horn <jannh(a)google.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Pavel Emelyanov <xemul(a)virtuozzo.com>
Cc: Qi Zheng <zhengqi.arch(a)bytedance.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e54e5c8907fa..290b2a0d84ac 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -787,21 +787,23 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
}
dst_pmdval = pmdp_get_lockless(dst_pmd);
- /*
- * If the dst_pmd is mapped as THP don't
- * override it and just be strict.
- */
- if (unlikely(pmd_trans_huge(dst_pmdval))) {
- err = -EEXIST;
- break;
- }
if (unlikely(pmd_none(dst_pmdval)) &&
unlikely(__pte_alloc(dst_mm, dst_pmd))) {
err = -ENOMEM;
break;
}
- /* If an huge pmd materialized from under us fail */
- if (unlikely(pmd_trans_huge(*dst_pmd))) {
+ dst_pmdval = pmdp_get_lockless(dst_pmd);
+ /*
+ * If the dst_pmd is THP don't override it and just be strict.
+ * (This includes the case where the PMD used to be THP and
+ * changed back to none after __pte_alloc().)
+ */
+ if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval) ||
+ pmd_devmap(dst_pmdval))) {
+ err = -EEXIST;
+ break;
+ }
+ if (unlikely(pmd_bad(dst_pmdval))) {
err = -EFAULT;
break;
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 71c186efc1b2cf1aeabfeff3b9bd5ac4c5ac14d8
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090853-unworldly-aerobics-75a0@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
71c186efc1b2 ("userfaultfd: fix checks for huge PMDs")
dab6e717429e ("mm: Rename pmd_read_atomic()")
024d232ae4fc ("mm: Fix pmd_read_atomic()")
fbfdec9989e6 ("x86/mm/pae: Make pmd_t similar to pte_t")
bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
018ee47f1489 ("mm: multi-gen LRU: exploit locality in rmap")
ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
ec1c86b25f4b ("mm: multi-gen LRU: groundwork")
f1e1a7be4718 ("mm/vmscan.c: refactor shrink_node()")
d3629af59f41 ("mm/vmscan: make the annotations of refaults code at the right place")
e9c2dbc8bf71 ("mm/vmscan: define macros for refaults in struct lruvec")
507228044236 ("mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage")
6614a3c3164a ("Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 71c186efc1b2cf1aeabfeff3b9bd5ac4c5ac14d8 Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 13 Aug 2024 22:25:21 +0200
Subject: [PATCH] userfaultfd: fix checks for huge PMDs
Patch series "userfaultfd: fix races around pmd_trans_huge() check", v2.
The pmd_trans_huge() code in mfill_atomic() is wrong in three different
ways depending on kernel version:
1. The pmd_trans_huge() check is racy and can lead to a BUG_ON() (if you hit
the right two race windows) - I've tested this in a kernel build with
some extra mdelay() calls. See the commit message for a description
of the race scenario.
On older kernels (before 6.5), I think the same bug can even
theoretically lead to accessing transhuge page contents as a page table
if you hit the right 5 narrow race windows (I haven't tested this case).
2. As pointed out by Qi Zheng, pmd_trans_huge() is not sufficient for
detecting PMDs that don't point to page tables.
On older kernels (before 6.5), you'd just have to win a single fairly
wide race to hit this.
I've tested this on 6.1 stable by racing migration (with a mdelay()
patched into try_to_migrate()) against UFFDIO_ZEROPAGE - on my x86
VM, that causes a kernel oops in ptlock_ptr().
3. On newer kernels (>=6.5), for shmem mappings, khugepaged is allowed
to yank page tables out from under us (though I haven't tested that),
so I think the BUG_ON() checks in mfill_atomic() are just wrong.
I decided to write two separate fixes for these (one fix for bugs 1+2, one
fix for bug 3), so that the first fix can be backported to kernels
affected by bugs 1+2.
This patch (of 2):
This fixes two issues.
I discovered that the following race can occur:
mfill_atomic other thread
============ ============
<zap PMD>
pmdp_get_lockless() [reads none pmd]
<bail if trans_huge>
<if none:>
<pagefault creates transhuge zeropage>
__pte_alloc [no-op]
<zap PMD>
<bail if pmd_trans_huge(*dst_pmd)>
BUG_ON(pmd_none(*dst_pmd))
I have experimentally verified this in a kernel with extra mdelay() calls;
the BUG_ON(pmd_none(*dst_pmd)) triggers.
On kernels newer than commit 0d940a9b270b ("mm/pgtable: allow
pte_offset_map[_lock]() to fail"), this can't lead to anything worse than
a BUG_ON(), since the page table access helpers are actually designed to
deal with page tables concurrently disappearing; but on older kernels
(<=6.4), I think we could probably theoretically race past the two
BUG_ON() checks and end up treating a hugepage as a page table.
The second issue is that, as Qi Zheng pointed out, there are other types
of huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs
(in particular, migration PMDs).
On <=6.4, this is worse than the first issue: If mfill_atomic() runs on a
PMD that contains a migration entry (which just requires winning a single,
fairly wide race), it will pass the PMD to pte_offset_map_lock(), which
assumes that the PMD points to a page table.
Breakage follows: First, the kernel tries to take the PTE lock (which will
crash or maybe worse if there is no "struct page" for the address bits in
the migration entry PMD - I think at least on X86 there usually is no
corresponding "struct page" thanks to the PTE inversion mitigation, amd64
looks different).
If that didn't crash, the kernel would next try to write a PTE into what
it wrongly thinks is a page table.
As part of fixing these issues, get rid of the check for pmd_trans_huge()
before __pte_alloc() - that's redundant, we're going to have to check for
that after the __pte_alloc() anyway.
Backport note: pmdp_get_lockless() is pmd_read_atomic() in older kernels.
Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-0-5efa61078a41@goog…
Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-1-5efa61078a41@goog…
Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
Signed-off-by: Jann Horn <jannh(a)google.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Pavel Emelyanov <xemul(a)virtuozzo.com>
Cc: Qi Zheng <zhengqi.arch(a)bytedance.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e54e5c8907fa..290b2a0d84ac 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -787,21 +787,23 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
}
dst_pmdval = pmdp_get_lockless(dst_pmd);
- /*
- * If the dst_pmd is mapped as THP don't
- * override it and just be strict.
- */
- if (unlikely(pmd_trans_huge(dst_pmdval))) {
- err = -EEXIST;
- break;
- }
if (unlikely(pmd_none(dst_pmdval)) &&
unlikely(__pte_alloc(dst_mm, dst_pmd))) {
err = -ENOMEM;
break;
}
- /* If an huge pmd materialized from under us fail */
- if (unlikely(pmd_trans_huge(*dst_pmd))) {
+ dst_pmdval = pmdp_get_lockless(dst_pmd);
+ /*
+ * If the dst_pmd is THP don't override it and just be strict.
+ * (This includes the case where the PMD used to be THP and
+ * changed back to none after __pte_alloc().)
+ */
+ if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval) ||
+ pmd_devmap(dst_pmdval))) {
+ err = -EEXIST;
+ break;
+ }
+ if (unlikely(pmd_bad(dst_pmdval))) {
err = -EFAULT;
break;
}
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 71c186efc1b2cf1aeabfeff3b9bd5ac4c5ac14d8
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090851-swinging-pug-fba8@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
71c186efc1b2 ("userfaultfd: fix checks for huge PMDs")
dab6e717429e ("mm: Rename pmd_read_atomic()")
024d232ae4fc ("mm: Fix pmd_read_atomic()")
fbfdec9989e6 ("x86/mm/pae: Make pmd_t similar to pte_t")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 71c186efc1b2cf1aeabfeff3b9bd5ac4c5ac14d8 Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 13 Aug 2024 22:25:21 +0200
Subject: [PATCH] userfaultfd: fix checks for huge PMDs
Patch series "userfaultfd: fix races around pmd_trans_huge() check", v2.
The pmd_trans_huge() code in mfill_atomic() is wrong in three different
ways depending on kernel version:
1. The pmd_trans_huge() check is racy and can lead to a BUG_ON() (if you hit
the right two race windows) - I've tested this in a kernel build with
some extra mdelay() calls. See the commit message for a description
of the race scenario.
On older kernels (before 6.5), I think the same bug can even
theoretically lead to accessing transhuge page contents as a page table
if you hit the right 5 narrow race windows (I haven't tested this case).
2. As pointed out by Qi Zheng, pmd_trans_huge() is not sufficient for
detecting PMDs that don't point to page tables.
On older kernels (before 6.5), you'd just have to win a single fairly
wide race to hit this.
I've tested this on 6.1 stable by racing migration (with a mdelay()
patched into try_to_migrate()) against UFFDIO_ZEROPAGE - on my x86
VM, that causes a kernel oops in ptlock_ptr().
3. On newer kernels (>=6.5), for shmem mappings, khugepaged is allowed
to yank page tables out from under us (though I haven't tested that),
so I think the BUG_ON() checks in mfill_atomic() are just wrong.
I decided to write two separate fixes for these (one fix for bugs 1+2, one
fix for bug 3), so that the first fix can be backported to kernels
affected by bugs 1+2.
This patch (of 2):
This fixes two issues.
I discovered that the following race can occur:
mfill_atomic other thread
============ ============
<zap PMD>
pmdp_get_lockless() [reads none pmd]
<bail if trans_huge>
<if none:>
<pagefault creates transhuge zeropage>
__pte_alloc [no-op]
<zap PMD>
<bail if pmd_trans_huge(*dst_pmd)>
BUG_ON(pmd_none(*dst_pmd))
I have experimentally verified this in a kernel with extra mdelay() calls;
the BUG_ON(pmd_none(*dst_pmd)) triggers.
On kernels newer than commit 0d940a9b270b ("mm/pgtable: allow
pte_offset_map[_lock]() to fail"), this can't lead to anything worse than
a BUG_ON(), since the page table access helpers are actually designed to
deal with page tables concurrently disappearing; but on older kernels
(<=6.4), I think we could probably theoretically race past the two
BUG_ON() checks and end up treating a hugepage as a page table.
The second issue is that, as Qi Zheng pointed out, there are other types
of huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs
(in particular, migration PMDs).
On <=6.4, this is worse than the first issue: If mfill_atomic() runs on a
PMD that contains a migration entry (which just requires winning a single,
fairly wide race), it will pass the PMD to pte_offset_map_lock(), which
assumes that the PMD points to a page table.
Breakage follows: First, the kernel tries to take the PTE lock (which will
crash or maybe worse if there is no "struct page" for the address bits in
the migration entry PMD - I think at least on X86 there usually is no
corresponding "struct page" thanks to the PTE inversion mitigation, amd64
looks different).
If that didn't crash, the kernel would next try to write a PTE into what
it wrongly thinks is a page table.
As part of fixing these issues, get rid of the check for pmd_trans_huge()
before __pte_alloc() - that's redundant, we're going to have to check for
that after the __pte_alloc() anyway.
Backport note: pmdp_get_lockless() is pmd_read_atomic() in older kernels.
Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-0-5efa61078a41@goog…
Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-1-5efa61078a41@goog…
Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
Signed-off-by: Jann Horn <jannh(a)google.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Pavel Emelyanov <xemul(a)virtuozzo.com>
Cc: Qi Zheng <zhengqi.arch(a)bytedance.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e54e5c8907fa..290b2a0d84ac 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -787,21 +787,23 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
}
dst_pmdval = pmdp_get_lockless(dst_pmd);
- /*
- * If the dst_pmd is mapped as THP don't
- * override it and just be strict.
- */
- if (unlikely(pmd_trans_huge(dst_pmdval))) {
- err = -EEXIST;
- break;
- }
if (unlikely(pmd_none(dst_pmdval)) &&
unlikely(__pte_alloc(dst_mm, dst_pmd))) {
err = -ENOMEM;
break;
}
- /* If an huge pmd materialized from under us fail */
- if (unlikely(pmd_trans_huge(*dst_pmd))) {
+ dst_pmdval = pmdp_get_lockless(dst_pmd);
+ /*
+ * If the dst_pmd is THP don't override it and just be strict.
+ * (This includes the case where the PMD used to be THP and
+ * changed back to none after __pte_alloc().)
+ */
+ if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval) ||
+ pmd_devmap(dst_pmdval))) {
+ err = -EEXIST;
+ break;
+ }
+ if (unlikely(pmd_bad(dst_pmdval))) {
err = -EFAULT;
break;
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x 49aa8a1f4d6800721c7971ed383078257f12e8f9
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090835-matchbox-untreated-40ad@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
49aa8a1f4d68 ("tracing: Avoid possible softlockup in tracing_iter_reset()")
bc1a72afdc4a ("ring-buffer: Rename ring_buffer_read() to read_buffer_iter_advance()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 49aa8a1f4d6800721c7971ed383078257f12e8f9 Mon Sep 17 00:00:00 2001
From: Zheng Yejian <zhengyejian(a)huaweicloud.com>
Date: Tue, 27 Aug 2024 20:46:54 +0800
Subject: [PATCH] tracing: Avoid possible softlockup in tracing_iter_reset()
In __tracing_open(), when max latency tracers took place on the cpu,
the time start of its buffer would be updated, then event entries with
timestamps being earlier than start of the buffer would be skipped
(see tracing_iter_reset()).
Softlockup will occur if the kernel is non-preemptible and too many
entries were skipped in the loop that reset every cpu buffer, so add
cond_resched() to avoid it.
Cc: stable(a)vger.kernel.org
Fixes: 2f26ebd549b9a ("tracing: use timestamp to determine start of latency traces")
Link: https://lore.kernel.org/20240827124654.3817443-1-zhengyejian@huaweicloud.com
Suggested-by: Steven Rostedt <rostedt(a)goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian(a)huaweicloud.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ebe7ce2f5f4a..edf6bc817aa1 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3958,6 +3958,8 @@ void tracing_iter_reset(struct trace_iterator *iter, int cpu)
break;
entries++;
ring_buffer_iter_advance(buf_iter);
+ /* This could be a big loop */
+ cond_resched();
}
per_cpu_ptr(iter->array_buffer->data, cpu)->skipped_entries = entries;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 49aa8a1f4d6800721c7971ed383078257f12e8f9
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090834-cheddar-crust-24f3@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
49aa8a1f4d68 ("tracing: Avoid possible softlockup in tracing_iter_reset()")
bc1a72afdc4a ("ring-buffer: Rename ring_buffer_read() to read_buffer_iter_advance()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 49aa8a1f4d6800721c7971ed383078257f12e8f9 Mon Sep 17 00:00:00 2001
From: Zheng Yejian <zhengyejian(a)huaweicloud.com>
Date: Tue, 27 Aug 2024 20:46:54 +0800
Subject: [PATCH] tracing: Avoid possible softlockup in tracing_iter_reset()
In __tracing_open(), when max latency tracers took place on the cpu,
the time start of its buffer would be updated, then event entries with
timestamps being earlier than start of the buffer would be skipped
(see tracing_iter_reset()).
Softlockup will occur if the kernel is non-preemptible and too many
entries were skipped in the loop that reset every cpu buffer, so add
cond_resched() to avoid it.
Cc: stable(a)vger.kernel.org
Fixes: 2f26ebd549b9a ("tracing: use timestamp to determine start of latency traces")
Link: https://lore.kernel.org/20240827124654.3817443-1-zhengyejian@huaweicloud.com
Suggested-by: Steven Rostedt <rostedt(a)goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian(a)huaweicloud.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ebe7ce2f5f4a..edf6bc817aa1 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3958,6 +3958,8 @@ void tracing_iter_reset(struct trace_iterator *iter, int cpu)
break;
entries++;
ring_buffer_iter_advance(buf_iter);
+ /* This could be a big loop */
+ cond_resched();
}
per_cpu_ptr(iter->array_buffer->data, cpu)->skipped_entries = entries;
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x bfe0857c20c663fcc1592fa4e3a61ca12b07dac9
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090830-imaging-symphonic-8783@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
bfe0857c20c6 ("Revert "mm: skip CMA pages when they are not available"")
97144ce008f9 ("mm/vmscan: use folio_migratetype() instead of get_pageblock_migratetype()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From bfe0857c20c663fcc1592fa4e3a61ca12b07dac9 Mon Sep 17 00:00:00 2001
From: Usama Arif <usamaarif642(a)gmail.com>
Date: Wed, 21 Aug 2024 20:26:07 +0100
Subject: [PATCH] Revert "mm: skip CMA pages when they are not available"
This reverts commit 5da226dbfce3 ("mm: skip CMA pages when they are not
available") and b7108d66318a ("Multi-gen LRU: skip CMA pages when they are
not eligible").
lruvec->lru_lock is highly contended and is held when calling
isolate_lru_folios. If the lru has a large number of CMA folios
consecutively, while the allocation type requested is not MIGRATE_MOVABLE,
isolate_lru_folios can hold the lock for a very long time while it skips
those. For FIO workload, ~150million order=0 folios were skipped to
isolate a few ZONE_DMA folios [1]. This can cause lockups [1] and high
memory pressure for extended periods of time [2].
Remove skipping CMA for MGLRU as well, as it was introduced in sort_folio
for the same resaon as 5da226dbfce3a2f44978c2c7cf88166e69a6788b.
[1] https://lore.kernel.org/all/CAOUHufbkhMZYz20aM_3rHZ3OcK4m2puji2FGpUpn_-DevG…
[2] https://lore.kernel.org/all/ZrssOrcJIDy8hacI@gmail.com/
[usamaarif642(a)gmail.com: also revert b7108d66318a, per Johannes]
Link: https://lkml.kernel.org/r/9060a32d-b2d7-48c0-8626-1db535653c54@gmail.com
Link: https://lkml.kernel.org/r/357ac325-4c61-497a-92a3-bdbd230d5ec9@gmail.com
Link: https://lkml.kernel.org/r/9060a32d-b2d7-48c0-8626-1db535653c54@gmail.com
Fixes: 5da226dbfce3 ("mm: skip CMA pages when they are not available")
Signed-off-by: Usama Arif <usamaarif642(a)gmail.com>
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Bharata B Rao <bharata(a)amd.com>
Cc: Breno Leitao <leitao(a)debian.org>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Rik van Riel <riel(a)surriel.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Yu Zhao <yuzhao(a)google.com>
Cc: Zhaoyang Huang <huangzhaoyang(a)gmail.com>
Cc: Zhaoyang Huang <zhaoyang.huang(a)unisoc.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cfa839284b92..bd489c1af228 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1604,25 +1604,6 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
}
-#ifdef CONFIG_CMA
-/*
- * It is waste of effort to scan and reclaim CMA pages if it is not available
- * for current allocation context. Kswapd can not be enrolled as it can not
- * distinguish this scenario by using sc->gfp_mask = GFP_KERNEL
- */
-static bool skip_cma(struct folio *folio, struct scan_control *sc)
-{
- return !current_is_kswapd() &&
- gfp_migratetype(sc->gfp_mask) != MIGRATE_MOVABLE &&
- folio_migratetype(folio) == MIGRATE_CMA;
-}
-#else
-static bool skip_cma(struct folio *folio, struct scan_control *sc)
-{
- return false;
-}
-#endif
-
/*
* Isolating page from the lruvec to fill in @dst list by nr_to_scan times.
*
@@ -1669,8 +1650,7 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan,
nr_pages = folio_nr_pages(folio);
total_scan += nr_pages;
- if (folio_zonenum(folio) > sc->reclaim_idx ||
- skip_cma(folio, sc)) {
+ if (folio_zonenum(folio) > sc->reclaim_idx) {
nr_skipped[folio_zonenum(folio)] += nr_pages;
move_to = &folios_skipped;
goto move;
@@ -4320,7 +4300,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
}
/* ineligible */
- if (zone > sc->reclaim_idx || skip_cma(folio, sc)) {
+ if (zone > sc->reclaim_idx) {
gen = folio_inc_gen(lruvec, folio, false);
list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
return true;
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x f806de88d8f7f8191afd0fd9b94db4cd058e7d4f
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090851-coroner-surname-cd39@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
f806de88d8f7 ("maple_tree: remove rcu_read_lock() from mt_validate()")
9a40d45c1f2c ("maple_tree: remove mas_searchable()")
067311d33e65 ("maple_tree: separate ma_state node from status")
271f61a8b41d ("maple_tree: clean up inlines for some functions")
bf857ddd21d0 ("maple_tree: move debug check to __mas_set_range()")
f7a590189539 ("maple_tree: make mas_erase() more robust")
a2587a7e8d37 ("maple_tree: add test for mtree_dup()")
fd32e4e9b764 ("maple_tree: introduce interfaces __mt_dup() and mtree_dup()")
4f2267b58a22 ("maple_tree: add mt_free_one() and mt_attr() helpers")
a8091f039c1e ("maple_tree: add MAS_UNDERFLOW and MAS_OVERFLOW states")
5c590804b6b0 ("maple_tree: add mas_is_active() to detect in-tree walks")
530f745c7620 ("maple_tree: replace data before marking dead in split and spanning store")
4ffc2ee2cf01 ("maple_tree: introduce mas_tree_parent() definition")
1238f6a226dc ("maple_tree: introduce mas_put_in_tree()")
72bcf4aa86ec ("maple_tree: reorder replacement of nodes to avoid live lock")
fec29364348f ("maple_tree: reduce resets during store setup")
da0892547b10 ("maple_tree: re-introduce entry to mas_preallocate() arguments")
53bee98d004f ("mm: remove re-walk from mmap_region()")
c1297987cc2a ("maple_tree: introduce __mas_set_range()")
a489539e33c2 ("maple_tree: update mt_validate()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f806de88d8f7f8191afd0fd9b94db4cd058e7d4f Mon Sep 17 00:00:00 2001
From: "Liam R. Howlett" <Liam.Howlett(a)Oracle.com>
Date: Tue, 20 Aug 2024 13:54:17 -0400
Subject: [PATCH] maple_tree: remove rcu_read_lock() from mt_validate()
The write lock should be held when validating the tree to avoid updates
racing with checks. Holding the rcu read lock during a large tree
validation may also cause a prolonged rcu read window and "rcu_preempt
detected stalls" warnings.
Link: https://lore.kernel.org/all/0000000000001d12d4062005aea1@google.com/
Link: https://lkml.kernel.org/r/20240820175417.2782532-1-Liam.Howlett@oracle.com
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)Oracle.com>
Reported-by: syzbot+036af2f0c7338a33b0cd(a)syzkaller.appspotmail.com
Cc: Hillf Danton <hdanton(a)sina.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: "Paul E. McKenney" <paulmck(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index aa3a5df15b8e..6df3a8b95808 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -7566,14 +7566,14 @@ static void mt_validate_nulls(struct maple_tree *mt)
* 2. The gap is correctly set in the parents
*/
void mt_validate(struct maple_tree *mt)
+ __must_hold(mas->tree->ma_lock)
{
unsigned char end;
MA_STATE(mas, mt, 0, 0);
- rcu_read_lock();
mas_start(&mas);
if (!mas_is_active(&mas))
- goto done;
+ return;
while (!mte_is_leaf(mas.node))
mas_descend(&mas);
@@ -7594,9 +7594,6 @@ void mt_validate(struct maple_tree *mt)
mas_dfs_postorder(&mas, ULONG_MAX);
}
mt_validate_nulls(mt);
-done:
- rcu_read_unlock();
-
}
EXPORT_SYMBOL_GPL(mt_validate);
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x f806de88d8f7f8191afd0fd9b94db4cd058e7d4f
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090850-skirt-unsaved-fc27@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
f806de88d8f7 ("maple_tree: remove rcu_read_lock() from mt_validate()")
9a40d45c1f2c ("maple_tree: remove mas_searchable()")
067311d33e65 ("maple_tree: separate ma_state node from status")
271f61a8b41d ("maple_tree: clean up inlines for some functions")
bf857ddd21d0 ("maple_tree: move debug check to __mas_set_range()")
f7a590189539 ("maple_tree: make mas_erase() more robust")
a2587a7e8d37 ("maple_tree: add test for mtree_dup()")
fd32e4e9b764 ("maple_tree: introduce interfaces __mt_dup() and mtree_dup()")
4f2267b58a22 ("maple_tree: add mt_free_one() and mt_attr() helpers")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f806de88d8f7f8191afd0fd9b94db4cd058e7d4f Mon Sep 17 00:00:00 2001
From: "Liam R. Howlett" <Liam.Howlett(a)Oracle.com>
Date: Tue, 20 Aug 2024 13:54:17 -0400
Subject: [PATCH] maple_tree: remove rcu_read_lock() from mt_validate()
The write lock should be held when validating the tree to avoid updates
racing with checks. Holding the rcu read lock during a large tree
validation may also cause a prolonged rcu read window and "rcu_preempt
detected stalls" warnings.
Link: https://lore.kernel.org/all/0000000000001d12d4062005aea1@google.com/
Link: https://lkml.kernel.org/r/20240820175417.2782532-1-Liam.Howlett@oracle.com
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)Oracle.com>
Reported-by: syzbot+036af2f0c7338a33b0cd(a)syzkaller.appspotmail.com
Cc: Hillf Danton <hdanton(a)sina.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: "Paul E. McKenney" <paulmck(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index aa3a5df15b8e..6df3a8b95808 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -7566,14 +7566,14 @@ static void mt_validate_nulls(struct maple_tree *mt)
* 2. The gap is correctly set in the parents
*/
void mt_validate(struct maple_tree *mt)
+ __must_hold(mas->tree->ma_lock)
{
unsigned char end;
MA_STATE(mas, mt, 0, 0);
- rcu_read_lock();
mas_start(&mas);
if (!mas_is_active(&mas))
- goto done;
+ return;
while (!mte_is_leaf(mas.node))
mas_descend(&mas);
@@ -7594,9 +7594,6 @@ void mt_validate(struct maple_tree *mt)
mas_dfs_postorder(&mas, ULONG_MAX);
}
mt_validate_nulls(mt);
-done:
- rcu_read_unlock();
-
}
EXPORT_SYMBOL_GPL(mt_validate);
Hi, Greg
Could you please help to cherry-pick the following commit to the 5.4.y branch?
f1edb498bd9f ("clk: hi6220: use CLK_OF_DECLARE_DRIVER")
It's been there since the 5.10 kernel, and this along with the reset
controller patch
are needed for Hikey devices to work with 5.4.y kernels, otherwise it
will get stuck
during the boot.
--
Best Regards,
Yongqin Liu
---------------------------------------------------------------
#mailing list
linaro-android(a)lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-android
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 3002240d16494d798add0575e8ba1f284258ab34
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090804-leverage-floral-9b6b@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
3002240d1649 ("fuse: fix memory leak in fuse_create_open")
15d937d7ca8c ("fuse: add request extension")
153524053bbb ("fuse: allow non-extending parallel direct writes on the same file")
4f8d37020e1f ("fuse: add "expire only" mode to FUSE_NOTIFY_INVAL_ENTRY")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 3002240d16494d798add0575e8ba1f284258ab34 Mon Sep 17 00:00:00 2001
From: yangyun <yangyun50(a)huawei.com>
Date: Fri, 23 Aug 2024 16:51:46 +0800
Subject: [PATCH] fuse: fix memory leak in fuse_create_open
The memory of struct fuse_file is allocated but not freed
when get_create_ext return error.
Fixes: 3e2b6fdbdc9a ("fuse: send security context of inode on file")
Cc: stable(a)vger.kernel.org # v5.17
Signed-off-by: yangyun <yangyun50(a)huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi(a)redhat.com>
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 2b0d4781f394..8e96df9fd76c 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -670,7 +670,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
err = get_create_ext(&args, dir, entry, mode);
if (err)
- goto out_put_forget_req;
+ goto out_free_ff;
err = fuse_simple_request(fm, &args);
free_ext_value(&args);
The patch below does not apply to the 6.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.10.y
git checkout FETCH_HEAD
git cherry-pick -x 9ec87c5957ea9bf68d36f5e098605b585b2571e4
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090809-portion-riptide-f7d9@gregkh' --subject-prefix 'PATCH 6.10.y' HEAD^..
Possible dependencies:
9ec87c5957ea ("OPP: Fix support for required OPPs for multiple PM domains")
0d865221c8b1 ("OPP: Drop a redundant in-parameter to _set_opp_level()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9ec87c5957ea9bf68d36f5e098605b585b2571e4 Mon Sep 17 00:00:00 2001
From: Ulf Hansson <ulf.hansson(a)linaro.org>
Date: Fri, 23 Aug 2024 00:45:38 +0200
Subject: [PATCH] OPP: Fix support for required OPPs for multiple PM domains
It has turned out that having _set_required_opps() to recursively call
dev_pm_opp_set_opp() to set the required OPPs, doesn't really work as well
as we expected.
More precisely, at each recursive call to dev_pm_opp_set_opp() we are
changing an OPP for a required_dev that belongs to a required-OPP table.
The problem with this, is that we may have several devices sharing the same
required-OPP table, which leads to an incorrect behaviour in regards to
aggregating the per device votes.
To fix the problem for a required-OPP table belonging to a PM domain, which
is the only existing usecase for now, let's simply replace the call to
dev_pm_opp_set_opp() in _set_required_opps() by a call to _set_opp_level().
Moving forward we may potentially need to add support for other types of
required-OPP tables. In this case, the aggregation needs to be thought of.
Fixes: e37440e7e2c2 ("OPP: Call dev_pm_opp_set_opp() for required OPPs")
Cc: stable(a)vger.kernel.org
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
Acked-by: Viresh Kumar <viresh.kumar(a)linaro.org>
Link: https://lore.kernel.org/r/20240822224547.385095-2-ulf.hansson@linaro.org
diff --git a/drivers/opp/core.c b/drivers/opp/core.c
index 5f4598246a87..494f8860220d 100644
--- a/drivers/opp/core.c
+++ b/drivers/opp/core.c
@@ -1061,6 +1061,27 @@ static int _set_opp_bw(const struct opp_table *opp_table,
return 0;
}
+static int _set_opp_level(struct device *dev, struct dev_pm_opp *opp)
+{
+ unsigned int level = 0;
+ int ret = 0;
+
+ if (opp) {
+ if (opp->level == OPP_LEVEL_UNSET)
+ return 0;
+
+ level = opp->level;
+ }
+
+ /* Request a new performance state through the device's PM domain. */
+ ret = dev_pm_domain_set_performance_state(dev, level);
+ if (ret)
+ dev_err(dev, "Failed to set performance state %u (%d)\n", level,
+ ret);
+
+ return ret;
+}
+
/* This is only called for PM domain for now */
static int _set_required_opps(struct device *dev, struct opp_table *opp_table,
struct dev_pm_opp *opp, bool up)
@@ -1091,7 +1112,7 @@ static int _set_required_opps(struct device *dev, struct opp_table *opp_table,
if (devs[index]) {
required_opp = opp ? opp->required_opps[index] : NULL;
- ret = dev_pm_opp_set_opp(devs[index], required_opp);
+ ret = _set_opp_level(devs[index], required_opp);
if (ret)
return ret;
}
@@ -1102,27 +1123,6 @@ static int _set_required_opps(struct device *dev, struct opp_table *opp_table,
return 0;
}
-static int _set_opp_level(struct device *dev, struct dev_pm_opp *opp)
-{
- unsigned int level = 0;
- int ret = 0;
-
- if (opp) {
- if (opp->level == OPP_LEVEL_UNSET)
- return 0;
-
- level = opp->level;
- }
-
- /* Request a new performance state through the device's PM domain. */
- ret = dev_pm_domain_set_performance_state(dev, level);
- if (ret)
- dev_err(dev, "Failed to set performance state %u (%d)\n", level,
- ret);
-
- return ret;
-}
-
static void _find_current_opp(struct device *dev, struct opp_table *opp_table)
{
struct dev_pm_opp *opp = ERR_PTR(-ENODEV);
@@ -2457,18 +2457,6 @@ static int _opp_attach_genpd(struct opp_table *opp_table, struct device *dev,
}
}
- /*
- * Add the virtual genpd device as a user of the OPP table, so
- * we can call dev_pm_opp_set_opp() on it directly.
- *
- * This will be automatically removed when the OPP table is
- * removed, don't need to handle that here.
- */
- if (!_add_opp_dev(virt_dev, opp_table->required_opp_tables[index])) {
- ret = -ENOMEM;
- goto err;
- }
-
opp_table->required_devs[index] = virt_dev;
index++;
name++;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x aea62c744a9ae2a8247c54ec42138405216414da
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090853-flogging-deepen-eabe@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
aea62c744a9a ("mmc: cqhci: Fix checking of CQHCI_HALT state")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From aea62c744a9ae2a8247c54ec42138405216414da Mon Sep 17 00:00:00 2001
From: Seunghwan Baek <sh8267.baek(a)samsung.com>
Date: Thu, 29 Aug 2024 15:18:22 +0900
Subject: [PATCH] mmc: cqhci: Fix checking of CQHCI_HALT state
To check if mmc cqe is in halt state, need to check set/clear of CQHCI_HALT
bit. At this time, we need to check with &, not &&.
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Cc: stable(a)vger.kernel.org
Signed-off-by: Seunghwan Baek <sh8267.baek(a)samsung.com>
Reviewed-by: Ritesh Harjani <ritesh.list(a)gmail.com>
Acked-by: Adrian Hunter <adrian.hunter(a)intel.com>
Link: https://lore.kernel.org/r/20240829061823.3718-2-sh8267.baek@samsung.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
index c14d7251d0bb..a02da26a1efd 100644
--- a/drivers/mmc/host/cqhci-core.c
+++ b/drivers/mmc/host/cqhci-core.c
@@ -617,7 +617,7 @@ static int cqhci_request(struct mmc_host *mmc, struct mmc_request *mrq)
cqhci_writel(cq_host, 0, CQHCI_CTL);
mmc->cqe_on = true;
pr_debug("%s: cqhci: CQE on\n", mmc_hostname(mmc));
- if (cqhci_readl(cq_host, CQHCI_CTL) && CQHCI_HALT) {
+ if (cqhci_readl(cq_host, CQHCI_CTL) & CQHCI_HALT) {
pr_err("%s: cqhci: CQE failed to exit halt state\n",
mmc_hostname(mmc));
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x aea62c744a9ae2a8247c54ec42138405216414da
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090853-unweave-borrowing-7c2a@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
aea62c744a9a ("mmc: cqhci: Fix checking of CQHCI_HALT state")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From aea62c744a9ae2a8247c54ec42138405216414da Mon Sep 17 00:00:00 2001
From: Seunghwan Baek <sh8267.baek(a)samsung.com>
Date: Thu, 29 Aug 2024 15:18:22 +0900
Subject: [PATCH] mmc: cqhci: Fix checking of CQHCI_HALT state
To check if mmc cqe is in halt state, need to check set/clear of CQHCI_HALT
bit. At this time, we need to check with &, not &&.
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Cc: stable(a)vger.kernel.org
Signed-off-by: Seunghwan Baek <sh8267.baek(a)samsung.com>
Reviewed-by: Ritesh Harjani <ritesh.list(a)gmail.com>
Acked-by: Adrian Hunter <adrian.hunter(a)intel.com>
Link: https://lore.kernel.org/r/20240829061823.3718-2-sh8267.baek@samsung.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
index c14d7251d0bb..a02da26a1efd 100644
--- a/drivers/mmc/host/cqhci-core.c
+++ b/drivers/mmc/host/cqhci-core.c
@@ -617,7 +617,7 @@ static int cqhci_request(struct mmc_host *mmc, struct mmc_request *mrq)
cqhci_writel(cq_host, 0, CQHCI_CTL);
mmc->cqe_on = true;
pr_debug("%s: cqhci: CQE on\n", mmc_hostname(mmc));
- if (cqhci_readl(cq_host, CQHCI_CTL) && CQHCI_HALT) {
+ if (cqhci_readl(cq_host, CQHCI_CTL) & CQHCI_HALT) {
pr_err("%s: cqhci: CQE failed to exit halt state\n",
mmc_hostname(mmc));
}
This is a note to let you know that I've just added the patch titled
iio: magnetometer: ak8975: Fix reading for ak099xx sensors
to my char-misc git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
in the char-misc-next branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will also be merged in the next major kernel release
during the merge window.
If you have any questions about this process, please let me know.
From 129464e86c7445a858b790ac2d28d35f58256bbe Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Barnab=C3=A1s=20Cz=C3=A9m=C3=A1n?=
<barnabas.czeman(a)mainlining.org>
Date: Mon, 19 Aug 2024 00:29:40 +0200
Subject: iio: magnetometer: ak8975: Fix reading for ak099xx sensors
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Move ST2 reading with overflow handling after measurement data
reading.
ST2 register read have to be read after read measurment data,
because it means end of the reading and realease the lock on the data.
Remove ST2 read skip on interrupt based waiting because ST2 required to
be read out at and of the axis read.
Fixes: 57e73a423b1e ("iio: ak8975: add ak09911 and ak09912 support")
Signed-off-by: Barnabás Czémán <barnabas.czeman(a)mainlining.org>
Link: https://patch.msgid.link/20240819-ak09918-v4-2-f0734d14cfb9@mainlining.org
Cc: <Stable(a)vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
---
drivers/iio/magnetometer/ak8975.c | 32 +++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)
diff --git a/drivers/iio/magnetometer/ak8975.c b/drivers/iio/magnetometer/ak8975.c
index 67d5d1f2402f..1d5f79b7005b 100644
--- a/drivers/iio/magnetometer/ak8975.c
+++ b/drivers/iio/magnetometer/ak8975.c
@@ -696,22 +696,8 @@ static int ak8975_start_read_axis(struct ak8975_data *data,
if (ret < 0)
return ret;
- /* This will be executed only for non-interrupt based waiting case */
- if (ret & data->def->ctrl_masks[ST1_DRDY]) {
- ret = i2c_smbus_read_byte_data(client,
- data->def->ctrl_regs[ST2]);
- if (ret < 0) {
- dev_err(&client->dev, "Error in reading ST2\n");
- return ret;
- }
- if (ret & (data->def->ctrl_masks[ST2_DERR] |
- data->def->ctrl_masks[ST2_HOFL])) {
- dev_err(&client->dev, "ST2 status error 0x%x\n", ret);
- return -EINVAL;
- }
- }
-
- return 0;
+ /* Return with zero if the data is ready. */
+ return !data->def->ctrl_regs[ST1_DRDY];
}
/* Retrieve raw flux value for one of the x, y, or z axis. */
@@ -738,6 +724,20 @@ static int ak8975_read_axis(struct iio_dev *indio_dev, int index, int *val)
if (ret < 0)
goto exit;
+ /* Read out ST2 for release lock on measurment data. */
+ ret = i2c_smbus_read_byte_data(client, data->def->ctrl_regs[ST2]);
+ if (ret < 0) {
+ dev_err(&client->dev, "Error in reading ST2\n");
+ goto exit;
+ }
+
+ if (ret & (data->def->ctrl_masks[ST2_DERR] |
+ data->def->ctrl_masks[ST2_HOFL])) {
+ dev_err(&client->dev, "ST2 status error 0x%x\n", ret);
+ ret = -EINVAL;
+ goto exit;
+ }
+
mutex_unlock(&data->lock);
pm_runtime_mark_last_busy(&data->client->dev);
--
2.46.0
This is a note to let you know that I've just added the patch titled
iio: magnetometer: ak8975: Fix reading for ak099xx sensors
to my char-misc git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
in the char-misc-testing branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will be merged to the char-misc-next branch sometime soon,
after it passes testing, and the merge window is open.
If you have any questions about this process, please let me know.
From 129464e86c7445a858b790ac2d28d35f58256bbe Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Barnab=C3=A1s=20Cz=C3=A9m=C3=A1n?=
<barnabas.czeman(a)mainlining.org>
Date: Mon, 19 Aug 2024 00:29:40 +0200
Subject: iio: magnetometer: ak8975: Fix reading for ak099xx sensors
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Move ST2 reading with overflow handling after measurement data
reading.
ST2 register read have to be read after read measurment data,
because it means end of the reading and realease the lock on the data.
Remove ST2 read skip on interrupt based waiting because ST2 required to
be read out at and of the axis read.
Fixes: 57e73a423b1e ("iio: ak8975: add ak09911 and ak09912 support")
Signed-off-by: Barnabás Czémán <barnabas.czeman(a)mainlining.org>
Link: https://patch.msgid.link/20240819-ak09918-v4-2-f0734d14cfb9@mainlining.org
Cc: <Stable(a)vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
---
drivers/iio/magnetometer/ak8975.c | 32 +++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)
diff --git a/drivers/iio/magnetometer/ak8975.c b/drivers/iio/magnetometer/ak8975.c
index 67d5d1f2402f..1d5f79b7005b 100644
--- a/drivers/iio/magnetometer/ak8975.c
+++ b/drivers/iio/magnetometer/ak8975.c
@@ -696,22 +696,8 @@ static int ak8975_start_read_axis(struct ak8975_data *data,
if (ret < 0)
return ret;
- /* This will be executed only for non-interrupt based waiting case */
- if (ret & data->def->ctrl_masks[ST1_DRDY]) {
- ret = i2c_smbus_read_byte_data(client,
- data->def->ctrl_regs[ST2]);
- if (ret < 0) {
- dev_err(&client->dev, "Error in reading ST2\n");
- return ret;
- }
- if (ret & (data->def->ctrl_masks[ST2_DERR] |
- data->def->ctrl_masks[ST2_HOFL])) {
- dev_err(&client->dev, "ST2 status error 0x%x\n", ret);
- return -EINVAL;
- }
- }
-
- return 0;
+ /* Return with zero if the data is ready. */
+ return !data->def->ctrl_regs[ST1_DRDY];
}
/* Retrieve raw flux value for one of the x, y, or z axis. */
@@ -738,6 +724,20 @@ static int ak8975_read_axis(struct iio_dev *indio_dev, int index, int *val)
if (ret < 0)
goto exit;
+ /* Read out ST2 for release lock on measurment data. */
+ ret = i2c_smbus_read_byte_data(client, data->def->ctrl_regs[ST2]);
+ if (ret < 0) {
+ dev_err(&client->dev, "Error in reading ST2\n");
+ goto exit;
+ }
+
+ if (ret & (data->def->ctrl_masks[ST2_DERR] |
+ data->def->ctrl_masks[ST2_HOFL])) {
+ dev_err(&client->dev, "ST2 status error 0x%x\n", ret);
+ ret = -EINVAL;
+ goto exit;
+ }
+
mutex_unlock(&data->lock);
pm_runtime_mark_last_busy(&data->client->dev);
--
2.46.0
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 4bcdd831d9d01e0fb64faea50732b59b2ee88da1
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090850-handrail-battering-849f@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
4bcdd831d9d0 ("KVM: x86: Acquire kvm->srcu when handling KVM_SET_VCPU_EVENTS")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 4bcdd831d9d01e0fb64faea50732b59b2ee88da1 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Tue, 23 Jul 2024 16:20:55 -0700
Subject: [PATCH] KVM: x86: Acquire kvm->srcu when handling KVM_SET_VCPU_EVENTS
Grab kvm->srcu when processing KVM_SET_VCPU_EVENTS, as KVM will forcibly
leave nested VMX/SVM if SMM mode is being toggled, and leaving nested VMX
reads guest memory.
Note, kvm_vcpu_ioctl_x86_set_vcpu_events() can also be called from KVM_RUN
via sync_regs(), which already holds SRCU. I.e. trying to precisely use
kvm_vcpu_srcu_read_lock() around the problematic SMM code would cause
problems. Acquiring SRCU isn't all that expensive, so for simplicity,
grab it unconditionally for KVM_SET_VCPU_EVENTS.
=============================
WARNING: suspicious RCU usage
6.10.0-rc7-332d2c1d713e-next-vm #552 Not tainted
-----------------------------
include/linux/kvm_host.h:1027 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by repro/1071:
#0: ffff88811e424430 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x7d/0x970 [kvm]
stack backtrace:
CPU: 15 PID: 1071 Comm: repro Not tainted 6.10.0-rc7-332d2c1d713e-next-vm #552
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
<TASK>
dump_stack_lvl+0x7f/0x90
lockdep_rcu_suspicious+0x13f/0x1a0
kvm_vcpu_gfn_to_memslot+0x168/0x190 [kvm]
kvm_vcpu_read_guest+0x3e/0x90 [kvm]
nested_vmx_load_msr+0x6b/0x1d0 [kvm_intel]
load_vmcs12_host_state+0x432/0xb40 [kvm_intel]
vmx_leave_nested+0x30/0x40 [kvm_intel]
kvm_vcpu_ioctl_x86_set_vcpu_events+0x15d/0x2b0 [kvm]
kvm_arch_vcpu_ioctl+0x1107/0x1750 [kvm]
? mark_held_locks+0x49/0x70
? kvm_vcpu_ioctl+0x7d/0x970 [kvm]
? kvm_vcpu_ioctl+0x497/0x970 [kvm]
kvm_vcpu_ioctl+0x497/0x970 [kvm]
? lock_acquire+0xba/0x2d0
? find_held_lock+0x2b/0x80
? do_user_addr_fault+0x40c/0x6f0
? lock_release+0xb7/0x270
__x64_sys_ioctl+0x82/0xb0
do_syscall_64+0x6c/0x170
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7ff11eb1b539
</TASK>
Fixes: f7e570780efc ("KVM: x86: Forcibly leave nested virt when SMM state is toggled")
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20240723232055.3643811-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 70219e406987..2c7327ef0f0d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6040,7 +6040,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
if (copy_from_user(&events, argp, sizeof(struct kvm_vcpu_events)))
break;
+ kvm_vcpu_srcu_read_lock(vcpu);
r = kvm_vcpu_ioctl_x86_set_vcpu_events(vcpu, &events);
+ kvm_vcpu_srcu_read_unlock(vcpu);
break;
}
case KVM_GET_DEBUGREGS: {
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 4bcdd831d9d01e0fb64faea50732b59b2ee88da1
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090849-facsimile-hamstring-a776@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
4bcdd831d9d0 ("KVM: x86: Acquire kvm->srcu when handling KVM_SET_VCPU_EVENTS")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 4bcdd831d9d01e0fb64faea50732b59b2ee88da1 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Tue, 23 Jul 2024 16:20:55 -0700
Subject: [PATCH] KVM: x86: Acquire kvm->srcu when handling KVM_SET_VCPU_EVENTS
Grab kvm->srcu when processing KVM_SET_VCPU_EVENTS, as KVM will forcibly
leave nested VMX/SVM if SMM mode is being toggled, and leaving nested VMX
reads guest memory.
Note, kvm_vcpu_ioctl_x86_set_vcpu_events() can also be called from KVM_RUN
via sync_regs(), which already holds SRCU. I.e. trying to precisely use
kvm_vcpu_srcu_read_lock() around the problematic SMM code would cause
problems. Acquiring SRCU isn't all that expensive, so for simplicity,
grab it unconditionally for KVM_SET_VCPU_EVENTS.
=============================
WARNING: suspicious RCU usage
6.10.0-rc7-332d2c1d713e-next-vm #552 Not tainted
-----------------------------
include/linux/kvm_host.h:1027 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by repro/1071:
#0: ffff88811e424430 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x7d/0x970 [kvm]
stack backtrace:
CPU: 15 PID: 1071 Comm: repro Not tainted 6.10.0-rc7-332d2c1d713e-next-vm #552
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
<TASK>
dump_stack_lvl+0x7f/0x90
lockdep_rcu_suspicious+0x13f/0x1a0
kvm_vcpu_gfn_to_memslot+0x168/0x190 [kvm]
kvm_vcpu_read_guest+0x3e/0x90 [kvm]
nested_vmx_load_msr+0x6b/0x1d0 [kvm_intel]
load_vmcs12_host_state+0x432/0xb40 [kvm_intel]
vmx_leave_nested+0x30/0x40 [kvm_intel]
kvm_vcpu_ioctl_x86_set_vcpu_events+0x15d/0x2b0 [kvm]
kvm_arch_vcpu_ioctl+0x1107/0x1750 [kvm]
? mark_held_locks+0x49/0x70
? kvm_vcpu_ioctl+0x7d/0x970 [kvm]
? kvm_vcpu_ioctl+0x497/0x970 [kvm]
kvm_vcpu_ioctl+0x497/0x970 [kvm]
? lock_acquire+0xba/0x2d0
? find_held_lock+0x2b/0x80
? do_user_addr_fault+0x40c/0x6f0
? lock_release+0xb7/0x270
__x64_sys_ioctl+0x82/0xb0
do_syscall_64+0x6c/0x170
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7ff11eb1b539
</TASK>
Fixes: f7e570780efc ("KVM: x86: Forcibly leave nested virt when SMM state is toggled")
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20240723232055.3643811-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 70219e406987..2c7327ef0f0d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6040,7 +6040,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
if (copy_from_user(&events, argp, sizeof(struct kvm_vcpu_events)))
break;
+ kvm_vcpu_srcu_read_lock(vcpu);
r = kvm_vcpu_ioctl_x86_set_vcpu_events(vcpu, &events);
+ kvm_vcpu_srcu_read_unlock(vcpu);
break;
}
case KVM_GET_DEBUGREGS: {
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x d33d26036a0274b472299d7dcdaa5fb34329f91b
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090849-android-shield-d1fd@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
d33d26036a02 ("rtmutex: Drop rt_mutex::wait_lock before scheduling")
add461325ec5 ("locking/rtmutex: Extend the rtmutex core to support ww_mutex")
1c143c4b65da ("locking/rtmutex: Provide the spin/rwlock core lock function")
e17ba59b7e8e ("locking/rtmutex: Guard regular sleeping locks specific functions")
7980aa397cc0 ("locking/rtmutex: Use rt_mutex_wake_q_head")
c014ef69b3ac ("locking/rtmutex: Add wake_state to rt_mutex_waiter")
42254105dfe8 ("locking/rwsem: Add rtmutex based R/W semaphore implementation")
ebbdc41e90ff ("locking/rtmutex: Provide rt_mutex_slowlock_locked()")
830e6acc8a1c ("locking/rtmutex: Split out the inner parts of 'struct rtmutex'")
531ae4b06a73 ("locking/rtmutex: Split API from implementation")
785159301bed ("locking/rtmutex: Convert macros to inlines")
b41cda037655 ("locking/rtmutex: Set proper wait context for lockdep")
2f064a59a11f ("sched: Change task_struct::state")
d6c23bb3a2ad ("sched: Add get_current_state()")
b03fbd4ff24c ("sched: Introduce task_is_running()")
a9e906b71f96 ("Merge branch 'sched/urgent' into sched/core, to pick up fixes")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d33d26036a0274b472299d7dcdaa5fb34329f91b Mon Sep 17 00:00:00 2001
From: Roland Xu <mu001999(a)outlook.com>
Date: Thu, 15 Aug 2024 10:58:13 +0800
Subject: [PATCH] rtmutex: Drop rt_mutex::wait_lock before scheduling
rt_mutex_handle_deadlock() is called with rt_mutex::wait_lock held. In the
good case it returns with the lock held and in the deadlock case it emits a
warning and goes into an endless scheduling loop with the lock held, which
triggers the 'scheduling in atomic' warning.
Unlock rt_mutex::wait_lock in the dead lock case before issuing the warning
and dropping into the schedule for ever loop.
[ tglx: Moved unlock before the WARN(), removed the pointless comment,
massaged changelog, added Fixes tag ]
Fixes: 3d5c9340d194 ("rtmutex: Handle deadlock detection smarter")
Signed-off-by: Roland Xu <mu001999(a)outlook.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/ME0P300MB063599BEF0743B8FA339C2CECC802@ME0P300M…
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 88d08eeb8bc0..fba1229f1de6 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1644,6 +1644,7 @@ static int __sched rt_mutex_slowlock_block(struct rt_mutex_base *lock,
}
static void __sched rt_mutex_handle_deadlock(int res, int detect_deadlock,
+ struct rt_mutex_base *lock,
struct rt_mutex_waiter *w)
{
/*
@@ -1656,10 +1657,10 @@ static void __sched rt_mutex_handle_deadlock(int res, int detect_deadlock,
if (build_ww_mutex() && w->ww_ctx)
return;
- /*
- * Yell loudly and stop the task right here.
- */
+ raw_spin_unlock_irq(&lock->wait_lock);
+
WARN(1, "rtmutex deadlock detected\n");
+
while (1) {
set_current_state(TASK_INTERRUPTIBLE);
rt_mutex_schedule();
@@ -1713,7 +1714,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
} else {
__set_current_state(TASK_RUNNING);
remove_waiter(lock, waiter);
- rt_mutex_handle_deadlock(ret, chwalk, waiter);
+ rt_mutex_handle_deadlock(ret, chwalk, lock, waiter);
}
/*
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x d33d26036a0274b472299d7dcdaa5fb34329f91b
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090848-dragster-ahoy-c47e@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
d33d26036a02 ("rtmutex: Drop rt_mutex::wait_lock before scheduling")
add461325ec5 ("locking/rtmutex: Extend the rtmutex core to support ww_mutex")
1c143c4b65da ("locking/rtmutex: Provide the spin/rwlock core lock function")
e17ba59b7e8e ("locking/rtmutex: Guard regular sleeping locks specific functions")
7980aa397cc0 ("locking/rtmutex: Use rt_mutex_wake_q_head")
c014ef69b3ac ("locking/rtmutex: Add wake_state to rt_mutex_waiter")
42254105dfe8 ("locking/rwsem: Add rtmutex based R/W semaphore implementation")
ebbdc41e90ff ("locking/rtmutex: Provide rt_mutex_slowlock_locked()")
830e6acc8a1c ("locking/rtmutex: Split out the inner parts of 'struct rtmutex'")
531ae4b06a73 ("locking/rtmutex: Split API from implementation")
785159301bed ("locking/rtmutex: Convert macros to inlines")
b41cda037655 ("locking/rtmutex: Set proper wait context for lockdep")
2f064a59a11f ("sched: Change task_struct::state")
d6c23bb3a2ad ("sched: Add get_current_state()")
b03fbd4ff24c ("sched: Introduce task_is_running()")
a9e906b71f96 ("Merge branch 'sched/urgent' into sched/core, to pick up fixes")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d33d26036a0274b472299d7dcdaa5fb34329f91b Mon Sep 17 00:00:00 2001
From: Roland Xu <mu001999(a)outlook.com>
Date: Thu, 15 Aug 2024 10:58:13 +0800
Subject: [PATCH] rtmutex: Drop rt_mutex::wait_lock before scheduling
rt_mutex_handle_deadlock() is called with rt_mutex::wait_lock held. In the
good case it returns with the lock held and in the deadlock case it emits a
warning and goes into an endless scheduling loop with the lock held, which
triggers the 'scheduling in atomic' warning.
Unlock rt_mutex::wait_lock in the dead lock case before issuing the warning
and dropping into the schedule for ever loop.
[ tglx: Moved unlock before the WARN(), removed the pointless comment,
massaged changelog, added Fixes tag ]
Fixes: 3d5c9340d194 ("rtmutex: Handle deadlock detection smarter")
Signed-off-by: Roland Xu <mu001999(a)outlook.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/ME0P300MB063599BEF0743B8FA339C2CECC802@ME0P300M…
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 88d08eeb8bc0..fba1229f1de6 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1644,6 +1644,7 @@ static int __sched rt_mutex_slowlock_block(struct rt_mutex_base *lock,
}
static void __sched rt_mutex_handle_deadlock(int res, int detect_deadlock,
+ struct rt_mutex_base *lock,
struct rt_mutex_waiter *w)
{
/*
@@ -1656,10 +1657,10 @@ static void __sched rt_mutex_handle_deadlock(int res, int detect_deadlock,
if (build_ww_mutex() && w->ww_ctx)
return;
- /*
- * Yell loudly and stop the task right here.
- */
+ raw_spin_unlock_irq(&lock->wait_lock);
+
WARN(1, "rtmutex deadlock detected\n");
+
while (1) {
set_current_state(TASK_INTERRUPTIBLE);
rt_mutex_schedule();
@@ -1713,7 +1714,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
} else {
__set_current_state(TASK_RUNNING);
remove_waiter(lock, waiter);
- rt_mutex_handle_deadlock(ret, chwalk, waiter);
+ rt_mutex_handle_deadlock(ret, chwalk, lock, waiter);
}
/*
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x ea72ce5da22806d5713f3ffb39a6d5ae73841f93
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090830-cough-rewrite-bcc9@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
ea72ce5da228 ("x86/kaslr: Expose and use the end of the physical memory address space")
1a167ddd3c56 ("x86: kmsan: pgtable: reduce vmalloc space")
14b80582c43e ("resource: Introduce alloc_free_mem_region()")
27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page refcount")
dc90f0846df4 ("mm: don't include <linux/memremap.h> in <linux/mm.h>")
895749455f60 ("mm: simplify freeing of devmap managed pages")
75e55d8a107e ("mm: move free_devmap_managed_page to memremap.c")
730ff52194cd ("mm: remove pointless includes from <linux/hmm.h>")
f56caedaf94f ("Merge branch 'akpm' (patches from Andrew)")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ea72ce5da22806d5713f3ffb39a6d5ae73841f93 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx(a)linutronix.de>
Date: Wed, 14 Aug 2024 00:29:36 +0200
Subject: [PATCH] x86/kaslr: Expose and use the end of the physical memory
address space
iounmap() on x86 occasionally fails to unmap because the provided valid
ioremap address is not below high_memory. It turned out that this
happens due to KASLR.
KASLR uses the full address space between PAGE_OFFSET and vaddr_end to
randomize the starting points of the direct map, vmalloc and vmemmap
regions. It thereby limits the size of the direct map by using the
installed memory size plus an extra configurable margin for hot-plug
memory. This limitation is done to gain more randomization space
because otherwise only the holes between the direct map, vmalloc,
vmemmap and vaddr_end would be usable for randomizing.
The limited direct map size is not exposed to the rest of the kernel, so
the memory hot-plug and resource management related code paths still
operate under the assumption that the available address space can be
determined with MAX_PHYSMEM_BITS.
request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1
downwards. That means the first allocation happens past the end of the
direct map and if unlucky this address is in the vmalloc space, which
causes high_memory to become greater than VMALLOC_START and consequently
causes iounmap() to fail for valid ioremap addresses.
MAX_PHYSMEM_BITS cannot be changed for that because the randomization
does not align with address bit boundaries and there are other places
which actually require to know the maximum number of address bits. All
remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found
to be correct.
Cure this by exposing the end of the direct map via PHYSMEM_END and use
that for the memory hot-plug and resource management related places
instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END
maps to a variable which is initialized by the KASLR initialization and
otherwise it is based on MAX_PHYSMEM_BITS as before.
To prevent future hickups add a check into add_pages() to catch callers
trying to add memory above PHYSMEM_END.
Fixes: 0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions")
Reported-by: Max Ramanouski <max8rr8(a)gmail.com>
Reported-by: Alistair Popple <apopple(a)nvidia.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-By: Max Ramanouski <max8rr8(a)gmail.com>
Tested-by: Alistair Popple <apopple(a)nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams(a)intel.com>
Reviewed-by: Alistair Popple <apopple(a)nvidia.com>
Reviewed-by: Kees Cook <kees(a)kernel.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index af4302d79b59..f3d257c45225 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -17,6 +17,7 @@ extern unsigned long phys_base;
extern unsigned long page_offset_base;
extern unsigned long vmalloc_base;
extern unsigned long vmemmap_base;
+extern unsigned long physmem_end;
static __always_inline unsigned long __phys_addr_nodebug(unsigned long x)
{
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 9053dfe9fa03..a98e53491a4e 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -140,6 +140,10 @@ extern unsigned int ptrs_per_p4d;
# define VMEMMAP_START __VMEMMAP_BASE_L4
#endif /* CONFIG_DYNAMIC_MEMORY_LAYOUT */
+#ifdef CONFIG_RANDOMIZE_MEMORY
+# define PHYSMEM_END physmem_end
+#endif
+
/*
* End of the region for which vmalloc page tables are pre-allocated.
* For non-KMSAN builds, this is the same as VMALLOC_END.
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index d8dbeac8b206..ff253648706f 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -958,8 +958,12 @@ static void update_end_of_memory_vars(u64 start, u64 size)
int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
struct mhp_params *params)
{
+ unsigned long end = ((start_pfn + nr_pages) << PAGE_SHIFT) - 1;
int ret;
+ if (WARN_ON_ONCE(end > PHYSMEM_END))
+ return -ERANGE;
+
ret = __add_pages(nid, start_pfn, nr_pages, params);
WARN_ON_ONCE(ret);
diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index 37db264866b6..230f1dee4f09 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -47,13 +47,24 @@ static const unsigned long vaddr_end = CPU_ENTRY_AREA_BASE;
*/
static __initdata struct kaslr_memory_region {
unsigned long *base;
+ unsigned long *end;
unsigned long size_tb;
} kaslr_regions[] = {
- { &page_offset_base, 0 },
- { &vmalloc_base, 0 },
- { &vmemmap_base, 0 },
+ {
+ .base = &page_offset_base,
+ .end = &physmem_end,
+ },
+ {
+ .base = &vmalloc_base,
+ },
+ {
+ .base = &vmemmap_base,
+ },
};
+/* The end of the possible address space for physical memory */
+unsigned long physmem_end __ro_after_init;
+
/* Get size in bytes used by the memory region */
static inline unsigned long get_padding(struct kaslr_memory_region *region)
{
@@ -82,6 +93,8 @@ void __init kernel_randomize_memory(void)
BUILD_BUG_ON(vaddr_end != CPU_ENTRY_AREA_BASE);
BUILD_BUG_ON(vaddr_end > __START_KERNEL_map);
+ /* Preset the end of the possible address space for physical memory */
+ physmem_end = ((1ULL << MAX_PHYSMEM_BITS) - 1);
if (!kaslr_memory_enabled())
return;
@@ -128,11 +141,18 @@ void __init kernel_randomize_memory(void)
vaddr += entropy;
*kaslr_regions[i].base = vaddr;
- /*
- * Jump the region and add a minimum padding based on
- * randomization alignment.
- */
+ /* Calculate the end of the region */
vaddr += get_padding(&kaslr_regions[i]);
+ /*
+ * KASLR trims the maximum possible size of the
+ * direct-map. Update the physmem_end boundary.
+ * No rounding required as the region starts
+ * PUD aligned and size is in units of TB.
+ */
+ if (kaslr_regions[i].end)
+ *kaslr_regions[i].end = __pa_nodebug(vaddr - 1);
+
+ /* Add a minimum padding based on randomization alignment. */
vaddr = round_up(vaddr + 1, PUD_SIZE);
remain_entropy -= entropy;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c4b238a20b76..b3864156eaa4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -97,6 +97,10 @@ extern const int mmap_rnd_compat_bits_max;
extern int mmap_rnd_compat_bits __read_mostly;
#endif
+#ifndef PHYSMEM_END
+# define PHYSMEM_END ((1ULL << MAX_PHYSMEM_BITS) - 1)
+#endif
+
#include <asm/page.h>
#include <asm/processor.h>
diff --git a/kernel/resource.c b/kernel/resource.c
index 14777afb0a99..a83040fde236 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1826,8 +1826,7 @@ static resource_size_t gfr_start(struct resource *base, resource_size_t size,
if (flags & GFR_DESCENDING) {
resource_size_t end;
- end = min_t(resource_size_t, base->end,
- (1ULL << MAX_PHYSMEM_BITS) - 1);
+ end = min_t(resource_size_t, base->end, PHYSMEM_END);
return end - size + 1;
}
@@ -1844,8 +1843,7 @@ static bool gfr_continue(struct resource *base, resource_size_t addr,
* @size did not wrap 0.
*/
return addr > addr - size &&
- addr <= min_t(resource_size_t, base->end,
- (1ULL << MAX_PHYSMEM_BITS) - 1);
+ addr <= min_t(resource_size_t, base->end, PHYSMEM_END);
}
static resource_size_t gfr_next(resource_size_t addr, resource_size_t size,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 66267c26ca1b..951878ab627a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1681,7 +1681,7 @@ struct range __weak arch_get_mappable_range(void)
struct range mhp_get_pluggable_range(bool need_mapping)
{
- const u64 max_phys = (1ULL << MAX_PHYSMEM_BITS) - 1;
+ const u64 max_phys = PHYSMEM_END;
struct range mhp_range;
if (need_mapping) {
diff --git a/mm/sparse.c b/mm/sparse.c
index e4b830091d13..0c3bff882033 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -129,7 +129,7 @@ static inline int sparse_early_nid(struct mem_section *section)
static void __meminit mminit_validate_memmodel_limits(unsigned long *start_pfn,
unsigned long *end_pfn)
{
- unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
+ unsigned long max_sparsemem_pfn = (PHYSMEM_END + 1) >> PAGE_SHIFT;
/*
* Sanity checks - do not allow an architecture to pass
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x ea72ce5da22806d5713f3ffb39a6d5ae73841f93
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090829-swizzle-angelfish-9047@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
ea72ce5da228 ("x86/kaslr: Expose and use the end of the physical memory address space")
1a167ddd3c56 ("x86: kmsan: pgtable: reduce vmalloc space")
14b80582c43e ("resource: Introduce alloc_free_mem_region()")
27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page refcount")
dc90f0846df4 ("mm: don't include <linux/memremap.h> in <linux/mm.h>")
895749455f60 ("mm: simplify freeing of devmap managed pages")
75e55d8a107e ("mm: move free_devmap_managed_page to memremap.c")
730ff52194cd ("mm: remove pointless includes from <linux/hmm.h>")
f56caedaf94f ("Merge branch 'akpm' (patches from Andrew)")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ea72ce5da22806d5713f3ffb39a6d5ae73841f93 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx(a)linutronix.de>
Date: Wed, 14 Aug 2024 00:29:36 +0200
Subject: [PATCH] x86/kaslr: Expose and use the end of the physical memory
address space
iounmap() on x86 occasionally fails to unmap because the provided valid
ioremap address is not below high_memory. It turned out that this
happens due to KASLR.
KASLR uses the full address space between PAGE_OFFSET and vaddr_end to
randomize the starting points of the direct map, vmalloc and vmemmap
regions. It thereby limits the size of the direct map by using the
installed memory size plus an extra configurable margin for hot-plug
memory. This limitation is done to gain more randomization space
because otherwise only the holes between the direct map, vmalloc,
vmemmap and vaddr_end would be usable for randomizing.
The limited direct map size is not exposed to the rest of the kernel, so
the memory hot-plug and resource management related code paths still
operate under the assumption that the available address space can be
determined with MAX_PHYSMEM_BITS.
request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1
downwards. That means the first allocation happens past the end of the
direct map and if unlucky this address is in the vmalloc space, which
causes high_memory to become greater than VMALLOC_START and consequently
causes iounmap() to fail for valid ioremap addresses.
MAX_PHYSMEM_BITS cannot be changed for that because the randomization
does not align with address bit boundaries and there are other places
which actually require to know the maximum number of address bits. All
remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found
to be correct.
Cure this by exposing the end of the direct map via PHYSMEM_END and use
that for the memory hot-plug and resource management related places
instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END
maps to a variable which is initialized by the KASLR initialization and
otherwise it is based on MAX_PHYSMEM_BITS as before.
To prevent future hickups add a check into add_pages() to catch callers
trying to add memory above PHYSMEM_END.
Fixes: 0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions")
Reported-by: Max Ramanouski <max8rr8(a)gmail.com>
Reported-by: Alistair Popple <apopple(a)nvidia.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-By: Max Ramanouski <max8rr8(a)gmail.com>
Tested-by: Alistair Popple <apopple(a)nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams(a)intel.com>
Reviewed-by: Alistair Popple <apopple(a)nvidia.com>
Reviewed-by: Kees Cook <kees(a)kernel.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index af4302d79b59..f3d257c45225 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -17,6 +17,7 @@ extern unsigned long phys_base;
extern unsigned long page_offset_base;
extern unsigned long vmalloc_base;
extern unsigned long vmemmap_base;
+extern unsigned long physmem_end;
static __always_inline unsigned long __phys_addr_nodebug(unsigned long x)
{
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 9053dfe9fa03..a98e53491a4e 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -140,6 +140,10 @@ extern unsigned int ptrs_per_p4d;
# define VMEMMAP_START __VMEMMAP_BASE_L4
#endif /* CONFIG_DYNAMIC_MEMORY_LAYOUT */
+#ifdef CONFIG_RANDOMIZE_MEMORY
+# define PHYSMEM_END physmem_end
+#endif
+
/*
* End of the region for which vmalloc page tables are pre-allocated.
* For non-KMSAN builds, this is the same as VMALLOC_END.
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index d8dbeac8b206..ff253648706f 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -958,8 +958,12 @@ static void update_end_of_memory_vars(u64 start, u64 size)
int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
struct mhp_params *params)
{
+ unsigned long end = ((start_pfn + nr_pages) << PAGE_SHIFT) - 1;
int ret;
+ if (WARN_ON_ONCE(end > PHYSMEM_END))
+ return -ERANGE;
+
ret = __add_pages(nid, start_pfn, nr_pages, params);
WARN_ON_ONCE(ret);
diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index 37db264866b6..230f1dee4f09 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -47,13 +47,24 @@ static const unsigned long vaddr_end = CPU_ENTRY_AREA_BASE;
*/
static __initdata struct kaslr_memory_region {
unsigned long *base;
+ unsigned long *end;
unsigned long size_tb;
} kaslr_regions[] = {
- { &page_offset_base, 0 },
- { &vmalloc_base, 0 },
- { &vmemmap_base, 0 },
+ {
+ .base = &page_offset_base,
+ .end = &physmem_end,
+ },
+ {
+ .base = &vmalloc_base,
+ },
+ {
+ .base = &vmemmap_base,
+ },
};
+/* The end of the possible address space for physical memory */
+unsigned long physmem_end __ro_after_init;
+
/* Get size in bytes used by the memory region */
static inline unsigned long get_padding(struct kaslr_memory_region *region)
{
@@ -82,6 +93,8 @@ void __init kernel_randomize_memory(void)
BUILD_BUG_ON(vaddr_end != CPU_ENTRY_AREA_BASE);
BUILD_BUG_ON(vaddr_end > __START_KERNEL_map);
+ /* Preset the end of the possible address space for physical memory */
+ physmem_end = ((1ULL << MAX_PHYSMEM_BITS) - 1);
if (!kaslr_memory_enabled())
return;
@@ -128,11 +141,18 @@ void __init kernel_randomize_memory(void)
vaddr += entropy;
*kaslr_regions[i].base = vaddr;
- /*
- * Jump the region and add a minimum padding based on
- * randomization alignment.
- */
+ /* Calculate the end of the region */
vaddr += get_padding(&kaslr_regions[i]);
+ /*
+ * KASLR trims the maximum possible size of the
+ * direct-map. Update the physmem_end boundary.
+ * No rounding required as the region starts
+ * PUD aligned and size is in units of TB.
+ */
+ if (kaslr_regions[i].end)
+ *kaslr_regions[i].end = __pa_nodebug(vaddr - 1);
+
+ /* Add a minimum padding based on randomization alignment. */
vaddr = round_up(vaddr + 1, PUD_SIZE);
remain_entropy -= entropy;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c4b238a20b76..b3864156eaa4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -97,6 +97,10 @@ extern const int mmap_rnd_compat_bits_max;
extern int mmap_rnd_compat_bits __read_mostly;
#endif
+#ifndef PHYSMEM_END
+# define PHYSMEM_END ((1ULL << MAX_PHYSMEM_BITS) - 1)
+#endif
+
#include <asm/page.h>
#include <asm/processor.h>
diff --git a/kernel/resource.c b/kernel/resource.c
index 14777afb0a99..a83040fde236 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1826,8 +1826,7 @@ static resource_size_t gfr_start(struct resource *base, resource_size_t size,
if (flags & GFR_DESCENDING) {
resource_size_t end;
- end = min_t(resource_size_t, base->end,
- (1ULL << MAX_PHYSMEM_BITS) - 1);
+ end = min_t(resource_size_t, base->end, PHYSMEM_END);
return end - size + 1;
}
@@ -1844,8 +1843,7 @@ static bool gfr_continue(struct resource *base, resource_size_t addr,
* @size did not wrap 0.
*/
return addr > addr - size &&
- addr <= min_t(resource_size_t, base->end,
- (1ULL << MAX_PHYSMEM_BITS) - 1);
+ addr <= min_t(resource_size_t, base->end, PHYSMEM_END);
}
static resource_size_t gfr_next(resource_size_t addr, resource_size_t size,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 66267c26ca1b..951878ab627a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1681,7 +1681,7 @@ struct range __weak arch_get_mappable_range(void)
struct range mhp_get_pluggable_range(bool need_mapping)
{
- const u64 max_phys = (1ULL << MAX_PHYSMEM_BITS) - 1;
+ const u64 max_phys = PHYSMEM_END;
struct range mhp_range;
if (need_mapping) {
diff --git a/mm/sparse.c b/mm/sparse.c
index e4b830091d13..0c3bff882033 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -129,7 +129,7 @@ static inline int sparse_early_nid(struct mem_section *section)
static void __meminit mminit_validate_memmodel_limits(unsigned long *start_pfn,
unsigned long *end_pfn)
{
- unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
+ unsigned long max_sparsemem_pfn = (PHYSMEM_END + 1) >> PAGE_SHIFT;
/*
* Sanity checks - do not allow an architecture to pass
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x ea72ce5da22806d5713f3ffb39a6d5ae73841f93
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090827-corporate-raider-afc5@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
ea72ce5da228 ("x86/kaslr: Expose and use the end of the physical memory address space")
1a167ddd3c56 ("x86: kmsan: pgtable: reduce vmalloc space")
14b80582c43e ("resource: Introduce alloc_free_mem_region()")
27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page refcount")
dc90f0846df4 ("mm: don't include <linux/memremap.h> in <linux/mm.h>")
895749455f60 ("mm: simplify freeing of devmap managed pages")
75e55d8a107e ("mm: move free_devmap_managed_page to memremap.c")
730ff52194cd ("mm: remove pointless includes from <linux/hmm.h>")
f56caedaf94f ("Merge branch 'akpm' (patches from Andrew)")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ea72ce5da22806d5713f3ffb39a6d5ae73841f93 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx(a)linutronix.de>
Date: Wed, 14 Aug 2024 00:29:36 +0200
Subject: [PATCH] x86/kaslr: Expose and use the end of the physical memory
address space
iounmap() on x86 occasionally fails to unmap because the provided valid
ioremap address is not below high_memory. It turned out that this
happens due to KASLR.
KASLR uses the full address space between PAGE_OFFSET and vaddr_end to
randomize the starting points of the direct map, vmalloc and vmemmap
regions. It thereby limits the size of the direct map by using the
installed memory size plus an extra configurable margin for hot-plug
memory. This limitation is done to gain more randomization space
because otherwise only the holes between the direct map, vmalloc,
vmemmap and vaddr_end would be usable for randomizing.
The limited direct map size is not exposed to the rest of the kernel, so
the memory hot-plug and resource management related code paths still
operate under the assumption that the available address space can be
determined with MAX_PHYSMEM_BITS.
request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1
downwards. That means the first allocation happens past the end of the
direct map and if unlucky this address is in the vmalloc space, which
causes high_memory to become greater than VMALLOC_START and consequently
causes iounmap() to fail for valid ioremap addresses.
MAX_PHYSMEM_BITS cannot be changed for that because the randomization
does not align with address bit boundaries and there are other places
which actually require to know the maximum number of address bits. All
remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found
to be correct.
Cure this by exposing the end of the direct map via PHYSMEM_END and use
that for the memory hot-plug and resource management related places
instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END
maps to a variable which is initialized by the KASLR initialization and
otherwise it is based on MAX_PHYSMEM_BITS as before.
To prevent future hickups add a check into add_pages() to catch callers
trying to add memory above PHYSMEM_END.
Fixes: 0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions")
Reported-by: Max Ramanouski <max8rr8(a)gmail.com>
Reported-by: Alistair Popple <apopple(a)nvidia.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-By: Max Ramanouski <max8rr8(a)gmail.com>
Tested-by: Alistair Popple <apopple(a)nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams(a)intel.com>
Reviewed-by: Alistair Popple <apopple(a)nvidia.com>
Reviewed-by: Kees Cook <kees(a)kernel.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index af4302d79b59..f3d257c45225 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -17,6 +17,7 @@ extern unsigned long phys_base;
extern unsigned long page_offset_base;
extern unsigned long vmalloc_base;
extern unsigned long vmemmap_base;
+extern unsigned long physmem_end;
static __always_inline unsigned long __phys_addr_nodebug(unsigned long x)
{
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 9053dfe9fa03..a98e53491a4e 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -140,6 +140,10 @@ extern unsigned int ptrs_per_p4d;
# define VMEMMAP_START __VMEMMAP_BASE_L4
#endif /* CONFIG_DYNAMIC_MEMORY_LAYOUT */
+#ifdef CONFIG_RANDOMIZE_MEMORY
+# define PHYSMEM_END physmem_end
+#endif
+
/*
* End of the region for which vmalloc page tables are pre-allocated.
* For non-KMSAN builds, this is the same as VMALLOC_END.
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index d8dbeac8b206..ff253648706f 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -958,8 +958,12 @@ static void update_end_of_memory_vars(u64 start, u64 size)
int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
struct mhp_params *params)
{
+ unsigned long end = ((start_pfn + nr_pages) << PAGE_SHIFT) - 1;
int ret;
+ if (WARN_ON_ONCE(end > PHYSMEM_END))
+ return -ERANGE;
+
ret = __add_pages(nid, start_pfn, nr_pages, params);
WARN_ON_ONCE(ret);
diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index 37db264866b6..230f1dee4f09 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -47,13 +47,24 @@ static const unsigned long vaddr_end = CPU_ENTRY_AREA_BASE;
*/
static __initdata struct kaslr_memory_region {
unsigned long *base;
+ unsigned long *end;
unsigned long size_tb;
} kaslr_regions[] = {
- { &page_offset_base, 0 },
- { &vmalloc_base, 0 },
- { &vmemmap_base, 0 },
+ {
+ .base = &page_offset_base,
+ .end = &physmem_end,
+ },
+ {
+ .base = &vmalloc_base,
+ },
+ {
+ .base = &vmemmap_base,
+ },
};
+/* The end of the possible address space for physical memory */
+unsigned long physmem_end __ro_after_init;
+
/* Get size in bytes used by the memory region */
static inline unsigned long get_padding(struct kaslr_memory_region *region)
{
@@ -82,6 +93,8 @@ void __init kernel_randomize_memory(void)
BUILD_BUG_ON(vaddr_end != CPU_ENTRY_AREA_BASE);
BUILD_BUG_ON(vaddr_end > __START_KERNEL_map);
+ /* Preset the end of the possible address space for physical memory */
+ physmem_end = ((1ULL << MAX_PHYSMEM_BITS) - 1);
if (!kaslr_memory_enabled())
return;
@@ -128,11 +141,18 @@ void __init kernel_randomize_memory(void)
vaddr += entropy;
*kaslr_regions[i].base = vaddr;
- /*
- * Jump the region and add a minimum padding based on
- * randomization alignment.
- */
+ /* Calculate the end of the region */
vaddr += get_padding(&kaslr_regions[i]);
+ /*
+ * KASLR trims the maximum possible size of the
+ * direct-map. Update the physmem_end boundary.
+ * No rounding required as the region starts
+ * PUD aligned and size is in units of TB.
+ */
+ if (kaslr_regions[i].end)
+ *kaslr_regions[i].end = __pa_nodebug(vaddr - 1);
+
+ /* Add a minimum padding based on randomization alignment. */
vaddr = round_up(vaddr + 1, PUD_SIZE);
remain_entropy -= entropy;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c4b238a20b76..b3864156eaa4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -97,6 +97,10 @@ extern const int mmap_rnd_compat_bits_max;
extern int mmap_rnd_compat_bits __read_mostly;
#endif
+#ifndef PHYSMEM_END
+# define PHYSMEM_END ((1ULL << MAX_PHYSMEM_BITS) - 1)
+#endif
+
#include <asm/page.h>
#include <asm/processor.h>
diff --git a/kernel/resource.c b/kernel/resource.c
index 14777afb0a99..a83040fde236 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1826,8 +1826,7 @@ static resource_size_t gfr_start(struct resource *base, resource_size_t size,
if (flags & GFR_DESCENDING) {
resource_size_t end;
- end = min_t(resource_size_t, base->end,
- (1ULL << MAX_PHYSMEM_BITS) - 1);
+ end = min_t(resource_size_t, base->end, PHYSMEM_END);
return end - size + 1;
}
@@ -1844,8 +1843,7 @@ static bool gfr_continue(struct resource *base, resource_size_t addr,
* @size did not wrap 0.
*/
return addr > addr - size &&
- addr <= min_t(resource_size_t, base->end,
- (1ULL << MAX_PHYSMEM_BITS) - 1);
+ addr <= min_t(resource_size_t, base->end, PHYSMEM_END);
}
static resource_size_t gfr_next(resource_size_t addr, resource_size_t size,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 66267c26ca1b..951878ab627a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1681,7 +1681,7 @@ struct range __weak arch_get_mappable_range(void)
struct range mhp_get_pluggable_range(bool need_mapping)
{
- const u64 max_phys = (1ULL << MAX_PHYSMEM_BITS) - 1;
+ const u64 max_phys = PHYSMEM_END;
struct range mhp_range;
if (need_mapping) {
diff --git a/mm/sparse.c b/mm/sparse.c
index e4b830091d13..0c3bff882033 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -129,7 +129,7 @@ static inline int sparse_early_nid(struct mem_section *section)
static void __meminit mminit_validate_memmodel_limits(unsigned long *start_pfn,
unsigned long *end_pfn)
{
- unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
+ unsigned long max_sparsemem_pfn = (PHYSMEM_END + 1) >> PAGE_SHIFT;
/*
* Sanity checks - do not allow an architecture to pass
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x ea72ce5da22806d5713f3ffb39a6d5ae73841f93
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090826-expulsion-unkempt-11e3@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
ea72ce5da228 ("x86/kaslr: Expose and use the end of the physical memory address space")
1a167ddd3c56 ("x86: kmsan: pgtable: reduce vmalloc space")
14b80582c43e ("resource: Introduce alloc_free_mem_region()")
27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page refcount")
dc90f0846df4 ("mm: don't include <linux/memremap.h> in <linux/mm.h>")
895749455f60 ("mm: simplify freeing of devmap managed pages")
75e55d8a107e ("mm: move free_devmap_managed_page to memremap.c")
730ff52194cd ("mm: remove pointless includes from <linux/hmm.h>")
f56caedaf94f ("Merge branch 'akpm' (patches from Andrew)")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ea72ce5da22806d5713f3ffb39a6d5ae73841f93 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx(a)linutronix.de>
Date: Wed, 14 Aug 2024 00:29:36 +0200
Subject: [PATCH] x86/kaslr: Expose and use the end of the physical memory
address space
iounmap() on x86 occasionally fails to unmap because the provided valid
ioremap address is not below high_memory. It turned out that this
happens due to KASLR.
KASLR uses the full address space between PAGE_OFFSET and vaddr_end to
randomize the starting points of the direct map, vmalloc and vmemmap
regions. It thereby limits the size of the direct map by using the
installed memory size plus an extra configurable margin for hot-plug
memory. This limitation is done to gain more randomization space
because otherwise only the holes between the direct map, vmalloc,
vmemmap and vaddr_end would be usable for randomizing.
The limited direct map size is not exposed to the rest of the kernel, so
the memory hot-plug and resource management related code paths still
operate under the assumption that the available address space can be
determined with MAX_PHYSMEM_BITS.
request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1
downwards. That means the first allocation happens past the end of the
direct map and if unlucky this address is in the vmalloc space, which
causes high_memory to become greater than VMALLOC_START and consequently
causes iounmap() to fail for valid ioremap addresses.
MAX_PHYSMEM_BITS cannot be changed for that because the randomization
does not align with address bit boundaries and there are other places
which actually require to know the maximum number of address bits. All
remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found
to be correct.
Cure this by exposing the end of the direct map via PHYSMEM_END and use
that for the memory hot-plug and resource management related places
instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END
maps to a variable which is initialized by the KASLR initialization and
otherwise it is based on MAX_PHYSMEM_BITS as before.
To prevent future hickups add a check into add_pages() to catch callers
trying to add memory above PHYSMEM_END.
Fixes: 0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions")
Reported-by: Max Ramanouski <max8rr8(a)gmail.com>
Reported-by: Alistair Popple <apopple(a)nvidia.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-By: Max Ramanouski <max8rr8(a)gmail.com>
Tested-by: Alistair Popple <apopple(a)nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams(a)intel.com>
Reviewed-by: Alistair Popple <apopple(a)nvidia.com>
Reviewed-by: Kees Cook <kees(a)kernel.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index af4302d79b59..f3d257c45225 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -17,6 +17,7 @@ extern unsigned long phys_base;
extern unsigned long page_offset_base;
extern unsigned long vmalloc_base;
extern unsigned long vmemmap_base;
+extern unsigned long physmem_end;
static __always_inline unsigned long __phys_addr_nodebug(unsigned long x)
{
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 9053dfe9fa03..a98e53491a4e 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -140,6 +140,10 @@ extern unsigned int ptrs_per_p4d;
# define VMEMMAP_START __VMEMMAP_BASE_L4
#endif /* CONFIG_DYNAMIC_MEMORY_LAYOUT */
+#ifdef CONFIG_RANDOMIZE_MEMORY
+# define PHYSMEM_END physmem_end
+#endif
+
/*
* End of the region for which vmalloc page tables are pre-allocated.
* For non-KMSAN builds, this is the same as VMALLOC_END.
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index d8dbeac8b206..ff253648706f 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -958,8 +958,12 @@ static void update_end_of_memory_vars(u64 start, u64 size)
int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
struct mhp_params *params)
{
+ unsigned long end = ((start_pfn + nr_pages) << PAGE_SHIFT) - 1;
int ret;
+ if (WARN_ON_ONCE(end > PHYSMEM_END))
+ return -ERANGE;
+
ret = __add_pages(nid, start_pfn, nr_pages, params);
WARN_ON_ONCE(ret);
diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index 37db264866b6..230f1dee4f09 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -47,13 +47,24 @@ static const unsigned long vaddr_end = CPU_ENTRY_AREA_BASE;
*/
static __initdata struct kaslr_memory_region {
unsigned long *base;
+ unsigned long *end;
unsigned long size_tb;
} kaslr_regions[] = {
- { &page_offset_base, 0 },
- { &vmalloc_base, 0 },
- { &vmemmap_base, 0 },
+ {
+ .base = &page_offset_base,
+ .end = &physmem_end,
+ },
+ {
+ .base = &vmalloc_base,
+ },
+ {
+ .base = &vmemmap_base,
+ },
};
+/* The end of the possible address space for physical memory */
+unsigned long physmem_end __ro_after_init;
+
/* Get size in bytes used by the memory region */
static inline unsigned long get_padding(struct kaslr_memory_region *region)
{
@@ -82,6 +93,8 @@ void __init kernel_randomize_memory(void)
BUILD_BUG_ON(vaddr_end != CPU_ENTRY_AREA_BASE);
BUILD_BUG_ON(vaddr_end > __START_KERNEL_map);
+ /* Preset the end of the possible address space for physical memory */
+ physmem_end = ((1ULL << MAX_PHYSMEM_BITS) - 1);
if (!kaslr_memory_enabled())
return;
@@ -128,11 +141,18 @@ void __init kernel_randomize_memory(void)
vaddr += entropy;
*kaslr_regions[i].base = vaddr;
- /*
- * Jump the region and add a minimum padding based on
- * randomization alignment.
- */
+ /* Calculate the end of the region */
vaddr += get_padding(&kaslr_regions[i]);
+ /*
+ * KASLR trims the maximum possible size of the
+ * direct-map. Update the physmem_end boundary.
+ * No rounding required as the region starts
+ * PUD aligned and size is in units of TB.
+ */
+ if (kaslr_regions[i].end)
+ *kaslr_regions[i].end = __pa_nodebug(vaddr - 1);
+
+ /* Add a minimum padding based on randomization alignment. */
vaddr = round_up(vaddr + 1, PUD_SIZE);
remain_entropy -= entropy;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c4b238a20b76..b3864156eaa4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -97,6 +97,10 @@ extern const int mmap_rnd_compat_bits_max;
extern int mmap_rnd_compat_bits __read_mostly;
#endif
+#ifndef PHYSMEM_END
+# define PHYSMEM_END ((1ULL << MAX_PHYSMEM_BITS) - 1)
+#endif
+
#include <asm/page.h>
#include <asm/processor.h>
diff --git a/kernel/resource.c b/kernel/resource.c
index 14777afb0a99..a83040fde236 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1826,8 +1826,7 @@ static resource_size_t gfr_start(struct resource *base, resource_size_t size,
if (flags & GFR_DESCENDING) {
resource_size_t end;
- end = min_t(resource_size_t, base->end,
- (1ULL << MAX_PHYSMEM_BITS) - 1);
+ end = min_t(resource_size_t, base->end, PHYSMEM_END);
return end - size + 1;
}
@@ -1844,8 +1843,7 @@ static bool gfr_continue(struct resource *base, resource_size_t addr,
* @size did not wrap 0.
*/
return addr > addr - size &&
- addr <= min_t(resource_size_t, base->end,
- (1ULL << MAX_PHYSMEM_BITS) - 1);
+ addr <= min_t(resource_size_t, base->end, PHYSMEM_END);
}
static resource_size_t gfr_next(resource_size_t addr, resource_size_t size,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 66267c26ca1b..951878ab627a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1681,7 +1681,7 @@ struct range __weak arch_get_mappable_range(void)
struct range mhp_get_pluggable_range(bool need_mapping)
{
- const u64 max_phys = (1ULL << MAX_PHYSMEM_BITS) - 1;
+ const u64 max_phys = PHYSMEM_END;
struct range mhp_range;
if (need_mapping) {
diff --git a/mm/sparse.c b/mm/sparse.c
index e4b830091d13..0c3bff882033 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -129,7 +129,7 @@ static inline int sparse_early_nid(struct mem_section *section)
static void __meminit mminit_validate_memmodel_limits(unsigned long *start_pfn,
unsigned long *end_pfn)
{
- unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
+ unsigned long max_sparsemem_pfn = (PHYSMEM_END + 1) >> PAGE_SHIFT;
/*
* Sanity checks - do not allow an architecture to pass
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x c5af2c90ba5629f0424a8d315f75fb8d91713c3c
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090844-cohesive-abrasion-ef6b@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
c5af2c90ba56 ("irqchip/gic-v2m: Fix refcount leak in gicv2m_of_init()")
90b4c5558615 ("irqchip/gic-v2m: Add support for Amazon Graviton variant of GICv3+GICv2m")
737be74710f3 ("irqchip/gicv2m: Don't map the MSI page in gicv2m_compose_msi_msg()")
d38a71c54525 ("irqchip/gic-v3-its: Change initialization ordering for LPIs")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c5af2c90ba5629f0424a8d315f75fb8d91713c3c Mon Sep 17 00:00:00 2001
From: Ma Ke <make24(a)iscas.ac.cn>
Date: Tue, 20 Aug 2024 17:28:43 +0800
Subject: [PATCH] irqchip/gic-v2m: Fix refcount leak in gicv2m_of_init()
gicv2m_of_init() fails to perform an of_node_put() when
of_address_to_resource() fails, leading to a refcount leak.
Address this by moving the error handling path outside of the loop and
making it common to all failure modes.
Fixes: 4266ab1a8ff5 ("irqchip/gic-v2m: Refactor to prepare for ACPI support")
Signed-off-by: Ma Ke <make24(a)iscas.ac.cn>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Reviewed-by: Marc Zyngier <maz(a)kernel.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/20240820092843.1219933-1-make24@iscas.ac.cn
diff --git a/drivers/irqchip/irq-gic-v2m.c b/drivers/irqchip/irq-gic-v2m.c
index 51af63c046ed..be35c5349986 100644
--- a/drivers/irqchip/irq-gic-v2m.c
+++ b/drivers/irqchip/irq-gic-v2m.c
@@ -407,12 +407,12 @@ static int __init gicv2m_of_init(struct fwnode_handle *parent_handle,
ret = gicv2m_init_one(&child->fwnode, spi_start, nr_spis,
&res, 0);
- if (ret) {
- of_node_put(child);
+ if (ret)
break;
- }
}
+ if (ret && child)
+ of_node_put(child);
if (!ret)
ret = gicv2m_allocate_domains(parent);
if (ret)
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x 25dfc9e357af8aed1ca79b318a73f2c59c1f0b2b
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090841-erasure-dice-630c@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
25dfc9e357af ("perf/x86/intel: Limit the period on Haswell")
28f0f3c44b5c ("perf/x86: Change x86_pmu::limit_period signature")
706460a96fc6 ("perf/x86/amd/core: Add generic branch record interfaces")
b40d0156f560 ("perf/x86/amd/brs: Move feature-specific functions")
3c27b0c6ea48 ("perf/x86/amd: Fix AMD BRS period adjustment")
9622e67e3980 ("perf/x86/amd/core: Add PerfMonV2 counter control")
21d59e3e2c40 ("perf/x86/amd/core: Detect PerfMonV2 support")
cc37e520a236 ("perf/x86/amd: Make Zen3 branch sampling opt-in")
ba2fe7500845 ("perf/x86/amd: Add AMD branch sampling period adjustment")
ada543459cab ("perf/x86/amd: Add AMD Fam19h Branch Sampling support")
369461ce8fb6 ("x86: perf: Move RDPMC event flag to a common definition")
05485745ad48 ("perf/amd/uncore: Allow the driver to be built as a module")
5471eea5d3bf ("perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task")
f83d2f91d259 ("perf/x86/intel: Add Alder Lake Hybrid support")
58ae30c29a37 ("perf/x86/intel: Add attr_update for Hybrid PMUs")
d9977c43bff8 ("perf/x86: Register hybrid PMUs")
e11c1a7eb302 ("perf/x86: Factor out x86_pmu_show_pmu_cap")
b98567298bad ("perf/x86: Remove temporary pmu assignment in event_init")
34d5b61f29ee ("perf/x86/intel: Factor out intel_pmu_check_extra_regs")
bc14fe1beeec ("perf/x86/intel: Factor out intel_pmu_check_event_constraints")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 25dfc9e357af8aed1ca79b318a73f2c59c1f0b2b Mon Sep 17 00:00:00 2001
From: Kan Liang <kan.liang(a)linux.intel.com>
Date: Mon, 19 Aug 2024 11:30:04 -0700
Subject: [PATCH] perf/x86/intel: Limit the period on Haswell
Running the ltp test cve-2015-3290 concurrently reports the following
warnings.
perfevents: irq loop stuck!
WARNING: CPU: 31 PID: 32438 at arch/x86/events/intel/core.c:3174
intel_pmu_handle_irq+0x285/0x370
Call Trace:
<NMI>
? __warn+0xa4/0x220
? intel_pmu_handle_irq+0x285/0x370
? __report_bug+0x123/0x130
? intel_pmu_handle_irq+0x285/0x370
? __report_bug+0x123/0x130
? intel_pmu_handle_irq+0x285/0x370
? report_bug+0x3e/0xa0
? handle_bug+0x3c/0x70
? exc_invalid_op+0x18/0x50
? asm_exc_invalid_op+0x1a/0x20
? irq_work_claim+0x1e/0x40
? intel_pmu_handle_irq+0x285/0x370
perf_event_nmi_handler+0x3d/0x60
nmi_handle+0x104/0x330
Thanks to Thomas Gleixner's analysis, the issue is caused by the low
initial period (1) of the frequency estimation algorithm, which triggers
the defects of the HW, specifically erratum HSW11 and HSW143. (For the
details, please refer https://lore.kernel.org/lkml/87plq9l5d2.ffs@tglx/)
The HSW11 requires a period larger than 100 for the INST_RETIRED.ALL
event, but the initial period in the freq mode is 1. The erratum is the
same as the BDM11, which has been supported in the kernel. A minimum
period of 128 is enforced as well on HSW.
HSW143 is regarding that the fixed counter 1 may overcount 32 with the
Hyper-Threading is enabled. However, based on the test, the hardware
has more issues than it tells. Besides the fixed counter 1, the message
'interrupt took too long' can be observed on any counter which was armed
with a period < 32 and two events expired in the same NMI. A minimum
period of 32 is enforced for the rest of the events.
The recommended workaround code of the HSW143 is not implemented.
Because it only addresses the issue for the fixed counter. It brings
extra overhead through extra MSR writing. No related overcounting issue
has been reported so far.
Fixes: 3a632cb229bf ("perf/x86/intel: Add simple Haswell PMU support")
Reported-by: Li Huafei <lihuafei1(a)huawei.com>
Suggested-by: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: Kan Liang <kan.liang(a)linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/20240819183004.3132920-1-kan.liang@linux.intel.…
Closes: https://lore.kernel.org/lkml/20240729223328.327835-1-lihuafei1@huawei.com/
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 0c9c2706d4ec..9e519d8a810a 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4589,6 +4589,25 @@ static enum hybrid_cpu_type adl_get_hybrid_cpu_type(void)
return HYBRID_INTEL_CORE;
}
+static inline bool erratum_hsw11(struct perf_event *event)
+{
+ return (event->hw.config & INTEL_ARCH_EVENT_MASK) ==
+ X86_CONFIG(.event=0xc0, .umask=0x01);
+}
+
+/*
+ * The HSW11 requires a period larger than 100 which is the same as the BDM11.
+ * A minimum period of 128 is enforced as well for the INST_RETIRED.ALL.
+ *
+ * The message 'interrupt took too long' can be observed on any counter which
+ * was armed with a period < 32 and two events expired in the same NMI.
+ * A minimum period of 32 is enforced for the rest of the events.
+ */
+static void hsw_limit_period(struct perf_event *event, s64 *left)
+{
+ *left = max(*left, erratum_hsw11(event) ? 128 : 32);
+}
+
/*
* Broadwell:
*
@@ -4606,8 +4625,7 @@ static enum hybrid_cpu_type adl_get_hybrid_cpu_type(void)
*/
static void bdw_limit_period(struct perf_event *event, s64 *left)
{
- if ((event->hw.config & INTEL_ARCH_EVENT_MASK) ==
- X86_CONFIG(.event=0xc0, .umask=0x01)) {
+ if (erratum_hsw11(event)) {
if (*left < 128)
*left = 128;
*left &= ~0x3fULL;
@@ -6766,6 +6784,7 @@ __init int intel_pmu_init(void)
x86_pmu.hw_config = hsw_hw_config;
x86_pmu.get_event_constraints = hsw_get_event_constraints;
+ x86_pmu.limit_period = hsw_limit_period;
x86_pmu.lbr_double_abort = true;
extra_attr = boot_cpu_has(X86_FEATURE_RTM) ?
hsw_format_attr : nhm_format_attr;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 25dfc9e357af8aed1ca79b318a73f2c59c1f0b2b
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090839-issuing-jeep-f477@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
25dfc9e357af ("perf/x86/intel: Limit the period on Haswell")
28f0f3c44b5c ("perf/x86: Change x86_pmu::limit_period signature")
706460a96fc6 ("perf/x86/amd/core: Add generic branch record interfaces")
b40d0156f560 ("perf/x86/amd/brs: Move feature-specific functions")
3c27b0c6ea48 ("perf/x86/amd: Fix AMD BRS period adjustment")
9622e67e3980 ("perf/x86/amd/core: Add PerfMonV2 counter control")
21d59e3e2c40 ("perf/x86/amd/core: Detect PerfMonV2 support")
cc37e520a236 ("perf/x86/amd: Make Zen3 branch sampling opt-in")
ba2fe7500845 ("perf/x86/amd: Add AMD branch sampling period adjustment")
ada543459cab ("perf/x86/amd: Add AMD Fam19h Branch Sampling support")
369461ce8fb6 ("x86: perf: Move RDPMC event flag to a common definition")
05485745ad48 ("perf/amd/uncore: Allow the driver to be built as a module")
5471eea5d3bf ("perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task")
f83d2f91d259 ("perf/x86/intel: Add Alder Lake Hybrid support")
58ae30c29a37 ("perf/x86/intel: Add attr_update for Hybrid PMUs")
d9977c43bff8 ("perf/x86: Register hybrid PMUs")
e11c1a7eb302 ("perf/x86: Factor out x86_pmu_show_pmu_cap")
b98567298bad ("perf/x86: Remove temporary pmu assignment in event_init")
34d5b61f29ee ("perf/x86/intel: Factor out intel_pmu_check_extra_regs")
bc14fe1beeec ("perf/x86/intel: Factor out intel_pmu_check_event_constraints")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 25dfc9e357af8aed1ca79b318a73f2c59c1f0b2b Mon Sep 17 00:00:00 2001
From: Kan Liang <kan.liang(a)linux.intel.com>
Date: Mon, 19 Aug 2024 11:30:04 -0700
Subject: [PATCH] perf/x86/intel: Limit the period on Haswell
Running the ltp test cve-2015-3290 concurrently reports the following
warnings.
perfevents: irq loop stuck!
WARNING: CPU: 31 PID: 32438 at arch/x86/events/intel/core.c:3174
intel_pmu_handle_irq+0x285/0x370
Call Trace:
<NMI>
? __warn+0xa4/0x220
? intel_pmu_handle_irq+0x285/0x370
? __report_bug+0x123/0x130
? intel_pmu_handle_irq+0x285/0x370
? __report_bug+0x123/0x130
? intel_pmu_handle_irq+0x285/0x370
? report_bug+0x3e/0xa0
? handle_bug+0x3c/0x70
? exc_invalid_op+0x18/0x50
? asm_exc_invalid_op+0x1a/0x20
? irq_work_claim+0x1e/0x40
? intel_pmu_handle_irq+0x285/0x370
perf_event_nmi_handler+0x3d/0x60
nmi_handle+0x104/0x330
Thanks to Thomas Gleixner's analysis, the issue is caused by the low
initial period (1) of the frequency estimation algorithm, which triggers
the defects of the HW, specifically erratum HSW11 and HSW143. (For the
details, please refer https://lore.kernel.org/lkml/87plq9l5d2.ffs@tglx/)
The HSW11 requires a period larger than 100 for the INST_RETIRED.ALL
event, but the initial period in the freq mode is 1. The erratum is the
same as the BDM11, which has been supported in the kernel. A minimum
period of 128 is enforced as well on HSW.
HSW143 is regarding that the fixed counter 1 may overcount 32 with the
Hyper-Threading is enabled. However, based on the test, the hardware
has more issues than it tells. Besides the fixed counter 1, the message
'interrupt took too long' can be observed on any counter which was armed
with a period < 32 and two events expired in the same NMI. A minimum
period of 32 is enforced for the rest of the events.
The recommended workaround code of the HSW143 is not implemented.
Because it only addresses the issue for the fixed counter. It brings
extra overhead through extra MSR writing. No related overcounting issue
has been reported so far.
Fixes: 3a632cb229bf ("perf/x86/intel: Add simple Haswell PMU support")
Reported-by: Li Huafei <lihuafei1(a)huawei.com>
Suggested-by: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: Kan Liang <kan.liang(a)linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/20240819183004.3132920-1-kan.liang@linux.intel.…
Closes: https://lore.kernel.org/lkml/20240729223328.327835-1-lihuafei1@huawei.com/
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 0c9c2706d4ec..9e519d8a810a 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4589,6 +4589,25 @@ static enum hybrid_cpu_type adl_get_hybrid_cpu_type(void)
return HYBRID_INTEL_CORE;
}
+static inline bool erratum_hsw11(struct perf_event *event)
+{
+ return (event->hw.config & INTEL_ARCH_EVENT_MASK) ==
+ X86_CONFIG(.event=0xc0, .umask=0x01);
+}
+
+/*
+ * The HSW11 requires a period larger than 100 which is the same as the BDM11.
+ * A minimum period of 128 is enforced as well for the INST_RETIRED.ALL.
+ *
+ * The message 'interrupt took too long' can be observed on any counter which
+ * was armed with a period < 32 and two events expired in the same NMI.
+ * A minimum period of 32 is enforced for the rest of the events.
+ */
+static void hsw_limit_period(struct perf_event *event, s64 *left)
+{
+ *left = max(*left, erratum_hsw11(event) ? 128 : 32);
+}
+
/*
* Broadwell:
*
@@ -4606,8 +4625,7 @@ static enum hybrid_cpu_type adl_get_hybrid_cpu_type(void)
*/
static void bdw_limit_period(struct perf_event *event, s64 *left)
{
- if ((event->hw.config & INTEL_ARCH_EVENT_MASK) ==
- X86_CONFIG(.event=0xc0, .umask=0x01)) {
+ if (erratum_hsw11(event)) {
if (*left < 128)
*left = 128;
*left &= ~0x3fULL;
@@ -6766,6 +6784,7 @@ __init int intel_pmu_init(void)
x86_pmu.hw_config = hsw_hw_config;
x86_pmu.get_event_constraints = hsw_get_event_constraints;
+ x86_pmu.limit_period = hsw_limit_period;
x86_pmu.lbr_double_abort = true;
extra_attr = boot_cpu_has(X86_FEATURE_RTM) ?
hsw_format_attr : nhm_format_attr;
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 25dfc9e357af8aed1ca79b318a73f2c59c1f0b2b
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090838-facility-embargo-19d0@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
25dfc9e357af ("perf/x86/intel: Limit the period on Haswell")
28f0f3c44b5c ("perf/x86: Change x86_pmu::limit_period signature")
706460a96fc6 ("perf/x86/amd/core: Add generic branch record interfaces")
b40d0156f560 ("perf/x86/amd/brs: Move feature-specific functions")
3c27b0c6ea48 ("perf/x86/amd: Fix AMD BRS period adjustment")
9622e67e3980 ("perf/x86/amd/core: Add PerfMonV2 counter control")
21d59e3e2c40 ("perf/x86/amd/core: Detect PerfMonV2 support")
cc37e520a236 ("perf/x86/amd: Make Zen3 branch sampling opt-in")
ba2fe7500845 ("perf/x86/amd: Add AMD branch sampling period adjustment")
ada543459cab ("perf/x86/amd: Add AMD Fam19h Branch Sampling support")
369461ce8fb6 ("x86: perf: Move RDPMC event flag to a common definition")
05485745ad48 ("perf/amd/uncore: Allow the driver to be built as a module")
5471eea5d3bf ("perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task")
f83d2f91d259 ("perf/x86/intel: Add Alder Lake Hybrid support")
58ae30c29a37 ("perf/x86/intel: Add attr_update for Hybrid PMUs")
d9977c43bff8 ("perf/x86: Register hybrid PMUs")
e11c1a7eb302 ("perf/x86: Factor out x86_pmu_show_pmu_cap")
b98567298bad ("perf/x86: Remove temporary pmu assignment in event_init")
34d5b61f29ee ("perf/x86/intel: Factor out intel_pmu_check_extra_regs")
bc14fe1beeec ("perf/x86/intel: Factor out intel_pmu_check_event_constraints")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 25dfc9e357af8aed1ca79b318a73f2c59c1f0b2b Mon Sep 17 00:00:00 2001
From: Kan Liang <kan.liang(a)linux.intel.com>
Date: Mon, 19 Aug 2024 11:30:04 -0700
Subject: [PATCH] perf/x86/intel: Limit the period on Haswell
Running the ltp test cve-2015-3290 concurrently reports the following
warnings.
perfevents: irq loop stuck!
WARNING: CPU: 31 PID: 32438 at arch/x86/events/intel/core.c:3174
intel_pmu_handle_irq+0x285/0x370
Call Trace:
<NMI>
? __warn+0xa4/0x220
? intel_pmu_handle_irq+0x285/0x370
? __report_bug+0x123/0x130
? intel_pmu_handle_irq+0x285/0x370
? __report_bug+0x123/0x130
? intel_pmu_handle_irq+0x285/0x370
? report_bug+0x3e/0xa0
? handle_bug+0x3c/0x70
? exc_invalid_op+0x18/0x50
? asm_exc_invalid_op+0x1a/0x20
? irq_work_claim+0x1e/0x40
? intel_pmu_handle_irq+0x285/0x370
perf_event_nmi_handler+0x3d/0x60
nmi_handle+0x104/0x330
Thanks to Thomas Gleixner's analysis, the issue is caused by the low
initial period (1) of the frequency estimation algorithm, which triggers
the defects of the HW, specifically erratum HSW11 and HSW143. (For the
details, please refer https://lore.kernel.org/lkml/87plq9l5d2.ffs@tglx/)
The HSW11 requires a period larger than 100 for the INST_RETIRED.ALL
event, but the initial period in the freq mode is 1. The erratum is the
same as the BDM11, which has been supported in the kernel. A minimum
period of 128 is enforced as well on HSW.
HSW143 is regarding that the fixed counter 1 may overcount 32 with the
Hyper-Threading is enabled. However, based on the test, the hardware
has more issues than it tells. Besides the fixed counter 1, the message
'interrupt took too long' can be observed on any counter which was armed
with a period < 32 and two events expired in the same NMI. A minimum
period of 32 is enforced for the rest of the events.
The recommended workaround code of the HSW143 is not implemented.
Because it only addresses the issue for the fixed counter. It brings
extra overhead through extra MSR writing. No related overcounting issue
has been reported so far.
Fixes: 3a632cb229bf ("perf/x86/intel: Add simple Haswell PMU support")
Reported-by: Li Huafei <lihuafei1(a)huawei.com>
Suggested-by: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: Kan Liang <kan.liang(a)linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/20240819183004.3132920-1-kan.liang@linux.intel.…
Closes: https://lore.kernel.org/lkml/20240729223328.327835-1-lihuafei1@huawei.com/
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 0c9c2706d4ec..9e519d8a810a 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4589,6 +4589,25 @@ static enum hybrid_cpu_type adl_get_hybrid_cpu_type(void)
return HYBRID_INTEL_CORE;
}
+static inline bool erratum_hsw11(struct perf_event *event)
+{
+ return (event->hw.config & INTEL_ARCH_EVENT_MASK) ==
+ X86_CONFIG(.event=0xc0, .umask=0x01);
+}
+
+/*
+ * The HSW11 requires a period larger than 100 which is the same as the BDM11.
+ * A minimum period of 128 is enforced as well for the INST_RETIRED.ALL.
+ *
+ * The message 'interrupt took too long' can be observed on any counter which
+ * was armed with a period < 32 and two events expired in the same NMI.
+ * A minimum period of 32 is enforced for the rest of the events.
+ */
+static void hsw_limit_period(struct perf_event *event, s64 *left)
+{
+ *left = max(*left, erratum_hsw11(event) ? 128 : 32);
+}
+
/*
* Broadwell:
*
@@ -4606,8 +4625,7 @@ static enum hybrid_cpu_type adl_get_hybrid_cpu_type(void)
*/
static void bdw_limit_period(struct perf_event *event, s64 *left)
{
- if ((event->hw.config & INTEL_ARCH_EVENT_MASK) ==
- X86_CONFIG(.event=0xc0, .umask=0x01)) {
+ if (erratum_hsw11(event)) {
if (*left < 128)
*left = 128;
*left &= ~0x3fULL;
@@ -6766,6 +6784,7 @@ __init int intel_pmu_init(void)
x86_pmu.hw_config = hsw_hw_config;
x86_pmu.get_event_constraints = hsw_get_event_constraints;
+ x86_pmu.limit_period = hsw_limit_period;
x86_pmu.lbr_double_abort = true;
extra_attr = boot_cpu_has(X86_FEATURE_RTM) ?
hsw_format_attr : nhm_format_attr;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 25dfc9e357af8aed1ca79b318a73f2c59c1f0b2b
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090837-twisty-deodorant-ee35@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
25dfc9e357af ("perf/x86/intel: Limit the period on Haswell")
28f0f3c44b5c ("perf/x86: Change x86_pmu::limit_period signature")
706460a96fc6 ("perf/x86/amd/core: Add generic branch record interfaces")
b40d0156f560 ("perf/x86/amd/brs: Move feature-specific functions")
3c27b0c6ea48 ("perf/x86/amd: Fix AMD BRS period adjustment")
9622e67e3980 ("perf/x86/amd/core: Add PerfMonV2 counter control")
21d59e3e2c40 ("perf/x86/amd/core: Detect PerfMonV2 support")
cc37e520a236 ("perf/x86/amd: Make Zen3 branch sampling opt-in")
ba2fe7500845 ("perf/x86/amd: Add AMD branch sampling period adjustment")
ada543459cab ("perf/x86/amd: Add AMD Fam19h Branch Sampling support")
369461ce8fb6 ("x86: perf: Move RDPMC event flag to a common definition")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 25dfc9e357af8aed1ca79b318a73f2c59c1f0b2b Mon Sep 17 00:00:00 2001
From: Kan Liang <kan.liang(a)linux.intel.com>
Date: Mon, 19 Aug 2024 11:30:04 -0700
Subject: [PATCH] perf/x86/intel: Limit the period on Haswell
Running the ltp test cve-2015-3290 concurrently reports the following
warnings.
perfevents: irq loop stuck!
WARNING: CPU: 31 PID: 32438 at arch/x86/events/intel/core.c:3174
intel_pmu_handle_irq+0x285/0x370
Call Trace:
<NMI>
? __warn+0xa4/0x220
? intel_pmu_handle_irq+0x285/0x370
? __report_bug+0x123/0x130
? intel_pmu_handle_irq+0x285/0x370
? __report_bug+0x123/0x130
? intel_pmu_handle_irq+0x285/0x370
? report_bug+0x3e/0xa0
? handle_bug+0x3c/0x70
? exc_invalid_op+0x18/0x50
? asm_exc_invalid_op+0x1a/0x20
? irq_work_claim+0x1e/0x40
? intel_pmu_handle_irq+0x285/0x370
perf_event_nmi_handler+0x3d/0x60
nmi_handle+0x104/0x330
Thanks to Thomas Gleixner's analysis, the issue is caused by the low
initial period (1) of the frequency estimation algorithm, which triggers
the defects of the HW, specifically erratum HSW11 and HSW143. (For the
details, please refer https://lore.kernel.org/lkml/87plq9l5d2.ffs@tglx/)
The HSW11 requires a period larger than 100 for the INST_RETIRED.ALL
event, but the initial period in the freq mode is 1. The erratum is the
same as the BDM11, which has been supported in the kernel. A minimum
period of 128 is enforced as well on HSW.
HSW143 is regarding that the fixed counter 1 may overcount 32 with the
Hyper-Threading is enabled. However, based on the test, the hardware
has more issues than it tells. Besides the fixed counter 1, the message
'interrupt took too long' can be observed on any counter which was armed
with a period < 32 and two events expired in the same NMI. A minimum
period of 32 is enforced for the rest of the events.
The recommended workaround code of the HSW143 is not implemented.
Because it only addresses the issue for the fixed counter. It brings
extra overhead through extra MSR writing. No related overcounting issue
has been reported so far.
Fixes: 3a632cb229bf ("perf/x86/intel: Add simple Haswell PMU support")
Reported-by: Li Huafei <lihuafei1(a)huawei.com>
Suggested-by: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: Kan Liang <kan.liang(a)linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/20240819183004.3132920-1-kan.liang@linux.intel.…
Closes: https://lore.kernel.org/lkml/20240729223328.327835-1-lihuafei1@huawei.com/
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 0c9c2706d4ec..9e519d8a810a 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4589,6 +4589,25 @@ static enum hybrid_cpu_type adl_get_hybrid_cpu_type(void)
return HYBRID_INTEL_CORE;
}
+static inline bool erratum_hsw11(struct perf_event *event)
+{
+ return (event->hw.config & INTEL_ARCH_EVENT_MASK) ==
+ X86_CONFIG(.event=0xc0, .umask=0x01);
+}
+
+/*
+ * The HSW11 requires a period larger than 100 which is the same as the BDM11.
+ * A minimum period of 128 is enforced as well for the INST_RETIRED.ALL.
+ *
+ * The message 'interrupt took too long' can be observed on any counter which
+ * was armed with a period < 32 and two events expired in the same NMI.
+ * A minimum period of 32 is enforced for the rest of the events.
+ */
+static void hsw_limit_period(struct perf_event *event, s64 *left)
+{
+ *left = max(*left, erratum_hsw11(event) ? 128 : 32);
+}
+
/*
* Broadwell:
*
@@ -4606,8 +4625,7 @@ static enum hybrid_cpu_type adl_get_hybrid_cpu_type(void)
*/
static void bdw_limit_period(struct perf_event *event, s64 *left)
{
- if ((event->hw.config & INTEL_ARCH_EVENT_MASK) ==
- X86_CONFIG(.event=0xc0, .umask=0x01)) {
+ if (erratum_hsw11(event)) {
if (*left < 128)
*left = 128;
*left &= ~0x3fULL;
@@ -6766,6 +6784,7 @@ __init int intel_pmu_init(void)
x86_pmu.hw_config = hsw_hw_config;
x86_pmu.get_event_constraints = hsw_get_event_constraints;
+ x86_pmu.limit_period = hsw_limit_period;
x86_pmu.lbr_double_abort = true;
extra_attr = boot_cpu_has(X86_FEATURE_RTM) ?
hsw_format_attr : nhm_format_attr;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 844436e045ac2ab7895d8b281cb784a24de1d14d
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090848-botch-appointee-7cb2@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
844436e045ac ("ksmbd: Unlock on in ksmbd_tcp_set_interfaces()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 844436e045ac2ab7895d8b281cb784a24de1d14d Mon Sep 17 00:00:00 2001
From: Dan Carpenter <dan.carpenter(a)linaro.org>
Date: Thu, 29 Aug 2024 22:22:35 +0300
Subject: [PATCH] ksmbd: Unlock on in ksmbd_tcp_set_interfaces()
Unlock before returning an error code if this allocation fails.
Fixes: 0626e6641f6b ("cifsd: add server handler for central processing and tranport layers")
Cc: stable(a)vger.kernel.org # v5.15+
Signed-off-by: Dan Carpenter <dan.carpenter(a)linaro.org>
Acked-by: Namjae Jeon <linkinjeon(a)kernel.org>
Signed-off-by: Steve French <stfrench(a)microsoft.com>
diff --git a/fs/smb/server/transport_tcp.c b/fs/smb/server/transport_tcp.c
index a84788396daa..aaed9e293b2e 100644
--- a/fs/smb/server/transport_tcp.c
+++ b/fs/smb/server/transport_tcp.c
@@ -624,8 +624,10 @@ int ksmbd_tcp_set_interfaces(char *ifc_list, int ifc_list_sz)
for_each_netdev(&init_net, netdev) {
if (netif_is_bridge_port(netdev))
continue;
- if (!alloc_iface(kstrdup(netdev->name, GFP_KERNEL)))
+ if (!alloc_iface(kstrdup(netdev->name, GFP_KERNEL))) {
+ rtnl_unlock();
return -ENOMEM;
+ }
}
rtnl_unlock();
bind_additional_ifaces = 1;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 78c5a6f1f630172b19af4912e755e1da93ef0ab5
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024090838-reentry-fender-aec1@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
78c5a6f1f630 ("ksmbd: unset the binding mark of a reused connection")
38c8a9a52082 ("smb: move client and server files to common directory fs/smb")
f5c779b7ddbd ("ksmbd: fix racy issue from session setup and logoff")
1d9c4172110e ("ksmbd: Implements sess->ksmbd_chann_list as xarray")
62c487b53a7f ("ksmbd: limit pdu length size according to connection status")
abdb1742a312 ("cifs: get rid of mount options string parsing")
9fd29a5bae6e ("cifs: use fs_context for automounts")
5dd8ce24667a ("cifs: missing directory in MAINTAINERS file")
332019e23a51 ("Merge tag '5.20-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 78c5a6f1f630172b19af4912e755e1da93ef0ab5 Mon Sep 17 00:00:00 2001
From: Namjae Jeon <linkinjeon(a)kernel.org>
Date: Tue, 27 Aug 2024 21:44:41 +0900
Subject: [PATCH] ksmbd: unset the binding mark of a reused connection
Steve French reported null pointer dereference error from sha256 lib.
cifs.ko can send session setup requests on reused connection.
If reused connection is used for binding session, conn->binding can
still remain true and generate_preauth_hash() will not set
sess->Preauth_HashValue and it will be NULL.
It is used as a material to create an encryption key in
ksmbd_gen_smb311_encryptionkey. ->Preauth_HashValue cause null pointer
dereference error from crypto_shash_update().
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 8 PID: 429254 Comm: kworker/8:39
Hardware name: LENOVO 20MAS08500/20MAS08500, BIOS N2CET69W (1.52 )
Workqueue: ksmbd-io handle_ksmbd_work [ksmbd]
RIP: 0010:lib_sha256_base_do_update.isra.0+0x11e/0x1d0 [sha256_ssse3]
<TASK>
? show_regs+0x6d/0x80
? __die+0x24/0x80
? page_fault_oops+0x99/0x1b0
? do_user_addr_fault+0x2ee/0x6b0
? exc_page_fault+0x83/0x1b0
? asm_exc_page_fault+0x27/0x30
? __pfx_sha256_transform_rorx+0x10/0x10 [sha256_ssse3]
? lib_sha256_base_do_update.isra.0+0x11e/0x1d0 [sha256_ssse3]
? __pfx_sha256_transform_rorx+0x10/0x10 [sha256_ssse3]
? __pfx_sha256_transform_rorx+0x10/0x10 [sha256_ssse3]
_sha256_update+0x77/0xa0 [sha256_ssse3]
sha256_avx2_update+0x15/0x30 [sha256_ssse3]
crypto_shash_update+0x1e/0x40
hmac_update+0x12/0x20
crypto_shash_update+0x1e/0x40
generate_key+0x234/0x380 [ksmbd]
generate_smb3encryptionkey+0x40/0x1c0 [ksmbd]
ksmbd_gen_smb311_encryptionkey+0x72/0xa0 [ksmbd]
ntlm_authenticate.isra.0+0x423/0x5d0 [ksmbd]
smb2_sess_setup+0x952/0xaa0 [ksmbd]
__process_request+0xa3/0x1d0 [ksmbd]
__handle_ksmbd_work+0x1c4/0x2f0 [ksmbd]
handle_ksmbd_work+0x2d/0xa0 [ksmbd]
process_one_work+0x16c/0x350
worker_thread+0x306/0x440
? __pfx_worker_thread+0x10/0x10
kthread+0xef/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x44/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
Fixes: f5a544e3bab7 ("ksmbd: add support for SMB3 multichannel")
Cc: stable(a)vger.kernel.org # v5.15+
Signed-off-by: Namjae Jeon <linkinjeon(a)kernel.org>
Signed-off-by: Steve French <stfrench(a)microsoft.com>
diff --git a/fs/smb/server/smb2pdu.c b/fs/smb/server/smb2pdu.c
index 20846a4d3031..8bdc59251418 100644
--- a/fs/smb/server/smb2pdu.c
+++ b/fs/smb/server/smb2pdu.c
@@ -1690,6 +1690,8 @@ int smb2_sess_setup(struct ksmbd_work *work)
rc = ksmbd_session_register(conn, sess);
if (rc)
goto out_err;
+
+ conn->binding = false;
} else if (conn->dialect >= SMB30_PROT_ID &&
(server_conf.flags & KSMBD_GLOBAL_FLAG_SMB3_MULTICHANNEL) &&
req->Flags & SMB2_SESSION_REQ_FLAG_BINDING) {
@@ -1768,6 +1770,8 @@ int smb2_sess_setup(struct ksmbd_work *work)
sess = NULL;
goto out_err;
}
+
+ conn->binding = false;
}
work->sess = sess;
The patch titled
Subject: mm/codetag: add pgalloc_tag_copy()
has been added to the -mm mm-unstable branch. Its filename is
mm-codetag-add-pgalloc_tag_copy.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Yu Zhao <yuzhao(a)google.com>
Subject: mm/codetag: add pgalloc_tag_copy()
Date: Thu, 5 Sep 2024 22:21:08 -0600
Add pgalloc_tag_copy() to transfer the codetag from the old folio to the
new one during migration. This makes original allocation sites persist
cross migration rather than lump into the get_new_folio callbacks passed
into migrate_pages(), e.g., compaction_alloc():
# echo 1 >/proc/sys/vm/compact_memory
# grep compaction_alloc /proc/allocinfo
Before this patch:
132968448 32463 mm/compaction.c:1880 func:compaction_alloc
After this patch:
0 0 mm/compaction.c:1880 func:compaction_alloc
Link: https://lkml.kernel.org/r/20240906042108.1150526-3-yuzhao@google.com
Fixes: dcfe378c81f7 ("lib: introduce support for page allocation tagging")
Signed-off-by: Yu Zhao <yuzhao(a)google.com>
Acked-by: Suren Baghdasaryan <surenb(a)google.com>
Cc: <stable(a)vger.kernel.org>
Cc: Kent Overstreet <kent.overstreet(a)linux.dev>
Cc: Muchun Song <muchun.song(a)linux.dev>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/alloc_tag.h | 24 ++++++++++--------------
include/linux/mm.h | 27 +++++++++++++++++++++++++++
mm/migrate.c | 1 +
3 files changed, 38 insertions(+), 14 deletions(-)
--- a/include/linux/alloc_tag.h~mm-codetag-add-pgalloc_tag_copy
+++ a/include/linux/alloc_tag.h
@@ -137,7 +137,16 @@ static inline void alloc_tag_sub_check(u
/* Caller should verify both ref and tag to be valid */
static inline void __alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *tag)
{
+ alloc_tag_add_check(ref, tag);
+ if (!ref || !tag)
+ return;
+
ref->ct = &tag->ct;
+}
+
+static inline void alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *tag)
+{
+ __alloc_tag_ref_set(ref, tag);
/*
* We need in increment the call counter every time we have a new
* allocation or when we split a large allocation into smaller ones.
@@ -147,22 +156,9 @@ static inline void __alloc_tag_ref_set(u
this_cpu_inc(tag->counters->calls);
}
-static inline void alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *tag)
-{
- alloc_tag_add_check(ref, tag);
- if (!ref || !tag)
- return;
-
- __alloc_tag_ref_set(ref, tag);
-}
-
static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag, size_t bytes)
{
- alloc_tag_add_check(ref, tag);
- if (!ref || !tag)
- return;
-
- __alloc_tag_ref_set(ref, tag);
+ alloc_tag_ref_set(ref, tag);
this_cpu_add(tag->counters->bytes, bytes);
}
--- a/include/linux/mm.h~mm-codetag-add-pgalloc_tag_copy
+++ a/include/linux/mm.h
@@ -4161,10 +4161,37 @@ static inline void pgalloc_tag_split(str
}
}
}
+
+static inline void pgalloc_tag_copy(struct folio *new, struct folio *old)
+{
+ struct alloc_tag *tag;
+ union codetag_ref *ref;
+
+ tag = pgalloc_tag_get(&old->page);
+ if (!tag)
+ return;
+
+ ref = get_page_tag_ref(&new->page);
+ if (!ref)
+ return;
+
+ /* Clear the old ref to the original allocation tag. */
+ clear_page_tag_ref(&old->page);
+ /* Decrement the counters of the tag on get_new_folio. */
+ alloc_tag_sub(ref, folio_nr_pages(new));
+
+ __alloc_tag_ref_set(ref, tag);
+
+ put_page_tag_ref(ref);
+}
#else /* !CONFIG_MEM_ALLOC_PROFILING */
static inline void pgalloc_tag_split(struct folio *folio, int old_order, int new_order)
{
}
+
+static inline void pgalloc_tag_copy(struct folio *new, struct folio *old)
+{
+}
#endif /* CONFIG_MEM_ALLOC_PROFILING */
#endif /* _LINUX_MM_H */
--- a/mm/migrate.c~mm-codetag-add-pgalloc_tag_copy
+++ a/mm/migrate.c
@@ -743,6 +743,7 @@ void folio_migrate_flags(struct folio *n
folio_set_readahead(newfolio);
folio_copy_owner(newfolio, folio);
+ pgalloc_tag_copy(newfolio, folio);
mem_cgroup_migrate(folio, newfolio);
}
_
Patches currently in -mm which might be from yuzhao(a)google.com are
mm-remap-unused-subpages-to-shared-zeropage-when-splitting-isolated-thp.patch
mm-codetag-fix-a-typo.patch
mm-codetag-fix-pgalloc_tag_split.patch
mm-codetag-add-pgalloc_tag_copy.patch
The patch titled
Subject: mm/codetag: fix pgalloc_tag_split()
has been added to the -mm mm-unstable branch. Its filename is
mm-codetag-fix-pgalloc_tag_split.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Yu Zhao <yuzhao(a)google.com>
Subject: mm/codetag: fix pgalloc_tag_split()
Date: Thu, 5 Sep 2024 22:21:07 -0600
The current assumption is that a large folio can only be split into
order-0 folios. That is not the case for hugeTLB demotion, nor for THP
split: see commit c010d47f107f ("mm: thp: split huge page to any lower
order pages").
When a large folio is split into ones of a lower non-zero order, only the
new head pages should be tagged. Tagging tail pages can cause imbalanced
"calls" counters, since only head pages are untagged by pgalloc_tag_sub()
and the "calls" counts on tail pages are leaked, e.g.,
# echo 2048kB >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote_size
# echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
# time echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote
# echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# grep alloc_gigantic_folio /proc/allocinfo
Before this patch:
0 549427200 mm/hugetlb.c:1549 func:alloc_gigantic_folio
real 0m2.057s
user 0m0.000s
sys 0m2.051s
After this patch:
0 0 mm/hugetlb.c:1549 func:alloc_gigantic_folio
real 0m1.711s
user 0m0.000s
sys 0m1.704s
Not tagging tail pages also improves the splitting time, e.g., by about
15% when demoting 1GB hugeTLB folios to 2MB ones, as shown above.
Link: https://lkml.kernel.org/r/20240906042108.1150526-2-yuzhao@google.com
Fixes: be25d1d4e822 ("mm: create new codetag references during page splitting")
Signed-off-by: Yu Zhao <yuzhao(a)google.com>
Acked-by: Suren Baghdasaryan <surenb(a)google.com>
Cc: <stable(a)vger.kernel.org>
Cc: Kent Overstreet <kent.overstreet(a)linux.dev>
Cc: Muchun Song <muchun.song(a)linux.dev>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mm.h | 30 ++++++++++++++++++++++++++++++
include/linux/pgalloc_tag.h | 31 -------------------------------
mm/huge_memory.c | 2 +-
mm/hugetlb.c | 2 +-
mm/page_alloc.c | 4 ++--
5 files changed, 34 insertions(+), 35 deletions(-)
--- a/include/linux/mm.h~mm-codetag-fix-pgalloc_tag_split
+++ a/include/linux/mm.h
@@ -4137,4 +4137,34 @@ void vma_pgtable_walk_end(struct vm_area
int reserve_mem_find_by_name(const char *name, phys_addr_t *start, phys_addr_t *size);
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+static inline void pgalloc_tag_split(struct folio *folio, int old_order, int new_order)
+{
+ int i;
+ struct alloc_tag *tag;
+ unsigned int nr_pages = 1 << new_order;
+
+ if (!mem_alloc_profiling_enabled())
+ return;
+
+ tag = pgalloc_tag_get(&folio->page);
+ if (!tag)
+ return;
+
+ for (i = nr_pages; i < (1 << old_order); i += nr_pages) {
+ union codetag_ref *ref = get_page_tag_ref(folio_page(folio, i));
+
+ if (ref) {
+ /* Set new reference to point to the original tag */
+ alloc_tag_ref_set(ref, tag);
+ put_page_tag_ref(ref);
+ }
+ }
+}
+#else /* !CONFIG_MEM_ALLOC_PROFILING */
+static inline void pgalloc_tag_split(struct folio *folio, int old_order, int new_order)
+{
+}
+#endif /* CONFIG_MEM_ALLOC_PROFILING */
+
#endif /* _LINUX_MM_H */
--- a/include/linux/pgalloc_tag.h~mm-codetag-fix-pgalloc_tag_split
+++ a/include/linux/pgalloc_tag.h
@@ -80,36 +80,6 @@ static inline void pgalloc_tag_sub(struc
}
}
-static inline void pgalloc_tag_split(struct page *page, unsigned int nr)
-{
- int i;
- struct page_ext *first_page_ext;
- struct page_ext *page_ext;
- union codetag_ref *ref;
- struct alloc_tag *tag;
-
- if (!mem_alloc_profiling_enabled())
- return;
-
- first_page_ext = page_ext = page_ext_get(page);
- if (unlikely(!page_ext))
- return;
-
- ref = codetag_ref_from_page_ext(page_ext);
- if (!ref->ct)
- goto out;
-
- tag = ct_to_alloc_tag(ref->ct);
- page_ext = page_ext_next(page_ext);
- for (i = 1; i < nr; i++) {
- /* Set new reference to point to the original tag */
- alloc_tag_ref_set(codetag_ref_from_page_ext(page_ext), tag);
- page_ext = page_ext_next(page_ext);
- }
-out:
- page_ext_put(first_page_ext);
-}
-
static inline struct alloc_tag *pgalloc_tag_get(struct page *page)
{
struct alloc_tag *tag = NULL;
@@ -142,7 +112,6 @@ static inline void clear_page_tag_ref(st
static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
unsigned int nr) {}
static inline void pgalloc_tag_sub(struct page *page, unsigned int nr) {}
-static inline void pgalloc_tag_split(struct page *page, unsigned int nr) {}
static inline struct alloc_tag *pgalloc_tag_get(struct page *page) { return NULL; }
static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr) {}
--- a/mm/huge_memory.c~mm-codetag-fix-pgalloc_tag_split
+++ a/mm/huge_memory.c
@@ -3242,7 +3242,7 @@ static void __split_huge_page(struct pag
/* Caller disabled irqs, so they are still disabled here */
split_page_owner(head, order, new_order);
- pgalloc_tag_split(head, 1 << order);
+ pgalloc_tag_split(folio, order, new_order);
/* See comment in __split_huge_page_tail() */
if (folio_test_anon(folio)) {
--- a/mm/hugetlb.c~mm-codetag-fix-pgalloc_tag_split
+++ a/mm/hugetlb.c
@@ -3795,7 +3795,7 @@ static long demote_free_hugetlb_folios(s
list_del(&folio->lru);
split_page_owner(&folio->page, huge_page_order(src), huge_page_order(dst));
- pgalloc_tag_split(&folio->page, 1 << huge_page_order(src));
+ pgalloc_tag_split(folio, huge_page_order(src), huge_page_order(dst));
for (i = 0; i < pages_per_huge_page(src); i += pages_per_huge_page(dst)) {
struct page *page = folio_page(folio, i);
--- a/mm/page_alloc.c~mm-codetag-fix-pgalloc_tag_split
+++ a/mm/page_alloc.c
@@ -2783,7 +2783,7 @@ void split_page(struct page *page, unsig
for (i = 1; i < (1 << order); i++)
set_page_refcounted(page + i);
split_page_owner(page, order, 0);
- pgalloc_tag_split(page, 1 << order);
+ pgalloc_tag_split(page_folio(page), order, 0);
split_page_memcg(page, order, 0);
}
EXPORT_SYMBOL_GPL(split_page);
@@ -4981,7 +4981,7 @@ static void *make_alloc_exact(unsigned l
struct page *last = page + nr;
split_page_owner(page, order, 0);
- pgalloc_tag_split(page, 1 << order);
+ pgalloc_tag_split(page_folio(page), order, 0);
split_page_memcg(page, order, 0);
while (page < --last)
set_page_refcounted(last);
_
Patches currently in -mm which might be from yuzhao(a)google.com are
mm-remap-unused-subpages-to-shared-zeropage-when-splitting-isolated-thp.patch
mm-codetag-fix-a-typo.patch
mm-codetag-fix-pgalloc_tag_split.patch
mm-codetag-add-pgalloc_tag_copy.patch
From: Guillaume Stols <gstols(a)baylibre.com>
The current implementation attempts to recover from an eventual glitch
in the clock by checking frstdata state after reading the first
channel's sample: If frstdata is low, it will reset the chip and
return -EIO.
This will only work in parallel mode, where frstdata pin is set low
after the 2nd sample read starts.
For the serial mode, according to the datasheet, "The FRSTDATA output
returns to a logic low following the 16th SCLK falling edge.", thus
after the Xth pulse, X being the number of bits in a sample, the check
will always be true, and the driver will not work at all in serial
mode if frstdata(optional) is defined in the devicetree as it will
reset the chip, and return -EIO every time read_sample is called.
Hence, this check must be removed for serial mode.
Fixes: b9618c0cacd7 ("staging: IIO: ADC: New driver for AD7606/AD7606-6/AD7606-4")
Signed-off-by: Guillaume Stols <gstols(a)baylibre.com>
Reviewed-by: Nuno Sa <nuno.sa(a)analog.com>
Link: https://patch.msgid.link/20240702-cleanup-ad7606-v3-1-18d5ea18770e@baylibre…
Cc: <Stable(a)vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
---
drivers/iio/adc/ad7606.c | 28 ++-------------------
drivers/iio/adc/ad7606.h | 2 ++
drivers/iio/adc/ad7606_par.c | 48 +++++++++++++++++++++++++++++++++---
3 files changed, 49 insertions(+), 29 deletions(-)
diff --git a/drivers/iio/adc/ad7606.c b/drivers/iio/adc/ad7606.c
index 539e4a8621fe..9b457472d49c 100644
--- a/drivers/iio/adc/ad7606.c
+++ b/drivers/iio/adc/ad7606.c
@@ -49,7 +49,7 @@ static const unsigned int ad7616_oversampling_avail[8] = {
1, 2, 4, 8, 16, 32, 64, 128,
};
-static int ad7606_reset(struct ad7606_state *st)
+int ad7606_reset(struct ad7606_state *st)
{
if (st->gpio_reset) {
gpiod_set_value(st->gpio_reset, 1);
@@ -60,6 +60,7 @@ static int ad7606_reset(struct ad7606_state *st)
return -ENODEV;
}
+EXPORT_SYMBOL_NS_GPL(ad7606_reset, IIO_AD7606);
static int ad7606_reg_access(struct iio_dev *indio_dev,
unsigned int reg,
@@ -86,31 +87,6 @@ static int ad7606_read_samples(struct ad7606_state *st)
{
unsigned int num = st->chip_info->num_channels - 1;
u16 *data = st->data;
- int ret;
-
- /*
- * The frstdata signal is set to high while and after reading the sample
- * of the first channel and low for all other channels. This can be used
- * to check that the incoming data is correctly aligned. During normal
- * operation the data should never become unaligned, but some glitch or
- * electrostatic discharge might cause an extra read or clock cycle.
- * Monitoring the frstdata signal allows to recover from such failure
- * situations.
- */
-
- if (st->gpio_frstdata) {
- ret = st->bops->read_block(st->dev, 1, data);
- if (ret)
- return ret;
-
- if (!gpiod_get_value(st->gpio_frstdata)) {
- ad7606_reset(st);
- return -EIO;
- }
-
- data++;
- num--;
- }
return st->bops->read_block(st->dev, num, data);
}
diff --git a/drivers/iio/adc/ad7606.h b/drivers/iio/adc/ad7606.h
index 0c6a88cc4695..6649e84d25de 100644
--- a/drivers/iio/adc/ad7606.h
+++ b/drivers/iio/adc/ad7606.h
@@ -151,6 +151,8 @@ int ad7606_probe(struct device *dev, int irq, void __iomem *base_address,
const char *name, unsigned int id,
const struct ad7606_bus_ops *bops);
+int ad7606_reset(struct ad7606_state *st);
+
enum ad7606_supported_device_ids {
ID_AD7605_4,
ID_AD7606_8,
diff --git a/drivers/iio/adc/ad7606_par.c b/drivers/iio/adc/ad7606_par.c
index b5975bbfcbe0..02d8c309304e 100644
--- a/drivers/iio/adc/ad7606_par.c
+++ b/drivers/iio/adc/ad7606_par.c
@@ -7,6 +7,7 @@
#include <linux/mod_devicetable.h>
#include <linux/module.h>
+#include <linux/gpio/consumer.h>
#include <linux/platform_device.h>
#include <linux/types.h>
#include <linux/err.h>
@@ -21,8 +22,29 @@ static int ad7606_par16_read_block(struct device *dev,
struct iio_dev *indio_dev = dev_get_drvdata(dev);
struct ad7606_state *st = iio_priv(indio_dev);
- insw((unsigned long)st->base_address, buf, count);
+ /*
+ * On the parallel interface, the frstdata signal is set to high while
+ * and after reading the sample of the first channel and low for all
+ * other channels. This can be used to check that the incoming data is
+ * correctly aligned. During normal operation the data should never
+ * become unaligned, but some glitch or electrostatic discharge might
+ * cause an extra read or clock cycle. Monitoring the frstdata signal
+ * allows to recover from such failure situations.
+ */
+ int num = count;
+ u16 *_buf = buf;
+
+ if (st->gpio_frstdata) {
+ insw((unsigned long)st->base_address, _buf, 1);
+ if (!gpiod_get_value(st->gpio_frstdata)) {
+ ad7606_reset(st);
+ return -EIO;
+ }
+ _buf++;
+ num--;
+ }
+ insw((unsigned long)st->base_address, _buf, num);
return 0;
}
@@ -35,8 +57,28 @@ static int ad7606_par8_read_block(struct device *dev,
{
struct iio_dev *indio_dev = dev_get_drvdata(dev);
struct ad7606_state *st = iio_priv(indio_dev);
-
- insb((unsigned long)st->base_address, buf, count * 2);
+ /*
+ * On the parallel interface, the frstdata signal is set to high while
+ * and after reading the sample of the first channel and low for all
+ * other channels. This can be used to check that the incoming data is
+ * correctly aligned. During normal operation the data should never
+ * become unaligned, but some glitch or electrostatic discharge might
+ * cause an extra read or clock cycle. Monitoring the frstdata signal
+ * allows to recover from such failure situations.
+ */
+ int num = count;
+ u16 *_buf = buf;
+
+ if (st->gpio_frstdata) {
+ insb((unsigned long)st->base_address, _buf, 2);
+ if (!gpiod_get_value(st->gpio_frstdata)) {
+ ad7606_reset(st);
+ return -EIO;
+ }
+ _buf++;
+ num--;
+ }
+ insb((unsigned long)st->base_address, _buf, num * 2);
return 0;
}
--
2.46.0
This fixes two problems in the handling of negative times:
• rem is signed, but the rem * c->sb.nsec_per_time_unit operation
produced a bogus unsigned result, because s32 * u32 = u32.
• The timespec was not normalized (it could contain more than a
billion nanoseconds).
For example, { .tv_sec = -14245441, .tv_nsec = 750000000 }, after
being round tripped through timespec_to_bch2_time and then
bch2_time_to_timespec would come back as
{ .tv_sec = -14245440, .tv_nsec = 4044967296 } (more than 4 billion
nanoseconds).
Cc: stable(a)vger.kernel.org
Fixes: 595c1e9bab7f ("bcachefs: Fix time handling")
Closes: https://github.com/koverstreet/bcachefs/issues/743
Co-developed-by: Erin Shepherd <erin.shepherd(a)e43.eu>
Signed-off-by: Erin Shepherd <erin.shepherd(a)e43.eu>
Co-developed-by: Ryan Lahfa <ryan(a)lahfa.xyz>
Signed-off-by: Ryan Lahfa <ryan(a)lahfa.xyz>
Signed-off-by: Alyssa Ross <hi(a)alyssa.is>
---
I've submitted an RFC to fstests to add a regression test for this:
https://lore.kernel.org/fstests/20240907154527.604864-2-hi@alyssa.is/
fs/bcachefs/bcachefs.h | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
index 0c7086e00d18..81c4d935cca8 100644
--- a/fs/bcachefs/bcachefs.h
+++ b/fs/bcachefs/bcachefs.h
@@ -1195,12 +1195,15 @@ static inline bool btree_id_cached(const struct bch_fs *c, enum btree_id btree)
static inline struct timespec64 bch2_time_to_timespec(const struct bch_fs *c, s64 time)
{
struct timespec64 t;
+ s64 sec;
s32 rem;
time += c->sb.time_base_lo;
- t.tv_sec = div_s64_rem(time, c->sb.time_units_per_sec, &rem);
- t.tv_nsec = rem * c->sb.nsec_per_time_unit;
+ sec = div_s64_rem(time, c->sb.time_units_per_sec, &rem);
+
+ set_normalized_timespec64(&t, sec, rem * (s64)c->sb.nsec_per_time_unit);
+
return t;
}
base-commit: 53f6619554fb1edf8d7599b560d44dbea085c730
--
2.45.2
Hi Greg,
Thank you again for your support when we send patches for stable
versions for MPTCP!
Recently, I sent many patches for the stable versions, and I just wanted
to check if what I did was OK for you?
I tried to reply to all the 'FAILED: patch' emails you sent, either with
patches, or with reasons explaining why it is fine not to backport them.
Are you OK with that?
Or do you prefer only receiving the patches, and not the emails with the
reasons not to backport some of them?
About the patches, do you prefer to receive one big series per version
or individual patches sent in reply to the different 'FAILED: patch'
emails like I did?
Do not hesitate if there are things we can improve!
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
The submit queue polling threads are userland threads that just never
exit to the userland. In case the creating task is part of a cgroup
with the cpuset controller enabled, the poller should also stay within
that cpuset. This also holds, as the poller belongs to the same cgroup
as the task that started it.
With the current implementation, a process can "break out" of the
defined cpuset by creating sq pollers consuming CPU time on other CPUs,
which is especially problematic for realtime applications.
Part of this problem was fixed in a5fc1441 by dropping the
PF_NO_SETAFFINITY flag, but this only becomes effective after the first
modification of the cpuset (i.e. the pollers cpuset is correct after the
first update of the enclosing cgroups cpuset).
By inheriting the cpuset of the creating tasks, we ensure that the
poller is created with a cpumask that is a subset of the cgroups mask.
Inheriting the creators cpumask is reasonable, as other userland tasks
also inherit the mask.
Fixes: 37d1e2e3642e ("io_uring: move SQPOLL thread io-wq forked worker")
Cc: stable(a)vger.kernel.org # 6.1+
Signed-off-by: Felix Moessbauer <felix.moessbauer(a)siemens.com>
---
Changes since v2:
- in v2 I accidentally sent the backport of this patch for v6.1. Will
resend that once this one is accepted. Anyways, now we know that this
also works on a v6.1 kernel.
Changes since v1:
- do not set poller thread cpuset in non-pinning case, as the default is already
correct (the mask is inherited from the parent).
- Remove incorrect term "kernel thread" from the commit message
I tested this without pinning, explicit pinning of the parent task and
non-all cgroup cpusets (and all combinations).
Best regards,
Felix Moessbauer
Siemens AG
io_uring/sqpoll.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index 3b50dc9586d1..713be7c29388 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -289,7 +289,6 @@ static int io_sq_thread(void *data)
if (sqd->sq_cpu != -1) {
set_cpus_allowed_ptr(current, cpumask_of(sqd->sq_cpu));
} else {
- set_cpus_allowed_ptr(current, cpu_online_mask);
sqd->sq_cpu = raw_smp_processor_id();
}
--
2.39.2
The submit queue polling threads are userland threads that just never
exit to the userland. In case the creating task is part of a cgroup
with the cpuset controller enabled, the poller should also stay within
that cpuset. This also holds, as the poller belongs to the same cgroup
as the task that started it.
With the current implementation, a process can "break out" of the
defined cpuset by creating sq pollers consuming CPU time on other CPUs,
which is especially problematic for realtime applications.
Part of this problem was fixed in a5fc1441 by dropping the
PF_NO_SETAFFINITY flag, but this only becomes effective after the first
modification of the cpuset (i.e. the pollers cpuset is correct after the
first update of the enclosing cgroups cpuset).
By inheriting the cpuset of the creating tasks, we ensure that the
poller is created with a cpumask that is a subset of the cgroups mask.
Inheriting the creators cpumask is reasonable, as other userland tasks
also inherit the mask.
Fixes: 37d1e2e3642e ("io_uring: move SQPOLL thread io-wq forked worker")
Cc: stable(a)vger.kernel.org # 6.1+
Signed-off-by: Felix Moessbauer <felix.moessbauer(a)siemens.com>
---
Changes since v1:
- do not set poller thread cpuset in non-pinning case, as the default is already
correct (the mask is inherited from the parent).
- Remove incorrect term "kernel thread" from the commit message
I tested this without pinning, explicit pinning of the parent task and
non-all cgroup cpusets (and all combinations).
Best regards,
Felix Moessbauer
Siemens AG
io_uring/sqpoll.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index 6ea21b503113..5a002fa1d953 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -231,8 +231,6 @@ static int io_sq_thread(void *data)
if (sqd->sq_cpu != -1)
set_cpus_allowed_ptr(current, cpumask_of(sqd->sq_cpu));
- else
- set_cpus_allowed_ptr(current, cpu_online_mask);
/*
* Force audit context to get setup, in case we do prep side async
--
2.39.2
It may be possible for the sum of the values derived from
i915_ggtt_offset() and __get_parent_scratch_offset()/
i915_ggtt_offset() to go over the u32 limit before being assigned
to wq offsets of u64 type.
Mitigate these issues by expanding one of the right operands
to u64 to avoid any overflow issues just in case.
Found by Linux Verification Center (linuxtesting.org) with static
analysis tool SVACE.
Fixes: 2584b3549f4c ("drm/i915/guc: Update to GuC version 70.1.1")
Cc: stable(a)vger.kernel.org
Signed-off-by: Nikita Zhandarovich <n.zhandarovich(a)fintech.ru>
---
drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 9400d0eb682b..908ebfa22933 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2842,9 +2842,9 @@ static void prepare_context_registration_info_v70(struct intel_context *ce,
ce->parallel.guc.wqi_tail = 0;
ce->parallel.guc.wqi_head = 0;
- wq_desc_offset = i915_ggtt_offset(ce->state) +
+ wq_desc_offset = (u64)i915_ggtt_offset(ce->state) +
__get_parent_scratch_offset(ce);
- wq_base_offset = i915_ggtt_offset(ce->state) +
+ wq_base_offset = (u64)i915_ggtt_offset(ce->state) +
__get_wq_offset(ce);
info->wq_desc_lo = lower_32_bits(wq_desc_offset);
info->wq_desc_hi = upper_32_bits(wq_desc_offset);
Commit 08d08e2e9f0a ("tpm: ibmvtpm: Call tpm2_sessions_init() to
initialize session support") adds call to tpm2_sessions_init() in ibmvtpm,
which could be built as a module. However, tpm2_sessions_init() wasn't
exported, causing libmvtpm to fail to build as a module:
ERROR: modpost: "tpm2_sessions_init" [drivers/char/tpm/tpm_ibmvtpm.ko] undefined!
Export tpm2_sessions_init() to resolve the issue.
Cc: stable(a)vger.kernel.org # v6.10+
Reported-by: kernel test robot <lkp(a)intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202408051735.ZJkAPQ3b-lkp@intel.com/
Fixes: 08d08e2e9f0a ("tpm: ibmvtpm: Call tpm2_sessions_init() to initialize session support")
Signed-off-by: Kexy Biscuit <kexybiscuit(a)aosc.io>
Signed-off-by: Mingcong Bai <jeffbai(a)aosc.io>
---
V1 -> V2: Added Fixes tag and fixed email format
RESEND: The previous email was sent directly to stable-rc review
drivers/char/tpm/tpm2-sessions.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/char/tpm/tpm2-sessions.c b/drivers/char/tpm/tpm2-sessions.c
index d3521aadd43e..44f60730cff4 100644
--- a/drivers/char/tpm/tpm2-sessions.c
+++ b/drivers/char/tpm/tpm2-sessions.c
@@ -1362,4 +1362,5 @@ int tpm2_sessions_init(struct tpm_chip *chip)
return rc;
}
+EXPORT_SYMBOL(tpm2_sessions_init);
#endif /* CONFIG_TCG_TPM2_HMAC */
--
2.46.0
Hi Greg, Sasha,
A number of commits were identified[1] by syzbot as non-backported
fixes for the fuzzer-detected findings in various Linux LTS trees.
[1] https://syzkaller.appspot.com/upstream/backports
Please consider backporting the following commits to LTS v6.1:
9a8ec9e8ebb5a7c0cfbce2d6b4a6b67b2b78e8f3 "Bluetooth: SCO: Fix possible circular locking dependency on sco_connect_cfm"
(fixes 9a8ec9e) 3dcaa192ac2159193bc6ab57bc5369dcb84edd8e "Bluetooth: SCO: fix sco_conn related locking and validity issues"
3f5424790d4377839093b68c12b130077a4e4510 "ext4: fix inode tree inconsistency caused by ENOMEM"
7b0151caf73a656b75b550e361648430233455a0 "KVM: x86: Remove WARN sanity check on hypervisor timer vs. UNINITIALIZED vCPU"
c2efd13a2ed4f29bf9ef14ac2fbb7474084655f8 "udf: Limit file size to 4TB"
4b827b3f305d1fcf837265f1e12acc22ee84327c "xfs: remove WARN when dquot cache insertion fails"
These were verified to apply cleanly on top of v6.1.107 and to
build/boot.
The following commits to LTS v5.15:
8216776ccff6fcd40e3fdaa109aa4150ebe760b3 "ext4: reject casefold inode flag without casefold feature"
c2efd13a2ed4f29bf9ef14ac2fbb7474084655f8 "udf: Limit file size to 4TB"
These were verified to apply cleanly on top of v5.15.165 and to
build/boot.
The following commits to LTS v5.10:
04e568a3b31cfbd545c04c8bfc35c20e5ccfce0f "ext4: handle redirtying in ext4_bio_write_page()"
2a1fc7dc36260fbe74b6ca29dc6d9088194a2115 "KVM: x86: Suppress MMIO that is triggered during task switch emulation"
2454ad83b90afbc6ed2c22ec1310b624c40bf0d3 "fs: Restrict lock_two_nondirectories() to non-directory inodes"
(fixes 2454ad) 33ab231f83cc12d0157711bbf84e180c3be7d7bc "fs: don't assume arguments are non-NULL"
These were verified to apply cleanly on top of v5.10.224 and to
build/boot.
There are also a lot of syzbot-detected fix commits that did not apply
cleanly, but the conflicts seem to be quite straightforward to resolve
manually. Could you please share what the current process is with
respect to such fix patches? For example, are you sending emails
asking developers to adjust the non-applied patch (if they want), or
is it the other way around -- you expect the authors to be proactive
and send the adjusted patch versions themselves?
Some sample commits, which failed to apply to v6.1.107:
ff91059932401894e6c86341915615c5eb0eca48 "bpf, sockmap: Prevent lock inversion deadlock in map delete elem"
f8f210dc84709804c9f952297f2bfafa6ea6b4bd "btrfs: calculate the right space for delayed refs when updating global reserve"
--
Aleksandr
The submit queue polling threads are "kernel" threads that are started
from the userland. In case the userland task is part of a cgroup with
the cpuset controller enabled, the poller should also stay within that
cpuset. This also holds, as the poller belongs to the same cgroup as
the task that started it.
With the current implementation, a process can "break out" of the
defined cpuset by creating sq pollers consuming CPU time on other CPUs,
which is especially problematic for realtime applications.
Part of this problem was fixed in a5fc1441 by dropping the
PF_NO_SETAFFINITY flag, but this only becomes effective after the first
modification of the cpuset (i.e. the pollers cpuset is correct after the
first update of the enclosing cgroups cpuset).
By inheriting the cpuset of the creating tasks, we ensure that the
poller is created with a cpumask that is a subset of the cgroups mask.
Inheriting the creators cpumask is reasonable, as other userland tasks
also inherit the mask.
Fixes: 37d1e2e3642e ("io_uring: move SQPOLL thread io-wq forked worker")
Cc: stable(a)vger.kernel.org # 6.1+
Signed-off-by: Felix Moessbauer <felix.moessbauer(a)siemens.com>
---
io_uring/sqpoll.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index 3b50dc9586d1..4681b2c41a96 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -289,7 +289,7 @@ static int io_sq_thread(void *data)
if (sqd->sq_cpu != -1) {
set_cpus_allowed_ptr(current, cpumask_of(sqd->sq_cpu));
} else {
- set_cpus_allowed_ptr(current, cpu_online_mask);
+ set_cpus_allowed_ptr(current, sqd->thread->cpus_ptr);
sqd->sq_cpu = raw_smp_processor_id();
}
--
2.39.2
Zero and negative number is not a valid IRQ for in-kernel code and the
irq_of_parse_and_map() function returns zero on error. So this check for
valid IRQs should only accept values > 0.
Cc: stable(a)vger.kernel.org
Fixes: f7578496a671 ("of/irq: Use irq_of_parse_and_map()")
Signed-off-by: Ma Ke <make24(a)iscas.ac.cn>
---
drivers/i2c/busses/i2c-cpm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/i2c/busses/i2c-cpm.c b/drivers/i2c/busses/i2c-cpm.c
index 4794ec066eb0..41e3c95c0ef7 100644
--- a/drivers/i2c/busses/i2c-cpm.c
+++ b/drivers/i2c/busses/i2c-cpm.c
@@ -435,7 +435,7 @@ static int cpm_i2c_setup(struct cpm_i2c *cpm)
init_waitqueue_head(&cpm->i2c_wait);
cpm->irq = irq_of_parse_and_map(ofdev->dev.of_node, 0);
- if (!cpm->irq)
+ if (cpm->irq <= 0)
return -EINVAL;
/* Install interrupt handler. */
--
2.25.1
The qcom_geni_serial_poll_bit() can be used to wait for events like
command completion and is supposed to wait for the time it takes to
clear a full fifo before timing out.
As noted by Doug, the current implementation does not account for start,
stop and parity bits when determining the timeout. The helper also does
not currently account for the shift register and the two-word
intermediate transfer register.
Instead of determining the fifo timeout on every call, store the timeout
when updating it in set_termios() and wait for up to 19/16 the time it
takes to clear the 16 word fifo to account for the shift and
intermediate registers. Note that serial core has already added a 20 ms
margin to the fifo timeout.
Also note that the current uart_fifo_timeout() interface does
unnecessary calculations on every call and also did not exists in
earlier kernels so only store its result once. This also facilitates
backports as earlier kernels can derive the timeout from uport->timeout,
which has since been removed.
Fixes: c4f528795d1a ("tty: serial: msm_geni_serial: Add serial driver support for GENI based QUP")
Cc: stable(a)vger.kernel.org # 4.17
Reported-by: Douglas Anderson <dianders(a)chromium.org>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
---
drivers/tty/serial/qcom_geni_serial.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index 69a632fefc41..e1926124339d 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -124,7 +124,7 @@ struct qcom_geni_serial_port {
dma_addr_t tx_dma_addr;
dma_addr_t rx_dma_addr;
bool setup;
- unsigned int baud;
+ unsigned long fifo_timeout_us;
unsigned long clk_rate;
void *rx_buf;
u32 loopback;
@@ -270,22 +270,21 @@ static bool qcom_geni_serial_poll_bit(struct uart_port *uport,
{
u32 reg;
struct qcom_geni_serial_port *port;
- unsigned int baud;
- unsigned int fifo_bits;
unsigned long timeout_us = 20000;
struct qcom_geni_private_data *private_data = uport->private_data;
if (private_data->drv) {
port = to_dev_port(uport);
- baud = port->baud;
- if (!baud)
- baud = 115200;
- fifo_bits = port->tx_fifo_depth * port->tx_fifo_width;
+
/*
- * Total polling iterations based on FIFO worth of bytes to be
- * sent at current baud. Add a little fluff to the wait.
+ * Wait up to 19/16 the time it would take to clear a full
+ * FIFO, which accounts for the three words in the shift and
+ * intermediate registers.
+ *
+ * Note that fifo_timeout_us already has a 20 ms margin.
*/
- timeout_us = ((fifo_bits * USEC_PER_SEC) / baud) + 500;
+ if (port->fifo_timeout_us)
+ timeout_us = 19 * port->fifo_timeout_us / 16;
}
/*
@@ -1248,7 +1247,6 @@ static void qcom_geni_serial_set_termios(struct uart_port *uport,
qcom_geni_serial_stop_rx(uport);
/* baud rate */
baud = uart_get_baud_rate(uport, termios, old, 300, 4000000);
- port->baud = baud;
sampling_rate = UART_OVERSAMPLING;
/* Sampling rate is halved for IP versions >= 2.5 */
@@ -1326,8 +1324,10 @@ static void qcom_geni_serial_set_termios(struct uart_port *uport,
else
tx_trans_cfg |= UART_CTS_MASK;
- if (baud)
+ if (baud) {
uart_update_timeout(uport, termios->c_cflag, baud);
+ port->fifo_timeout_us = jiffies_to_usecs(uart_fifo_timeout(uport));
+ }
if (!uart_console(uport))
writel(port->loopback,
--
2.44.2
The polled UART operations are used by the kernel debugger (KDB, KGDB),
which can interrupt the kernel at any point in time. The current
Qualcomm GENI implementation does not really work when there is on-going
serial output as it inadvertently "hijacks" the current tx command,
which can result in both the initial debugger output being corrupted as
well as the corruption of any on-going serial output (up to 4k
characters) when execution resumes:
0190: abcdefghijklmnopqrstuvwxyz0123456789 0190: abcdefghijklmnopqrstuvwxyz0123456789
0191: abcdefghijklmnop[ 50.825552] sysrq: DEBUG
qrstuvwxyz0123456789 0191: abcdefghijklmnopqrstuvwxyz0123456789
Entering kdb (current=0xffff53510b4cd280, pid 640) on processor 2 due to Keyboard Entry
[2]kdb> go
omlji3h3h2g2g1f1f0e0ezdzdycycxbxbwawav :t72r2rp
o9n976k5j5j4i4i3h3h2g2g1f1f0e0ezdzdycycxbxbwawavu:t7t8s8s8r2r2q0q0p
o9n9n8ml6k6k5j5j4i4i3h3h2g2g1f1f0e0ezdzdycycxbxbwawav v u:u:t9t0s4s4rq0p
o9n9n8m8m7l7l6k6k5j5j40q0p p o
o9n9n8m8m7l7l6k6k5j5j4i4i3h3h2g2g1f1f0e0ezdzdycycxbxbwawav :t8t9s4s4r4r4q0q0p
Fix this by making sure that the polled output implementation waits for
the tx fifo to drain before cancelling any on-going longer transfers. As
the polled code cannot take any locks, leave the state variables as they
are and instead make sure that the interrupt handler always starts a new
tx command when there is data in the write buffer.
Since the debugger can interrupt the interrupt handler when it is
writing data to the tx fifo, it is currently not possible to fully
prevent losing up to 64 bytes of tty output on resume.
Fixes: c4f528795d1a ("tty: serial: msm_geni_serial: Add serial driver support for GENI based QUP")
Cc: stable(a)vger.kernel.org # 4.17
Reviewed-by: Douglas Anderson <dianders(a)chromium.org>
Tested-by: Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
---
drivers/tty/serial/qcom_geni_serial.c | 27 ++++++++++++++++++---------
1 file changed, 18 insertions(+), 9 deletions(-)
diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index f23fd0ac3cfd..6f0db310cf69 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -145,6 +145,7 @@ static const struct uart_ops qcom_geni_uart_pops;
static struct uart_driver qcom_geni_console_driver;
static struct uart_driver qcom_geni_uart_driver;
+static void __qcom_geni_serial_cancel_tx_cmd(struct uart_port *uport);
static void qcom_geni_serial_cancel_tx_cmd(struct uart_port *uport);
static inline struct qcom_geni_serial_port *to_dev_port(struct uart_port *uport)
@@ -384,13 +385,14 @@ static int qcom_geni_serial_get_char(struct uart_port *uport)
static void qcom_geni_serial_poll_put_char(struct uart_port *uport,
unsigned char c)
{
- writel(DEF_TX_WM, uport->membase + SE_GENI_TX_WATERMARK_REG);
+ if (qcom_geni_serial_main_active(uport)) {
+ qcom_geni_serial_poll_tx_done(uport);
+ __qcom_geni_serial_cancel_tx_cmd(uport);
+ }
+
writel(M_CMD_DONE_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
qcom_geni_serial_setup_tx(uport, 1);
- WARN_ON(!qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
- M_TX_FIFO_WATERMARK_EN, true));
writel(c, uport->membase + SE_GENI_TX_FIFOn);
- writel(M_TX_FIFO_WATERMARK_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
qcom_geni_serial_poll_tx_done(uport);
}
#endif
@@ -677,13 +679,10 @@ static void qcom_geni_serial_stop_tx_fifo(struct uart_port *uport)
writel(irq_en, uport->membase + SE_GENI_M_IRQ_EN);
}
-static void qcom_geni_serial_cancel_tx_cmd(struct uart_port *uport)
+static void __qcom_geni_serial_cancel_tx_cmd(struct uart_port *uport)
{
struct qcom_geni_serial_port *port = to_dev_port(uport);
- if (!qcom_geni_serial_main_active(uport))
- return;
-
geni_se_cancel_m_cmd(&port->se);
if (!qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
M_CMD_CANCEL_EN, true)) {
@@ -693,6 +692,16 @@ static void qcom_geni_serial_cancel_tx_cmd(struct uart_port *uport)
writel(M_CMD_ABORT_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
}
writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
+}
+
+static void qcom_geni_serial_cancel_tx_cmd(struct uart_port *uport)
+{
+ struct qcom_geni_serial_port *port = to_dev_port(uport);
+
+ if (!qcom_geni_serial_main_active(uport))
+ return;
+
+ __qcom_geni_serial_cancel_tx_cmd(uport);
port->tx_remaining = 0;
port->tx_queued = 0;
@@ -919,7 +928,7 @@ static void qcom_geni_serial_handle_tx_fifo(struct uart_port *uport,
if (!chunk)
goto out_write_wakeup;
- if (!port->tx_remaining) {
+ if (!active) {
qcom_geni_serial_setup_tx(uport, pending);
port->tx_remaining = pending;
port->tx_queued = 0;
--
2.44.2
The Qualcomm serial console implementation is broken and can lose
characters when the serial port is also used for tty output.
Specifically, the console code only waits for the current tx command to
complete when all data has already been written to the fifo. When there
are on-going longer transfers this often means that console output is
lost when the console code inadvertently "hijacks" the current tx
command instead of starting a new one.
This can, for example, be observed during boot when console output that
should have been interspersed with init output is truncated:
[ 9.462317] qcom-snps-eusb2-hsphy fde000.phy: Registered Qcom-eUSB2 phy
[ OK ] Found device KBG50ZNS256G KIOXIA Wi[ 9.471743ndows.
[ 9.539915] xhci-hcd xhci-hcd.0.auto: xHCI Host Controller
Add a new state variable to track how much data has been written to the
fifo and use it to determine when the fifo and shift register are both
empty. This is needed since there is currently no other known way to
determine when the shift register is empty.
This in turn allows the console code to interrupt long transfers without
losing data.
Note that the oops-in-progress case is similarly broken as it does not
cancel any active command and also waits for the wrong status flag when
attempting to drain the fifo (TX_FIFO_NOT_EMPTY_EN is only set when
cancelling a command leaves data in the fifo).
Fixes: c4f528795d1a ("tty: serial: msm_geni_serial: Add serial driver support for GENI based QUP")
Fixes: a1fee899e5be ("tty: serial: qcom_geni_serial: Fix softlock")
Fixes: 9e957a155005 ("serial: qcom-geni: Don't cancel/abort if we can't get the port lock")
Cc: stable(a)vger.kernel.org # 4.17
Reviewed-by: Douglas Anderson <dianders(a)chromium.org>
Tested-by: Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
---
drivers/tty/serial/qcom_geni_serial.c | 45 +++++++++++++--------------
1 file changed, 22 insertions(+), 23 deletions(-)
diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index 7bbd70c30620..f8f6e9466b40 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -131,6 +131,7 @@ struct qcom_geni_serial_port {
bool brk;
unsigned int tx_remaining;
+ unsigned int tx_queued;
int wakeup_irq;
bool rx_tx_swap;
bool cts_rts_swap;
@@ -144,6 +145,8 @@ static const struct uart_ops qcom_geni_uart_pops;
static struct uart_driver qcom_geni_console_driver;
static struct uart_driver qcom_geni_uart_driver;
+static void qcom_geni_serial_cancel_tx_cmd(struct uart_port *uport);
+
static inline struct qcom_geni_serial_port *to_dev_port(struct uart_port *uport)
{
return container_of(uport, struct qcom_geni_serial_port, uport);
@@ -393,6 +396,14 @@ static void qcom_geni_serial_poll_put_char(struct uart_port *uport,
#endif
#ifdef CONFIG_SERIAL_QCOM_GENI_CONSOLE
+static void qcom_geni_serial_drain_fifo(struct uart_port *uport)
+{
+ struct qcom_geni_serial_port *port = to_dev_port(uport);
+
+ qcom_geni_serial_poll_bitfield(uport, SE_GENI_M_GP_LENGTH, GP_LENGTH,
+ port->tx_queued);
+}
+
static void qcom_geni_serial_wr_char(struct uart_port *uport, unsigned char ch)
{
struct qcom_geni_private_data *private_data = uport->private_data;
@@ -468,7 +479,6 @@ static void qcom_geni_serial_console_write(struct console *co, const char *s,
struct qcom_geni_serial_port *port;
bool locked = true;
unsigned long flags;
- u32 geni_status;
WARN_ON(co->index < 0 || co->index >= GENI_UART_CONS_PORTS);
@@ -482,34 +492,20 @@ static void qcom_geni_serial_console_write(struct console *co, const char *s,
else
uart_port_lock_irqsave(uport, &flags);
- geni_status = readl(uport->membase + SE_GENI_STATUS);
+ if (qcom_geni_serial_main_active(uport)) {
+ /* Wait for completion or drain FIFO */
+ if (!locked || port->tx_remaining == 0)
+ qcom_geni_serial_poll_tx_done(uport);
+ else
+ qcom_geni_serial_drain_fifo(uport);
- if (!locked) {
- /*
- * We can only get here if an oops is in progress then we were
- * unable to get the lock. This means we can't safely access
- * our state variables like tx_remaining. About the best we
- * can do is wait for the FIFO to be empty before we start our
- * transfer, so we'll do that.
- */
- qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
- M_TX_FIFO_NOT_EMPTY_EN, false);
- } else if ((geni_status & M_GENI_CMD_ACTIVE) && !port->tx_remaining) {
- /*
- * It seems we can't interrupt existing transfers if all data
- * has been sent, in which case we need to look for done first.
- */
- qcom_geni_serial_poll_tx_done(uport);
+ qcom_geni_serial_cancel_tx_cmd(uport);
}
__qcom_geni_serial_console_write(uport, s, count);
-
- if (locked) {
- if (port->tx_remaining)
- qcom_geni_serial_setup_tx(uport, port->tx_remaining);
+ if (locked)
uart_port_unlock_irqrestore(uport, flags);
- }
}
static void handle_rx_console(struct uart_port *uport, u32 bytes, bool drop)
@@ -690,6 +686,7 @@ static void qcom_geni_serial_cancel_tx_cmd(struct uart_port *uport)
writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
port->tx_remaining = 0;
+ port->tx_queued = 0;
}
static void qcom_geni_serial_handle_rx_fifo(struct uart_port *uport, bool drop)
@@ -916,6 +913,7 @@ static void qcom_geni_serial_handle_tx_fifo(struct uart_port *uport,
if (!port->tx_remaining) {
qcom_geni_serial_setup_tx(uport, pending);
port->tx_remaining = pending;
+ port->tx_queued = 0;
irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
if (!(irq_en & M_TX_FIFO_WATERMARK_EN))
@@ -924,6 +922,7 @@ static void qcom_geni_serial_handle_tx_fifo(struct uart_port *uport,
}
qcom_geni_serial_send_chunk_fifo(uport, chunk);
+ port->tx_queued += chunk;
/*
* The tx fifo watermark is level triggered and latched. Though we had
--
2.44.2
Hi, Greg
Could you please help to cherry-pick the following commit to the 5.4.y branch?
697fa27dc5fb ("reset: hi6220: Add support for AO reset controller")
It's been there since the 5.10 kernel, and this along with the clk patch
are needed for Hikey devices to work with the 5.4.y kernels, otherwise
it will reboot again and again during the boot.
--
Best Regards,
Yongqin Liu
---------------------------------------------------------------
#mailing list
linaro-android(a)lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-android
If an interrupt occurs in queued_spin_lock_slowpath() after we increment
qnodesp->count and before node->lock is initialized, another CPU might
see stale lock values in get_tail_qnode(). If the stale lock value happens
to match the lock on that CPU, then we write to the "next" pointer of
the wrong qnode. This causes a deadlock as the former CPU, once it becomes
the head of the MCS queue, will spin indefinitely until it's "next" pointer
is set by its successor in the queue.
Running stress-ng on a 16 core (16EC/16VP) shared LPAR, results in
occasional lockups similar to the following:
$ stress-ng --all 128 --vm-bytes 80% --aggressive \
--maximize --oomable --verify --syslog \
--metrics --times --timeout 5m
watchdog: CPU 15 Hard LOCKUP
......
NIP [c0000000000b78f4] queued_spin_lock_slowpath+0x1184/0x1490
LR [c000000001037c5c] _raw_spin_lock+0x6c/0x90
Call Trace:
0xc000002cfffa3bf0 (unreliable)
_raw_spin_lock+0x6c/0x90
raw_spin_rq_lock_nested.part.135+0x4c/0xd0
sched_ttwu_pending+0x60/0x1f0
__flush_smp_call_function_queue+0x1dc/0x670
smp_ipi_demux_relaxed+0xa4/0x100
xive_muxed_ipi_action+0x20/0x40
__handle_irq_event_percpu+0x80/0x240
handle_irq_event_percpu+0x2c/0x80
handle_percpu_irq+0x84/0xd0
generic_handle_irq+0x54/0x80
__do_irq+0xac/0x210
__do_IRQ+0x74/0xd0
0x0
do_IRQ+0x8c/0x170
hardware_interrupt_common_virt+0x29c/0x2a0
--- interrupt: 500 at queued_spin_lock_slowpath+0x4b8/0x1490
......
NIP [c0000000000b6c28] queued_spin_lock_slowpath+0x4b8/0x1490
LR [c000000001037c5c] _raw_spin_lock+0x6c/0x90
--- interrupt: 500
0xc0000029c1a41d00 (unreliable)
_raw_spin_lock+0x6c/0x90
futex_wake+0x100/0x260
do_futex+0x21c/0x2a0
sys_futex+0x98/0x270
system_call_exception+0x14c/0x2f0
system_call_vectored_common+0x15c/0x2ec
The following code flow illustrates how the deadlock occurs.
For the sake of brevity, assume that both locks (A and B) are
contended and we call the queued_spin_lock_slowpath() function.
CPU0 CPU1
---- ----
spin_lock_irqsave(A) |
spin_unlock_irqrestore(A) |
spin_lock(B) |
| |
▼ |
id = qnodesp->count++; |
(Note that nodes[0].lock == A) |
| |
▼ |
Interrupt |
(happens before "nodes[0].lock = B") |
| |
▼ |
spin_lock_irqsave(A) |
| |
▼ |
id = qnodesp->count++ |
nodes[1].lock = A |
| |
▼ |
Tail of MCS queue |
| spin_lock_irqsave(A)
▼ |
Head of MCS queue ▼
| CPU0 is previous tail
▼ |
Spin indefinitely ▼
(until "nodes[1].next != NULL") prev = get_tail_qnode(A, CPU0)
|
▼
prev == &qnodes[CPU0].nodes[0]
(as qnodes[CPU0].nodes[0].lock == A)
|
▼
WRITE_ONCE(prev->next, node)
|
▼
Spin indefinitely
(until nodes[0].locked == 1)
Thanks to Saket Kumar Bhaskar for help with recreating the issue
Fixes: 84990b169557 ("powerpc/qspinlock: add mcs queueing for contended waiters")
Cc: stable(a)vger.kernel.org # v6.2+
Reported-by: Geetika Moolchandani <geetika(a)linux.ibm.com>
Reported-by: Vaishnavi Bhat <vaish123(a)in.ibm.com>
Reported-by: Jijo Varghese <vargjijo(a)in.ibm.com>
Signed-off-by: Nysal Jan K.A. <nysal(a)linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin(a)gmail.com>
---
arch/powerpc/lib/qspinlock.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 5de4dd549f6e..bcc7e4dff8c3 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -697,7 +697,15 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
}
release:
- qnodesp->count--; /* release the node */
+ /*
+ * Clear the lock before releasing the node, as another CPU might see stale
+ * values if an interrupt occurs after we increment qnodesp->count
+ * but before node->lock is initialized. The barrier ensures that
+ * there are no further stores to the node after it has been released.
+ */
+ node->lock = NULL;
+ barrier();
+ qnodesp->count--;
}
void queued_spin_lock_slowpath(struct qspinlock *lock)
--
2.46.0
From: Filipe Manana <fdmanana(a)suse.com>
commit 28b21c558a3753171097193b6f6602a94169093a upstream.
At ioctl.c:create_snapshot(), we allocate a pending snapshot structure and
then attach it to the transaction's list of pending snapshots. After that
we call btrfs_commit_transaction(), and if that returns an error we jump
to 'fail' label, where we kfree() the pending snapshot structure. This can
result in a later use-after-free of the pending snapshot:
1) We allocated the pending snapshot and added it to the transaction's
list of pending snapshots;
2) We call btrfs_commit_transaction(), and it fails either at the first
call to btrfs_run_delayed_refs() or btrfs_start_dirty_block_groups().
In both cases, we don't abort the transaction and we release our
transaction handle. We jump to the 'fail' label and free the pending
snapshot structure. We return with the pending snapshot still in the
transaction's list;
3) Another task commits the transaction. This time there's no error at
all, and then during the transaction commit it accesses a pointer
to the pending snapshot structure that the snapshot creation task
has already freed, resulting in a user-after-free.
This issue could actually be detected by smatch, which produced the
following warning:
fs/btrfs/ioctl.c:843 create_snapshot() warn: '&pending_snapshot->list' not removed from list
So fix this by not having the snapshot creation ioctl directly add the
pending snapshot to the transaction's list. Instead add the pending
snapshot to the transaction handle, and then at btrfs_commit_transaction()
we add the snapshot to the list only when we can guarantee that any error
returned after that point will result in a transaction abort, in which
case the ioctl code can safely free the pending snapshot and no one can
access it anymore.
CC: stable(a)vger.kernel.org # 5.10+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: Hugo SIMELIERE <hsimeliere.opensource(a)witekio.com>
---
fs/btrfs/ioctl.c | 5 +----
fs/btrfs/transaction.c | 24 ++++++++++++++++++++++++
fs/btrfs/transaction.h | 2 ++
3 files changed, 27 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index ab8ed187746e..24c4d059cfab 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -853,10 +853,7 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
goto fail;
}
- spin_lock(&fs_info->trans_lock);
- list_add(&pending_snapshot->list,
- &trans->transaction->pending_snapshots);
- spin_unlock(&fs_info->trans_lock);
+ trans->pending_snapshot = pending_snapshot;
ret = btrfs_commit_transaction(trans);
if (ret)
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 8cefe11c57db..8878aa7cbdc5 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2075,6 +2075,27 @@ static inline void btrfs_wait_delalloc_flush(struct btrfs_trans_handle *trans)
}
}
+/*
+ * Add a pending snapshot associated with the given transaction handle to the
+ * respective handle. This must be called after the transaction commit started
+ * and while holding fs_info->trans_lock.
+ * This serves to guarantee a caller of btrfs_commit_transaction() that it can
+ * safely free the pending snapshot pointer in case btrfs_commit_transaction()
+ * returns an error.
+ */
+static void add_pending_snapshot(struct btrfs_trans_handle *trans)
+{
+ struct btrfs_transaction *cur_trans = trans->transaction;
+
+ if (!trans->pending_snapshot)
+ return;
+
+ lockdep_assert_held(&trans->fs_info->trans_lock);
+ ASSERT(cur_trans->state >= TRANS_STATE_COMMIT_START);
+
+ list_add(&trans->pending_snapshot->list, &cur_trans->pending_snapshots);
+}
+
int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
{
struct btrfs_fs_info *fs_info = trans->fs_info;
@@ -2161,6 +2182,8 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
spin_lock(&fs_info->trans_lock);
if (cur_trans->state >= TRANS_STATE_COMMIT_START) {
+ add_pending_snapshot(trans);
+
spin_unlock(&fs_info->trans_lock);
refcount_inc(&cur_trans->use_count);
ret = btrfs_end_transaction(trans);
@@ -2243,6 +2266,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
* COMMIT_DOING so make sure to wait for num_writers to == 1 again.
*/
spin_lock(&fs_info->trans_lock);
+ add_pending_snapshot(trans);
cur_trans->state = TRANS_STATE_COMMIT_DOING;
spin_unlock(&fs_info->trans_lock);
wait_event(cur_trans->writer_wait,
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index f73654d93fa0..eb26eb068fe8 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -122,6 +122,8 @@ struct btrfs_trans_handle {
struct btrfs_transaction *transaction;
struct btrfs_block_rsv *block_rsv;
struct btrfs_block_rsv *orig_rsv;
+ /* Set by a task that wants to create a snapshot. */
+ struct btrfs_pending_snapshot *pending_snapshot;
refcount_t use_count;
unsigned int type;
/*
--
2.43.0
On Fri, Sep 06, 2024 at 04:07:59PM +0530, Meetakshi Setiya wrote:
> Upstream commit 66d45ca1350a3bb8d5f4db8879ccad3ed492337a
>
> Yes, you are right, it would be good to backport to 6.1 as well.
> Please let me know if you need anything from me.
I need a working backport for 6.1.y before we can take this one. Oh
wait, it's already in 6.1.16, a long time ago.
This will be fine, thanks.
greg k-h
This patch fixes a directory lease bug on the smb client and
prevents it from incorrectly caching the directories if the
server returns an invalid lease state. The patch is in
6.3 kernel, requesting backport to stable 5.15.
I have cherry-picked the patch for 5.15 kernel below
From 2bb51b129ceb884145c3527f8c04817cc00d0e6e Mon Sep 17 00:00:00 2001
From: Ronnie Sahlberg <lsahlber(a)redhat.com>
Date: Fri, 17 Feb 2023 13:35:00 +1000
Subject: [PATCH] cifs: Check the lease context if we actually got a lease
Some servers may return that we got a lease in rsp->OplockLevel
but then in the lease context contradict this and say we got no lease
at all. Thus we need to check the context if we have a lease.
Additionally, If we do not get a lease we need to make sure we close
the handle before we return an error to the caller.
Signed-off-by: Ronnie Sahlberg <lsahlber(a)redhat.com>
Cc: stable(a)vger.kernel.org
Reviewed-by: Bharath SM <bharathsm(a)microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc(a)manguebit.com>
Signed-off-by: Steve French <stfrench(a)microsoft.com>
Signed-off-by: Meetakshi Setiya <msetiya(a)microsoft.com>
---
fs/cifs/smb2ops.c | 24 ++++++++++++++++--------
1 file changed, 16 insertions(+), 8 deletions(-)
diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index b725bd3144fb..6c30fff8a029 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -886,8 +886,6 @@ int open_cached_dir(unsigned int xid, struct cifs_tcon *tcon,
goto oshr_exit;
}
- atomic_inc(&tcon->num_remote_opens);
-
o_rsp = (struct smb2_create_rsp *)rsp_iov[0].iov_base;
oparms.fid->persistent_fid = o_rsp->PersistentFileId;
oparms.fid->volatile_fid = o_rsp->VolatileFileId;
@@ -897,8 +895,6 @@ int open_cached_dir(unsigned int xid, struct cifs_tcon *tcon,
tcon->crfid.tcon = tcon;
tcon->crfid.is_valid = true;
- tcon->crfid.dentry = dentry;
- dget(dentry);
kref_init(&tcon->crfid.refcount);
/* BB TBD check to see if oplock level check can be removed below */
@@ -907,14 +903,16 @@ int open_cached_dir(unsigned int xid, struct cifs_tcon *tcon,
* See commit 2f94a3125b87. Increment the refcount when we
* get a lease for root, release it if lease break occurs
*/
- kref_get(&tcon->crfid.refcount);
- tcon->crfid.has_lease = true;
rc = smb2_parse_contexts(server, rsp_iov,
&oparms.fid->epoch,
oparms.fid->lease_key, &oplock,
NULL, NULL);
if (rc)
goto oshr_exit;
+
+ if (!(oplock & SMB2_LEASE_READ_CACHING_HE))
+ goto oshr_exit;
+
} else
goto oshr_exit;
@@ -928,7 +926,10 @@ int open_cached_dir(unsigned int xid, struct cifs_tcon *tcon,
(char *)&tcon->crfid.file_all_info))
tcon->crfid.file_all_info_is_valid = true;
tcon->crfid.time = jiffies;
-
+ tcon->crfid.dentry = dentry;
+ dget(dentry);
+ kref_get(&tcon->crfid.refcount);
+ tcon->crfid.has_lease = true;
oshr_exit:
mutex_unlock(&tcon->crfid.fid_mutex);
@@ -937,8 +938,15 @@ int open_cached_dir(unsigned int xid, struct cifs_tcon *tcon,
SMB2_query_info_free(&rqst[1]);
free_rsp_buf(resp_buftype[0], rsp_iov[0].iov_base);
free_rsp_buf(resp_buftype[1], rsp_iov[1].iov_base);
- if (rc == 0)
+ if (rc) {
+ if (tcon->crfid.is_valid)
+ SMB2_close(0, tcon, oparms.fid->persistent_fid,
+ oparms.fid->volatile_fid);
+ }
+ if (rc == 0) {
*cfid = &tcon->crfid;
+ atomic_inc(&tcon->num_remote_opens);
+ }
return rc;
}
--
2.46.0.46.g406f326d27
devm_kasprintf() can return a NULL pointer on failure but this returned
value is not checked. Fix this lack and check the returned value.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: 32c170ff15b0 ("pinctrl: stm32: set default gpio line names using pin names")
Signed-off-by: Ma Ke <make24(a)iscas.ac.cn>
---
Changes in v2:
- modified the patch according to suggestions, added braces;
- modified the typo.
---
drivers/pinctrl/stm32/pinctrl-stm32.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/drivers/pinctrl/stm32/pinctrl-stm32.c b/drivers/pinctrl/stm32/pinctrl-stm32.c
index a8673739871d..f23b081f31b3 100644
--- a/drivers/pinctrl/stm32/pinctrl-stm32.c
+++ b/drivers/pinctrl/stm32/pinctrl-stm32.c
@@ -1374,10 +1374,16 @@ static int stm32_gpiolib_register_bank(struct stm32_pinctrl *pctl, struct fwnode
for (i = 0; i < npins; i++) {
stm32_pin = stm32_pctrl_get_desc_pin_from_gpio(pctl, bank, i);
- if (stm32_pin && stm32_pin->pin.name)
+ if (stm32_pin && stm32_pin->pin.name) {
names[i] = devm_kasprintf(dev, GFP_KERNEL, "%s", stm32_pin->pin.name);
- else
+ if (!names[i]) {
+ err = -ENOMEM;
+ goto err_clk;
+ }
+ }
+
+ else {
names[i] = NULL;
+ }
}
bank->gpio_chip.names = (const char * const *)names;
--
2.25.1
The patch below does not apply to the 6.12-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.12.y
git checkout FETCH_HEAD
git cherry-pick -x 8b8ed1b429f8fa7ebd5632555e7b047bc0620075
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024083041-irate-headless-590c@gregkh' --subject-prefix 'PATCH 6.12.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 8b8ed1b429f8fa7ebd5632555e7b047bc0620075 Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Wed, 28 Aug 2024 08:14:24 +0200
Subject: [PATCH] mptcp: pm: reuse ID 0 after delete and re-add
When the endpoint used by the initial subflow is removed and re-added
later, the PM has to force the ID 0, it is a special case imposed by the
MPTCP specs.
Note that the endpoint should then need to be re-added reusing the same
ID.
Fixes: 3ad14f54bd74 ("mptcp: more accurate MPC endpoint tracking")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 8d2f97854c64..ec45ab4c66ab 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -585,6 +585,11 @@ static void mptcp_pm_create_subflow_or_signal_addr(struct mptcp_sock *msk)
__clear_bit(local.addr.id, msk->pm.id_avail_bitmap);
msk->pm.add_addr_signaled++;
+
+ /* Special case for ID0: set the correct ID */
+ if (local.addr.id == msk->mpc_endpoint_id)
+ local.addr.id = 0;
+
mptcp_pm_announce_addr(msk, &local.addr, false);
mptcp_pm_nl_addr_send_ack(msk);
@@ -609,6 +614,11 @@ static void mptcp_pm_create_subflow_or_signal_addr(struct mptcp_sock *msk)
msk->pm.local_addr_used++;
__clear_bit(local.addr.id, msk->pm.id_avail_bitmap);
+
+ /* Special case for ID0: set the correct ID */
+ if (local.addr.id == msk->mpc_endpoint_id)
+ local.addr.id = 0;
+
nr = fill_remote_addresses_vec(msk, &local.addr, fullmesh, addrs);
if (nr == 0)
continue;
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 57f86203b41c98b322119dfdbb1ec54ce5e3369b
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024083025-evoke-catering-3aab@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
57f86203b41c ("mptcp: pm: ADD_ADDR 0 is not a new address")
4d25247d3ae4 ("mptcp: bypass in-kernel PM restrictions for non-kernel PMs")
14b06811bec6 ("mptcp: Bypass kernel PM when userspace PM is enabled")
a88c9e496937 ("mptcp: do not block subflows creation on errors")
86e39e04482b ("mptcp: keep track of local endpoint still available for each msk")
f7d6a237d742 ("mptcp: fix per socket endpoint accounting")
b29fcfb54cd7 ("mptcp: full disconnect implementation")
59060a47ca50 ("mptcp: clean up harmless false expressions")
3ce0852c86b9 ("mptcp: enforce HoL-blocking estimation")
6511882cdd82 ("mptcp: allocate fwd memory separately on the rx and tx path")
765ff425528f ("mptcp: use lockdep_assert_held_once() instead of open-coding it")
1094c6fe7280 ("mptcp: fix possible divide by zero")
33c563ad28e3 ("selftests: mptcp: add_addr and echo race test")
2843ff6f36db ("mptcp: remote addresses fullmesh")
ee285257a9c1 ("mptcp: drop flags and ifindex arguments")
ff5a0b421cb2 ("mptcp: faster active backup recovery")
6da14d74e2bd ("mptcp: cleanup sysctl data and helpers")
1e1d9d6f119c ("mptcp: handle pending data on closed subflow")
71b7dec27f34 ("mptcp: less aggressive retransmission strategy")
33d41c9cd74c ("mptcp: more accurate timeout")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 57f86203b41c98b322119dfdbb1ec54ce5e3369b Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Wed, 28 Aug 2024 08:14:37 +0200
Subject: [PATCH] mptcp: pm: ADD_ADDR 0 is not a new address
The ADD_ADDR 0 with the address from the initial subflow should not be
considered as a new address: this is not something new. If the host
receives it, it simply means that the address is available again.
When receiving an ADD_ADDR for the ID 0, the PM already doesn't consider
it as new by not incrementing the 'add_addr_accepted' counter. But the
'accept_addr' might not be set if the limit has already been reached:
this can be bypassed in this case. But before, it is important to check
that this ADD_ADDR for the ID 0 is for the same address as the initial
subflow. If not, it is not something that should happen, and the
ADD_ADDR can be ignored.
Note that if an ADD_ADDR is received while there is already a subflow
opened using the same address, this ADD_ADDR is ignored as well. It
means that if multiple ADD_ADDR for ID 0 are received, there will not be
any duplicated subflows created by the client.
Fixes: d0876b2284cf ("mptcp: add the incoming RM_ADDR support")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 3f8dbde243f1..37f6dbcd8434 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -226,7 +226,9 @@ void mptcp_pm_add_addr_received(const struct sock *ssk,
} else {
__MPTCP_INC_STATS(sock_net((struct sock *)msk), MPTCP_MIB_ADDADDRDROP);
}
- } else if (!READ_ONCE(pm->accept_addr)) {
+ /* id0 should not have a different address */
+ } else if ((addr->id == 0 && !mptcp_pm_nl_is_init_remote_addr(msk, addr)) ||
+ (addr->id > 0 && !READ_ONCE(pm->accept_addr))) {
mptcp_pm_announce_addr(msk, addr, true);
mptcp_pm_add_addr_send_ack(msk);
} else if (mptcp_pm_schedule_work(msk, MPTCP_PM_ADD_ADDR_RECEIVED)) {
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index a93450ded50a..f891bc714668 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -760,6 +760,15 @@ static void mptcp_pm_nl_add_addr_received(struct mptcp_sock *msk)
}
}
+bool mptcp_pm_nl_is_init_remote_addr(struct mptcp_sock *msk,
+ const struct mptcp_addr_info *remote)
+{
+ struct mptcp_addr_info mpc_remote;
+
+ remote_address((struct sock_common *)msk, &mpc_remote);
+ return mptcp_addresses_equal(&mpc_remote, remote, remote->port);
+}
+
void mptcp_pm_nl_addr_send_ack(struct mptcp_sock *msk)
{
struct mptcp_subflow_context *subflow;
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 26eb898a202b..3b22313d1b86 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -993,6 +993,8 @@ void mptcp_pm_add_addr_received(const struct sock *ssk,
void mptcp_pm_add_addr_echoed(struct mptcp_sock *msk,
const struct mptcp_addr_info *addr);
void mptcp_pm_add_addr_send_ack(struct mptcp_sock *msk);
+bool mptcp_pm_nl_is_init_remote_addr(struct mptcp_sock *msk,
+ const struct mptcp_addr_info *remote);
void mptcp_pm_nl_addr_send_ack(struct mptcp_sock *msk);
void mptcp_pm_rm_addr_received(struct mptcp_sock *msk,
const struct mptcp_rm_list *rm_list);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 0137a3c7c2ea3f9df8ebfc65d78b4ba712a187bb
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024082609-vivacious-jaywalker-cfac@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
0137a3c7c2ea ("mptcp: pm: check add_addr_accept_max before accepting new ADD_ADDR")
1c1f72137598 ("mptcp: pm: only decrement add_addr_accepted for MPJ req")
322ea3778965 ("mptcp: pm: only mark 'subflow' endp as available")
f448451aa62d ("mptcp: pm: remove mptcp_pm_remove_subflow()")
ef34a6ea0cab ("mptcp: pm: re-using ID of unused flushed subflows")
edd8b5d868a4 ("mptcp: pm: re-using ID of unused removed subflows")
4b317e0eb287 ("mptcp: fix NL PM announced address accounting")
6a09788c1a66 ("mptcp: pm: inc RmAddr MIB counter once per RM_ADDR ID")
9bbec87ecfe8 ("mptcp: unify pm get_local_id interfaces")
dc886bce753c ("mptcp: export local_address")
8b1c94da1e48 ("mptcp: only send RM_ADDR in nl_cmd_remove")
3ad14f54bd74 ("mptcp: more accurate MPC endpoint tracking")
c157bbe776b7 ("mptcp: allow the in kernel PM to set MPC subflow priority")
843b5e75efff ("mptcp: fix local endpoint accounting")
d9a4594edabf ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE")
9ab4807c84a4 ("mptcp: netlink: Add MPTCP_PM_CMD_ANNOUNCE")
982f17ba1a25 ("mptcp: netlink: split mptcp_pm_parse_addr into two functions")
8b20137012d9 ("mptcp: read attributes of addr entries managed by userspace PMs")
4638de5aefe5 ("mptcp: handle local addrs announced by userspace PMs")
4d25247d3ae4 ("mptcp: bypass in-kernel PM restrictions for non-kernel PMs")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0137a3c7c2ea3f9df8ebfc65d78b4ba712a187bb Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Mon, 19 Aug 2024 21:45:28 +0200
Subject: [PATCH] mptcp: pm: check add_addr_accept_max before accepting new
ADD_ADDR
The limits might have changed in between, it is best to check them
before accepting new ADD_ADDR.
Fixes: d0876b2284cf ("mptcp: add the incoming RM_ADDR support")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-10-38035d40de5…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 882781571c7b..28a9a3726146 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -848,8 +848,8 @@ static void mptcp_pm_nl_rm_addr_or_subflow(struct mptcp_sock *msk,
/* Note: if the subflow has been closed before, this
* add_addr_accepted counter will not be decremented.
*/
- msk->pm.add_addr_accepted--;
- WRITE_ONCE(msk->pm.accept_addr, true);
+ if (--msk->pm.add_addr_accepted < mptcp_pm_get_add_addr_accept_max(msk))
+ WRITE_ONCE(msk->pm.accept_addr, true);
}
}
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 1c1f721375989579e46741f59523e39ec9b2a9bd
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024082653-scenic-deprive-65e4@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
1c1f72137598 ("mptcp: pm: only decrement add_addr_accepted for MPJ req")
6a09788c1a66 ("mptcp: pm: inc RmAddr MIB counter once per RM_ADDR ID")
3ad14f54bd74 ("mptcp: more accurate MPC endpoint tracking")
c157bbe776b7 ("mptcp: allow the in kernel PM to set MPC subflow priority")
843b5e75efff ("mptcp: fix local endpoint accounting")
4638de5aefe5 ("mptcp: handle local addrs announced by userspace PMs")
4d25247d3ae4 ("mptcp: bypass in-kernel PM restrictions for non-kernel PMs")
14b06811bec6 ("mptcp: Bypass kernel PM when userspace PM is enabled")
0530020a7c8f ("mptcp: track and update contiguous data status")
0348c690ed37 ("mptcp: add the fallback check")
1761fed25678 ("mptcp: don't send RST for single subflow")
c682bf536cf4 ("mptcp: add pm_nl_pernet helpers")
ae7bd9ccecc3 ("selftests: mptcp: join: option to execute specific tests")
e59300ce3ff8 ("selftests: mptcp: join: reset failing links")
3afd0280e7d3 ("selftests: mptcp: join: define tests groups once")
3c082695e78b ("selftests: mptcp: drop msg argument of chk_csum_nr")
69c6ce7b6eca ("selftests: mptcp: add implicit endpoint test case")
4cf86ae84c71 ("mptcp: strict local address ID selection")
d045b9eb95a9 ("mptcp: introduce implicit endpoints")
6fa0174a7c86 ("mptcp: more careful RM_ADDR generation")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1c1f721375989579e46741f59523e39ec9b2a9bd Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Mon, 19 Aug 2024 21:45:27 +0200
Subject: [PATCH] mptcp: pm: only decrement add_addr_accepted for MPJ req
Adding the following warning ...
WARN_ON_ONCE(msk->pm.add_addr_accepted == 0)
... before decrementing the add_addr_accepted counter helped to find a
bug when running the "remove single subflow" subtest from the
mptcp_join.sh selftest.
Removing a 'subflow' endpoint will first trigger a RM_ADDR, then the
subflow closure. Before this patch, and upon the reception of the
RM_ADDR, the other peer will then try to decrement this
add_addr_accepted. That's not correct because the attached subflows have
not been created upon the reception of an ADD_ADDR.
A way to solve that is to decrement the counter only if the attached
subflow was an MP_JOIN to a remote id that was not 0, and initiated by
the host receiving the RM_ADDR.
Fixes: d0876b2284cf ("mptcp: add the incoming RM_ADDR support")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-9-38035d40de5b…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 4cf7cc851f80..882781571c7b 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -829,7 +829,7 @@ static void mptcp_pm_nl_rm_addr_or_subflow(struct mptcp_sock *msk,
mptcp_close_ssk(sk, ssk, subflow);
spin_lock_bh(&msk->pm.lock);
- removed = true;
+ removed |= subflow->request_join;
if (rm_type == MPTCP_MIB_RMSUBFLOW)
__MPTCP_INC_STATS(sock_net(sk), rm_type);
}
@@ -843,7 +843,11 @@ static void mptcp_pm_nl_rm_addr_or_subflow(struct mptcp_sock *msk,
if (!mptcp_pm_is_kernel(msk))
continue;
- if (rm_type == MPTCP_MIB_RMADDR) {
+ if (rm_type == MPTCP_MIB_RMADDR && rm_id &&
+ !WARN_ON_ONCE(msk->pm.add_addr_accepted == 0)) {
+ /* Note: if the subflow has been closed before, this
+ * add_addr_accepted counter will not be decremented.
+ */
msk->pm.add_addr_accepted--;
WRITE_ONCE(msk->pm.accept_addr, true);
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x ef34a6ea0cab1800f4b3c9c3c2cefd5091e03379
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024082631-emerald-impotence-a278@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
ef34a6ea0cab ("mptcp: pm: re-using ID of unused flushed subflows")
06faa2271034 ("mptcp: remove multi addresses and subflows in PM")
141694df6573 ("mptcp: remove address when netlink flushes addrs")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ef34a6ea0cab1800f4b3c9c3c2cefd5091e03379 Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Mon, 19 Aug 2024 21:45:23 +0200
Subject: [PATCH] mptcp: pm: re-using ID of unused flushed subflows
If no subflows are attached to the 'subflow' endpoints that are being
flushed, the corresponding addr IDs will not be marked as available
again.
Mark all ID as being available when flushing all the 'subflow'
endpoints, and reset local_addr_used counter to cover these cases.
Note that mptcp_pm_remove_addrs_and_subflows() helper is only called for
flushing operations, not to remove a specific set of addresses and
subflows.
Fixes: 06faa2271034 ("mptcp: remove multi addresses and subflows in PM")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-5-38035d40de5b…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 8b232a210a06..2c26696b820e 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -1623,8 +1623,15 @@ static void mptcp_pm_remove_addrs_and_subflows(struct mptcp_sock *msk,
mptcp_pm_remove_addr(msk, &alist);
spin_unlock_bh(&msk->pm.lock);
}
+
if (slist.nr)
mptcp_pm_remove_subflow(msk, &slist);
+
+ /* Reset counters: maybe some subflows have been removed before */
+ spin_lock_bh(&msk->pm.lock);
+ bitmap_fill(msk->pm.id_avail_bitmap, MPTCP_PM_MAX_ADDR_ID + 1);
+ msk->pm.local_addr_used = 0;
+ spin_unlock_bh(&msk->pm.lock);
}
static void mptcp_nl_remove_addrs_list(struct net *net,
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x f448451aa62d54be16acb0034223c17e0d12bc69
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024082657-quartered-obtuse-ce45@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
f448451aa62d ("mptcp: pm: remove mptcp_pm_remove_subflow()")
ef34a6ea0cab ("mptcp: pm: re-using ID of unused flushed subflows")
edd8b5d868a4 ("mptcp: pm: re-using ID of unused removed subflows")
4b317e0eb287 ("mptcp: fix NL PM announced address accounting")
9bbec87ecfe8 ("mptcp: unify pm get_local_id interfaces")
dc886bce753c ("mptcp: export local_address")
8b1c94da1e48 ("mptcp: only send RM_ADDR in nl_cmd_remove")
c157bbe776b7 ("mptcp: allow the in kernel PM to set MPC subflow priority")
d9a4594edabf ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE")
9ab4807c84a4 ("mptcp: netlink: Add MPTCP_PM_CMD_ANNOUNCE")
982f17ba1a25 ("mptcp: netlink: split mptcp_pm_parse_addr into two functions")
8b20137012d9 ("mptcp: read attributes of addr entries managed by userspace PMs")
4638de5aefe5 ("mptcp: handle local addrs announced by userspace PMs")
c682bf536cf4 ("mptcp: add pm_nl_pernet helpers")
4cf86ae84c71 ("mptcp: strict local address ID selection")
d045b9eb95a9 ("mptcp: introduce implicit endpoints")
6fa0174a7c86 ("mptcp: more careful RM_ADDR generation")
7d9bf018f907 ("selftests: mptcp: update output info of chk_rm_nr")
90d930882139 ("mptcp: constify a bunch of of helpers")
33397b83eee6 ("selftests: mptcp: add backup with port testcase")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f448451aa62d54be16acb0034223c17e0d12bc69 Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Mon, 19 Aug 2024 21:45:25 +0200
Subject: [PATCH] mptcp: pm: remove mptcp_pm_remove_subflow()
This helper is confusing. It is in pm.c, but it is specific to the
in-kernel PM and it cannot be used by the userspace one. Also, it simply
calls one in-kernel specific function with the PM lock, while the
similar mptcp_pm_remove_addr() helper requires the PM lock.
What's left is the pr_debug(), which is not that useful, because a
similar one is present in the only function called by this helper:
mptcp_pm_nl_rm_subflow_received()
After these modifications, this helper can be marked as 'static', and
the lock can be taken only once in mptcp_pm_flush_addrs_and_subflows().
Note that it is not a bug fix, but it will help backporting the
following commits.
Fixes: 0ee4261a3681 ("mptcp: implement mptcp_pm_remove_subflow")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-7-38035d40de5b…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 23bb89c94e90..925123e99889 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -60,16 +60,6 @@ int mptcp_pm_remove_addr(struct mptcp_sock *msk, const struct mptcp_rm_list *rm_
return 0;
}
-int mptcp_pm_remove_subflow(struct mptcp_sock *msk, const struct mptcp_rm_list *rm_list)
-{
- pr_debug("msk=%p, rm_list_nr=%d", msk, rm_list->nr);
-
- spin_lock_bh(&msk->pm.lock);
- mptcp_pm_nl_rm_subflow_received(msk, rm_list);
- spin_unlock_bh(&msk->pm.lock);
- return 0;
-}
-
/* path manager event handlers */
void mptcp_pm_new_connection(struct mptcp_sock *msk, const struct sock *ssk, int server_side)
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 2c26696b820e..44fc1c5959ac 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -857,8 +857,8 @@ static void mptcp_pm_nl_rm_addr_received(struct mptcp_sock *msk)
mptcp_pm_nl_rm_addr_or_subflow(msk, &msk->pm.rm_list_rx, MPTCP_MIB_RMADDR);
}
-void mptcp_pm_nl_rm_subflow_received(struct mptcp_sock *msk,
- const struct mptcp_rm_list *rm_list)
+static void mptcp_pm_nl_rm_subflow_received(struct mptcp_sock *msk,
+ const struct mptcp_rm_list *rm_list)
{
mptcp_pm_nl_rm_addr_or_subflow(msk, rm_list, MPTCP_MIB_RMSUBFLOW);
}
@@ -1471,7 +1471,9 @@ static int mptcp_nl_remove_subflow_and_signal_addr(struct net *net,
!(entry->flags & MPTCP_PM_ADDR_FLAG_IMPLICIT));
if (remove_subflow) {
- mptcp_pm_remove_subflow(msk, &list);
+ spin_lock_bh(&msk->pm.lock);
+ mptcp_pm_nl_rm_subflow_received(msk, &list);
+ spin_unlock_bh(&msk->pm.lock);
} else if (entry->flags & MPTCP_PM_ADDR_FLAG_SUBFLOW) {
/* If the subflow has been used, but now closed */
spin_lock_bh(&msk->pm.lock);
@@ -1617,18 +1619,14 @@ static void mptcp_pm_remove_addrs_and_subflows(struct mptcp_sock *msk,
alist.ids[alist.nr++] = entry->addr.id;
}
+ spin_lock_bh(&msk->pm.lock);
if (alist.nr) {
- spin_lock_bh(&msk->pm.lock);
msk->pm.add_addr_signaled -= alist.nr;
mptcp_pm_remove_addr(msk, &alist);
- spin_unlock_bh(&msk->pm.lock);
}
-
if (slist.nr)
- mptcp_pm_remove_subflow(msk, &slist);
-
+ mptcp_pm_nl_rm_subflow_received(msk, &slist);
/* Reset counters: maybe some subflows have been removed before */
- spin_lock_bh(&msk->pm.lock);
bitmap_fill(msk->pm.id_avail_bitmap, MPTCP_PM_MAX_ADDR_ID + 1);
msk->pm.local_addr_used = 0;
spin_unlock_bh(&msk->pm.lock);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 60c6b073d65f..a1c1b0ff1ce1 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -1026,7 +1026,6 @@ int mptcp_pm_announce_addr(struct mptcp_sock *msk,
const struct mptcp_addr_info *addr,
bool echo);
int mptcp_pm_remove_addr(struct mptcp_sock *msk, const struct mptcp_rm_list *rm_list);
-int mptcp_pm_remove_subflow(struct mptcp_sock *msk, const struct mptcp_rm_list *rm_list);
void mptcp_pm_remove_addrs(struct mptcp_sock *msk, struct list_head *rm_list);
void mptcp_free_local_addr_list(struct mptcp_sock *msk);
@@ -1133,8 +1132,6 @@ static inline u8 subflow_get_local_id(const struct mptcp_subflow_context *subflo
void __init mptcp_pm_nl_init(void);
void mptcp_pm_nl_work(struct mptcp_sock *msk);
-void mptcp_pm_nl_rm_subflow_received(struct mptcp_sock *msk,
- const struct mptcp_rm_list *rm_list);
unsigned int mptcp_pm_get_add_addr_signal_max(const struct mptcp_sock *msk);
unsigned int mptcp_pm_get_add_addr_accept_max(const struct mptcp_sock *msk);
unsigned int mptcp_pm_get_subflows_max(const struct mptcp_sock *msk);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x f448451aa62d54be16acb0034223c17e0d12bc69
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024082656-shield-daily-d746@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
f448451aa62d ("mptcp: pm: remove mptcp_pm_remove_subflow()")
ef34a6ea0cab ("mptcp: pm: re-using ID of unused flushed subflows")
edd8b5d868a4 ("mptcp: pm: re-using ID of unused removed subflows")
4b317e0eb287 ("mptcp: fix NL PM announced address accounting")
9bbec87ecfe8 ("mptcp: unify pm get_local_id interfaces")
dc886bce753c ("mptcp: export local_address")
8b1c94da1e48 ("mptcp: only send RM_ADDR in nl_cmd_remove")
c157bbe776b7 ("mptcp: allow the in kernel PM to set MPC subflow priority")
d9a4594edabf ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE")
9ab4807c84a4 ("mptcp: netlink: Add MPTCP_PM_CMD_ANNOUNCE")
982f17ba1a25 ("mptcp: netlink: split mptcp_pm_parse_addr into two functions")
8b20137012d9 ("mptcp: read attributes of addr entries managed by userspace PMs")
4638de5aefe5 ("mptcp: handle local addrs announced by userspace PMs")
c682bf536cf4 ("mptcp: add pm_nl_pernet helpers")
4cf86ae84c71 ("mptcp: strict local address ID selection")
d045b9eb95a9 ("mptcp: introduce implicit endpoints")
6fa0174a7c86 ("mptcp: more careful RM_ADDR generation")
7d9bf018f907 ("selftests: mptcp: update output info of chk_rm_nr")
90d930882139 ("mptcp: constify a bunch of of helpers")
33397b83eee6 ("selftests: mptcp: add backup with port testcase")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f448451aa62d54be16acb0034223c17e0d12bc69 Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Mon, 19 Aug 2024 21:45:25 +0200
Subject: [PATCH] mptcp: pm: remove mptcp_pm_remove_subflow()
This helper is confusing. It is in pm.c, but it is specific to the
in-kernel PM and it cannot be used by the userspace one. Also, it simply
calls one in-kernel specific function with the PM lock, while the
similar mptcp_pm_remove_addr() helper requires the PM lock.
What's left is the pr_debug(), which is not that useful, because a
similar one is present in the only function called by this helper:
mptcp_pm_nl_rm_subflow_received()
After these modifications, this helper can be marked as 'static', and
the lock can be taken only once in mptcp_pm_flush_addrs_and_subflows().
Note that it is not a bug fix, but it will help backporting the
following commits.
Fixes: 0ee4261a3681 ("mptcp: implement mptcp_pm_remove_subflow")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-7-38035d40de5b…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 23bb89c94e90..925123e99889 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -60,16 +60,6 @@ int mptcp_pm_remove_addr(struct mptcp_sock *msk, const struct mptcp_rm_list *rm_
return 0;
}
-int mptcp_pm_remove_subflow(struct mptcp_sock *msk, const struct mptcp_rm_list *rm_list)
-{
- pr_debug("msk=%p, rm_list_nr=%d", msk, rm_list->nr);
-
- spin_lock_bh(&msk->pm.lock);
- mptcp_pm_nl_rm_subflow_received(msk, rm_list);
- spin_unlock_bh(&msk->pm.lock);
- return 0;
-}
-
/* path manager event handlers */
void mptcp_pm_new_connection(struct mptcp_sock *msk, const struct sock *ssk, int server_side)
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 2c26696b820e..44fc1c5959ac 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -857,8 +857,8 @@ static void mptcp_pm_nl_rm_addr_received(struct mptcp_sock *msk)
mptcp_pm_nl_rm_addr_or_subflow(msk, &msk->pm.rm_list_rx, MPTCP_MIB_RMADDR);
}
-void mptcp_pm_nl_rm_subflow_received(struct mptcp_sock *msk,
- const struct mptcp_rm_list *rm_list)
+static void mptcp_pm_nl_rm_subflow_received(struct mptcp_sock *msk,
+ const struct mptcp_rm_list *rm_list)
{
mptcp_pm_nl_rm_addr_or_subflow(msk, rm_list, MPTCP_MIB_RMSUBFLOW);
}
@@ -1471,7 +1471,9 @@ static int mptcp_nl_remove_subflow_and_signal_addr(struct net *net,
!(entry->flags & MPTCP_PM_ADDR_FLAG_IMPLICIT));
if (remove_subflow) {
- mptcp_pm_remove_subflow(msk, &list);
+ spin_lock_bh(&msk->pm.lock);
+ mptcp_pm_nl_rm_subflow_received(msk, &list);
+ spin_unlock_bh(&msk->pm.lock);
} else if (entry->flags & MPTCP_PM_ADDR_FLAG_SUBFLOW) {
/* If the subflow has been used, but now closed */
spin_lock_bh(&msk->pm.lock);
@@ -1617,18 +1619,14 @@ static void mptcp_pm_remove_addrs_and_subflows(struct mptcp_sock *msk,
alist.ids[alist.nr++] = entry->addr.id;
}
+ spin_lock_bh(&msk->pm.lock);
if (alist.nr) {
- spin_lock_bh(&msk->pm.lock);
msk->pm.add_addr_signaled -= alist.nr;
mptcp_pm_remove_addr(msk, &alist);
- spin_unlock_bh(&msk->pm.lock);
}
-
if (slist.nr)
- mptcp_pm_remove_subflow(msk, &slist);
-
+ mptcp_pm_nl_rm_subflow_received(msk, &slist);
/* Reset counters: maybe some subflows have been removed before */
- spin_lock_bh(&msk->pm.lock);
bitmap_fill(msk->pm.id_avail_bitmap, MPTCP_PM_MAX_ADDR_ID + 1);
msk->pm.local_addr_used = 0;
spin_unlock_bh(&msk->pm.lock);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 60c6b073d65f..a1c1b0ff1ce1 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -1026,7 +1026,6 @@ int mptcp_pm_announce_addr(struct mptcp_sock *msk,
const struct mptcp_addr_info *addr,
bool echo);
int mptcp_pm_remove_addr(struct mptcp_sock *msk, const struct mptcp_rm_list *rm_list);
-int mptcp_pm_remove_subflow(struct mptcp_sock *msk, const struct mptcp_rm_list *rm_list);
void mptcp_pm_remove_addrs(struct mptcp_sock *msk, struct list_head *rm_list);
void mptcp_free_local_addr_list(struct mptcp_sock *msk);
@@ -1133,8 +1132,6 @@ static inline u8 subflow_get_local_id(const struct mptcp_subflow_context *subflo
void __init mptcp_pm_nl_init(void);
void mptcp_pm_nl_work(struct mptcp_sock *msk);
-void mptcp_pm_nl_rm_subflow_received(struct mptcp_sock *msk,
- const struct mptcp_rm_list *rm_list);
unsigned int mptcp_pm_get_add_addr_signal_max(const struct mptcp_sock *msk);
unsigned int mptcp_pm_get_add_addr_accept_max(const struct mptcp_sock *msk);
unsigned int mptcp_pm_get_subflows_max(const struct mptcp_sock *msk);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x f09b0ad55a1196f5891663f8888463c0541059cb
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024083037-germproof-omnivore-ecf0@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
f09b0ad55a11 ("mptcp: close subflow when receiving TCP+FIN")
a5cb752b1257 ("mptcp: use mptcp_schedule_work instead of open-coding it")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f09b0ad55a1196f5891663f8888463c0541059cb Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Mon, 26 Aug 2024 19:11:18 +0200
Subject: [PATCH] mptcp: close subflow when receiving TCP+FIN
When a peer decides to close one subflow in the middle of a connection
having multiple subflows, the receiver of the first FIN should accept
that, and close the subflow on its side as well. If not, the subflow
will stay half closed, and would even continue to be used until the end
of the MPTCP connection or a reset from the network.
The issue has not been seen before, probably because the in-kernel
path-manager always sends a RM_ADDR before closing the subflow. Upon the
reception of this RM_ADDR, the other peer will initiate the closure on
its side as well. On the other hand, if the RM_ADDR is lost, or if the
path-manager of the other peer only closes the subflow without sending a
RM_ADDR, the subflow would switch to TCP_CLOSE_WAIT, but that's it,
leaving the subflow half-closed.
So now, when the subflow switches to the TCP_CLOSE_WAIT state, and if
the MPTCP connection has not been closed before with a DATA_FIN, the
kernel owning the subflow schedules its worker to initiate the closure
on its side as well.
This issue can be easily reproduced with packetdrill, as visible in [1],
by creating an additional subflow, injecting a FIN+ACK before sending
the DATA_FIN, and expecting a FIN+ACK in return.
Fixes: 40947e13997a ("mptcp: schedule worker when subflow is closed")
Cc: stable(a)vger.kernel.org
Link: https://github.com/multipath-tcp/packetdrill/pull/154 [1]
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240826-net-mptcp-close-extra-sf-fin-v1-1-905199f…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 0d536b183a6c..151e82e2ff2e 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2533,8 +2533,11 @@ static void __mptcp_close_subflow(struct sock *sk)
mptcp_for_each_subflow_safe(msk, subflow, tmp) {
struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
+ int ssk_state = inet_sk_state_load(ssk);
- if (inet_sk_state_load(ssk) != TCP_CLOSE)
+ if (ssk_state != TCP_CLOSE &&
+ (ssk_state != TCP_CLOSE_WAIT ||
+ inet_sk_state_load(sk) != TCP_ESTABLISHED))
continue;
/* 'subflow_data_ready' will re-sched once rx queue is empty */
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index a21c712350c3..4834e7fc2fb6 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1255,12 +1255,16 @@ static void mptcp_subflow_discard_data(struct sock *ssk, struct sk_buff *skb,
/* sched mptcp worker to remove the subflow if no more data is pending */
static void subflow_sched_work_if_closed(struct mptcp_sock *msk, struct sock *ssk)
{
- if (likely(ssk->sk_state != TCP_CLOSE))
+ struct sock *sk = (struct sock *)msk;
+
+ if (likely(ssk->sk_state != TCP_CLOSE &&
+ (ssk->sk_state != TCP_CLOSE_WAIT ||
+ inet_sk_state_load(sk) != TCP_ESTABLISHED)))
return;
if (skb_queue_empty(&ssk->sk_receive_queue) &&
!test_and_set_bit(MPTCP_WORK_CLOSE_SUBFLOW, &msk->flags))
- mptcp_schedule_work((struct sock *)msk);
+ mptcp_schedule_work(sk);
}
static bool subflow_can_fallback(struct mptcp_subflow_context *subflow)
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 57f86203b41c98b322119dfdbb1ec54ce5e3369b
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024083024-distort-gallantly-afea@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
57f86203b41c ("mptcp: pm: ADD_ADDR 0 is not a new address")
4d25247d3ae4 ("mptcp: bypass in-kernel PM restrictions for non-kernel PMs")
14b06811bec6 ("mptcp: Bypass kernel PM when userspace PM is enabled")
a88c9e496937 ("mptcp: do not block subflows creation on errors")
86e39e04482b ("mptcp: keep track of local endpoint still available for each msk")
f7d6a237d742 ("mptcp: fix per socket endpoint accounting")
b29fcfb54cd7 ("mptcp: full disconnect implementation")
59060a47ca50 ("mptcp: clean up harmless false expressions")
3ce0852c86b9 ("mptcp: enforce HoL-blocking estimation")
6511882cdd82 ("mptcp: allocate fwd memory separately on the rx and tx path")
765ff425528f ("mptcp: use lockdep_assert_held_once() instead of open-coding it")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 57f86203b41c98b322119dfdbb1ec54ce5e3369b Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Wed, 28 Aug 2024 08:14:37 +0200
Subject: [PATCH] mptcp: pm: ADD_ADDR 0 is not a new address
The ADD_ADDR 0 with the address from the initial subflow should not be
considered as a new address: this is not something new. If the host
receives it, it simply means that the address is available again.
When receiving an ADD_ADDR for the ID 0, the PM already doesn't consider
it as new by not incrementing the 'add_addr_accepted' counter. But the
'accept_addr' might not be set if the limit has already been reached:
this can be bypassed in this case. But before, it is important to check
that this ADD_ADDR for the ID 0 is for the same address as the initial
subflow. If not, it is not something that should happen, and the
ADD_ADDR can be ignored.
Note that if an ADD_ADDR is received while there is already a subflow
opened using the same address, this ADD_ADDR is ignored as well. It
means that if multiple ADD_ADDR for ID 0 are received, there will not be
any duplicated subflows created by the client.
Fixes: d0876b2284cf ("mptcp: add the incoming RM_ADDR support")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 3f8dbde243f1..37f6dbcd8434 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -226,7 +226,9 @@ void mptcp_pm_add_addr_received(const struct sock *ssk,
} else {
__MPTCP_INC_STATS(sock_net((struct sock *)msk), MPTCP_MIB_ADDADDRDROP);
}
- } else if (!READ_ONCE(pm->accept_addr)) {
+ /* id0 should not have a different address */
+ } else if ((addr->id == 0 && !mptcp_pm_nl_is_init_remote_addr(msk, addr)) ||
+ (addr->id > 0 && !READ_ONCE(pm->accept_addr))) {
mptcp_pm_announce_addr(msk, addr, true);
mptcp_pm_add_addr_send_ack(msk);
} else if (mptcp_pm_schedule_work(msk, MPTCP_PM_ADD_ADDR_RECEIVED)) {
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index a93450ded50a..f891bc714668 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -760,6 +760,15 @@ static void mptcp_pm_nl_add_addr_received(struct mptcp_sock *msk)
}
}
+bool mptcp_pm_nl_is_init_remote_addr(struct mptcp_sock *msk,
+ const struct mptcp_addr_info *remote)
+{
+ struct mptcp_addr_info mpc_remote;
+
+ remote_address((struct sock_common *)msk, &mpc_remote);
+ return mptcp_addresses_equal(&mpc_remote, remote, remote->port);
+}
+
void mptcp_pm_nl_addr_send_ack(struct mptcp_sock *msk)
{
struct mptcp_subflow_context *subflow;
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 26eb898a202b..3b22313d1b86 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -993,6 +993,8 @@ void mptcp_pm_add_addr_received(const struct sock *ssk,
void mptcp_pm_add_addr_echoed(struct mptcp_sock *msk,
const struct mptcp_addr_info *addr);
void mptcp_pm_add_addr_send_ack(struct mptcp_sock *msk);
+bool mptcp_pm_nl_is_init_remote_addr(struct mptcp_sock *msk,
+ const struct mptcp_addr_info *remote);
void mptcp_pm_nl_addr_send_ack(struct mptcp_sock *msk);
void mptcp_pm_rm_addr_received(struct mptcp_sock *msk,
const struct mptcp_rm_list *rm_list);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x d82809b6c5f2676b382f77a5cbeb1a5d91ed2235
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024083029-shrank-thirstily-a532@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
d82809b6c5f2 ("mptcp: avoid duplicated SUB_CLOSED events")
a7cfe7766370 ("mptcp: fix data races on local_id")
84c531f54ad9 ("mptcp: userspace pm send RM_ADDR for ID 0")
f1f26512a9bf ("mptcp: use plain bool instead of custom binary enum")
1e07938e29c5 ("net: mptcp: rename netlink handlers to mptcp_pm_nl_<blah>_{doit,dumpit}")
1d0507f46843 ("net: mptcp: convert netlink from small_ops to ops")
fce68b03086f ("mptcp: add scheduled in mptcp_subflow_context")
1730b2b2c5a5 ("mptcp: add sched in mptcp_sock")
740ebe35bd3f ("mptcp: add struct mptcp_sched_ops")
a7384f391875 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d82809b6c5f2676b382f77a5cbeb1a5d91ed2235 Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Wed, 28 Aug 2024 08:14:35 +0200
Subject: [PATCH] mptcp: avoid duplicated SUB_CLOSED events
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The initial subflow might have already been closed, but still in the
connection list. When the worker is instructed to close the subflows
that have been marked as closed, it might then try to close the initial
subflow again.
A consequence of that is that the SUB_CLOSED event can be seen twice:
# ip mptcp endpoint
1.1.1.1 id 1 subflow dev eth0
2.2.2.2 id 2 subflow dev eth1
# ip mptcp monitor &
[ CREATED] remid=0 locid=0 saddr4=1.1.1.1 daddr4=9.9.9.9
[ ESTABLISHED] remid=0 locid=0 saddr4=1.1.1.1 daddr4=9.9.9.9
[ SF_ESTABLISHED] remid=0 locid=2 saddr4=2.2.2.2 daddr4=9.9.9.9
# ip mptcp endpoint delete id 1
[ SF_CLOSED] remid=0 locid=0 saddr4=1.1.1.1 daddr4=9.9.9.9
[ SF_CLOSED] remid=0 locid=0 saddr4=1.1.1.1 daddr4=9.9.9.9
The first one is coming from mptcp_pm_nl_rm_subflow_received(), and the
second one from __mptcp_close_subflow().
To avoid doing the post-closed processing twice, the subflow is now
marked as closed the first time.
Note that it is not enough to check if we are dealing with the first
subflow and check its sk_state: the subflow might have been reset or
closed before calling mptcp_close_ssk().
Fixes: b911c97c7dc7 ("mptcp: add netlink event support")
Cc: stable(a)vger.kernel.org
Tested-by: Arınç ÜNAL <arinc.unal(a)arinc9.com>
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index b571fba88a2f..37ebcb7640eb 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2508,6 +2508,12 @@ static void __mptcp_close_ssk(struct sock *sk, struct sock *ssk,
void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
struct mptcp_subflow_context *subflow)
{
+ /* The first subflow can already be closed and still in the list */
+ if (subflow->close_event_done)
+ return;
+
+ subflow->close_event_done = true;
+
if (sk->sk_state == TCP_ESTABLISHED)
mptcp_event(MPTCP_EVENT_SUB_CLOSED, mptcp_sk(sk), ssk, GFP_KERNEL);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 240d7c2ea551..26eb898a202b 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -524,7 +524,8 @@ struct mptcp_subflow_context {
stale : 1, /* unable to snd/rcv data, do not use for xmit */
valid_csum_seen : 1, /* at least one csum validated */
is_mptfo : 1, /* subflow is doing TFO */
- __unused : 10;
+ close_event_done : 1, /* has done the post-closed part */
+ __unused : 9;
bool data_avail;
bool scheduled;
u32 remote_nonce;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 58e1b66b4e4b8a602d3f2843e8eba00a969ecce2
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024083057-italics-abrasion-ab81@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
58e1b66b4e4b ("mptcp: pm: do not remove already closed subflows")
967d3c27127e ("mptcp: fix data races on remote_id")
a7cfe7766370 ("mptcp: fix data races on local_id")
84c531f54ad9 ("mptcp: userspace pm send RM_ADDR for ID 0")
f1f26512a9bf ("mptcp: use plain bool instead of custom binary enum")
1e07938e29c5 ("net: mptcp: rename netlink handlers to mptcp_pm_nl_<blah>_{doit,dumpit}")
1d0507f46843 ("net: mptcp: convert netlink from small_ops to ops")
fce68b03086f ("mptcp: add scheduled in mptcp_subflow_context")
1730b2b2c5a5 ("mptcp: add sched in mptcp_sock")
740ebe35bd3f ("mptcp: add struct mptcp_sched_ops")
a7384f391875 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 58e1b66b4e4b8a602d3f2843e8eba00a969ecce2 Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Wed, 28 Aug 2024 08:14:32 +0200
Subject: [PATCH] mptcp: pm: do not remove already closed subflows
It is possible to have in the list already closed subflows, e.g. the
initial subflow has been already closed, but still in the list. No need
to try to close it again, and increments the related counters again.
Fixes: 0ee4261a3681 ("mptcp: implement mptcp_pm_remove_subflow")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 5a84a55e37cc..3ff273e219f2 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -838,6 +838,8 @@ static void mptcp_pm_nl_rm_addr_or_subflow(struct mptcp_sock *msk,
int how = RCV_SHUTDOWN | SEND_SHUTDOWN;
u8 id = subflow_get_local_id(subflow);
+ if (inet_sk_state_load(ssk) == TCP_CLOSE)
+ continue;
if (rm_type == MPTCP_MIB_RMADDR && remote_id != rm_id)
continue;
if (rm_type == MPTCP_MIB_RMSUBFLOW && id != rm_id)
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x c07cc3ed895f9bfe0c53b5ed6be710c133b4271c
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024083031-diffuser-strongbox-b85f@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
c07cc3ed895f ("mptcp: pm: send ACK on an active subflow")
f5360e9b314c ("mptcp: introduce and use mptcp_pm_send_ack()")
892f396c8e68 ("mptcp: netlink: issue MP_PRIO signals from userspace PMs")
a657430260e5 ("mptcp: Acquire the subflow socket lock before modifying MP_PRIO flags")
c21b50d5912b ("mptcp: Avoid acquiring PM lock for subflow priority changes")
702c2f646d42 ("mptcp: netlink: allow userspace-driven subflow establishment")
d9a4594edabf ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE")
9ab4807c84a4 ("mptcp: netlink: Add MPTCP_PM_CMD_ANNOUNCE")
982f17ba1a25 ("mptcp: netlink: split mptcp_pm_parse_addr into two functions")
8b20137012d9 ("mptcp: read attributes of addr entries managed by userspace PMs")
4638de5aefe5 ("mptcp: handle local addrs announced by userspace PMs")
c682bf536cf4 ("mptcp: add pm_nl_pernet helpers")
0e203c324752 ("mptcp: reset the packet scheduler on PRIO change")
4cf86ae84c71 ("mptcp: strict local address ID selection")
d045b9eb95a9 ("mptcp: introduce implicit endpoints")
90d930882139 ("mptcp: constify a bunch of of helpers")
33397b83eee6 ("selftests: mptcp: add backup with port testcase")
09f12c3ab7a5 ("mptcp: allow to use port and non-signal in set_flags")
6a0653b96f5d ("selftests: mptcp: add fullmesh setting tests")
73c762c1f07d ("mptcp: set fullmesh flag in pm_netlink")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c07cc3ed895f9bfe0c53b5ed6be710c133b4271c Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Wed, 28 Aug 2024 08:14:27 +0200
Subject: [PATCH] mptcp: pm: send ACK on an active subflow
Taking the first one on the list doesn't work in some cases, e.g. if the
initial subflow is being removed. Pick another one instead of not
sending anything.
Fixes: 84dfe3677a6f ("mptcp: send out dedicated ADD_ADDR packet")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 42d4e7b5f65d..ed2205ef7208 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -765,9 +765,12 @@ void mptcp_pm_nl_addr_send_ack(struct mptcp_sock *msk)
!mptcp_pm_should_rm_signal(msk))
return;
- subflow = list_first_entry_or_null(&msk->conn_list, typeof(*subflow), node);
- if (subflow)
- mptcp_pm_send_ack(msk, subflow, false, false);
+ mptcp_for_each_subflow(msk, subflow) {
+ if (__mptcp_subflow_active(subflow)) {
+ mptcp_pm_send_ack(msk, subflow, false, false);
+ break;
+ }
+ }
}
int mptcp_pm_nl_mp_prio_send_ack(struct mptcp_sock *msk,
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x bc19ff57637ff563d2bdf2b385b48c41e6509e0d
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024083003-chowder-scurvy-2e9c@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
bc19ff57637f ("mptcp: pm: skip connecting to already established sf")
4638de5aefe5 ("mptcp: handle local addrs announced by userspace PMs")
c682bf536cf4 ("mptcp: add pm_nl_pernet helpers")
4cf86ae84c71 ("mptcp: strict local address ID selection")
d045b9eb95a9 ("mptcp: introduce implicit endpoints")
90d930882139 ("mptcp: constify a bunch of of helpers")
33397b83eee6 ("selftests: mptcp: add backup with port testcase")
09f12c3ab7a5 ("mptcp: allow to use port and non-signal in set_flags")
6a0653b96f5d ("selftests: mptcp: add fullmesh setting tests")
8e9eacad7ec7 ("mptcp: fix msk traversal in mptcp_nl_cmd_set_flags()")
327b9a94e2a8 ("selftests: mptcp: more stable join tests-cases")
a88c9e496937 ("mptcp: do not block subflows creation on errors")
86e39e04482b ("mptcp: keep track of local endpoint still available for each msk")
f7d6a237d742 ("mptcp: fix per socket endpoint accounting")
b29fcfb54cd7 ("mptcp: full disconnect implementation")
59060a47ca50 ("mptcp: clean up harmless false expressions")
3ce0852c86b9 ("mptcp: enforce HoL-blocking estimation")
602837e8479d ("mptcp: allow changing the "backup" bit by endpoint id")
6511882cdd82 ("mptcp: allocate fwd memory separately on the rx and tx path")
dd9a887b35b0 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From bc19ff57637ff563d2bdf2b385b48c41e6509e0d Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Wed, 28 Aug 2024 08:14:28 +0200
Subject: [PATCH] mptcp: pm: skip connecting to already established sf
The lookup_subflow_by_daddr() helper checks if there is already a
subflow connected to this address. But there could be a subflow that is
closing, but taking time due to some reasons: latency, losses, data to
process, etc.
If an ADD_ADDR is received while the endpoint is being closed, it is
better to try connecting to it, instead of rejecting it: the peer which
has sent the ADD_ADDR will not be notified that the ADD_ADDR has been
rejected for this reason, and the expected subflow will not be created
at the end.
This helper should then only look for subflows that are established, or
going to be, but not the ones being closed.
Fixes: d84ad04941c3 ("mptcp: skip connecting the connected address")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index ed2205ef7208..0134b6273c54 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -130,12 +130,15 @@ static bool lookup_subflow_by_daddr(const struct list_head *list,
{
struct mptcp_subflow_context *subflow;
struct mptcp_addr_info cur;
- struct sock_common *skc;
list_for_each_entry(subflow, list, node) {
- skc = (struct sock_common *)mptcp_subflow_tcp_sock(subflow);
+ struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
- remote_address(skc, &cur);
+ if (!((1 << inet_sk_state_load(ssk)) &
+ (TCPF_ESTABLISHED | TCPF_SYN_SENT | TCPF_SYN_RECV)))
+ continue;
+
+ remote_address((struct sock_common *)ssk, &cur);
if (mptcp_addresses_equal(&cur, daddr, daddr->port))
return true;
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 09355f7abb9fbfc1a240be029837921ea417bf4f
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024082641-cauterize-slum-9eb2@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
09355f7abb9f ("mptcp: pm: fullmesh: select the right ID later")
b9d69db87fb7 ("mptcp: let the in-kernel PM use mixed IPv4 and IPv6 addresses")
837cf45df163 ("mptcp: fix race in incoming ADD_ADDR option processing")
a88c9e496937 ("mptcp: do not block subflows creation on errors")
86e39e04482b ("mptcp: keep track of local endpoint still available for each msk")
f7d6a237d742 ("mptcp: fix per socket endpoint accounting")
b29fcfb54cd7 ("mptcp: full disconnect implementation")
59060a47ca50 ("mptcp: clean up harmless false expressions")
3ce0852c86b9 ("mptcp: enforce HoL-blocking estimation")
6511882cdd82 ("mptcp: allocate fwd memory separately on the rx and tx path")
765ff425528f ("mptcp: use lockdep_assert_held_once() instead of open-coding it")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 09355f7abb9fbfc1a240be029837921ea417bf4f Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Mon, 19 Aug 2024 21:45:30 +0200
Subject: [PATCH] mptcp: pm: fullmesh: select the right ID later
When reacting upon the reception of an ADD_ADDR, the in-kernel PM first
looks for fullmesh endpoints. If there are some, it will pick them,
using their entry ID.
It should set the ID 0 when using the endpoint corresponding to the
initial subflow, it is a special case imposed by the MPTCP specs.
Note that msk->mpc_endpoint_id might not be set when receiving the first
ADD_ADDR from the server. So better to compare the addresses.
Fixes: 1a0d6136c5f0 ("mptcp: local addresses fullmesh")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-12-38035d40de5…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index d0a80f537fc3..a2e37ab1c40f 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -636,6 +636,7 @@ static unsigned int fill_local_addresses_vec(struct mptcp_sock *msk,
{
struct sock *sk = (struct sock *)msk;
struct mptcp_pm_addr_entry *entry;
+ struct mptcp_addr_info mpc_addr;
struct pm_nl_pernet *pernet;
unsigned int subflows_max;
int i = 0;
@@ -643,6 +644,8 @@ static unsigned int fill_local_addresses_vec(struct mptcp_sock *msk,
pernet = pm_nl_get_pernet_from_msk(msk);
subflows_max = mptcp_pm_get_subflows_max(msk);
+ mptcp_local_address((struct sock_common *)msk, &mpc_addr);
+
rcu_read_lock();
list_for_each_entry_rcu(entry, &pernet->local_addr_list, list) {
if (!(entry->flags & MPTCP_PM_ADDR_FLAG_FULLMESH))
@@ -653,7 +656,13 @@ static unsigned int fill_local_addresses_vec(struct mptcp_sock *msk,
if (msk->pm.subflows < subflows_max) {
msk->pm.subflows++;
- addrs[i++] = entry->addr;
+ addrs[i] = entry->addr;
+
+ /* Special case for ID0: set the correct ID */
+ if (mptcp_addresses_equal(&entry->addr, &mpc_addr, entry->addr.port))
+ addrs[i].id = 0;
+
+ i++;
}
}
rcu_read_unlock();
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 0137a3c7c2ea3f9df8ebfc65d78b4ba712a187bb
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024082608-cornmeal-stoop-7021@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
0137a3c7c2ea ("mptcp: pm: check add_addr_accept_max before accepting new ADD_ADDR")
1c1f72137598 ("mptcp: pm: only decrement add_addr_accepted for MPJ req")
322ea3778965 ("mptcp: pm: only mark 'subflow' endp as available")
f448451aa62d ("mptcp: pm: remove mptcp_pm_remove_subflow()")
ef34a6ea0cab ("mptcp: pm: re-using ID of unused flushed subflows")
edd8b5d868a4 ("mptcp: pm: re-using ID of unused removed subflows")
4b317e0eb287 ("mptcp: fix NL PM announced address accounting")
6a09788c1a66 ("mptcp: pm: inc RmAddr MIB counter once per RM_ADDR ID")
9bbec87ecfe8 ("mptcp: unify pm get_local_id interfaces")
dc886bce753c ("mptcp: export local_address")
8b1c94da1e48 ("mptcp: only send RM_ADDR in nl_cmd_remove")
3ad14f54bd74 ("mptcp: more accurate MPC endpoint tracking")
c157bbe776b7 ("mptcp: allow the in kernel PM to set MPC subflow priority")
843b5e75efff ("mptcp: fix local endpoint accounting")
d9a4594edabf ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE")
9ab4807c84a4 ("mptcp: netlink: Add MPTCP_PM_CMD_ANNOUNCE")
982f17ba1a25 ("mptcp: netlink: split mptcp_pm_parse_addr into two functions")
8b20137012d9 ("mptcp: read attributes of addr entries managed by userspace PMs")
4638de5aefe5 ("mptcp: handle local addrs announced by userspace PMs")
4d25247d3ae4 ("mptcp: bypass in-kernel PM restrictions for non-kernel PMs")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0137a3c7c2ea3f9df8ebfc65d78b4ba712a187bb Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Mon, 19 Aug 2024 21:45:28 +0200
Subject: [PATCH] mptcp: pm: check add_addr_accept_max before accepting new
ADD_ADDR
The limits might have changed in between, it is best to check them
before accepting new ADD_ADDR.
Fixes: d0876b2284cf ("mptcp: add the incoming RM_ADDR support")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-10-38035d40de5…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 882781571c7b..28a9a3726146 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -848,8 +848,8 @@ static void mptcp_pm_nl_rm_addr_or_subflow(struct mptcp_sock *msk,
/* Note: if the subflow has been closed before, this
* add_addr_accepted counter will not be decremented.
*/
- msk->pm.add_addr_accepted--;
- WRITE_ONCE(msk->pm.accept_addr, true);
+ if (--msk->pm.add_addr_accepted < mptcp_pm_get_add_addr_accept_max(msk))
+ WRITE_ONCE(msk->pm.accept_addr, true);
}
}
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 1c1f721375989579e46741f59523e39ec9b2a9bd
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024082652-polka-escapist-f8f8@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
1c1f72137598 ("mptcp: pm: only decrement add_addr_accepted for MPJ req")
6a09788c1a66 ("mptcp: pm: inc RmAddr MIB counter once per RM_ADDR ID")
3ad14f54bd74 ("mptcp: more accurate MPC endpoint tracking")
c157bbe776b7 ("mptcp: allow the in kernel PM to set MPC subflow priority")
843b5e75efff ("mptcp: fix local endpoint accounting")
4638de5aefe5 ("mptcp: handle local addrs announced by userspace PMs")
4d25247d3ae4 ("mptcp: bypass in-kernel PM restrictions for non-kernel PMs")
14b06811bec6 ("mptcp: Bypass kernel PM when userspace PM is enabled")
0530020a7c8f ("mptcp: track and update contiguous data status")
0348c690ed37 ("mptcp: add the fallback check")
1761fed25678 ("mptcp: don't send RST for single subflow")
c682bf536cf4 ("mptcp: add pm_nl_pernet helpers")
ae7bd9ccecc3 ("selftests: mptcp: join: option to execute specific tests")
e59300ce3ff8 ("selftests: mptcp: join: reset failing links")
3afd0280e7d3 ("selftests: mptcp: join: define tests groups once")
3c082695e78b ("selftests: mptcp: drop msg argument of chk_csum_nr")
69c6ce7b6eca ("selftests: mptcp: add implicit endpoint test case")
4cf86ae84c71 ("mptcp: strict local address ID selection")
d045b9eb95a9 ("mptcp: introduce implicit endpoints")
6fa0174a7c86 ("mptcp: more careful RM_ADDR generation")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1c1f721375989579e46741f59523e39ec9b2a9bd Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Mon, 19 Aug 2024 21:45:27 +0200
Subject: [PATCH] mptcp: pm: only decrement add_addr_accepted for MPJ req
Adding the following warning ...
WARN_ON_ONCE(msk->pm.add_addr_accepted == 0)
... before decrementing the add_addr_accepted counter helped to find a
bug when running the "remove single subflow" subtest from the
mptcp_join.sh selftest.
Removing a 'subflow' endpoint will first trigger a RM_ADDR, then the
subflow closure. Before this patch, and upon the reception of the
RM_ADDR, the other peer will then try to decrement this
add_addr_accepted. That's not correct because the attached subflows have
not been created upon the reception of an ADD_ADDR.
A way to solve that is to decrement the counter only if the attached
subflow was an MP_JOIN to a remote id that was not 0, and initiated by
the host receiving the RM_ADDR.
Fixes: d0876b2284cf ("mptcp: add the incoming RM_ADDR support")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-9-38035d40de5b…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 4cf7cc851f80..882781571c7b 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -829,7 +829,7 @@ static void mptcp_pm_nl_rm_addr_or_subflow(struct mptcp_sock *msk,
mptcp_close_ssk(sk, ssk, subflow);
spin_lock_bh(&msk->pm.lock);
- removed = true;
+ removed |= subflow->request_join;
if (rm_type == MPTCP_MIB_RMSUBFLOW)
__MPTCP_INC_STATS(sock_net(sk), rm_type);
}
@@ -843,7 +843,11 @@ static void mptcp_pm_nl_rm_addr_or_subflow(struct mptcp_sock *msk,
if (!mptcp_pm_is_kernel(msk))
continue;
- if (rm_type == MPTCP_MIB_RMADDR) {
+ if (rm_type == MPTCP_MIB_RMADDR && rm_id &&
+ !WARN_ON_ONCE(msk->pm.add_addr_accepted == 0)) {
+ /* Note: if the subflow has been closed before, this
+ * add_addr_accepted counter will not be decremented.
+ */
msk->pm.add_addr_accepted--;
WRITE_ONCE(msk->pm.accept_addr, true);
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 4878f9f8421f4587bee7b232c1c8a9d3a7d4d782
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024082646-slang-phoniness-d213@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
4878f9f8421f ("selftests: mptcp: join: validate fullmesh endp on 1st sf")
e571fb09c893 ("selftests: mptcp: add speed env var")
4aadde088a58 ("selftests: mptcp: add fullmesh env var")
080b7f5733fd ("selftests: mptcp: add fastclose env var")
662aa22d7dcd ("selftests: mptcp: set all env vars as local ones")
9e9d176df8e9 ("selftests: mptcp: add pm_nl_set_endpoint helper")
1534f87ee0dc ("selftests: mptcp: drop sflags parameter")
595ef566a2ef ("selftests: mptcp: drop addr_nr_ns1/2 parameters")
0c93af1f8907 ("selftests: mptcp: drop test_linkfail parameter")
be7e9786c915 ("selftests: mptcp: set FAILING_LINKS in run_tests")
4369c198e599 ("selftests: mptcp: test userspace pm out of transfer")
ae947bb2c253 ("selftests: mptcp: join: skip Fastclose tests if not supported")
d4c81bbb8600 ("selftests: mptcp: join: support local endpoint being tracked or not")
4a0b866a3f7d ("selftests: mptcp: join: skip test if iptables/tc cmds fail")
0c4cd3f86a40 ("selftests: mptcp: join: use 'iptables-legacy' if available")
6c160b636c91 ("selftests: mptcp: update userspace pm subflow tests")
48d73f609dcc ("selftests: mptcp: update userspace pm addr tests")
8697a258ae24 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 4878f9f8421f4587bee7b232c1c8a9d3a7d4d782 Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Mon, 19 Aug 2024 21:45:31 +0200
Subject: [PATCH] selftests: mptcp: join: validate fullmesh endp on 1st sf
This case was not covered, and the wrong ID was set before the previous
commit.
The rest is not modified, it is just that it will increase the code
coverage.
The right address ID can be verified by looking at the packet traces. We
could automate that using Netfilter with some cBPF code for example, but
that's always a bit cryptic. Packetdrill seems better fitted for that.
Fixes: 4f49d63352da ("selftests: mptcp: add fullmesh testcases")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-13-38035d40de5…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index f609c02c6123..89e553e0e0c2 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -3059,6 +3059,7 @@ fullmesh_tests()
pm_nl_set_limits $ns1 1 3
pm_nl_set_limits $ns2 1 3
pm_nl_add_endpoint $ns1 10.0.2.1 flags signal
+ pm_nl_add_endpoint $ns2 10.0.1.2 flags subflow,fullmesh
fullmesh=1 speed=slow \
run_tests $ns1 $ns2 10.0.1.1
chk_join_nr 3 3 3
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x f18fa2abf81099d822d842a107f8c9889c86043c
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024083034-parched-driller-96e5@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
f18fa2abf810 ("selftests: mptcp: join: check re-re-adding ID 0 signal")
20ccc7c5f7a3 ("selftests: mptcp: join: validate event numbers")
d397d7246c11 ("selftests: mptcp: join: check re-re-adding ID 0 endp")
1c2326fcae4f ("selftests: mptcp: join: check re-adding init endp with != id")
5f94b08c0012 ("selftests: mptcp: join: check removing ID 0 endpoint")
65fb58afa341 ("selftests: mptcp: join: check re-using ID of closed subflow")
a13d5aad4dd9 ("selftests: mptcp: join: check re-using ID of unused ADD_ADDR")
b5e2fb832f48 ("selftests: mptcp: add explicit test case for remove/readd")
40061817d95b ("selftests: mptcp: join: fix dev in check_endpoint")
23a0485d1c04 ("selftests: mptcp: declare event macros in mptcp_lib")
35bc143a8514 ("selftests: mptcp: add mptcp_lib_events helper")
3a0f9bed3c28 ("selftests: mptcp: add mptcp_lib_ns_init/exit helpers")
3fb8c33ef4b9 ("selftests: mptcp: add mptcp_lib_check_tools helper")
7c2eac649054 ("selftests: mptcp: stop forcing iptables-legacy")
4cc5cc7ca052 ("selftests: mptcp: userspace pm get addr tests")
38f027fca1b7 ("selftests: mptcp: dump userspace addrs list")
2d0c1d27ea4e ("selftests: mptcp: add mptcp_lib_check_output helper")
9480f388a2ef ("selftests: mptcp: join: add ss mptcp support check")
7092dbee2328 ("selftests: mptcp: rm subflow with v4/v4mapped addr")
04b57c9e096a ("selftests: mptcp: join: stop transfer when check is done (part 2)")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f18fa2abf81099d822d842a107f8c9889c86043c Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Wed, 28 Aug 2024 08:14:38 +0200
Subject: [PATCH] selftests: mptcp: join: check re-re-adding ID 0 signal
This test extends "delete re-add signal" to validate the previous
commit: when the 'signal' endpoint linked to the initial subflow (ID 0)
is re-added multiple times, it will re-send the ADD_ADDR with id 0. The
client should still be able to re-create this subflow, even if the
add_addr_accepted limit has been reached as this special address is not
considered as a new address.
The 'Fixes' tag here below is the same as the one from the previous
commit: this patch here is not fixing anything wrong in the selftests,
but it validates the previous fix for an issue introduced by this commit
ID.
Fixes: d0876b2284cf ("mptcp: add the incoming RM_ADDR support")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index a8ea0fe200fb..a4762c49a878 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -3688,7 +3688,7 @@ endpoint_tests()
# broadcast IP: no packet for this address will be received on ns1
pm_nl_add_endpoint $ns1 224.0.0.1 id 2 flags signal
pm_nl_add_endpoint $ns1 10.0.1.1 id 42 flags signal
- test_linkfail=4 speed=20 \
+ test_linkfail=4 speed=5 \
run_tests $ns1 $ns2 10.0.1.1 &
local tests_pid=$!
@@ -3717,7 +3717,17 @@ endpoint_tests()
pm_nl_add_endpoint $ns1 10.0.1.1 id 99 flags signal
wait_mpj $ns2
- chk_subflow_nr "after re-add" 3
+ chk_subflow_nr "after re-add ID 0" 3
+ chk_mptcp_info subflows 3 subflows 3
+
+ pm_nl_del_endpoint $ns1 99 10.0.1.1
+ sleep 0.5
+ chk_subflow_nr "after re-delete ID 0" 2
+ chk_mptcp_info subflows 2 subflows 2
+
+ pm_nl_add_endpoint $ns1 10.0.1.1 id 88 flags signal
+ wait_mpj $ns2
+ chk_subflow_nr "after re-re-add ID 0" 3
chk_mptcp_info subflows 3 subflows 3
mptcp_lib_kill_wait $tests_pid
@@ -3727,19 +3737,19 @@ endpoint_tests()
chk_evt_nr ns1 MPTCP_LIB_EVENT_ESTABLISHED 1
chk_evt_nr ns1 MPTCP_LIB_EVENT_ANNOUNCED 0
chk_evt_nr ns1 MPTCP_LIB_EVENT_REMOVED 0
- chk_evt_nr ns1 MPTCP_LIB_EVENT_SUB_ESTABLISHED 4
- chk_evt_nr ns1 MPTCP_LIB_EVENT_SUB_CLOSED 2
+ chk_evt_nr ns1 MPTCP_LIB_EVENT_SUB_ESTABLISHED 5
+ chk_evt_nr ns1 MPTCP_LIB_EVENT_SUB_CLOSED 3
chk_evt_nr ns2 MPTCP_LIB_EVENT_CREATED 1
chk_evt_nr ns2 MPTCP_LIB_EVENT_ESTABLISHED 1
- chk_evt_nr ns2 MPTCP_LIB_EVENT_ANNOUNCED 5
- chk_evt_nr ns2 MPTCP_LIB_EVENT_REMOVED 3
- chk_evt_nr ns2 MPTCP_LIB_EVENT_SUB_ESTABLISHED 4
- chk_evt_nr ns2 MPTCP_LIB_EVENT_SUB_CLOSED 2
+ chk_evt_nr ns2 MPTCP_LIB_EVENT_ANNOUNCED 6
+ chk_evt_nr ns2 MPTCP_LIB_EVENT_REMOVED 4
+ chk_evt_nr ns2 MPTCP_LIB_EVENT_SUB_ESTABLISHED 5
+ chk_evt_nr ns2 MPTCP_LIB_EVENT_SUB_CLOSED 3
- chk_join_nr 4 4 4
- chk_add_nr 5 5
- chk_rm_nr 3 2 invert
+ chk_join_nr 5 5 5
+ chk_add_nr 6 6
+ chk_rm_nr 4 3 invert
fi
# flush and re-add
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x f18fa2abf81099d822d842a107f8c9889c86043c
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024083033-figure-spelling-0a00@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
f18fa2abf810 ("selftests: mptcp: join: check re-re-adding ID 0 signal")
20ccc7c5f7a3 ("selftests: mptcp: join: validate event numbers")
d397d7246c11 ("selftests: mptcp: join: check re-re-adding ID 0 endp")
1c2326fcae4f ("selftests: mptcp: join: check re-adding init endp with != id")
5f94b08c0012 ("selftests: mptcp: join: check removing ID 0 endpoint")
65fb58afa341 ("selftests: mptcp: join: check re-using ID of closed subflow")
a13d5aad4dd9 ("selftests: mptcp: join: check re-using ID of unused ADD_ADDR")
b5e2fb832f48 ("selftests: mptcp: add explicit test case for remove/readd")
40061817d95b ("selftests: mptcp: join: fix dev in check_endpoint")
23a0485d1c04 ("selftests: mptcp: declare event macros in mptcp_lib")
35bc143a8514 ("selftests: mptcp: add mptcp_lib_events helper")
3a0f9bed3c28 ("selftests: mptcp: add mptcp_lib_ns_init/exit helpers")
3fb8c33ef4b9 ("selftests: mptcp: add mptcp_lib_check_tools helper")
7c2eac649054 ("selftests: mptcp: stop forcing iptables-legacy")
4cc5cc7ca052 ("selftests: mptcp: userspace pm get addr tests")
38f027fca1b7 ("selftests: mptcp: dump userspace addrs list")
2d0c1d27ea4e ("selftests: mptcp: add mptcp_lib_check_output helper")
9480f388a2ef ("selftests: mptcp: join: add ss mptcp support check")
7092dbee2328 ("selftests: mptcp: rm subflow with v4/v4mapped addr")
04b57c9e096a ("selftests: mptcp: join: stop transfer when check is done (part 2)")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f18fa2abf81099d822d842a107f8c9889c86043c Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Wed, 28 Aug 2024 08:14:38 +0200
Subject: [PATCH] selftests: mptcp: join: check re-re-adding ID 0 signal
This test extends "delete re-add signal" to validate the previous
commit: when the 'signal' endpoint linked to the initial subflow (ID 0)
is re-added multiple times, it will re-send the ADD_ADDR with id 0. The
client should still be able to re-create this subflow, even if the
add_addr_accepted limit has been reached as this special address is not
considered as a new address.
The 'Fixes' tag here below is the same as the one from the previous
commit: this patch here is not fixing anything wrong in the selftests,
but it validates the previous fix for an issue introduced by this commit
ID.
Fixes: d0876b2284cf ("mptcp: add the incoming RM_ADDR support")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index a8ea0fe200fb..a4762c49a878 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -3688,7 +3688,7 @@ endpoint_tests()
# broadcast IP: no packet for this address will be received on ns1
pm_nl_add_endpoint $ns1 224.0.0.1 id 2 flags signal
pm_nl_add_endpoint $ns1 10.0.1.1 id 42 flags signal
- test_linkfail=4 speed=20 \
+ test_linkfail=4 speed=5 \
run_tests $ns1 $ns2 10.0.1.1 &
local tests_pid=$!
@@ -3717,7 +3717,17 @@ endpoint_tests()
pm_nl_add_endpoint $ns1 10.0.1.1 id 99 flags signal
wait_mpj $ns2
- chk_subflow_nr "after re-add" 3
+ chk_subflow_nr "after re-add ID 0" 3
+ chk_mptcp_info subflows 3 subflows 3
+
+ pm_nl_del_endpoint $ns1 99 10.0.1.1
+ sleep 0.5
+ chk_subflow_nr "after re-delete ID 0" 2
+ chk_mptcp_info subflows 2 subflows 2
+
+ pm_nl_add_endpoint $ns1 10.0.1.1 id 88 flags signal
+ wait_mpj $ns2
+ chk_subflow_nr "after re-re-add ID 0" 3
chk_mptcp_info subflows 3 subflows 3
mptcp_lib_kill_wait $tests_pid
@@ -3727,19 +3737,19 @@ endpoint_tests()
chk_evt_nr ns1 MPTCP_LIB_EVENT_ESTABLISHED 1
chk_evt_nr ns1 MPTCP_LIB_EVENT_ANNOUNCED 0
chk_evt_nr ns1 MPTCP_LIB_EVENT_REMOVED 0
- chk_evt_nr ns1 MPTCP_LIB_EVENT_SUB_ESTABLISHED 4
- chk_evt_nr ns1 MPTCP_LIB_EVENT_SUB_CLOSED 2
+ chk_evt_nr ns1 MPTCP_LIB_EVENT_SUB_ESTABLISHED 5
+ chk_evt_nr ns1 MPTCP_LIB_EVENT_SUB_CLOSED 3
chk_evt_nr ns2 MPTCP_LIB_EVENT_CREATED 1
chk_evt_nr ns2 MPTCP_LIB_EVENT_ESTABLISHED 1
- chk_evt_nr ns2 MPTCP_LIB_EVENT_ANNOUNCED 5
- chk_evt_nr ns2 MPTCP_LIB_EVENT_REMOVED 3
- chk_evt_nr ns2 MPTCP_LIB_EVENT_SUB_ESTABLISHED 4
- chk_evt_nr ns2 MPTCP_LIB_EVENT_SUB_CLOSED 2
+ chk_evt_nr ns2 MPTCP_LIB_EVENT_ANNOUNCED 6
+ chk_evt_nr ns2 MPTCP_LIB_EVENT_REMOVED 4
+ chk_evt_nr ns2 MPTCP_LIB_EVENT_SUB_ESTABLISHED 5
+ chk_evt_nr ns2 MPTCP_LIB_EVENT_SUB_CLOSED 3
- chk_join_nr 4 4 4
- chk_add_nr 5 5
- chk_rm_nr 3 2 invert
+ chk_join_nr 5 5 5
+ chk_add_nr 6 6
+ chk_rm_nr 4 3 invert
fi
# flush and re-add
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x a13d5aad4dd9a309eecdc33cfd75045bd5f376a3
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024082629-tapping-motivator-444a@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
a13d5aad4dd9 ("selftests: mptcp: join: check re-using ID of unused ADD_ADDR")
b5e2fb832f48 ("selftests: mptcp: add explicit test case for remove/readd")
ae7bd9ccecc3 ("selftests: mptcp: join: option to execute specific tests")
e59300ce3ff8 ("selftests: mptcp: join: reset failing links")
3afd0280e7d3 ("selftests: mptcp: join: define tests groups once")
3c082695e78b ("selftests: mptcp: drop msg argument of chk_csum_nr")
69c6ce7b6eca ("selftests: mptcp: add implicit endpoint test case")
d045b9eb95a9 ("mptcp: introduce implicit endpoints")
6fa0174a7c86 ("mptcp: more careful RM_ADDR generation")
f98c2bca7b2b ("selftests: mptcp: Rename wait function")
826d7bdca833 ("selftests: mptcp: join: allow running -cCi")
7d9bf018f907 ("selftests: mptcp: update output info of chk_rm_nr")
26516e10c433 ("selftests: mptcp: add more arguments for chk_join_nr")
8117dac3e7c3 ("selftests: mptcp: add invert check in check_transfer")
01542c9bf9ab ("selftests: mptcp: add fastclose testcase")
cbfafac4cf8f ("selftests: mptcp: add extra_args in do_transfer")
922fd2b39e5a ("selftests: mptcp: add the MP_RST mibs check")
e8e947ef50f6 ("selftests: mptcp: add the MP_FASTCLOSE mibs check")
9a0a93672c14 ("selftests: mptcp: adjust output alignment for more tests")
aaa25a2fa796 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From a13d5aad4dd9a309eecdc33cfd75045bd5f376a3 Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Mon, 19 Aug 2024 21:45:20 +0200
Subject: [PATCH] selftests: mptcp: join: check re-using ID of unused ADD_ADDR
This test extends "delete re-add signal" to validate the previous
commit. An extra address is announced by the server, but this address
cannot be used by the client. The result is that no subflow will be
established to this address.
Later, the server will delete this extra endpoint, and set a new one,
with a valid address, but re-using the same ID. Before the previous
commit, the server would not have been able to announce this new
address.
While at it, extra checks have been added to validate the expected
numbers of MPJ, ADD_ADDR and RM_ADDR.
The 'Fixes' tag here below is the same as the one from the previous
commit: this patch here is not fixing anything wrong in the selftests,
but it validates the previous fix for an issue introduced by this commit
ID.
Fixes: b6c08380860b ("mptcp: remove addr and subflow in PM netlink")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-2-38035d40de5b…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index 9ea6d698e9d3..25077ccf31d2 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -3601,9 +3601,11 @@ endpoint_tests()
# remove and re-add
if reset "delete re-add signal" &&
mptcp_lib_kallsyms_has "subflow_rebuild_header$"; then
- pm_nl_set_limits $ns1 1 1
- pm_nl_set_limits $ns2 1 1
+ pm_nl_set_limits $ns1 0 2
+ pm_nl_set_limits $ns2 2 2
pm_nl_add_endpoint $ns1 10.0.2.1 id 1 flags signal
+ # broadcast IP: no packet for this address will be received on ns1
+ pm_nl_add_endpoint $ns1 224.0.0.1 id 2 flags signal
test_linkfail=4 speed=20 \
run_tests $ns1 $ns2 10.0.1.1 &
local tests_pid=$!
@@ -3615,15 +3617,21 @@ endpoint_tests()
chk_mptcp_info subflows 1 subflows 1
pm_nl_del_endpoint $ns1 1 10.0.2.1
+ pm_nl_del_endpoint $ns1 2 224.0.0.1
sleep 0.5
chk_subflow_nr "after delete" 1
chk_mptcp_info subflows 0 subflows 0
- pm_nl_add_endpoint $ns1 10.0.2.1 flags signal
+ pm_nl_add_endpoint $ns1 10.0.2.1 id 1 flags signal
+ pm_nl_add_endpoint $ns1 10.0.3.1 id 2 flags signal
wait_mpj $ns2
- chk_subflow_nr "after re-add" 2
- chk_mptcp_info subflows 1 subflows 1
+ chk_subflow_nr "after re-add" 3
+ chk_mptcp_info subflows 2 subflows 2
mptcp_lib_kill_wait $tests_pid
+
+ chk_join_nr 3 3 3
+ chk_add_nr 4 4
+ chk_rm_nr 2 1 invert
fi
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x a13d5aad4dd9a309eecdc33cfd75045bd5f376a3
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024082628-stapling-dad-555c@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
a13d5aad4dd9 ("selftests: mptcp: join: check re-using ID of unused ADD_ADDR")
b5e2fb832f48 ("selftests: mptcp: add explicit test case for remove/readd")
ae7bd9ccecc3 ("selftests: mptcp: join: option to execute specific tests")
e59300ce3ff8 ("selftests: mptcp: join: reset failing links")
3afd0280e7d3 ("selftests: mptcp: join: define tests groups once")
3c082695e78b ("selftests: mptcp: drop msg argument of chk_csum_nr")
69c6ce7b6eca ("selftests: mptcp: add implicit endpoint test case")
d045b9eb95a9 ("mptcp: introduce implicit endpoints")
6fa0174a7c86 ("mptcp: more careful RM_ADDR generation")
f98c2bca7b2b ("selftests: mptcp: Rename wait function")
826d7bdca833 ("selftests: mptcp: join: allow running -cCi")
7d9bf018f907 ("selftests: mptcp: update output info of chk_rm_nr")
26516e10c433 ("selftests: mptcp: add more arguments for chk_join_nr")
8117dac3e7c3 ("selftests: mptcp: add invert check in check_transfer")
01542c9bf9ab ("selftests: mptcp: add fastclose testcase")
cbfafac4cf8f ("selftests: mptcp: add extra_args in do_transfer")
922fd2b39e5a ("selftests: mptcp: add the MP_RST mibs check")
e8e947ef50f6 ("selftests: mptcp: add the MP_FASTCLOSE mibs check")
9a0a93672c14 ("selftests: mptcp: adjust output alignment for more tests")
aaa25a2fa796 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From a13d5aad4dd9a309eecdc33cfd75045bd5f376a3 Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Mon, 19 Aug 2024 21:45:20 +0200
Subject: [PATCH] selftests: mptcp: join: check re-using ID of unused ADD_ADDR
This test extends "delete re-add signal" to validate the previous
commit. An extra address is announced by the server, but this address
cannot be used by the client. The result is that no subflow will be
established to this address.
Later, the server will delete this extra endpoint, and set a new one,
with a valid address, but re-using the same ID. Before the previous
commit, the server would not have been able to announce this new
address.
While at it, extra checks have been added to validate the expected
numbers of MPJ, ADD_ADDR and RM_ADDR.
The 'Fixes' tag here below is the same as the one from the previous
commit: this patch here is not fixing anything wrong in the selftests,
but it validates the previous fix for an issue introduced by this commit
ID.
Fixes: b6c08380860b ("mptcp: remove addr and subflow in PM netlink")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-2-38035d40de5b…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index 9ea6d698e9d3..25077ccf31d2 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -3601,9 +3601,11 @@ endpoint_tests()
# remove and re-add
if reset "delete re-add signal" &&
mptcp_lib_kallsyms_has "subflow_rebuild_header$"; then
- pm_nl_set_limits $ns1 1 1
- pm_nl_set_limits $ns2 1 1
+ pm_nl_set_limits $ns1 0 2
+ pm_nl_set_limits $ns2 2 2
pm_nl_add_endpoint $ns1 10.0.2.1 id 1 flags signal
+ # broadcast IP: no packet for this address will be received on ns1
+ pm_nl_add_endpoint $ns1 224.0.0.1 id 2 flags signal
test_linkfail=4 speed=20 \
run_tests $ns1 $ns2 10.0.1.1 &
local tests_pid=$!
@@ -3615,15 +3617,21 @@ endpoint_tests()
chk_mptcp_info subflows 1 subflows 1
pm_nl_del_endpoint $ns1 1 10.0.2.1
+ pm_nl_del_endpoint $ns1 2 224.0.0.1
sleep 0.5
chk_subflow_nr "after delete" 1
chk_mptcp_info subflows 0 subflows 0
- pm_nl_add_endpoint $ns1 10.0.2.1 flags signal
+ pm_nl_add_endpoint $ns1 10.0.2.1 id 1 flags signal
+ pm_nl_add_endpoint $ns1 10.0.3.1 id 2 flags signal
wait_mpj $ns2
- chk_subflow_nr "after re-add" 2
- chk_mptcp_info subflows 1 subflows 1
+ chk_subflow_nr "after re-add" 3
+ chk_mptcp_info subflows 2 subflows 2
mptcp_lib_kill_wait $tests_pid
+
+ chk_join_nr 3 3 3
+ chk_add_nr 4 4
+ chk_rm_nr 2 1 invert
fi
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x e06959e9eebdfea4654390f53b65cff57691872e
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024082619-chair-irritable-03aa@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
e06959e9eebd ("selftests: mptcp: join: test for flush/re-add endpoints")
b5e2fb832f48 ("selftests: mptcp: add explicit test case for remove/readd")
ae7bd9ccecc3 ("selftests: mptcp: join: option to execute specific tests")
e59300ce3ff8 ("selftests: mptcp: join: reset failing links")
3afd0280e7d3 ("selftests: mptcp: join: define tests groups once")
3c082695e78b ("selftests: mptcp: drop msg argument of chk_csum_nr")
69c6ce7b6eca ("selftests: mptcp: add implicit endpoint test case")
d045b9eb95a9 ("mptcp: introduce implicit endpoints")
6fa0174a7c86 ("mptcp: more careful RM_ADDR generation")
f98c2bca7b2b ("selftests: mptcp: Rename wait function")
826d7bdca833 ("selftests: mptcp: join: allow running -cCi")
7d9bf018f907 ("selftests: mptcp: update output info of chk_rm_nr")
26516e10c433 ("selftests: mptcp: add more arguments for chk_join_nr")
8117dac3e7c3 ("selftests: mptcp: add invert check in check_transfer")
01542c9bf9ab ("selftests: mptcp: add fastclose testcase")
cbfafac4cf8f ("selftests: mptcp: add extra_args in do_transfer")
922fd2b39e5a ("selftests: mptcp: add the MP_RST mibs check")
e8e947ef50f6 ("selftests: mptcp: add the MP_FASTCLOSE mibs check")
9a0a93672c14 ("selftests: mptcp: adjust output alignment for more tests")
aaa25a2fa796 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e06959e9eebdfea4654390f53b65cff57691872e Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Mon, 19 Aug 2024 21:45:24 +0200
Subject: [PATCH] selftests: mptcp: join: test for flush/re-add endpoints
After having flushed endpoints that didn't cause the creation of new
subflows, it is important to check endpoints can be re-created, re-using
previously used IDs.
Before the previous commit, the client would not have been able to
re-create the subflow that was previously rejected.
The 'Fixes' tag here below is the same as the one from the previous
commit: this patch here is not fixing anything wrong in the selftests,
but it validates the previous fix for an issue introduced by this commit
ID.
Fixes: 06faa2271034 ("mptcp: remove multi addresses and subflows in PM")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-6-38035d40de5b…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index fbb0174145ad..f609c02c6123 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -3651,6 +3651,36 @@ endpoint_tests()
chk_rm_nr 2 1 invert
fi
+ # flush and re-add
+ if reset_with_tcp_filter "flush re-add" ns2 10.0.3.2 REJECT OUTPUT &&
+ mptcp_lib_kallsyms_has "subflow_rebuild_header$"; then
+ pm_nl_set_limits $ns1 0 2
+ pm_nl_set_limits $ns2 1 2
+ # broadcast IP: no packet for this address will be received on ns1
+ pm_nl_add_endpoint $ns1 224.0.0.1 id 2 flags signal
+ pm_nl_add_endpoint $ns2 10.0.3.2 id 3 flags subflow
+ test_linkfail=4 speed=20 \
+ run_tests $ns1 $ns2 10.0.1.1 &
+ local tests_pid=$!
+
+ wait_attempt_fail $ns2
+ chk_subflow_nr "before flush" 1
+ chk_mptcp_info subflows 0 subflows 0
+
+ pm_nl_flush_endpoint $ns2
+ pm_nl_flush_endpoint $ns1
+ wait_rm_addr $ns2 0
+ ip netns exec "${ns2}" ${iptables} -D OUTPUT -s "10.0.3.2" -p tcp -j REJECT
+ pm_nl_add_endpoint $ns2 10.0.3.2 id 3 flags subflow
+ wait_mpj $ns2
+ pm_nl_add_endpoint $ns1 10.0.3.1 id 2 flags signal
+ wait_mpj $ns2
+ mptcp_lib_kill_wait $tests_pid
+
+ chk_join_nr 2 2 2
+ chk_add_nr 2 2
+ chk_rm_nr 1 0 invert
+ fi
}
# [$1: error message]
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 65fb58afa341ad68e71e5c4d816b407e6a683a66
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024082602-dimmed-relay-f039@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
65fb58afa341 ("selftests: mptcp: join: check re-using ID of closed subflow")
04b57c9e096a ("selftests: mptcp: join: stop transfer when check is done (part 2)")
b9fb176081fb ("selftests: mptcp: userspace pm send RM_ADDR for ID 0")
e3b47e460b4b ("selftests: mptcp: userspace pm remove initial subflow")
e571fb09c893 ("selftests: mptcp: add speed env var")
4aadde088a58 ("selftests: mptcp: add fullmesh env var")
080b7f5733fd ("selftests: mptcp: add fastclose env var")
662aa22d7dcd ("selftests: mptcp: set all env vars as local ones")
9e9d176df8e9 ("selftests: mptcp: add pm_nl_set_endpoint helper")
1534f87ee0dc ("selftests: mptcp: drop sflags parameter")
595ef566a2ef ("selftests: mptcp: drop addr_nr_ns1/2 parameters")
0c93af1f8907 ("selftests: mptcp: drop test_linkfail parameter")
be7e9786c915 ("selftests: mptcp: set FAILING_LINKS in run_tests")
d7ced753aa85 ("selftests: mptcp: check subflow and addr infos")
4369c198e599 ("selftests: mptcp: test userspace pm out of transfer")
36c4127ae8dd ("selftests: mptcp: join: skip implicit tests if not supported")
ae947bb2c253 ("selftests: mptcp: join: skip Fastclose tests if not supported")
d4c81bbb8600 ("selftests: mptcp: join: support local endpoint being tracked or not")
4a0b866a3f7d ("selftests: mptcp: join: skip test if iptables/tc cmds fail")
0c4cd3f86a40 ("selftests: mptcp: join: use 'iptables-legacy' if available")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 65fb58afa341ad68e71e5c4d816b407e6a683a66 Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Mon, 19 Aug 2024 21:45:22 +0200
Subject: [PATCH] selftests: mptcp: join: check re-using ID of closed subflow
This test extends "delete and re-add" to validate the previous commit. A
new 'subflow' endpoint is added, but the subflow request will be
rejected. The result is that no subflow will be established from this
address.
Later, the endpoint is removed and re-added after having cleared the
firewall rule. Before the previous commit, the client would not have
been able to create this new subflow.
While at it, extra checks have been added to validate the expected
numbers of MPJ and RM_ADDR.
The 'Fixes' tag here below is the same as the one from the previous
commit: this patch here is not fixing anything wrong in the selftests,
but it validates the previous fix for an issue introduced by this commit
ID.
Fixes: b6c08380860b ("mptcp: remove addr and subflow in PM netlink")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-4-38035d40de5b…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index 25077ccf31d2..fbb0174145ad 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -436,9 +436,10 @@ reset_with_tcp_filter()
local ns="${!1}"
local src="${2}"
local target="${3}"
+ local chain="${4:-INPUT}"
if ! ip netns exec "${ns}" ${iptables} \
- -A INPUT \
+ -A "${chain}" \
-s "${src}" \
-p tcp \
-j "${target}"; then
@@ -3571,10 +3572,10 @@ endpoint_tests()
mptcp_lib_kill_wait $tests_pid
fi
- if reset "delete and re-add" &&
+ if reset_with_tcp_filter "delete and re-add" ns2 10.0.3.2 REJECT OUTPUT &&
mptcp_lib_kallsyms_has "subflow_rebuild_header$"; then
- pm_nl_set_limits $ns1 1 1
- pm_nl_set_limits $ns2 1 1
+ pm_nl_set_limits $ns1 0 2
+ pm_nl_set_limits $ns2 0 2
pm_nl_add_endpoint $ns2 10.0.2.2 id 2 dev ns2eth2 flags subflow
test_linkfail=4 speed=20 \
run_tests $ns1 $ns2 10.0.1.1 &
@@ -3591,11 +3592,27 @@ endpoint_tests()
chk_subflow_nr "after delete" 1
chk_mptcp_info subflows 0 subflows 0
- pm_nl_add_endpoint $ns2 10.0.2.2 dev ns2eth2 flags subflow
+ pm_nl_add_endpoint $ns2 10.0.2.2 id 2 dev ns2eth2 flags subflow
wait_mpj $ns2
chk_subflow_nr "after re-add" 2
chk_mptcp_info subflows 1 subflows 1
+
+ pm_nl_add_endpoint $ns2 10.0.3.2 id 3 flags subflow
+ wait_attempt_fail $ns2
+ chk_subflow_nr "after new reject" 2
+ chk_mptcp_info subflows 1 subflows 1
+
+ ip netns exec "${ns2}" ${iptables} -D OUTPUT -s "10.0.3.2" -p tcp -j REJECT
+ pm_nl_del_endpoint $ns2 3 10.0.3.2
+ pm_nl_add_endpoint $ns2 10.0.3.2 id 3 flags subflow
+ wait_mpj $ns2
+ chk_subflow_nr "after no reject" 3
+ chk_mptcp_info subflows 2 subflows 2
+
mptcp_lib_kill_wait $tests_pid
+
+ chk_join_nr 3 3 3
+ chk_rm_nr 1 1
fi
# remove and re-add
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 65fb58afa341ad68e71e5c4d816b407e6a683a66
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024082601-scurvy-decade-b887@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
65fb58afa341 ("selftests: mptcp: join: check re-using ID of closed subflow")
04b57c9e096a ("selftests: mptcp: join: stop transfer when check is done (part 2)")
b9fb176081fb ("selftests: mptcp: userspace pm send RM_ADDR for ID 0")
e3b47e460b4b ("selftests: mptcp: userspace pm remove initial subflow")
e571fb09c893 ("selftests: mptcp: add speed env var")
4aadde088a58 ("selftests: mptcp: add fullmesh env var")
080b7f5733fd ("selftests: mptcp: add fastclose env var")
662aa22d7dcd ("selftests: mptcp: set all env vars as local ones")
9e9d176df8e9 ("selftests: mptcp: add pm_nl_set_endpoint helper")
1534f87ee0dc ("selftests: mptcp: drop sflags parameter")
595ef566a2ef ("selftests: mptcp: drop addr_nr_ns1/2 parameters")
0c93af1f8907 ("selftests: mptcp: drop test_linkfail parameter")
be7e9786c915 ("selftests: mptcp: set FAILING_LINKS in run_tests")
d7ced753aa85 ("selftests: mptcp: check subflow and addr infos")
4369c198e599 ("selftests: mptcp: test userspace pm out of transfer")
36c4127ae8dd ("selftests: mptcp: join: skip implicit tests if not supported")
ae947bb2c253 ("selftests: mptcp: join: skip Fastclose tests if not supported")
d4c81bbb8600 ("selftests: mptcp: join: support local endpoint being tracked or not")
4a0b866a3f7d ("selftests: mptcp: join: skip test if iptables/tc cmds fail")
0c4cd3f86a40 ("selftests: mptcp: join: use 'iptables-legacy' if available")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 65fb58afa341ad68e71e5c4d816b407e6a683a66 Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Mon, 19 Aug 2024 21:45:22 +0200
Subject: [PATCH] selftests: mptcp: join: check re-using ID of closed subflow
This test extends "delete and re-add" to validate the previous commit. A
new 'subflow' endpoint is added, but the subflow request will be
rejected. The result is that no subflow will be established from this
address.
Later, the endpoint is removed and re-added after having cleared the
firewall rule. Before the previous commit, the client would not have
been able to create this new subflow.
While at it, extra checks have been added to validate the expected
numbers of MPJ and RM_ADDR.
The 'Fixes' tag here below is the same as the one from the previous
commit: this patch here is not fixing anything wrong in the selftests,
but it validates the previous fix for an issue introduced by this commit
ID.
Fixes: b6c08380860b ("mptcp: remove addr and subflow in PM netlink")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-4-38035d40de5b…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index 25077ccf31d2..fbb0174145ad 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -436,9 +436,10 @@ reset_with_tcp_filter()
local ns="${!1}"
local src="${2}"
local target="${3}"
+ local chain="${4:-INPUT}"
if ! ip netns exec "${ns}" ${iptables} \
- -A INPUT \
+ -A "${chain}" \
-s "${src}" \
-p tcp \
-j "${target}"; then
@@ -3571,10 +3572,10 @@ endpoint_tests()
mptcp_lib_kill_wait $tests_pid
fi
- if reset "delete and re-add" &&
+ if reset_with_tcp_filter "delete and re-add" ns2 10.0.3.2 REJECT OUTPUT &&
mptcp_lib_kallsyms_has "subflow_rebuild_header$"; then
- pm_nl_set_limits $ns1 1 1
- pm_nl_set_limits $ns2 1 1
+ pm_nl_set_limits $ns1 0 2
+ pm_nl_set_limits $ns2 0 2
pm_nl_add_endpoint $ns2 10.0.2.2 id 2 dev ns2eth2 flags subflow
test_linkfail=4 speed=20 \
run_tests $ns1 $ns2 10.0.1.1 &
@@ -3591,11 +3592,27 @@ endpoint_tests()
chk_subflow_nr "after delete" 1
chk_mptcp_info subflows 0 subflows 0
- pm_nl_add_endpoint $ns2 10.0.2.2 dev ns2eth2 flags subflow
+ pm_nl_add_endpoint $ns2 10.0.2.2 id 2 dev ns2eth2 flags subflow
wait_mpj $ns2
chk_subflow_nr "after re-add" 2
chk_mptcp_info subflows 1 subflows 1
+
+ pm_nl_add_endpoint $ns2 10.0.3.2 id 3 flags subflow
+ wait_attempt_fail $ns2
+ chk_subflow_nr "after new reject" 2
+ chk_mptcp_info subflows 1 subflows 1
+
+ ip netns exec "${ns2}" ${iptables} -D OUTPUT -s "10.0.3.2" -p tcp -j REJECT
+ pm_nl_del_endpoint $ns2 3 10.0.3.2
+ pm_nl_add_endpoint $ns2 10.0.3.2 id 3 flags subflow
+ wait_mpj $ns2
+ chk_subflow_nr "after no reject" 3
+ chk_mptcp_info subflows 2 subflows 2
+
mptcp_lib_kill_wait $tests_pid
+
+ chk_join_nr 3 3 3
+ chk_rm_nr 1 1
fi
# remove and re-add
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 322ea3778965da72862cca2a0c50253aacf65fe6
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024082627-devotion-chewer-87af@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
322ea3778965 ("mptcp: pm: only mark 'subflow' endp as available")
f448451aa62d ("mptcp: pm: remove mptcp_pm_remove_subflow()")
ef34a6ea0cab ("mptcp: pm: re-using ID of unused flushed subflows")
edd8b5d868a4 ("mptcp: pm: re-using ID of unused removed subflows")
4b317e0eb287 ("mptcp: fix NL PM announced address accounting")
6a09788c1a66 ("mptcp: pm: inc RmAddr MIB counter once per RM_ADDR ID")
9bbec87ecfe8 ("mptcp: unify pm get_local_id interfaces")
dc886bce753c ("mptcp: export local_address")
8b1c94da1e48 ("mptcp: only send RM_ADDR in nl_cmd_remove")
3ad14f54bd74 ("mptcp: more accurate MPC endpoint tracking")
c157bbe776b7 ("mptcp: allow the in kernel PM to set MPC subflow priority")
843b5e75efff ("mptcp: fix local endpoint accounting")
d9a4594edabf ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE")
9ab4807c84a4 ("mptcp: netlink: Add MPTCP_PM_CMD_ANNOUNCE")
982f17ba1a25 ("mptcp: netlink: split mptcp_pm_parse_addr into two functions")
8b20137012d9 ("mptcp: read attributes of addr entries managed by userspace PMs")
4638de5aefe5 ("mptcp: handle local addrs announced by userspace PMs")
0530020a7c8f ("mptcp: track and update contiguous data status")
0348c690ed37 ("mptcp: add the fallback check")
1761fed25678 ("mptcp: don't send RST for single subflow")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 322ea3778965da72862cca2a0c50253aacf65fe6 Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Mon, 19 Aug 2024 21:45:26 +0200
Subject: [PATCH] mptcp: pm: only mark 'subflow' endp as available
Adding the following warning ...
WARN_ON_ONCE(msk->pm.local_addr_used == 0)
... before decrementing the local_addr_used counter helped to find a bug
when running the "remove single address" subtest from the mptcp_join.sh
selftests.
Removing a 'signal' endpoint will trigger the removal of all subflows
linked to this endpoint via mptcp_pm_nl_rm_addr_or_subflow() with
rm_type == MPTCP_MIB_RMSUBFLOW. This will decrement the local_addr_used
counter, which is wrong in this case because this counter is linked to
'subflow' endpoints, and here it is a 'signal' endpoint that is being
removed.
Now, the counter is decremented, only if the ID is being used outside
of mptcp_pm_nl_rm_addr_or_subflow(), only for 'subflow' endpoints, and
if the ID is not 0 -- local_addr_used is not taking into account these
ones. This marking of the ID as being available, and the decrement is
done no matter if a subflow using this ID is currently available,
because the subflow could have been closed before.
Fixes: 06faa2271034 ("mptcp: remove multi addresses and subflows in PM")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-8-38035d40de5b…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 44fc1c5959ac..4cf7cc851f80 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -833,10 +833,10 @@ static void mptcp_pm_nl_rm_addr_or_subflow(struct mptcp_sock *msk,
if (rm_type == MPTCP_MIB_RMSUBFLOW)
__MPTCP_INC_STATS(sock_net(sk), rm_type);
}
- if (rm_type == MPTCP_MIB_RMSUBFLOW)
- __set_bit(rm_id ? rm_id : msk->mpc_endpoint_id, msk->pm.id_avail_bitmap);
- else if (rm_type == MPTCP_MIB_RMADDR)
+
+ if (rm_type == MPTCP_MIB_RMADDR)
__MPTCP_INC_STATS(sock_net(sk), rm_type);
+
if (!removed)
continue;
@@ -846,8 +846,6 @@ static void mptcp_pm_nl_rm_addr_or_subflow(struct mptcp_sock *msk,
if (rm_type == MPTCP_MIB_RMADDR) {
msk->pm.add_addr_accepted--;
WRITE_ONCE(msk->pm.accept_addr, true);
- } else if (rm_type == MPTCP_MIB_RMSUBFLOW) {
- msk->pm.local_addr_used--;
}
}
}
@@ -1441,6 +1439,14 @@ static bool mptcp_pm_remove_anno_addr(struct mptcp_sock *msk,
return ret;
}
+static void __mark_subflow_endp_available(struct mptcp_sock *msk, u8 id)
+{
+ /* If it was marked as used, and not ID 0, decrement local_addr_used */
+ if (!__test_and_set_bit(id ? : msk->mpc_endpoint_id, msk->pm.id_avail_bitmap) &&
+ id && !WARN_ON_ONCE(msk->pm.local_addr_used == 0))
+ msk->pm.local_addr_used--;
+}
+
static int mptcp_nl_remove_subflow_and_signal_addr(struct net *net,
const struct mptcp_pm_addr_entry *entry)
{
@@ -1474,11 +1480,11 @@ static int mptcp_nl_remove_subflow_and_signal_addr(struct net *net,
spin_lock_bh(&msk->pm.lock);
mptcp_pm_nl_rm_subflow_received(msk, &list);
spin_unlock_bh(&msk->pm.lock);
- } else if (entry->flags & MPTCP_PM_ADDR_FLAG_SUBFLOW) {
- /* If the subflow has been used, but now closed */
+ }
+
+ if (entry->flags & MPTCP_PM_ADDR_FLAG_SUBFLOW) {
spin_lock_bh(&msk->pm.lock);
- if (!__test_and_set_bit(entry->addr.id, msk->pm.id_avail_bitmap))
- msk->pm.local_addr_used--;
+ __mark_subflow_endp_available(msk, list.ids[0]);
spin_unlock_bh(&msk->pm.lock);
}
@@ -1516,6 +1522,7 @@ static int mptcp_nl_remove_id_zero_address(struct net *net,
spin_lock_bh(&msk->pm.lock);
mptcp_pm_remove_addr(msk, &list);
mptcp_pm_nl_rm_subflow_received(msk, &list);
+ __mark_subflow_endp_available(msk, 0);
spin_unlock_bh(&msk->pm.lock);
release_sock(sk);
@@ -1917,6 +1924,7 @@ static void mptcp_pm_nl_fullmesh(struct mptcp_sock *msk,
spin_lock_bh(&msk->pm.lock);
mptcp_pm_nl_rm_subflow_received(msk, &list);
+ __mark_subflow_endp_available(msk, list.ids[0]);
mptcp_pm_create_subflow_or_signal_addr(msk);
spin_unlock_bh(&msk->pm.lock);
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x edd8b5d868a4d459f3065493001e293901af758d
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024082613-gerbil-channel-c3e1@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
edd8b5d868a4 ("mptcp: pm: re-using ID of unused removed subflows")
d045b9eb95a9 ("mptcp: introduce implicit endpoints")
33397b83eee6 ("selftests: mptcp: add backup with port testcase")
09f12c3ab7a5 ("mptcp: allow to use port and non-signal in set_flags")
6a0653b96f5d ("selftests: mptcp: add fullmesh setting tests")
327b9a94e2a8 ("selftests: mptcp: more stable join tests-cases")
6bb3ab4913e9 ("selftests: mptcp: add MP_FAIL mibs check")
f444fea7896d ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From edd8b5d868a4d459f3065493001e293901af758d Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Mon, 19 Aug 2024 21:45:21 +0200
Subject: [PATCH] mptcp: pm: re-using ID of unused removed subflows
If no subflow is attached to the 'subflow' endpoint that is being
removed, the addr ID will not be marked as available again.
Mark the linked ID as available when removing the 'subflow' endpoint if
no subflow is attached to it.
While at it, the local_addr_used counter is decremented if the ID was
marked as being used to reflect the reality, but also to allow adding
new endpoints after that.
Fixes: b6c08380860b ("mptcp: remove addr and subflow in PM netlink")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-3-38035d40de5b…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 26f0329e16bb..8b232a210a06 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -1469,8 +1469,17 @@ static int mptcp_nl_remove_subflow_and_signal_addr(struct net *net,
remove_subflow = lookup_subflow_by_saddr(&msk->conn_list, addr);
mptcp_pm_remove_anno_addr(msk, addr, remove_subflow &&
!(entry->flags & MPTCP_PM_ADDR_FLAG_IMPLICIT));
- if (remove_subflow)
+
+ if (remove_subflow) {
mptcp_pm_remove_subflow(msk, &list);
+ } else if (entry->flags & MPTCP_PM_ADDR_FLAG_SUBFLOW) {
+ /* If the subflow has been used, but now closed */
+ spin_lock_bh(&msk->pm.lock);
+ if (!__test_and_set_bit(entry->addr.id, msk->pm.id_avail_bitmap))
+ msk->pm.local_addr_used--;
+ spin_unlock_bh(&msk->pm.lock);
+ }
+
release_sock(sk);
next:
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x edd8b5d868a4d459f3065493001e293901af758d
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024082612-hydrated-overdress-ea6a@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
edd8b5d868a4 ("mptcp: pm: re-using ID of unused removed subflows")
d045b9eb95a9 ("mptcp: introduce implicit endpoints")
33397b83eee6 ("selftests: mptcp: add backup with port testcase")
09f12c3ab7a5 ("mptcp: allow to use port and non-signal in set_flags")
6a0653b96f5d ("selftests: mptcp: add fullmesh setting tests")
327b9a94e2a8 ("selftests: mptcp: more stable join tests-cases")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From edd8b5d868a4d459f3065493001e293901af758d Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Mon, 19 Aug 2024 21:45:21 +0200
Subject: [PATCH] mptcp: pm: re-using ID of unused removed subflows
If no subflow is attached to the 'subflow' endpoint that is being
removed, the addr ID will not be marked as available again.
Mark the linked ID as available when removing the 'subflow' endpoint if
no subflow is attached to it.
While at it, the local_addr_used counter is decremented if the ID was
marked as being used to reflect the reality, but also to allow adding
new endpoints after that.
Fixes: b6c08380860b ("mptcp: remove addr and subflow in PM netlink")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-3-38035d40de5b…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 26f0329e16bb..8b232a210a06 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -1469,8 +1469,17 @@ static int mptcp_nl_remove_subflow_and_signal_addr(struct net *net,
remove_subflow = lookup_subflow_by_saddr(&msk->conn_list, addr);
mptcp_pm_remove_anno_addr(msk, addr, remove_subflow &&
!(entry->flags & MPTCP_PM_ADDR_FLAG_IMPLICIT));
- if (remove_subflow)
+
+ if (remove_subflow) {
mptcp_pm_remove_subflow(msk, &list);
+ } else if (entry->flags & MPTCP_PM_ADDR_FLAG_SUBFLOW) {
+ /* If the subflow has been used, but now closed */
+ spin_lock_bh(&msk->pm.lock);
+ if (!__test_and_set_bit(entry->addr.id, msk->pm.id_avail_bitmap))
+ msk->pm.local_addr_used--;
+ spin_unlock_bh(&msk->pm.lock);
+ }
+
release_sock(sk);
next:
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x e255683c06df572ead96db5efb5d21be30c0efaa
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024082646-gondola-dainty-6096@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
e255683c06df ("mptcp: pm: re-using ID of unused removed ADD_ADDR")
4b317e0eb287 ("mptcp: fix NL PM announced address accounting")
6fa0174a7c86 ("mptcp: more careful RM_ADDR generation")
7d9bf018f907 ("selftests: mptcp: update output info of chk_rm_nr")
327b9a94e2a8 ("selftests: mptcp: more stable join tests-cases")
6bb3ab4913e9 ("selftests: mptcp: add MP_FAIL mibs check")
f7713dd5d23a ("selftests: mptcp: delete uncontinuous removing ids")
4f49d63352da ("selftests: mptcp: add fullmesh testcases")
0cddb4a6f4e3 ("selftests: mptcp: add deny_join_id0 testcases")
af66d3e1c3fa ("selftests: mptcp: enable checksum in mptcp_join.sh")
5e287fe76149 ("selftests: mptcp: remove id 0 address testcases")
ef360019db40 ("selftests: mptcp: signal addresses testcases")
efd13b71a3fa ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e255683c06df572ead96db5efb5d21be30c0efaa Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Mon, 19 Aug 2024 21:45:19 +0200
Subject: [PATCH] mptcp: pm: re-using ID of unused removed ADD_ADDR
If no subflow is attached to the 'signal' endpoint that is being
removed, the addr ID will not be marked as available again.
Mark the linked ID as available when removing the address entry from the
list to cover this case.
Fixes: b6c08380860b ("mptcp: remove addr and subflow in PM netlink")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-1-38035d40de5b…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 4cae2aa7be5c..26f0329e16bb 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -1431,7 +1431,10 @@ static bool mptcp_pm_remove_anno_addr(struct mptcp_sock *msk,
ret = remove_anno_list_by_saddr(msk, addr);
if (ret || force) {
spin_lock_bh(&msk->pm.lock);
- msk->pm.add_addr_signaled -= ret;
+ if (ret) {
+ __set_bit(addr->id, msk->pm.id_avail_bitmap);
+ msk->pm.add_addr_signaled--;
+ }
mptcp_pm_remove_addr(msk, &list);
spin_unlock_bh(&msk->pm.lock);
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x e255683c06df572ead96db5efb5d21be30c0efaa
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024082621-unluckily-aghast-028b@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
e255683c06df ("mptcp: pm: re-using ID of unused removed ADD_ADDR")
4b317e0eb287 ("mptcp: fix NL PM announced address accounting")
6fa0174a7c86 ("mptcp: more careful RM_ADDR generation")
7d9bf018f907 ("selftests: mptcp: update output info of chk_rm_nr")
327b9a94e2a8 ("selftests: mptcp: more stable join tests-cases")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e255683c06df572ead96db5efb5d21be30c0efaa Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Mon, 19 Aug 2024 21:45:19 +0200
Subject: [PATCH] mptcp: pm: re-using ID of unused removed ADD_ADDR
If no subflow is attached to the 'signal' endpoint that is being
removed, the addr ID will not be marked as available again.
Mark the linked ID as available when removing the address entry from the
list to cover this case.
Fixes: b6c08380860b ("mptcp: remove addr and subflow in PM netlink")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-1-38035d40de5b…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 4cae2aa7be5c..26f0329e16bb 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -1431,7 +1431,10 @@ static bool mptcp_pm_remove_anno_addr(struct mptcp_sock *msk,
ret = remove_anno_list_by_saddr(msk, addr);
if (ret || force) {
spin_lock_bh(&msk->pm.lock);
- msk->pm.add_addr_signaled -= ret;
+ if (ret) {
+ __set_bit(addr->id, msk->pm.id_avail_bitmap);
+ msk->pm.add_addr_signaled--;
+ }
mptcp_pm_remove_addr(msk, &list);
spin_unlock_bh(&msk->pm.lock);
}
Fix changes incorrect usb_request->status returned during disabling
endpoints. Before fix the status returned during dequeuing requests
while disabling endpoint was ECONNRESET.
Patch change it to ESHUTDOWN.
Patch fixes issue detected during testing UVC gadget.
During stopping streaming the class starts dequeuing usb requests and
controller driver returns the -ECONNRESET status. After completion
requests the class or application "uvc-gadget" try to queue this
request again. Changing this status to ESHUTDOWN cause that UVC assumes
that endpoint is disabled, or device is disconnected and stops
re-queuing usb requests.
Fixes: 3d82904559f4 ("usb: cdnsp: cdns3 Add main part of Cadence USBSSP DRD Driver")
cc: stable(a)vger.kernel.org
Signed-off-by: Pawel Laszczak <pawell(a)cadence.com>
---
Changelog:
v2:
- added explanation of issue
drivers/usb/cdns3/cdnsp-ring.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/cdns3/cdnsp-ring.c b/drivers/usb/cdns3/cdnsp-ring.c
index 1e011560e3ae..bccc8fc143d0 100644
--- a/drivers/usb/cdns3/cdnsp-ring.c
+++ b/drivers/usb/cdns3/cdnsp-ring.c
@@ -718,7 +718,8 @@ int cdnsp_remove_request(struct cdnsp_device *pdev,
seg = cdnsp_trb_in_td(pdev, cur_td->start_seg, cur_td->first_trb,
cur_td->last_trb, hw_deq);
- if (seg && (pep->ep_state & EP_ENABLED))
+ if (seg && (pep->ep_state & EP_ENABLED) &&
+ !(pep->ep_state & EP_DIS_IN_RROGRESS))
cdnsp_find_new_dequeue_state(pdev, pep, preq->request.stream_id,
cur_td, &deq_state);
else
@@ -736,7 +737,8 @@ int cdnsp_remove_request(struct cdnsp_device *pdev,
* During disconnecting all endpoint will be disabled so we don't
* have to worry about updating dequeue pointer.
*/
- if (pdev->cdnsp_state & CDNSP_STATE_DISCONNECT_PENDING) {
+ if (pdev->cdnsp_state & CDNSP_STATE_DISCONNECT_PENDING ||
+ pep->ep_state & EP_DIS_IN_RROGRESS) {
status = -ESHUTDOWN;
ret = cdnsp_cmd_set_deq(pdev, pep, &deq_state);
}
--
2.43.0
Good day.
I am seeking a reliable and experienced partner to manage our
real estate investments in your country. The ideal partner will
possess:
- In-depth knowledge of the local real estate market
- Proven track record in property management and development
- Strong network and connections in the industry
- Ability to navigate regulatory requirements
- Transparency, integrity, and a commitment to delivering results
Responsibilities:
The partner will be responsible for:
- Sourcing and evaluating investment opportunities
- Conducting due diligence and risk assessments
- Managing property acquisition, development, and sales
- Ensuring compliance with local laws and regulations
- Providing regular updates and performance reports
Benefits:
By partnering with us, you will benefit from:
- Access to substantial investment capital
- Opportunity to collaborate with a reputable UK-based company
- Shared success and returns on investment.
I look forward to the possibility of working together and
achieving mutual success in the real estate market.
If you are interested in exploring this partnership opportunity,
I would be delighted to schedule a call or meeting to discuss
further.
Best regards
Croitoru Vasile.
The dwc2_handle_usb_suspend_intr() function disables gadget clocks in USB
peripheral mode when no other power-down mode is available (introduced by
commit 0112b7ce68ea ("usb: dwc2: Update dwc2_handle_usb_suspend_intr function.")).
However, the dwc2_drd_role_sw_set() USB role update handler attempts to
read DWC2 registers if the USB role has changed while the USB is in suspend
mode (when the clocks are gated). This causes the system to hang.
Release the gadget clocks before handling the USB role update.
Fixes: 0112b7ce68ea ("usb: dwc2: Update dwc2_handle_usb_suspend_intr function.")
Cc: stable(a)vger.kernel.org
Signed-off-by: Tomas Marek <tomas.marek(a)elrest.cz>
---
Changes in v2:
- CC stable(a)vger.kernel.org
- merge ifdef and nested if statements into one if statement
---
drivers/usb/dwc2/drd.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/drivers/usb/dwc2/drd.c b/drivers/usb/dwc2/drd.c
index a8605b02115b..1ad8fa3f862a 100644
--- a/drivers/usb/dwc2/drd.c
+++ b/drivers/usb/dwc2/drd.c
@@ -127,6 +127,15 @@ static int dwc2_drd_role_sw_set(struct usb_role_switch *sw, enum usb_role role)
role = USB_ROLE_DEVICE;
}
+ if ((IS_ENABLED(CONFIG_USB_DWC2_PERIPHERAL) ||
+ IS_ENABLED(CONFIG_USB_DWC2_DUAL_ROLE)) &&
+ dwc2_is_device_mode(hsotg) &&
+ hsotg->lx_state == DWC2_L2 &&
+ hsotg->params.power_down == DWC2_POWER_DOWN_PARAM_NONE &&
+ hsotg->bus_suspended &&
+ !hsotg->params.no_clock_gating)
+ dwc2_gadget_exit_clock_gating(hsotg, 0);
+
if (role == USB_ROLE_HOST) {
already = dwc2_ovr_avalid(hsotg, true);
} else if (role == USB_ROLE_DEVICE) {
--
2.25.1
nfs41_init_clientid does not signal a failure condition from
nfs4_proc_exchange_id and nfs4_proc_create_session to a client which may
lead to mount syscall indefinitely blocked in the following stack trace:
nfs_wait_client_init_complete
nfs41_discover_server_trunking
nfs4_discover_server_trunking
nfs4_init_client
nfs4_set_client
nfs4_create_server
nfs4_try_get_tree
vfs_get_tree
do_new_mount
__se_sys_mount
and the client stuck in uninitialized state.
In addition to this all subsequent mount calls would also get blocked in
nfs_match_client waiting for the uninitialized client to finish
initialization:
nfs_wait_client_init_complete
nfs_match_client
nfs_get_client
nfs4_set_client
nfs4_create_server
nfs4_try_get_tree
vfs_get_tree
do_new_mount
__se_sys_mount
To avoid this situation propagate error condition to the mount thread
and let mount syscall fail properly.
To: Trond Myklebust <trondmy(a)kernel.org>
To: Anna Schumaker <anna(a)kernel.org>
Cc: linux-nfs(a)vger.kernel.org
Cc: jbongio(a)google.com
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/stable/20240906-nfs-mount-deadlock-fix-v1-1-ea1aef5…
Signed-off-by: Oleksandr Tymoshenko <ovt(a)google.com>
---
Changes in v2:
- Added correct To/Cc entries to the commit message
- Link to v1: https://lore.kernel.org/r/20240906-nfs-mount-deadlock-fix-v1-1-ea1aef533f9c…
---
fs/nfs/nfs4state.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 877f682b45f2..54ad3440ad2b 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -335,8 +335,8 @@ int nfs41_init_clientid(struct nfs_client *clp, const struct cred *cred)
if (!(clp->cl_exchange_flags & EXCHGID4_FLAG_CONFIRMED_R))
nfs4_state_start_reclaim_reboot(clp);
nfs41_finish_session_reset(clp);
- nfs_mark_client_ready(clp, NFS_CS_READY);
out:
+ nfs_mark_client_ready(clp, status == 0 ? NFS_CS_READY : status);
return status;
}
---
base-commit: ad618736883b8970f66af799e34007475fe33a68
change-id: 20240906-nfs-mount-deadlock-fix-55c14b38e088
Best regards,
--
Oleksandr Tymoshenko <ovt(a)google.com>
Fix changes incorrect usb_request->status returned during disabling
endpoints. Before fix the status returned during dequeuing requests
while disabling endpoint was ECONNRESET.
Patch changes it to ESHUTDOWN.
Fixes: 3d82904559f4 ("usb: cdnsp: cdns3 Add main part of Cadence USBSSP DRD Driver")
cc: stable(a)vger.kernel.org
Signed-off-by: Pawel Laszczak <pawell(a)cadence.com>
---
drivers/usb/cdns3/cdnsp-ring.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/cdns3/cdnsp-ring.c b/drivers/usb/cdns3/cdnsp-ring.c
index 1e011560e3ae..bccc8fc143d0 100644
--- a/drivers/usb/cdns3/cdnsp-ring.c
+++ b/drivers/usb/cdns3/cdnsp-ring.c
@@ -718,7 +718,8 @@ int cdnsp_remove_request(struct cdnsp_device *pdev,
seg = cdnsp_trb_in_td(pdev, cur_td->start_seg, cur_td->first_trb,
cur_td->last_trb, hw_deq);
- if (seg && (pep->ep_state & EP_ENABLED))
+ if (seg && (pep->ep_state & EP_ENABLED) &&
+ !(pep->ep_state & EP_DIS_IN_RROGRESS))
cdnsp_find_new_dequeue_state(pdev, pep, preq->request.stream_id,
cur_td, &deq_state);
else
@@ -736,7 +737,8 @@ int cdnsp_remove_request(struct cdnsp_device *pdev,
* During disconnecting all endpoint will be disabled so we don't
* have to worry about updating dequeue pointer.
*/
- if (pdev->cdnsp_state & CDNSP_STATE_DISCONNECT_PENDING) {
+ if (pdev->cdnsp_state & CDNSP_STATE_DISCONNECT_PENDING ||
+ pep->ep_state & EP_DIS_IN_RROGRESS) {
status = -ESHUTDOWN;
ret = cdnsp_cmd_set_deq(pdev, pep, &deq_state);
}
--
2.43.0
v3:
- Amends the commit log for patch #1 per Johan's suggestion.
- Link to v2: https://lore.kernel.org/r/20240716-linux-next-24-07-13-camss-fixes-v2-0-e60…
v2:
- Updates commits with Johan's Review/Reported tags
- Adds Closes: https://lore.kernel.org/lkml/ZoVNHOTI0PKMNt4_@hovoldconsulting.com
- Cc's stable
- Adds in suggested kernel log to allow others to more easily match kernel
log to fixes
- Link to v1: https://lore.kernel.org/r/20240714-linux-next-24-07-13-camss-fixes-v1-0-8f8…
V1:
Dogfooding with SoftISP has uncovered two bugs in this series which I'm
posting fixes for.
- The first error:
A simple race condition which to be honest I'm surprised I haven't found
earlier nor has anybody else. Simply stated the order we typically
end up loading CAMSS on boot has masked out the pm_runtime_enable() race
condition that has been present in CAMSS for a long time.
If you blacklist qcom-camss in modules.d and then modprobe after boot,
the race condition shows up easily.
Moving the pm_runtime_enable prior to subdevice registration fixes the
problem.
The second error:
Nomenclature:
- CSIPHY: CSI Physical layer analogue to digital domain serialiser
- CSID: CSI Decoder
- VFE: Video Front End
- RDI: Raw Data Interface
- VC: Virtual Channel
In order to support streaming multiple virtual-channels on the same RDI a
V4L2 provided use_count variable is used to decide whether or not to actually
terminate streaming and release buffers for 'msm_vfe_rdiX'.
Unfortunately use_count indicates the number of times msm_vfe_rdiX has
been opened by user-space not the number of concurrent streams on
msm_vfe_rdiX.
Simply stated use_count and stream_count are two different things.
The silicon enabling code to select between VCs is valid but, a different
solution needs to be found to support _concurrent_ VC streams.
Right now the upstream use_count as-is is breaking the non concurrent VC
case and I don't believe there are upstream users of concurrent VCs on
CAMSS.
This series implements a revert for the invalid use_count check,
retaining the ability to select which VC is active on the RDI.
Dogfooding with libcamera's SoftISP in Hangouts, Zoom and multiple runs
of libcamera's "qcam" application is a very different test-case to the
simple capture of frames we previously did when validating the
'use_count' change.
A partial revert in expectation of a renewed push to fixup that
concurrent VC issue is included.
Signed-off-by: Bryan O'Donoghue <bryan.odonoghue(a)linaro.org>
---
Bryan O'Donoghue (2):
media: qcom: camss: Remove use_count guard in stop_streaming
media: qcom: camss: Fix ordering of pm_runtime_enable
drivers/media/platform/qcom/camss/camss-video.c | 6 ------
drivers/media/platform/qcom/camss/camss.c | 5 +++--
2 files changed, 3 insertions(+), 8 deletions(-)
---
base-commit: c6ce8f9ab92edc9726996a0130bfc1c408132d47
change-id: 20240713-linux-next-24-07-13-camss-fixes-fa98c0965a5d
Best regards,
--
Bryan O'Donoghue <bryan.odonoghue(a)linaro.org>
#regzbot introduced: 3ee1a1fc3981
Dear maintainers,
I think I have found a cifs regression in the 6.10 kernel series, which leads
certain programs to write corrupt data.
After upgrading from kernel 6.9.12 to 6.10.6, flatpak and ostree are now
writing bad gpg signatures when exporting signed packages or signing their
repository metadata/summary files, whenever the repository is on a cifs mount.
Instead of writing the signature data, null bytes are written in its place.
Furthermore, ffmpeg and mkvmerge are now intermittently writing corrupt files
to cifs mounts.
No error is reported by the applications or the kernel when it happens.
In the case of flatpak, the problem isn't revealed until something tries to use
the repository and finds signatures full of null bytes. (Of course, this means
the affected repositories have been rendered useless.) In the case of ffmpeg
and mkvmerge, the problem isn't revealed until someone plays the video file and
reaches a corrupt section.
A kernel bisect reveals this:
3ee1a1fc39819906f04d6c62c180e760cd3a689d is the first bad commit
commit 3ee1a1fc39819906f04d6c62c180e760cd3a689d
Author: David Howells <dhowells(a)redhat.com>
Date: Fri Oct 6 18:29:59 2023 +0100
cifs: Cut over to using netfslib
I was unable to determine whether 6.11.0-rc4 fixes it, due to another cifs bug
in that version (which I hope to report soon).
An strace of flatpak (which uses libostree) shows it generating correct
signatures internally, but behaving differently on cifs vs. ext4 when working
with memory-mapped temp files, in which the signatures are stored before being
written to their final outputs. Here's where I reported my initial findings to
those projects:
https://github.com/flatpak/flatpak/issues/5911https://github.com/ostreedev/ostree/issues/3288
Debian Testing and Unstable kernels (6.10.4-1 and 6.10.6-1) are affected:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1079394
The following reproducer script consistently triggers the problem for me. Run
it with two arguments: a path on a cifs mount where an ostree repo should be
created, and a GPG key ID with which to sign a commit.
#!/bin/sh
set -e
if [ "$#" -lt 2 ] || [ "$1" = "-h" ] ; then
echo "usage: $(basename "$0") <repo-dir> <gpg-key-id>"
exit 2
fi
repo=$1
keyid=$2
src="./foo"
echo "creating ostree repo at $repo"
ostree init --repo="$repo"
echo "creating source file tree at $src"
mkdir -p "$src"
echo hi > "$src"/hello
ostree commit --repo="$repo" --branch=foo --gpg-sign="$keyid" "$src"
if ostree show --repo="$repo" foo; then
echo ---
echo success!
else
echo ---
ostree show --repo="$repo" --print-detached-metadata-key=ostree.gpgsigs foo
echo failure!
echo look for null bytes in the above commit signature
fi
I get an "out of memory" error when building Linux kernels 5.15.164,
5.15.165 and 5.15.166-rc1:
...
LD [M] drivers/mtd/tests/mtd_stresstest.o
LD [M] drivers/pcmcia/pcmcia_core.o
LD [M] drivers/mtd/tests/mtd_subpagetest.o
cc1: out of memory allocating 180705472 bytes after a total of 283914240
bytes
LD [M] drivers/mtd/tests/mtd_torturetest.o
CC [M] drivers/mtd/ubi/wl.o
LD [M] drivers/pcmcia/pcmcia.o
CC [M] drivers/gpu/drm/nouveau/nvkm/engine/disp/headgv100.o
CC [M] drivers/gpu/drm/amd/amdgpu/../display/dc/dce/dmub_hw_lock_mgr.o
LD [M] drivers/mtd/tests/mtd_nandbiterrs.o
CC [M] drivers/mtd/ubi/attach.o
LD [M] drivers/staging/qlge/qlge.o
make[4]: *** [scripts/Makefile.build:289:
drivers/staging/media/atomisp/pci/isp/kernels/ynr/ynr_1.0/ia_css_ynr.host.o]
Error 1
make[3]: *** [scripts/Makefile.build:552: drivers/staging/media/atomisp]
Error 2
make[2]: *** [scripts/Makefile.build:552: drivers/staging/media] Error 2
make[2]: *** Waiting for unfinished jobs....
LD [M] drivers/pcmcia/pcmcia_rsrc.o
CC [M] drivers/gpu/drm/amd/amdgpu/../display/dc/dce/dmub_outbox.o
make[1]: *** [scripts/Makefile.build:552: drivers/staging] Error 2
make[1]: *** Waiting for unfinished jobs....
CC [M] drivers/gpu/drm/amd/amdgpu/../display/dc/dml/calcs/dce_calcs.o
CC [M] drivers/gpu/drm/amd/amdgpu/../display/dc/dml/calcs/custom_float.o
CC [M] drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.o
...
#uname -a
Linux aragorn 5.15.166-rc1-smp #1 SMP PREEMPT Mon Sep 2 14:03:00 PDT 2024
i686 AMD Ryzen 9 5900X 12-Core Processor AuthenticAMD GNU/Linux
Attached is my config file.
I found a work around for this problem.
Remove the six minmax patches introduced with kernel 5.15.164:
minmax: allow comparisons of 'int' against 'unsigned char/short'
minmax: allow min()/max()/clamp() if the arguments have the same
minmax: clamp more efficiently by avoiding extra comparison
minmax: fix header inclusions
minmax: relax check to allow comparison between unsigned arguments
minmax: sanity check constant bounds when clamping
Can these 6 patches be removed or fixed?
The patch titled
Subject: mm/codetag: add pgalloc_tag_copy()
has been added to the -mm mm-unstable branch. Its filename is
mm-codetag-add-pgalloc_tag_copy.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Yu Zhao <yuzhao(a)google.com>
Subject: mm/codetag: add pgalloc_tag_copy()
Date: Tue, 3 Sep 2024 15:36:49 -0600
Add pgalloc_tag_copy() to transfer the codetag from the old folio to the
new one during migration. This makes original allocation sites persist
cross migration rather than lump into compaction_alloc, e.g.,
# echo 1 >/proc/sys/vm/compact_memory
# grep compaction_alloc /proc/allocinfo
Before this patch:
132968448 32463 mm/compaction.c:1880 func:compaction_alloc
After this patch:
0 0 mm/compaction.c:1880 func:compaction_alloc
Link: https://lkml.kernel.org/r/20240903213649.3566695-3-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao(a)google.com>
Cc: Kent Overstreet <kent.overstreet(a)linux.dev>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/alloc_tag.h | 24 ++++++++++--------------
include/linux/mm.h | 25 +++++++++++++++++++++++++
mm/migrate.c | 1 +
3 files changed, 36 insertions(+), 14 deletions(-)
--- a/include/linux/alloc_tag.h~mm-codetag-add-pgalloc_tag_copy
+++ a/include/linux/alloc_tag.h
@@ -137,7 +137,16 @@ static inline void alloc_tag_sub_check(u
/* Caller should verify both ref and tag to be valid */
static inline void __alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *tag)
{
+ alloc_tag_add_check(ref, tag);
+ if (!ref || !tag)
+ return;
+
ref->ct = &tag->ct;
+}
+
+static inline void alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *tag)
+{
+ __alloc_tag_ref_set(ref, tag);
/*
* We need in increment the call counter every time we have a new
* allocation or when we split a large allocation into smaller ones.
@@ -147,22 +156,9 @@ static inline void __alloc_tag_ref_set(u
this_cpu_inc(tag->counters->calls);
}
-static inline void alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *tag)
-{
- alloc_tag_add_check(ref, tag);
- if (!ref || !tag)
- return;
-
- __alloc_tag_ref_set(ref, tag);
-}
-
static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag, size_t bytes)
{
- alloc_tag_add_check(ref, tag);
- if (!ref || !tag)
- return;
-
- __alloc_tag_ref_set(ref, tag);
+ alloc_tag_ref_set(ref, tag);
this_cpu_add(tag->counters->bytes, bytes);
}
--- a/include/linux/mm.h~mm-codetag-add-pgalloc_tag_copy
+++ a/include/linux/mm.h
@@ -4161,10 +4161,35 @@ static inline void pgalloc_tag_split(str
}
}
}
+
+static inline void pgalloc_tag_copy(struct folio *new, struct folio *old)
+{
+ struct alloc_tag *tag;
+ union codetag_ref *ref;
+
+ tag = pgalloc_tag_get(&old->page);
+ if (!tag)
+ return;
+
+ ref = get_page_tag_ref(&new->page);
+ if (!ref)
+ return;
+
+ /* Clear the old ref to the original allocation site. */
+ clear_page_tag_ref(&old->page);
+ /* Decrement the counters of the tag on get_new_folio. */
+ alloc_tag_sub(ref, folio_nr_pages(new));
+ __alloc_tag_ref_set(ref, tag);
+ put_page_tag_ref(ref);
+}
#else /* !CONFIG_MEM_ALLOC_PROFILING */
static inline void pgalloc_tag_split(struct folio *folio, int old_order, int new_order)
{
}
+
+static inline void pgalloc_tag_copy(struct folio *new, struct folio *old)
+{
+}
#endif /* CONFIG_MEM_ALLOC_PROFILING */
#endif /* _LINUX_MM_H */
--- a/mm/migrate.c~mm-codetag-add-pgalloc_tag_copy
+++ a/mm/migrate.c
@@ -743,6 +743,7 @@ void folio_migrate_flags(struct folio *n
folio_set_readahead(newfolio);
folio_copy_owner(newfolio, folio);
+ pgalloc_tag_copy(newfolio, folio);
mem_cgroup_migrate(folio, newfolio);
}
_
Patches currently in -mm which might be from yuzhao(a)google.com are
mm-remap-unused-subpages-to-shared-zeropage-when-splitting-isolated-thp.patch
mm-codetag-fix-a-typo.patch
mm-codetag-fix-pgalloc_tag_split.patch
mm-codetag-add-pgalloc_tag_copy.patch
The patch titled
Subject: mm/codetag: fix pgalloc_tag_split()
has been added to the -mm mm-unstable branch. Its filename is
mm-codetag-fix-pgalloc_tag_split.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Yu Zhao <yuzhao(a)google.com>
Subject: mm/codetag: fix pgalloc_tag_split()
Date: Tue, 3 Sep 2024 15:36:48 -0600
Only tag the new head pages when splitting one large folio to multiple
ones of a lower order. Tagging tail pages can cause imbalanced "calls"
counters, since only head pages are untagged by pgalloc_tag_sub() and
reference counts on tail pages are leaked, e.g.,
# echo 2048kB >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote_size
# echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
# time echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote
# grep alloc_gigantic_folio /proc/allocinfo
Before this patch:
0 549427200 mm/hugetlb.c:1549 func:alloc_gigantic_folio
real 0m2.057s
user 0m0.000s
sys 0m2.051s
After this patch:
0 0 mm/hugetlb.c:1549 func:alloc_gigantic_folio
real 0m1.711s
user 0m0.000s
sys 0m1.704s
Not tagging tail pages also improves the splitting time, e.g., by about
15% when demoting 1GB hugeTLB folios to 2MB ones, as shown above.
Link: https://lkml.kernel.org/r/20240903213649.3566695-2-yuzhao@google.com
Fixes: be25d1d4e822 ("mm: create new codetag references during page splitting")
Signed-off-by: Yu Zhao <yuzhao(a)google.com>
Cc: Kent Overstreet <kent.overstreet(a)linux.dev>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mm.h | 30 ++++++++++++++++++++++++++++++
include/linux/pgalloc_tag.h | 31 -------------------------------
mm/huge_memory.c | 2 +-
mm/hugetlb.c | 2 +-
mm/page_alloc.c | 4 ++--
5 files changed, 34 insertions(+), 35 deletions(-)
--- a/include/linux/mm.h~mm-codetag-fix-pgalloc_tag_split
+++ a/include/linux/mm.h
@@ -4137,4 +4137,34 @@ void vma_pgtable_walk_end(struct vm_area
int reserve_mem_find_by_name(const char *name, phys_addr_t *start, phys_addr_t *size);
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+static inline void pgalloc_tag_split(struct folio *folio, int old_order, int new_order)
+{
+ int i;
+ struct alloc_tag *tag;
+ unsigned int nr_pages = 1 << new_order;
+
+ if (!mem_alloc_profiling_enabled())
+ return;
+
+ tag = pgalloc_tag_get(&folio->page);
+ if (!tag)
+ return;
+
+ for (i = nr_pages; i < (1 << old_order); i += nr_pages) {
+ union codetag_ref *ref = get_page_tag_ref(folio_page(folio, i));
+
+ if (ref) {
+ /* Set new reference to point to the original tag */
+ alloc_tag_ref_set(ref, tag);
+ put_page_tag_ref(ref);
+ }
+ }
+}
+#else /* !CONFIG_MEM_ALLOC_PROFILING */
+static inline void pgalloc_tag_split(struct folio *folio, int old_order, int new_order)
+{
+}
+#endif /* CONFIG_MEM_ALLOC_PROFILING */
+
#endif /* _LINUX_MM_H */
--- a/include/linux/pgalloc_tag.h~mm-codetag-fix-pgalloc_tag_split
+++ a/include/linux/pgalloc_tag.h
@@ -80,36 +80,6 @@ static inline void pgalloc_tag_sub(struc
}
}
-static inline void pgalloc_tag_split(struct page *page, unsigned int nr)
-{
- int i;
- struct page_ext *first_page_ext;
- struct page_ext *page_ext;
- union codetag_ref *ref;
- struct alloc_tag *tag;
-
- if (!mem_alloc_profiling_enabled())
- return;
-
- first_page_ext = page_ext = page_ext_get(page);
- if (unlikely(!page_ext))
- return;
-
- ref = codetag_ref_from_page_ext(page_ext);
- if (!ref->ct)
- goto out;
-
- tag = ct_to_alloc_tag(ref->ct);
- page_ext = page_ext_next(page_ext);
- for (i = 1; i < nr; i++) {
- /* Set new reference to point to the original tag */
- alloc_tag_ref_set(codetag_ref_from_page_ext(page_ext), tag);
- page_ext = page_ext_next(page_ext);
- }
-out:
- page_ext_put(first_page_ext);
-}
-
static inline struct alloc_tag *pgalloc_tag_get(struct page *page)
{
struct alloc_tag *tag = NULL;
@@ -142,7 +112,6 @@ static inline void clear_page_tag_ref(st
static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
unsigned int nr) {}
static inline void pgalloc_tag_sub(struct page *page, unsigned int nr) {}
-static inline void pgalloc_tag_split(struct page *page, unsigned int nr) {}
static inline struct alloc_tag *pgalloc_tag_get(struct page *page) { return NULL; }
static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr) {}
--- a/mm/huge_memory.c~mm-codetag-fix-pgalloc_tag_split
+++ a/mm/huge_memory.c
@@ -3242,7 +3242,7 @@ static void __split_huge_page(struct pag
/* Caller disabled irqs, so they are still disabled here */
split_page_owner(head, order, new_order);
- pgalloc_tag_split(head, 1 << order);
+ pgalloc_tag_split(folio, order, new_order);
/* See comment in __split_huge_page_tail() */
if (folio_test_anon(folio)) {
--- a/mm/hugetlb.c~mm-codetag-fix-pgalloc_tag_split
+++ a/mm/hugetlb.c
@@ -3795,7 +3795,7 @@ static long demote_free_hugetlb_folios(s
list_del(&folio->lru);
split_page_owner(&folio->page, huge_page_order(src), huge_page_order(dst));
- pgalloc_tag_split(&folio->page, 1 << huge_page_order(src));
+ pgalloc_tag_split(folio, huge_page_order(src), huge_page_order(dst));
for (i = 0; i < pages_per_huge_page(src); i += pages_per_huge_page(dst)) {
struct page *page = folio_page(folio, i);
--- a/mm/page_alloc.c~mm-codetag-fix-pgalloc_tag_split
+++ a/mm/page_alloc.c
@@ -2783,7 +2783,7 @@ void split_page(struct page *page, unsig
for (i = 1; i < (1 << order); i++)
set_page_refcounted(page + i);
split_page_owner(page, order, 0);
- pgalloc_tag_split(page, 1 << order);
+ pgalloc_tag_split(page_folio(page), order, 0);
split_page_memcg(page, order, 0);
}
EXPORT_SYMBOL_GPL(split_page);
@@ -4981,7 +4981,7 @@ static void *make_alloc_exact(unsigned l
struct page *last = page + nr;
split_page_owner(page, order, 0);
- pgalloc_tag_split(page, 1 << order);
+ pgalloc_tag_split(page_folio(page), order, 0);
split_page_memcg(page, order, 0);
while (page < --last)
set_page_refcounted(last);
_
Patches currently in -mm which might be from yuzhao(a)google.com are
mm-remap-unused-subpages-to-shared-zeropage-when-splitting-isolated-thp.patch
mm-codetag-fix-a-typo.patch
mm-codetag-fix-pgalloc_tag_split.patch
mm-codetag-add-pgalloc_tag_copy.patch
The patch titled
Subject: mm/damon/vaddr: protect vma traversal in __damon_va_thre_regions() with rcu read lock
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-damon-vaddr-protect-vma-traversal-in-__damon_va_thre_regions-with-rcu-read-lock.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Subject: mm/damon/vaddr: protect vma traversal in __damon_va_thre_regions() with rcu read lock
Date: Wed, 4 Sep 2024 17:12:04 -0700
Traversing VMAs of a given maple tree should be protected by rcu read
lock. However, __damon_va_three_regions() is not doing the protection.
Hold the lock.
Link: https://lkml.kernel.org/r/20240905001204.1481-1-sj@kernel.org
Fixes: d0cf3dd47f0d ("damon: convert __damon_va_three_regions to use the VMA iterator")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Signed-off-by: SeongJae Park <sj(a)kernel.org>
Reported-by: Guenter Roeck <linux(a)roeck-us.net>
Closes: https://lore.kernel.org/b83651a0-5b24-4206-b860-cb54ffdf209b@roeck-us.net
Tested-by: Guenter Roeck <linux(a)roeck-us.net>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/damon/vaddr.c | 2 ++
1 file changed, 2 insertions(+)
--- a/mm/damon/vaddr.c~mm-damon-vaddr-protect-vma-traversal-in-__damon_va_thre_regions-with-rcu-read-lock
+++ a/mm/damon/vaddr.c
@@ -126,6 +126,7 @@ static int __damon_va_three_regions(stru
* If this is too slow, it can be optimised to examine the maple
* tree gaps.
*/
+ rcu_read_lock();
for_each_vma(vmi, vma) {
unsigned long gap;
@@ -146,6 +147,7 @@ static int __damon_va_three_regions(stru
next:
prev = vma;
}
+ rcu_read_unlock();
if (!sz_range(&second_gap) || !sz_range(&first_gap))
return -EINVAL;
_
Patches currently in -mm which might be from Liam.Howlett(a)oracle.com are
mm-damon-vaddr-protect-vma-traversal-in-__damon_va_thre_regions-with-rcu-read-lock.patch
The patch titled
Subject: mm: use unique zsmalloc caches names
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-use-unique-zsmalloc-caches-names.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Sergey Senozhatsky <senozhatsky(a)chromium.org>
Subject: mm: use unique zsmalloc caches names
Date: Thu, 5 Sep 2024 15:47:23 +0900
Each zsmalloc pool maintains several named kmem-caches for zs_handle-s and
zspage-s. On a system with multiple zsmalloc pools and CONFIG_DEBUG_VM
this triggers kmem_cache_sanity_check():
kmem_cache of name 'zspage' already exists
WARNING: at mm/slab_common.c:108 do_kmem_cache_create_usercopy+0xb5/0x310
...
kmem_cache of name 'zs_handle' already exists
WARNING: at mm/slab_common.c:108 do_kmem_cache_create_usercopy+0xb5/0x310
...
We provide zram device name when init its zsmalloc pool, so we can
use that same name for zsmalloc caches and, hence, create unique
names that can easily be linked to zram device that has created
them.
So instead of having this
cat /proc/slabinfo
slabinfo - version: 2.1
zspage 46 46 ...
zs_handle 128 128 ...
zspage 34270 34270 ...
zs_handle 34816 34816 ...
zspage 0 0 ...
zs_handle 0 0 ...
We now have this
cat /proc/slabinfo
slabinfo - version: 2.1
zspage-zram2 46 46 ...
zs_handle-zram2 128 128 ...
zspage-zram0 34270 34270 ...
zs_handle-zram0 34816 34816 ...
zspage-zram1 0 0 ...
zs_handle-zram1 0 0 ...
Link: https://lkml.kernel.org/r/20240905064736.2250735-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky(a)chromium.org>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/zsmalloc.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
--- a/mm/zsmalloc.c~mm-use-unique-zsmalloc-caches-names
+++ a/mm/zsmalloc.c
@@ -293,13 +293,17 @@ static void SetZsPageMovable(struct zs_p
static int create_cache(struct zs_pool *pool)
{
- pool->handle_cachep = kmem_cache_create("zs_handle", ZS_HANDLE_SIZE,
- 0, 0, NULL);
+ char name[32];
+
+ snprintf(name, sizeof(name), "zs_handle-%s", pool->name);
+ pool->handle_cachep = kmem_cache_create(name, ZS_HANDLE_SIZE,
+ 0, 0, NULL);
if (!pool->handle_cachep)
return 1;
- pool->zspage_cachep = kmem_cache_create("zspage", sizeof(struct zspage),
- 0, 0, NULL);
+ snprintf(name, sizeof(name), "zspage-%s", pool->name);
+ pool->zspage_cachep = kmem_cache_create(name, sizeof(struct zspage),
+ 0, 0, NULL);
if (!pool->zspage_cachep) {
kmem_cache_destroy(pool->handle_cachep);
pool->handle_cachep = NULL;
_
Patches currently in -mm which might be from senozhatsky(a)chromium.org are
mm-use-unique-zsmalloc-caches-names.patch
lib-zstd-export-api-needed-for-dictionary-support.patch
lib-lz4hc-export-lz4_resetstreamhc-symbol.patch
lib-zstd-fix-null-deref-in-zstd_createcdict_advanced2.patch
zram-introduce-custom-comp-backends-api.patch
zram-add-lzo-and-lzorle-compression-backends-support.patch
zram-add-lz4-compression-backend-support.patch
zram-add-lz4hc-compression-backend-support.patch
zram-add-zstd-compression-backend-support.patch
zram-pass-estimated-src-size-hint-to-zstd.patch
zram-add-zlib-compression-backend-support.patch
zram-add-842-compression-backend-support.patch
zram-check-that-backends-array-has-at-least-one-backend.patch
zram-introduce-zcomp_params-structure.patch
zram-recalculate-zstd-compression-params-once.patch
zram-introduce-algorithm_params-device-attribute.patch
zram-add-support-for-dict-comp-config.patch
zram-introduce-zcomp_req-structure.patch
zram-introduce-zcomp_ctx-structure.patch
zram-move-immutable-comp-params-away-from-per-cpu-context.patch
zram-add-dictionary-support-to-lz4.patch
zram-add-dictionary-support-to-lz4hc.patch
zram-add-dictionary-support-to-zstd-backend.patch
documentation-zram-add-documentation-for-algorithm-parameters.patch
documentation-zram-add-documentation-for-algorithm-parameters-fix.patch
zram-support-priority-parameter-in-recompression.patch
mm-kconfig-fixup-zsmalloc-configuration.patch
From: Steven Rostedt <rostedt(a)goodmis.org>
The timerlat interface will get and put the task that is part of the
"kthread" field of the osn_var to keep it around until all references are
released. But here's a race in the "stop_kthread()" code that will call
put_task_struct() on the kthread if it is not a kernel thread. This can
race with the releasing of the references to that task struct and the
put_task_struct() can be called twice when it should have been called just
once.
Take the interface_lock() in stop_kthread() to synchronize this change.
But to do so, the function stop_per_cpu_kthreads() needs to change the
loop from for_each_online_cpu() to for_each_possible_cpu() and remove the
cpu_read_lock(), as the interface_lock can not be taken while the cpu
locks are held. The only side effect of this change is that it may do some
extra work, as the per_cpu variables of the offline CPUs would not be set
anyway, and would simply be skipped in the loop.
Remove unneeded "return;" in stop_kthread().
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Tomas Glozar <tglozar(a)redhat.com>
Cc: John Kacur <jkacur(a)redhat.com>
Cc: "Luis Claudio R. Goncalves" <lgoncalv(a)redhat.com>
Link: https://lore.kernel.org/20240905113359.2b934242@gandalf.local.home
Fixes: e88ed227f639e ("tracing/timerlat: Add user-space interface")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_osnoise.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 48e5014dd4ab..bbe47781617e 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -1953,8 +1953,12 @@ static void stop_kthread(unsigned int cpu)
{
struct task_struct *kthread;
+ mutex_lock(&interface_lock);
kthread = per_cpu(per_cpu_osnoise_var, cpu).kthread;
if (kthread) {
+ per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL;
+ mutex_unlock(&interface_lock);
+
if (cpumask_test_and_clear_cpu(cpu, &kthread_cpumask) &&
!WARN_ON(!test_bit(OSN_WORKLOAD, &osnoise_options))) {
kthread_stop(kthread);
@@ -1967,8 +1971,8 @@ static void stop_kthread(unsigned int cpu)
kill_pid(kthread->thread_pid, SIGKILL, 1);
put_task_struct(kthread);
}
- per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL;
} else {
+ mutex_unlock(&interface_lock);
/* if no workload, just return */
if (!test_bit(OSN_WORKLOAD, &osnoise_options)) {
/*
@@ -1976,7 +1980,6 @@ static void stop_kthread(unsigned int cpu)
*/
per_cpu(per_cpu_osnoise_var, cpu).sampling = false;
barrier();
- return;
}
}
}
@@ -1991,12 +1994,8 @@ static void stop_per_cpu_kthreads(void)
{
int cpu;
- cpus_read_lock();
-
- for_each_online_cpu(cpu)
+ for_each_possible_cpu(cpu)
stop_kthread(cpu);
-
- cpus_read_unlock();
}
/*
--
2.43.0
From: Steven Rostedt <rostedt(a)goodmis.org>
The timerlat tracer can use user space threads to check for osnoise and
timer latency. If the program using this is killed via a SIGTERM, the
threads are shutdown one at a time and another tracing instance can start
up resetting the threads before they are fully closed. That causes the
hrtimer assigned to the kthread to be shutdown and freed twice when the
dying thread finally closes the file descriptors, causing a use-after-free
bug.
Only cancel the hrtimer if the associated thread is still around. Also add
the interface_lock around the resetting of the tlat_var->kthread.
Note, this is just a quick fix that can be backported to stable. A real
fix is to have a better synchronization between the shutdown of old
threads and the starting of new ones.
Link: https://lore.kernel.org/all/20240820130001.124768-1-tglozar@redhat.com/
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: "Luis Claudio R. Goncalves" <lgoncalv(a)redhat.com>
Link: https://lore.kernel.org/20240905085330.45985730@gandalf.local.home
Fixes: e88ed227f639e ("tracing/timerlat: Add user-space interface")
Reported-by: Tomas Glozar <tglozar(a)redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_osnoise.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index d770927efcd9..48e5014dd4ab 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -252,6 +252,11 @@ static inline struct timerlat_variables *this_cpu_tmr_var(void)
return this_cpu_ptr(&per_cpu_timerlat_var);
}
+/*
+ * Protect the interface.
+ */
+static struct mutex interface_lock;
+
/*
* tlat_var_reset - Reset the values of the given timerlat_variables
*/
@@ -259,14 +264,20 @@ static inline void tlat_var_reset(void)
{
struct timerlat_variables *tlat_var;
int cpu;
+
+ /* Synchronize with the timerlat interfaces */
+ mutex_lock(&interface_lock);
/*
* So far, all the values are initialized as 0, so
* zeroing the structure is perfect.
*/
for_each_cpu(cpu, cpu_online_mask) {
tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
+ if (tlat_var->kthread)
+ hrtimer_cancel(&tlat_var->timer);
memset(tlat_var, 0, sizeof(*tlat_var));
}
+ mutex_unlock(&interface_lock);
}
#else /* CONFIG_TIMERLAT_TRACER */
#define tlat_var_reset() do {} while (0)
@@ -331,11 +342,6 @@ struct timerlat_sample {
};
#endif
-/*
- * Protect the interface.
- */
-static struct mutex interface_lock;
-
/*
* Tracer data.
*/
@@ -2591,7 +2597,8 @@ static int timerlat_fd_release(struct inode *inode, struct file *file)
osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
- hrtimer_cancel(&tlat_var->timer);
+ if (tlat_var->kthread)
+ hrtimer_cancel(&tlat_var->timer);
memset(tlat_var, 0, sizeof(*tlat_var));
osn_var->sampling = 0;
--
2.43.0
devm_kasprintf() can return a NULL pointer on failure but this returned
value is not checked. Fix this lack and check the returned value.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: 32c170ff15b0 ("pinctrl: stm32: set default gpio line names using pin names")
Signed-off-by: Ma Ke <make24(a)iscas.ac.cn>
---
drivers/pinctrl/stm32/pinctrl-stm32.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/pinctrl/stm32/pinctrl-stm32.c b/drivers/pinctrl/stm32/pinctrl-stm32.c
index a8673739871d..53306d939d14 100644
--- a/drivers/pinctrl/stm32/pinctrl-stm32.c
+++ b/drivers/pinctrl/stm32/pinctrl-stm32.c
@@ -1374,8 +1374,13 @@ static int stm32_gpiolib_register_bank(struct stm32_pinctrl *pctl, struct fwnode
for (i = 0; i < npins; i++) {
stm32_pin = stm32_pctrl_get_desc_pin_from_gpio(pctl, bank, i);
- if (stm32_pin && stm32_pin->pin.name)
+ if (stm32_pin && stm32_pin->pin.name) {
names[i] = devm_kasprintf(dev, GFP_KERNEL, "%s", stm32_pin->pin.name);
+ if (!name[i]) {
+ err = -ENOMEM;
+ goto err_clk;
+ }
+ }
else
names[i] = NULL;
}
--
2.25.1
This is an automatic generated email to let you know that the following patch were queued:
Subject: media: venus: fix use after free bug in venus_remove due to race condition
Author: Zheng Wang <zyytlz.wz(a)163.com>
Date: Tue Jun 18 14:55:59 2024 +0530
in venus_probe, core->work is bound with venus_sys_error_handler, which is
used to handle error. The code use core->sys_err_done to make sync work.
The core->work is started in venus_event_notify.
If we call venus_remove, there might be an unfished work. The possible
sequence is as follows:
CPU0 CPU1
|venus_sys_error_handler
venus_remove |
hfi_destroy |
venus_hfi_destroy |
kfree(hdev); |
|hfi_reinit
|venus_hfi_queues_reinit
|//use hdev
Fix it by canceling the work in venus_remove.
Cc: stable(a)vger.kernel.org
Fixes: af2c3834c8ca ("[media] media: venus: adding core part and helper functions")
Signed-off-by: Zheng Wang <zyytlz.wz(a)163.com>
Signed-off-by: Dikshita Agarwal <quic_dikshita(a)quicinc.com>
Signed-off-by: Stanimir Varbanov <stanimir.k.varbanov(a)gmail.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco(a)xs4all.nl>
drivers/media/platform/qcom/venus/core.c | 1 +
1 file changed, 1 insertion(+)
---
diff --git a/drivers/media/platform/qcom/venus/core.c b/drivers/media/platform/qcom/venus/core.c
index 165c947a6703..84e95a46dfc9 100644
--- a/drivers/media/platform/qcom/venus/core.c
+++ b/drivers/media/platform/qcom/venus/core.c
@@ -430,6 +430,7 @@ static void venus_remove(struct platform_device *pdev)
struct device *dev = core->dev;
int ret;
+ cancel_delayed_work_sync(&core->work);
ret = pm_runtime_get_sync(dev);
WARN_ON(ret < 0);
This is an automatic generated email to let you know that the following patch were queued:
Subject: media: videobuf2: Drop minimum allocation requirement of 2 buffers
Author: Laurent Pinchart <laurent.pinchart+renesas(a)ideasonboard.com>
Date: Mon Aug 26 02:24:49 2024 +0300
When introducing the ability for drivers to indicate the minimum number
of buffers they require an application to allocate, commit 6662edcd32cc
("media: videobuf2: Add min_reqbufs_allocation field to vb2_queue
structure") also introduced a global minimum of 2 buffers. It turns out
this breaks the Renesas R-Car VSP test suite, where a test that
allocates a single buffer fails when two buffers are used.
One may consider debatable whether test suite failures without failures
in production use cases should be considered as a regression, but
operation with a single buffer is a valid use case. While full frame
rate can't be maintained, memory-to-memory devices can still be used
with a decent efficiency, and requiring applications to allocate
multiple buffers for single-shot use cases with capture devices would
just waste memory.
For those reasons, fix the regression by dropping the global minimum of
buffers. Individual drivers can still set their own minimum.
Fixes: 6662edcd32cc ("media: videobuf2: Add min_reqbufs_allocation field to vb2_queue structure")
Cc: stable(a)vger.kernel.org
Signed-off-by: Laurent Pinchart <laurent.pinchart+renesas(a)ideasonboard.com>
Reviewed-by: Hans Verkuil <hverkuil-cisco(a)xs4all.nl>
Acked-by: Tomasz Figa <tfiga(a)chromium.org>
Link: https://lore.kernel.org/r/20240825232449.25905-1-laurent.pinchart+renesas@i…
Signed-off-by: Laurent Pinchart <laurent.pinchart(a)ideasonboard.com>
drivers/media/common/videobuf2/videobuf2-core.c | 7 -------
1 file changed, 7 deletions(-)
---
diff --git a/drivers/media/common/videobuf2/videobuf2-core.c b/drivers/media/common/videobuf2/videobuf2-core.c
index 500a4e0c84ab..29a8d876e6c2 100644
--- a/drivers/media/common/videobuf2/videobuf2-core.c
+++ b/drivers/media/common/videobuf2/videobuf2-core.c
@@ -2632,13 +2632,6 @@ int vb2_core_queue_init(struct vb2_queue *q)
if (WARN_ON(q->supports_requests && q->min_queued_buffers))
return -EINVAL;
- /*
- * The minimum requirement is 2: one buffer is used
- * by the hardware while the other is being processed by userspace.
- */
- if (q->min_reqbufs_allocation < 2)
- q->min_reqbufs_allocation = 2;
-
/*
* If the driver needs 'min_queued_buffers' in the queue before
* calling start_streaming() then the minimum requirement is
This is an automatic generated email to let you know that the following patch were queued:
Subject: media: ov5675: Fix power on/off delay timings
Author: Bryan O'Donoghue <bryan.odonoghue(a)linaro.org>
Date: Sat Jul 13 23:33:29 2024 +0100
The ov5675 specification says that the gap between XSHUTDN deassert and the
first I2C transaction should be a minimum of 8192 XVCLK cycles.
Right now we use a usleep_rage() that gives a sleep time of between about
430 and 860 microseconds.
On the Lenovo X13s we have observed that in about 1/20 cases the current
timing is too tight and we start transacting before the ov5675's reset
cycle completes, leading to I2C bus transaction failures.
The reset racing is sometimes triggered at initial chip probe but, more
usually on a subsequent power-off/power-on cycle e.g.
[ 71.451662] ov5675 24-0010: failed to write reg 0x0103. error = -5
[ 71.451686] ov5675 24-0010: failed to set plls
The current quiescence period we have is too tight. Instead of expressing
the post reset delay in terms of the current XVCLK this patch converts the
power-on and power-off delays to the maximum theoretical delay @ 6 MHz with
an additional buffer.
1.365 milliseconds on the power-on path is 1.5 milliseconds with grace.
85.3 microseconds on the power-off path is 90 microseconds with grace.
Fixes: 49d9ad719e89 ("media: ov5675: add device-tree support and support runtime PM")
Cc: stable(a)vger.kernel.org
Signed-off-by: Bryan O'Donoghue <bryan.odonoghue(a)linaro.org>
Tested-by: Johan Hovold <johan+linaro(a)kernel.org>
Reviewed-by: Quentin Schulz <quentin.schulz(a)cherry.de>
Tested-by: Quentin Schulz <quentin.schulz(a)cherry.de> # RK3399 Puma with
Signed-off-by: Sakari Ailus <sakari.ailus(a)linux.intel.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco(a)xs4all.nl>
drivers/media/i2c/ov5675.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
---
diff --git a/drivers/media/i2c/ov5675.c b/drivers/media/i2c/ov5675.c
index 3641911bc73f..5b5127f8953f 100644
--- a/drivers/media/i2c/ov5675.c
+++ b/drivers/media/i2c/ov5675.c
@@ -972,12 +972,10 @@ static int ov5675_set_stream(struct v4l2_subdev *sd, int enable)
static int ov5675_power_off(struct device *dev)
{
- /* 512 xvclk cycles after the last SCCB transation or MIPI frame end */
- u32 delay_us = DIV_ROUND_UP(512, OV5675_XVCLK_19_2 / 1000 / 1000);
struct v4l2_subdev *sd = dev_get_drvdata(dev);
struct ov5675 *ov5675 = to_ov5675(sd);
- usleep_range(delay_us, delay_us * 2);
+ usleep_range(90, 100);
clk_disable_unprepare(ov5675->xvclk);
gpiod_set_value_cansleep(ov5675->reset_gpio, 1);
@@ -988,7 +986,6 @@ static int ov5675_power_off(struct device *dev)
static int ov5675_power_on(struct device *dev)
{
- u32 delay_us = DIV_ROUND_UP(8192, OV5675_XVCLK_19_2 / 1000 / 1000);
struct v4l2_subdev *sd = dev_get_drvdata(dev);
struct ov5675 *ov5675 = to_ov5675(sd);
int ret;
@@ -1014,8 +1011,11 @@ static int ov5675_power_on(struct device *dev)
gpiod_set_value_cansleep(ov5675->reset_gpio, 0);
- /* 8192 xvclk cycles prior to the first SCCB transation */
- usleep_range(delay_us, delay_us * 2);
+ /* Worst case quiesence gap is 1.365 milliseconds @ 6MHz XVCLK
+ * Add an additional threshold grace period to ensure reset
+ * completion before initiating our first I2C transaction.
+ */
+ usleep_range(1500, 1600);
return 0;
}
This is an automatic generated email to let you know that the following patch were queued:
Subject: media: imx335: Fix reset-gpio handling
Author: Umang Jain <umang.jain(a)ideasonboard.com>
Date: Fri Aug 30 11:41:52 2024 +0530
Rectify the logical value of reset-gpio so that it is set to
0 (disabled) during power-on and to 1 (enabled) during power-off.
Set the reset-gpio to GPIO_OUT_HIGH at initialization time to make
sure it starts off in reset. Also drop the "Set XCLR" comment which
is not-so-informative.
The existing usage of imx335 had reset-gpios polarity inverted
(GPIO_ACTIVE_HIGH) in their device-tree sources. With this patch
included, those DTS will not be able to stream imx335 anymore. The
reset-gpio polarity will need to be rectified in the device-tree
sources as shown in [1] example, in order to get imx335 functional
again (as it remains in reset prior to this fix).
Cc: stable(a)vger.kernel.org
Fixes: 45d19b5fb9ae ("media: i2c: Add imx335 camera sensor driver")
Reviewed-by: Laurent Pinchart <laurent.pinchart(a)ideasonboard.com>
Link: https://lore.kernel.org/linux-media/20240729110437.199428-1-umang.jain@idea…
Signed-off-by: Umang Jain <umang.jain(a)ideasonboard.com>
Signed-off-by: Sakari Ailus <sakari.ailus(a)linux.intel.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco(a)xs4all.nl>
drivers/media/i2c/imx335.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
---
diff --git a/drivers/media/i2c/imx335.c b/drivers/media/i2c/imx335.c
index 990d74214cc2..54a1de53d497 100644
--- a/drivers/media/i2c/imx335.c
+++ b/drivers/media/i2c/imx335.c
@@ -997,7 +997,7 @@ static int imx335_parse_hw_config(struct imx335 *imx335)
/* Request optional reset pin */
imx335->reset_gpio = devm_gpiod_get_optional(imx335->dev, "reset",
- GPIOD_OUT_LOW);
+ GPIOD_OUT_HIGH);
if (IS_ERR(imx335->reset_gpio)) {
dev_err(imx335->dev, "failed to get reset gpio %ld\n",
PTR_ERR(imx335->reset_gpio));
@@ -1110,8 +1110,7 @@ static int imx335_power_on(struct device *dev)
usleep_range(500, 550); /* Tlow */
- /* Set XCLR */
- gpiod_set_value_cansleep(imx335->reset_gpio, 1);
+ gpiod_set_value_cansleep(imx335->reset_gpio, 0);
ret = clk_prepare_enable(imx335->inclk);
if (ret) {
@@ -1124,7 +1123,7 @@ static int imx335_power_on(struct device *dev)
return 0;
error_reset:
- gpiod_set_value_cansleep(imx335->reset_gpio, 0);
+ gpiod_set_value_cansleep(imx335->reset_gpio, 1);
regulator_bulk_disable(ARRAY_SIZE(imx335_supply_name), imx335->supplies);
return ret;
@@ -1141,7 +1140,7 @@ static int imx335_power_off(struct device *dev)
struct v4l2_subdev *sd = dev_get_drvdata(dev);
struct imx335 *imx335 = to_imx335(sd);
- gpiod_set_value_cansleep(imx335->reset_gpio, 0);
+ gpiod_set_value_cansleep(imx335->reset_gpio, 1);
clk_disable_unprepare(imx335->inclk);
regulator_bulk_disable(ARRAY_SIZE(imx335_supply_name), imx335->supplies);
[ Upstream commit 801ebf1043ae7b182588554cc9b9ad3c14bc2ab5 ]
The recent USB core code performs sanity checks for the given pipe and
EP types, and it can be hit by manipulated USB descriptors by syzbot.
For making syzbot happier, this patch introduces a local helper for a
sanity check in the driver side and calls it at each place before the
message handling, so that we can avoid the WARNING splats.
Reported-by: syzbot+d952e5e28f5fb7718d23(a)syzkaller.appspotmail.com
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
Signed-off-by: Dan Carpenter <dan.carpenter(a)linaro.org>
---
sound/usb/helper.c | 17 +++++++++++++++++
sound/usb/helper.h | 1 +
sound/usb/quirks.c | 14 +++++++++++---
3 files changed, 29 insertions(+), 3 deletions(-)
diff --git a/sound/usb/helper.c b/sound/usb/helper.c
index 7712e2b84183..b1cc9499c57e 100644
--- a/sound/usb/helper.c
+++ b/sound/usb/helper.c
@@ -76,6 +76,20 @@ void *snd_usb_find_csint_desc(void *buffer, int buflen, void *after, u8 dsubtype
return NULL;
}
+/* check the validity of pipe and EP types */
+int snd_usb_pipe_sanity_check(struct usb_device *dev, unsigned int pipe)
+{
+ static const int pipetypes[4] = {
+ PIPE_CONTROL, PIPE_ISOCHRONOUS, PIPE_BULK, PIPE_INTERRUPT
+ };
+ struct usb_host_endpoint *ep;
+
+ ep = usb_pipe_endpoint(dev, pipe);
+ if (usb_pipetype(pipe) != pipetypes[usb_endpoint_type(&ep->desc)])
+ return -EINVAL;
+ return 0;
+}
+
/*
* Wrapper for usb_control_msg().
* Allocates a temp buffer to prevent dmaing from/to the stack.
@@ -88,6 +102,9 @@ int snd_usb_ctl_msg(struct usb_device *dev, unsigned int pipe, __u8 request,
void *buf = NULL;
int timeout;
+ if (snd_usb_pipe_sanity_check(dev, pipe))
+ return -EINVAL;
+
if (size > 0) {
buf = kmemdup(data, size, GFP_KERNEL);
if (!buf)
diff --git a/sound/usb/helper.h b/sound/usb/helper.h
index f5b4c6647e4d..5e8a18b4e7b9 100644
--- a/sound/usb/helper.h
+++ b/sound/usb/helper.h
@@ -7,6 +7,7 @@ unsigned int snd_usb_combine_bytes(unsigned char *bytes, int size);
void *snd_usb_find_desc(void *descstart, int desclen, void *after, u8 dtype);
void *snd_usb_find_csint_desc(void *descstart, int desclen, void *after, u8 dsubtype);
+int snd_usb_pipe_sanity_check(struct usb_device *dev, unsigned int pipe);
int snd_usb_ctl_msg(struct usb_device *dev, unsigned int pipe,
__u8 request, __u8 requesttype, __u16 value, __u16 index,
void *data, __u16 size);
diff --git a/sound/usb/quirks.c b/sound/usb/quirks.c
index 43cbaaff163f..087eef5e249d 100644
--- a/sound/usb/quirks.c
+++ b/sound/usb/quirks.c
@@ -743,11 +743,13 @@ static int snd_usb_novation_boot_quirk(struct usb_device *dev)
static int snd_usb_accessmusic_boot_quirk(struct usb_device *dev)
{
int err, actual_length;
-
/* "midi send" enable */
static const u8 seq[] = { 0x4e, 0x73, 0x52, 0x01 };
+ void *buf;
- void *buf = kmemdup(seq, ARRAY_SIZE(seq), GFP_KERNEL);
+ if (snd_usb_pipe_sanity_check(dev, usb_sndintpipe(dev, 0x05)))
+ return -EINVAL;
+ buf = kmemdup(seq, ARRAY_SIZE(seq), GFP_KERNEL);
if (!buf)
return -ENOMEM;
err = usb_interrupt_msg(dev, usb_sndintpipe(dev, 0x05), buf,
@@ -772,7 +774,11 @@ static int snd_usb_accessmusic_boot_quirk(struct usb_device *dev)
static int snd_usb_nativeinstruments_boot_quirk(struct usb_device *dev)
{
- int ret = usb_control_msg(dev, usb_sndctrlpipe(dev, 0),
+ int ret;
+
+ if (snd_usb_pipe_sanity_check(dev, usb_sndctrlpipe(dev, 0)))
+ return -EINVAL;
+ ret = usb_control_msg(dev, usb_sndctrlpipe(dev, 0),
0xaf, USB_TYPE_VENDOR | USB_RECIP_DEVICE,
1, 0, NULL, 0, 1000);
@@ -879,6 +885,8 @@ static int snd_usb_axefx3_boot_quirk(struct usb_device *dev)
dev_dbg(&dev->dev, "Waiting for Axe-Fx III to boot up...\n");
+ if (snd_usb_pipe_sanity_check(dev, usb_sndctrlpipe(dev, 0)))
+ return -EINVAL;
/* If the Axe-Fx III has not fully booted, it will timeout when trying
* to enable the audio streaming interface. A more generous timeout is
* used here to detect when the Axe-Fx III has finished booting as the
--
2.45.2
This is something that I've been thinking about for a while. We had a
discussion at LPC 2020 about this[1] but the proposals suggested there
never materialised.
In short, it is quite difficult for userspace to detect the feature
capability of syscalls at runtime. This is something a lot of programs
want to do, but they are forced to create elaborate scenarios to try to
figure out if a feature is supported without causing damage to the
system. For the vast majority of cases, each individual feature also
needs to be tested individually (because syscall results are
all-or-nothing), so testing even a single syscall's feature set can
easily inflate the startup time of programs.
This patchset implements the fairly minimal design I proposed in this
talk[2] and in some old LKML threads (though I can't find the exact
references ATM). The general flow looks like:
1. Userspace will indicate to the kernel that a syscall should a be
no-op by setting the top bit of the extensible struct size argument.
We will almost certainly never support exabyte sized structs, so the
top bits are free for us to use as makeshift flag bits. This is
preferable to using the per-syscall flag field inside the structure
because seccomp can easily detect the bit in the flag and allow the
probe or forcefully return -EEXTSYS_NOOP.
2. The kernel will then fill the provided structure with every valid
bit pattern that the current kernel understands.
For flags or other bitflag-like fields, this is the set of valid
flags or bits. For pointer fields or fields that take an arbitrary
value, the field has every bit set (0xFF... to fill the field) to
indicate that any value is valid in the field.
3. The syscall then returns -EEXTSYS_NOOP which is an errno that will
only ever be used for this purpose (so userspace can be sure that
the request succeeded).
On older kernels, the syscall will return a different error (usually
-E2BIG or -EFAULT) and userspace can do their old-fashioned checks.
4. Userspace can then check which flags and fields are supported by
looking at the fields in the returned structure. Flags are checked
by doing an AND with the flags field, and field support can checked
by comparing to 0. In principle you could just AND the entire
structure if you wanted to do this check generically without caring
about the structure contents (this is what libraries might consider
doing).
Userspace can even find out the internal kernel structure size by
passing a PAGE_SIZE buffer and seeing how many bytes are non-zero.
As with copy_struct_from_user(), this is designed to be forward- and
backwards- compatible.
This allows programas to get a one-shot understanding of what features a
syscall supports without having to do any elaborate setups or tricks to
detect support for destructive features. Flags can simply be ANDed to
check if they are in the supported set, and fields can just be checked
to see if they are non-zero.
This patchset is IMHO the simplest way we can add the ability to
introspect the feature set of extensible struct (copy_struct_from_user)
syscalls. It doesn't preclude the chance of a more generic mechanism
being added later.
The intended way of using this interface to get feature information
looks something like the following (imagine that openat2 has gained a
new field and a new flag in the future):
static bool openat2_no_automount_supported;
static bool openat2_cwd_fd_supported;
int check_openat2_support(void)
{
int err;
struct open_how how = {};
err = openat2(AT_FDCWD, ".", &how, CHECK_FIELDS | sizeof(how));
assert(err < 0);
switch (errno) {
case EFAULT: case E2BIG:
/* Old kernel... */
check_support_the_old_way();
break;
case EEXTSYS_NOOP:
openat2_no_automount_supported = (how.flags & RESOLVE_NO_AUTOMOUNT);
openat2_cwd_fd_supported = (how.cwd_fd != 0);
break;
}
}
This series adds CHECK_FIELDS support for the following extensible
struct syscalls, as they are quite likely to grow flags in the near
future:
* openat2
* clone3
* mount_setattr
[1]: https://lwn.net/Articles/830666/
[2]: https://youtu.be/ggD-eb3yPVs
Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com>
---
Changes in v2:
- Add CHECK_FIELDS support to mount_setattr(2).
- Fix build failure on architectures with custom errno values.
- Rework selftests to use the tools/ uAPI headers rather than custom
defining EEXTSYS_NOOP.
- Make sure we return -EINVAL and -E2BIG for invalid sizes even if
CHECK_FIELDS is set, and add some tests for that.
- v1: <https://lore.kernel.org/r/20240902-extensible-structs-check_fields-v1-0-545…>
---
Aleksa Sarai (10):
uaccess: add copy_struct_to_user helper
sched_getattr: port to copy_struct_to_user
openat2: explicitly return -E2BIG for (usize > PAGE_SIZE)
openat2: add CHECK_FIELDS flag to usize argument
selftests: openat2: add 0xFF poisoned data after misaligned struct
selftests: openat2: add CHECK_FIELDS selftests
clone3: add CHECK_FIELDS flag to usize argument
selftests: clone3: add CHECK_FIELDS selftests
mount_setattr: add CHECK_FIELDS flag to usize argument
selftests: mount_setattr: add CHECK_FIELDS selftest
arch/alpha/include/uapi/asm/errno.h | 3 +
arch/mips/include/uapi/asm/errno.h | 3 +
arch/parisc/include/uapi/asm/errno.h | 3 +
arch/sparc/include/uapi/asm/errno.h | 3 +
fs/namespace.c | 17 ++
fs/open.c | 18 ++
include/linux/uaccess.h | 98 ++++++++
include/uapi/asm-generic/errno.h | 3 +
include/uapi/linux/openat2.h | 2 +
kernel/fork.c | 30 ++-
kernel/sched/syscalls.c | 42 +---
tools/arch/alpha/include/uapi/asm/errno.h | 3 +
tools/arch/mips/include/uapi/asm/errno.h | 3 +
tools/arch/parisc/include/uapi/asm/errno.h | 3 +
tools/arch/sparc/include/uapi/asm/errno.h | 3 +
tools/include/uapi/asm-generic/errno.h | 3 +
tools/include/uapi/asm-generic/posix_types.h | 101 ++++++++
tools/testing/selftests/clone3/.gitignore | 1 +
tools/testing/selftests/clone3/Makefile | 4 +-
.../testing/selftests/clone3/clone3_check_fields.c | 264 +++++++++++++++++++++
tools/testing/selftests/mount_setattr/Makefile | 2 +-
.../selftests/mount_setattr/mount_setattr_test.c | 53 ++++-
tools/testing/selftests/openat2/Makefile | 2 +
tools/testing/selftests/openat2/openat2_test.c | 165 ++++++++++++-
24 files changed, 778 insertions(+), 51 deletions(-)
---
base-commit: 431c1646e1f86b949fa3685efc50b660a364c2b6
change-id: 20240803-extensible-structs-check_fields-a47e94cef691
Best regards,
--
Aleksa Sarai <cyphar(a)cyphar.com>
From: Zijun Hu <quic_zijuhu(a)quicinc.com>
It is bad for current match_free_decoder()'s logic to find a free switch
cxl decoder as explained below:
- If all child decoders are sorted by ID in ascending order, then
current logic can be simplified as below one:
static int match_free_decoder(struct device *dev, void *data)
{
struct cxl_decoder *cxld;
if (!is_switch_decoder(dev))
return 0;
cxld = to_cxl_decoder(dev);
return cxld->region ? 0 : 1;
}
dev = device_find_child(&port->dev, NULL, match_free_decoder);
which does not also need to modify device_find_child()'s match data.
- If all child decoders are NOT sorted by ID in ascending order, then
current logic are wrong as explained below:
F: free, (cxld->region == NULL) B: busy, (cxld->region != NULL)
S(n)F : State of switch cxl_decoder with ID n is Free
S(n)B : State of switch cxl_decoder with ID n is Busy
Provided there are 2 child decoders: S(1)F -> S(0)B, then current logic
will fail to find a free decoder even if there are a free one with ID 1
Anyway, current logic is not good, fixed by finding a free switch cxl
decoder with minimal ID.
Fixes: 384e624bb211 ("cxl/region: Attach endpoint decoders")
Closes: https://lore.kernel.org/all/cdfc6f98-1aa0-4cb5-bd7d-93256552c39b@icloud.com/
Cc: stable(a)vger.kernel.org
Signed-off-by: Zijun Hu <quic_zijuhu(a)quicinc.com>
---
Changes in v2:
- Correct title and commit message
- Link to v1: https://lore.kernel.org/r/20240903-fix_cxld-v1-1-61acba7198ae@quicinc.com
---
drivers/cxl/core/region.c | 27 ++++++++++++++++-----------
1 file changed, 16 insertions(+), 11 deletions(-)
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 21ad5f242875..b9607b4fc40b 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -797,21 +797,26 @@ static size_t show_targetN(struct cxl_region *cxlr, char *buf, int pos)
static int match_free_decoder(struct device *dev, void *data)
{
struct cxl_decoder *cxld;
- int *id = data;
+ struct cxl_decoder *target_cxld;
+ struct device **target_device = data;
if (!is_switch_decoder(dev))
return 0;
cxld = to_cxl_decoder(dev);
-
- /* enforce ordered allocation */
- if (cxld->id != *id)
+ if (cxld->region)
return 0;
- if (!cxld->region)
- return 1;
-
- (*id)++;
+ if (!*target_device) {
+ *target_device = get_device(dev);
+ return 0;
+ }
+ /* enforce ordered allocation */
+ target_cxld = to_cxl_decoder(*target_device);
+ if (cxld->id < target_cxld->id) {
+ put_device(*target_device);
+ *target_device = get_device(dev);
+ }
return 0;
}
@@ -839,8 +844,7 @@ cxl_region_find_decoder(struct cxl_port *port,
struct cxl_endpoint_decoder *cxled,
struct cxl_region *cxlr)
{
- struct device *dev;
- int id = 0;
+ struct device *dev = NULL;
if (port == cxled_to_port(cxled))
return &cxled->cxld;
@@ -849,7 +853,8 @@ cxl_region_find_decoder(struct cxl_port *port,
dev = device_find_child(&port->dev, &cxlr->params,
match_auto_decoder);
else
- dev = device_find_child(&port->dev, &id, match_free_decoder);
+ /* Need to put_device(@dev) after use */
+ device_for_each_child(&port->dev, &dev, match_free_decoder);
if (!dev)
return NULL;
/*
---
base-commit: 67784a74e258a467225f0e68335df77acd67b7ab
change-id: 20240903-fix_cxld-4f6575a90619
Best regards,
--
Zijun Hu <quic_zijuhu(a)quicinc.com>
Backporting the sanity checks on pipes seems like a good idea. These
are basically the same as Takashi Iwai's patches upstream. The
difference is that upstream added two sanity checks to code that
doesn't exist in 4.19.
I had talked about backporting fcc2cc1f3561 ("USB: move
snd_usb_pipe_sanity_check into the USB core") but that's just a
refactor and not a bug fix.
Hillf Danton (1):
ALSA: usb-audio: Fix gpf in snd_usb_pipe_sanity_check
Takashi Iwai (1):
ALSA: usb-audio: Sanity checks for each pipe and EP types
sound/usb/helper.c | 17 +++++++++++++++++
sound/usb/helper.h | 1 +
sound/usb/quirks.c | 14 +++++++++++---
3 files changed, 29 insertions(+), 3 deletions(-)
--
2.45.2
We have met a deadlock issue on our device which use 5.15.y when resuming.
After applying this patch which is picked from mainline, issue solved.
Backport to 6.6.y also.
Rafael J. Wysocki (1):
PM: sleep: Restore asynchronous device resume optimization
drivers/base/power/main.c | 117 +++++++++++++++++++++-----------------
include/linux/pm.h | 1 +
2 files changed, 65 insertions(+), 53 deletions(-)
--
2.18.0
The Qualcomm serial console implementation is broken and can lose
characters when the serial port is also used for tty output.
Specifically, the console code only waits for the current tx command to
complete when all data has already been written to the fifo. When there
are on-going longer transfers this often means that console output is
lost when the console code inadvertently "hijacks" the current tx
command instead of starting a new one.
This can, for example, be observed during boot when console output that
should have been interspersed with init output is truncated:
[ 9.462317] qcom-snps-eusb2-hsphy fde000.phy: Registered Qcom-eUSB2 phy
[ OK ] Found device KBG50ZNS256G KIOXIA Wi[ 9.471743ndows.
[ 9.539915] xhci-hcd xhci-hcd.0.auto: xHCI Host Controller
Add a new state variable to track how much data has been written to the
fifo and use it to determine when the fifo and shift register are both
empty. This is needed since there is currently no other known way to
determine when the shift register is empty.
This in turn allows the console code to interrupt long transfers without
losing data.
Note that the oops-in-progress case is similarly broken as it does not
cancel any active command and also waits for the wrong status flag when
attempting to drain the fifo (TX_FIFO_NOT_EMPTY_EN is only set when
cancelling a command leaves data in the fifo).
Fixes: c4f528795d1a ("tty: serial: msm_geni_serial: Add serial driver support for GENI based QUP")
Fixes: a1fee899e5be ("tty: serial: qcom_geni_serial: Fix softlock")
Fixes: 9e957a155005 ("serial: qcom-geni: Don't cancel/abort if we can't get the port lock")
Cc: stable(a)vger.kernel.org # 4.17
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
---
drivers/tty/serial/qcom_geni_serial.c | 48 ++++++++++++++-------------
1 file changed, 25 insertions(+), 23 deletions(-)
diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index 7029c39a9a21..be620c5703f5 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -131,6 +131,7 @@ struct qcom_geni_serial_port {
bool brk;
unsigned int tx_remaining;
+ unsigned int tx_queued;
int wakeup_irq;
bool rx_tx_swap;
bool cts_rts_swap;
@@ -144,6 +145,8 @@ static const struct uart_ops qcom_geni_uart_pops;
static struct uart_driver qcom_geni_console_driver;
static struct uart_driver qcom_geni_uart_driver;
+static void qcom_geni_serial_cancel_tx_cmd(struct uart_port *uport);
+
static inline struct qcom_geni_serial_port *to_dev_port(struct uart_port *uport)
{
return container_of(uport, struct qcom_geni_serial_port, uport);
@@ -308,6 +311,17 @@ static bool qcom_geni_serial_poll_bit(struct uart_port *uport,
return qcom_geni_serial_poll_bitfield(uport, offset, field, set ? field : 0);
}
+static void qcom_geni_serial_drain_fifo(struct uart_port *uport)
+{
+ struct qcom_geni_serial_port *port = to_dev_port(uport);
+
+ if (!qcom_geni_serial_main_active(uport))
+ return;
+
+ qcom_geni_serial_poll_bitfield(uport, SE_GENI_M_GP_LENGTH, GP_LENGTH,
+ port->tx_queued);
+}
+
static void qcom_geni_serial_setup_tx(struct uart_port *uport, u32 xmit_size)
{
u32 m_cmd;
@@ -476,7 +490,6 @@ static void qcom_geni_serial_console_write(struct console *co, const char *s,
struct qcom_geni_serial_port *port;
bool locked = true;
unsigned long flags;
- u32 geni_status;
WARN_ON(co->index < 0 || co->index >= GENI_UART_CONS_PORTS);
@@ -490,34 +503,20 @@ static void qcom_geni_serial_console_write(struct console *co, const char *s,
else
uart_port_lock_irqsave(uport, &flags);
- geni_status = readl(uport->membase + SE_GENI_STATUS);
+ if (qcom_geni_serial_main_active(uport)) {
+ /* Wait for completion or drain FIFO */
+ if (!locked || port->tx_remaining == 0)
+ qcom_geni_serial_poll_tx_done(uport);
+ else
+ qcom_geni_serial_drain_fifo(uport);
- if (!locked) {
- /*
- * We can only get here if an oops is in progress then we were
- * unable to get the lock. This means we can't safely access
- * our state variables like tx_remaining. About the best we
- * can do is wait for the FIFO to be empty before we start our
- * transfer, so we'll do that.
- */
- qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
- M_TX_FIFO_NOT_EMPTY_EN, false);
- } else if ((geni_status & M_GENI_CMD_ACTIVE) && !port->tx_remaining) {
- /*
- * It seems we can't interrupt existing transfers if all data
- * has been sent, in which case we need to look for done first.
- */
- qcom_geni_serial_poll_tx_done(uport);
+ qcom_geni_serial_cancel_tx_cmd(uport);
}
__qcom_geni_serial_console_write(uport, s, count);
-
- if (locked) {
- if (port->tx_remaining)
- qcom_geni_serial_setup_tx(uport, port->tx_remaining);
+ if (locked)
uart_port_unlock_irqrestore(uport, flags);
- }
}
static void handle_rx_console(struct uart_port *uport, u32 bytes, bool drop)
@@ -698,6 +697,7 @@ static void qcom_geni_serial_cancel_tx_cmd(struct uart_port *uport)
writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
port->tx_remaining = 0;
+ port->tx_queued = 0;
}
static void qcom_geni_serial_handle_rx_fifo(struct uart_port *uport, bool drop)
@@ -924,6 +924,7 @@ static void qcom_geni_serial_handle_tx_fifo(struct uart_port *uport,
if (!port->tx_remaining) {
qcom_geni_serial_setup_tx(uport, pending);
port->tx_remaining = pending;
+ port->tx_queued = 0;
irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
if (!(irq_en & M_TX_FIFO_WATERMARK_EN))
@@ -932,6 +933,7 @@ static void qcom_geni_serial_handle_tx_fifo(struct uart_port *uport,
}
qcom_geni_serial_send_chunk_fifo(uport, chunk);
+ port->tx_queued += chunk;
/*
* The tx fifo watermark is level triggered and latched. Though we had
--
2.44.2
This is an automatic generated email to let you know that the following patch were queued:
Subject: media: qcom: camss: Fix ordering of pm_runtime_enable
Author: Bryan O'Donoghue <bryan.odonoghue(a)linaro.org>
Date: Mon Jul 29 13:42:03 2024 +0100
pm_runtime_enable() should happen prior to vfe_get() since vfe_get() calls
pm_runtime_resume_and_get().
This is a basic race condition that doesn't show up for most users so is
not widely reported. If you blacklist qcom-camss in modules.d and then
subsequently modprobe the module post-boot it is possible to reliably show
this error up.
The kernel log for this error looks like this:
qcom-camss ac5a000.camss: Failed to power up pipeline: -13
Fixes: 02afa816dbbf ("media: camss: Add basic runtime PM support")
Reported-by: Johan Hovold <johan+linaro(a)kernel.org>
Closes: https://lore.kernel.org/lkml/ZoVNHOTI0PKMNt4_@hovoldconsulting.com/
Tested-by: Johan Hovold <johan+linaro(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Bryan O'Donoghue <bryan.odonoghue(a)linaro.org>
Reviewed-by: Konrad Dybcio <konradybcio(a)kernel.org>
Signed-off-by: Hans Verkuil <hverkuil-cisco(a)xs4all.nl>
drivers/media/platform/qcom/camss/camss.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
---
diff --git a/drivers/media/platform/qcom/camss/camss.c b/drivers/media/platform/qcom/camss/camss.c
index 51b1d3550421..d64985ca6e88 100644
--- a/drivers/media/platform/qcom/camss/camss.c
+++ b/drivers/media/platform/qcom/camss/camss.c
@@ -2283,6 +2283,8 @@ static int camss_probe(struct platform_device *pdev)
v4l2_async_nf_init(&camss->notifier, &camss->v4l2_dev);
+ pm_runtime_enable(dev);
+
num_subdevs = camss_of_parse_ports(camss);
if (num_subdevs < 0) {
ret = num_subdevs;
@@ -2323,8 +2325,6 @@ static int camss_probe(struct platform_device *pdev)
}
}
- pm_runtime_enable(dev);
-
return 0;
err_register_subdevs:
@@ -2332,6 +2332,7 @@ err_register_subdevs:
err_v4l2_device_unregister:
v4l2_device_unregister(&camss->v4l2_dev);
v4l2_async_nf_cleanup(&camss->notifier);
+ pm_runtime_disable(dev);
err_genpd_cleanup:
camss_genpd_cleanup(camss);
This is an automatic generated email to let you know that the following patch were queued:
Subject: media: qcom: camss: Remove use_count guard in stop_streaming
Author: Bryan O'Donoghue <bryan.odonoghue(a)linaro.org>
Date: Mon Jul 29 13:42:02 2024 +0100
The use_count check was introduced so that multiple concurrent Raw Data
Interfaces RDIs could be driven by different virtual channels VCs on the
CSIPHY input driving the video pipeline.
This is an invalid use of use_count though as use_count pertains to the
number of times a video entity has been opened by user-space not the number
of active streams.
If use_count and stream-on count don't agree then stop_streaming() will
break as is currently the case and has become apparent when using CAMSS
with libcamera's released softisp 0.3.
The use of use_count like this is a bit hacky and right now breaks regular
usage of CAMSS for a single stream case. Stopping qcam results in the splat
below, and then it cannot be started again and any attempts to do so fails
with -EBUSY.
[ 1265.509831] WARNING: CPU: 5 PID: 919 at drivers/media/common/videobuf2/videobuf2-core.c:2183 __vb2_queue_cancel+0x230/0x2c8 [videobuf2_common]
...
[ 1265.510630] Call trace:
[ 1265.510636] __vb2_queue_cancel+0x230/0x2c8 [videobuf2_common]
[ 1265.510648] vb2_core_streamoff+0x24/0xcc [videobuf2_common]
[ 1265.510660] vb2_ioctl_streamoff+0x5c/0xa8 [videobuf2_v4l2]
[ 1265.510673] v4l_streamoff+0x24/0x30 [videodev]
[ 1265.510707] __video_do_ioctl+0x190/0x3f4 [videodev]
[ 1265.510732] video_usercopy+0x304/0x8c4 [videodev]
[ 1265.510757] video_ioctl2+0x18/0x34 [videodev]
[ 1265.510782] v4l2_ioctl+0x40/0x60 [videodev]
...
[ 1265.510944] videobuf2_common: driver bug: stop_streaming operation is leaving buffer 0 in active state
[ 1265.511175] videobuf2_common: driver bug: stop_streaming operation is leaving buffer 1 in active state
[ 1265.511398] videobuf2_common: driver bug: stop_streaming operation is leaving buffer 2 in active st
One CAMSS specific way to handle multiple VCs on the same RDI might be:
- Reference count each pipeline enable for CSIPHY, CSID, VFE and RDIx.
- The video buffers are already associated with msm_vfeN_rdiX so
release video buffers when told to do so by stop_streaming.
- Only release the power-domains for the CSIPHY, CSID and VFE when
their internal refcounts drop.
Either way refusing to release video buffers based on use_count is
erroneous and should be reverted. The silicon enabling code for selecting
VCs is perfectly fine. Its a "known missing feature" that concurrent VCs
won't work with CAMSS right now.
Initial testing with this code didn't show an error but, SoftISP and "real"
usage with Google Hangouts breaks the upstream code pretty quickly, we need
to do a partial revert and take another pass at VCs.
This commit partially reverts commit 89013969e232 ("media: camss: sm8250:
Pipeline starting and stopping for multiple virtual channels")
Fixes: 89013969e232 ("media: camss: sm8250: Pipeline starting and stopping for multiple virtual channels")
Reported-by: Johan Hovold <johan+linaro(a)kernel.org>
Closes: https://lore.kernel.org/lkml/ZoVNHOTI0PKMNt4_@hovoldconsulting.com/
Tested-by: Johan Hovold <johan+linaro(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Bryan O'Donoghue <bryan.odonoghue(a)linaro.org>
Signed-off-by: Hans Verkuil <hverkuil-cisco(a)xs4all.nl>
drivers/media/platform/qcom/camss/camss-video.c | 6 ------
1 file changed, 6 deletions(-)
---
diff --git a/drivers/media/platform/qcom/camss/camss-video.c b/drivers/media/platform/qcom/camss/camss-video.c
index cd72feca618c..3b8fc31d957c 100644
--- a/drivers/media/platform/qcom/camss/camss-video.c
+++ b/drivers/media/platform/qcom/camss/camss-video.c
@@ -297,12 +297,6 @@ static void video_stop_streaming(struct vb2_queue *q)
ret = v4l2_subdev_call(subdev, video, s_stream, 0);
- if (entity->use_count > 1) {
- /* Don't stop if other instances of the pipeline are still running */
- dev_dbg(video->camss->dev, "Video pipeline still used, don't stop streaming.\n");
- return;
- }
-
if (ret) {
dev_err(video->camss->dev, "Video pipeline stop failed: %d\n", ret);
return;
From: Mathieu Tortuyaux <mtortuyaux(a)microsoft.com>
[ Upstream commit 89add40066f9ed9abe5f7f886fe5789ff7e0c50e ]
Tighten csum_start and csum_offset checks in virtio_net_hdr_to_skb
for GSO packets.
The function already checks that a checksum requested with
VIRTIO_NET_HDR_F_NEEDS_CSUM is in skb linear. But for GSO packets
this might not hold for segs after segmentation.
Syzkaller demonstrated to reach this warning in skb_checksum_help
offset = skb_checksum_start_offset(skb);
ret = -EINVAL;
if (WARN_ON_ONCE(offset >= skb_headlen(skb)))
By injecting a TSO packet:
WARNING: CPU: 1 PID: 3539 at net/core/dev.c:3284 skb_checksum_help+0x3d0/0x5b0
ip_do_fragment+0x209/0x1b20 net/ipv4/ip_output.c:774
ip_finish_output_gso net/ipv4/ip_output.c:279 [inline]
__ip_finish_output+0x2bd/0x4b0 net/ipv4/ip_output.c:301
iptunnel_xmit+0x50c/0x930 net/ipv4/ip_tunnel_core.c:82
ip_tunnel_xmit+0x2296/0x2c70 net/ipv4/ip_tunnel.c:813
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x759/0xa60 net/ipv4/ip_gre.c:661
__netdev_start_xmit include/linux/netdevice.h:4850 [inline]
netdev_start_xmit include/linux/netdevice.h:4864 [inline]
xmit_one net/core/dev.c:3595 [inline]
dev_hard_start_xmit+0x261/0x8c0 net/core/dev.c:3611
__dev_queue_xmit+0x1b97/0x3c90 net/core/dev.c:4261
packet_snd net/packet/af_packet.c:3073 [inline]
The geometry of the bad input packet at tcp_gso_segment:
[ 52.003050][ T8403] skb len=12202 headroom=244 headlen=12093 tailroom=0
[ 52.003050][ T8403] mac=(168,24) mac_len=24 net=(192,52) trans=244
[ 52.003050][ T8403] shinfo(txflags=0 nr_frags=1 gso(size=1552 type=3 segs=0))
[ 52.003050][ T8403] csum(0x60000c7 start=199 offset=1536
ip_summed=3 complete_sw=0 valid=0 level=0)
Mitigate with stricter input validation.
csum_offset: for GSO packets, deduce the correct value from gso_type.
This is already done for USO. Extend it to TSO. Let UFO be:
udp[46]_ufo_fragment ignores these fields and always computes the
checksum in software.
csum_start: finding the real offset requires parsing to the transport
header. Do not add a parser, use existing segmentation parsing. Thanks
to SKB_GSO_DODGY, that also catches bad packets that are hw offloaded.
Again test both TSO and USO. Do not test UFO for the above reason, and
do not test UDP tunnel offload.
GSO packet are almost always CHECKSUM_PARTIAL. USO packets may be
CHECKSUM_NONE since commit 10154db ("udp: Allow GSO transmit
from devices with no checksum offload"), but then still these fields
are initialized correctly in udp4_hwcsum/udp6_hwcsum_outgoing. So no
need to test for ip_summed == CHECKSUM_PARTIAL first.
This revises an existing fix mentioned in the Fixes tag, which broke
small packets with GSO offload, as detected by kselftests.
Link: https://syzkaller.appspot.com/bug?extid=e1db31216c789f552871
Link: https://lore.kernel.org/netdev/20240723223109.2196886-1-kuba@kernel.org
Fixes: e269d79 ("net: missing check virtio")
Cc: stable(a)vger.kernel.org
Signed-off-by: Willem de Bruijn <willemb(a)google.com>
Link: https://patch.msgid.link/20240729201108.1615114-1-willemdebruijn.kernel@gma…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Mathieu Tortuyaux <mtortuyaux(a)microsoft.com>
---
Hi,
This patch fixes network failures on OpenStack VMs running with Kernel
5.15.165.
In 5.15.165, the commit "net: missing check virtio" is breaking networks
on VMs that uses virtio in some conditions.
I slightly adapted the patch to have it fitting this branch (5.15.y).
Once patched and compiled it has been successfully tested on Flatcar CI
with Kernel 5.15.165.
NOTE: This patch has already been backported on other stable branches
(like 6.6.y)
Thanks,
Mathieu - @tormath1
include/linux/virtio_net.h | 17 ++++++-----------
net/ipv4/tcp_offload.c | 3 +++
net/ipv4/udp_offload.c | 4 ++++
3 files changed, 13 insertions(+), 11 deletions(-)
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 29b19d0a324c..d9410d97158d 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -51,7 +51,6 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
unsigned int thlen = 0;
unsigned int p_off = 0;
unsigned int ip_proto;
- u64 ret, remainder, gso_size;
if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
@@ -88,16 +87,6 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
u32 off = __virtio16_to_cpu(little_endian, hdr->csum_offset);
u32 needed = start + max_t(u32, thlen, off + sizeof(__sum16));
- if (hdr->gso_size) {
- gso_size = __virtio16_to_cpu(little_endian, hdr->gso_size);
- ret = div64_u64_rem(skb->len, gso_size, &remainder);
- if (!(ret && (hdr->gso_size > needed) &&
- ((remainder > needed) || (remainder == 0)))) {
- return -EINVAL;
- }
- skb_shinfo(skb)->tx_flags |= SKBFL_SHARED_FRAG;
- }
-
if (!pskb_may_pull(skb, needed))
return -EINVAL;
@@ -163,6 +152,12 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
if (gso_size == GSO_BY_FRAGS)
return -EINVAL;
+ if ((gso_size & SKB_GSO_TCPV4) ||
+ (gso_size & SKB_GSO_TCPV6)) {
+ if (skb->csum_offset != offsetof(struct tcphdr, check))
+ return -EINVAL;
+ }
+
/* Too small packets are not really GSO ones. */
if (skb->len - nh_off > gso_size) {
shinfo->gso_size = gso_size;
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index fc61cd3fea65..76684cbd63a4 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -71,6 +71,9 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
if (thlen < sizeof(*th))
goto out;
+ if (unlikely(skb_checksum_start(skb) != skb_transport_header(skb)))
+ goto out;
+
if (!pskb_may_pull(skb, thlen))
goto out;
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index c61268849948..61773a26fb34 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -279,6 +279,10 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
if (gso_skb->len <= sizeof(*uh) + mss)
return ERR_PTR(-EINVAL);
+ if (unlikely(skb_checksum_start(gso_skb) !=
+ skb_transport_header(gso_skb)))
+ return ERR_PTR(-EINVAL);
+
skb_pull(gso_skb, sizeof(*uh));
/* clear destructor to avoid skb_segment assigning it to tail */
--
2.44.2
devm_kasprintf() can return a NULL pointer on failure but this returned
value is not checked. Fix this lack and check the returned value.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: 0bb4e9187ea4 ("mt76: mt7615: fix hwmon temp sensor mem use-after-free")
Signed-off-by: Ma Ke <make24(a)iscas.ac.cn>
---
drivers/net/wireless/mediatek/mt76/mt7615/init.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/net/wireless/mediatek/mt76/mt7615/init.c b/drivers/net/wireless/mediatek/mt76/mt7615/init.c
index f7722f67db57..0b9ebdcda221 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7615/init.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7615/init.c
@@ -56,6 +56,9 @@ int mt7615_thermal_init(struct mt7615_dev *dev)
name = devm_kasprintf(&wiphy->dev, GFP_KERNEL, "mt7615_%s",
wiphy_name(wiphy));
+ if (!name)
+ return -ENOMEM;
+
hwmon = devm_hwmon_device_register_with_groups(&wiphy->dev, name, dev,
mt7615_hwmon_groups);
return PTR_ERR_OR_ZERO(hwmon);
--
2.25.1
From: Breno Leitao <leitao(a)debian.org>
[ Upstream commit f8321fa75102246d7415a6af441872f6637c93ab ]
After the commit bdacf3e34945 ("net: Use nested-BH locking for
napi_alloc_cache.") was merged, the following warning began to appear:
WARNING: CPU: 5 PID: 1 at net/core/skbuff.c:1451 napi_skb_cache_put+0x82/0x4b0
__warn+0x12f/0x340
napi_skb_cache_put+0x82/0x4b0
napi_skb_cache_put+0x82/0x4b0
report_bug+0x165/0x370
handle_bug+0x3d/0x80
exc_invalid_op+0x1a/0x50
asm_exc_invalid_op+0x1a/0x20
__free_old_xmit+0x1c8/0x510
napi_skb_cache_put+0x82/0x4b0
__free_old_xmit+0x1c8/0x510
__free_old_xmit+0x1c8/0x510
__pfx___free_old_xmit+0x10/0x10
The issue arises because virtio is assuming it's running in NAPI context
even when it's not, such as in the netpoll case.
To resolve this, modify virtnet_poll_tx() to only set NAPI when budget
is available. Same for virtnet_poll_cleantx(), which always assumed that
it was in a NAPI context.
Fixes: df133f3f9625 ("virtio_net: bulk free tx skbs")
Suggested-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Breno Leitao <leitao(a)debian.org>
Reviewed-by: Jakub Kicinski <kuba(a)kernel.org>
Acked-by: Michael S. Tsirkin <mst(a)redhat.com>
Acked-by: Jason Wang <jasowang(a)redhat.com>
Reviewed-by: Heng Qi <hengqi(a)linux.alibaba.com>
Link: https://patch.msgid.link/20240712115325.54175-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
[Shivani: Modified to apply on v4.19.y-v5.10.y]
Signed-off-by: Shivani Agarwal <shivani.agarwal(a)broadcom.com>
---
drivers/net/virtio_net.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f7ed99561..99dea89b2 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1497,7 +1497,7 @@ static bool is_xdp_raw_buffer_queue(struct virtnet_info *vi, int q)
return false;
}
-static void virtnet_poll_cleantx(struct receive_queue *rq)
+static void virtnet_poll_cleantx(struct receive_queue *rq, int budget)
{
struct virtnet_info *vi = rq->vq->vdev->priv;
unsigned int index = vq2rxq(rq->vq);
@@ -1508,7 +1508,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
return;
if (__netif_tx_trylock(txq)) {
- free_old_xmit_skbs(sq, true);
+ free_old_xmit_skbs(sq, !!budget);
__netif_tx_unlock(txq);
}
@@ -1525,7 +1525,7 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
unsigned int received;
unsigned int xdp_xmit = 0;
- virtnet_poll_cleantx(rq);
+ virtnet_poll_cleantx(rq, budget);
received = virtnet_receive(rq, budget, &xdp_xmit);
@@ -1598,7 +1598,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
txq = netdev_get_tx_queue(vi->dev, index);
__netif_tx_lock(txq, raw_smp_processor_id());
virtqueue_disable_cb(sq->vq);
- free_old_xmit_skbs(sq, true);
+ free_old_xmit_skbs(sq, !!budget);
opaque = virtqueue_enable_cb_prepare(sq->vq);
--
2.39.4
Streams should flush their TRB cache, re-read TRBs, and start executing
TRBs from the beginning of the new dequeue pointer after a 'Set TR Dequeue
Pointer' command.
Cadence controllers may fail to start from the beginning of the dequeue
TRB as it doesn't clear the Opaque 'RsvdO' field of the stream context
during 'Set TR Dequeue' command. This stream context area is where xHC
stores information about the last partially executed TD when a stream
is stopped. xHC uses this information to resume the transfer where it left
mid TD, when the stream is restarted.
Patch fixes this by clearing out all RsvdO fields before initializing new
Stream transfer using a 'Set TR Dequeue Pointer' command.
Fixes: 3d82904559f4 ("usb: cdnsp: cdns3 Add main part of Cadence USBSSP DRD Driver")
cc: <stable(a)vger.kernel.org>
Signed-off-by: Pawel Laszczak <pawell(a)cadence.com>
---
Changelog:
v3:
- changed patch to patch Cadence specific
v2:
- removed restoring of EDTLA field
drivers/usb/cdns3/host.c | 4 +++-
drivers/usb/host/xhci-pci.c | 7 +++++++
drivers/usb/host/xhci-ring.c | 14 ++++++++++++++
drivers/usb/host/xhci.h | 1 +
4 files changed, 25 insertions(+), 1 deletion(-)
diff --git a/drivers/usb/cdns3/host.c b/drivers/usb/cdns3/host.c
index ceca4d839dfd..7ba760ee62e3 100644
--- a/drivers/usb/cdns3/host.c
+++ b/drivers/usb/cdns3/host.c
@@ -62,7 +62,9 @@ static const struct xhci_plat_priv xhci_plat_cdns3_xhci = {
.resume_quirk = xhci_cdns3_resume_quirk,
};
-static const struct xhci_plat_priv xhci_plat_cdnsp_xhci;
+static const struct xhci_plat_priv xhci_plat_cdnsp_xhci = {
+ .quirks = XHCI_CDNS_SCTX_QUIRK,
+};
static int __cdns_host_init(struct cdns *cdns)
{
diff --git a/drivers/usb/host/xhci-pci.c b/drivers/usb/host/xhci-pci.c
index b9ae5c2a2527..9199dbfcea07 100644
--- a/drivers/usb/host/xhci-pci.c
+++ b/drivers/usb/host/xhci-pci.c
@@ -74,6 +74,9 @@
#define PCI_DEVICE_ID_ASMEDIA_2142_XHCI 0x2142
#define PCI_DEVICE_ID_ASMEDIA_3242_XHCI 0x3242
+#define PCI_DEVICE_ID_CADENCE 0x17CD
+#define PCI_DEVICE_ID_CADENCE_SSP 0x0200
+
static const char hcd_name[] = "xhci_hcd";
static struct hc_driver __read_mostly xhci_pci_hc_driver;
@@ -532,6 +535,10 @@ static void xhci_pci_quirks(struct device *dev, struct xhci_hcd *xhci)
xhci->quirks |= XHCI_ZHAOXIN_TRB_FETCH;
}
+ if (pdev->vendor == PCI_DEVICE_ID_CADENCE &&
+ pdev->device == PCI_DEVICE_ID_CADENCE_SSP)
+ xhci->quirks |= XHCI_CDNS_SCTX_QUIRK;
+
/* xHC spec requires PCI devices to support D3hot and D3cold */
if (xhci->hci_version >= 0x120)
xhci->quirks |= XHCI_DEFAULT_PM_RUNTIME_ALLOW;
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 1dde53f6eb31..a1ad2658c0c7 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1386,6 +1386,20 @@ static void xhci_handle_cmd_set_deq(struct xhci_hcd *xhci, int slot_id,
struct xhci_stream_ctx *ctx =
&ep->stream_info->stream_ctx_array[stream_id];
deq = le64_to_cpu(ctx->stream_ring) & SCTX_DEQ_MASK;
+
+ /*
+ * Cadence xHCI controllers store some endpoint state
+ * information within Rsvd0 fields of Stream Endpoint
+ * context. This field is not cleared during Set TR
+ * Dequeue Pointer command which causes XDMA to skip
+ * over transfer ring and leads to data loss on stream
+ * pipe.
+ * To fix this issue driver must clear Rsvd0 field.
+ */
+ if (xhci->quirks & XHCI_CDNS_SCTX_QUIRK) {
+ ctx->reserved[0] = 0;
+ ctx->reserved[1] = 0;
+ }
} else {
deq = le64_to_cpu(ep_ctx->deq) & ~EP_CTX_CYCLE_MASK;
}
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index 101e74c9060f..4cbd58eed214 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -1907,6 +1907,7 @@ struct xhci_hcd {
#define XHCI_ZHAOXIN_TRB_FETCH BIT_ULL(45)
#define XHCI_ZHAOXIN_HOST BIT_ULL(46)
#define XHCI_WRITE_64_HI_LO BIT_ULL(47)
+#define XHCI_CDNS_SCTX_QUIRK BIT_ULL(48)
unsigned int num_active_eps;
unsigned int limit_active_eps;
--
2.43.0
In the off-chance that waiting for the firmware to signal its booted status
timed out in the fast reset path, one must flush the cache lines for the
entire FW VM address space before reloading the regions, otherwise stale
values eventually lead to a scheduler job timeout.
Fixes: 647810ec2476 ("drm/panthor: Add the MMU/VM logical block")
Cc: stable(a)vger.kernel.org
Signed-off-by: Adrián Larumbe <adrian.larumbe(a)collabora.com>
Acked-by: Liviu Dudau <liviu.dudau(a)arm.com>
---
drivers/gpu/drm/panthor/panthor_fw.c | 8 +++++++-
drivers/gpu/drm/panthor/panthor_mmu.c | 21 ++++++++++++++++++---
drivers/gpu/drm/panthor/panthor_mmu.h | 1 +
3 files changed, 26 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
index 857f3f11258a..ef232c0c2049 100644
--- a/drivers/gpu/drm/panthor/panthor_fw.c
+++ b/drivers/gpu/drm/panthor/panthor_fw.c
@@ -1089,6 +1089,12 @@ int panthor_fw_post_reset(struct panthor_device *ptdev)
panthor_fw_stop(ptdev);
ptdev->fw->fast_reset = false;
drm_err(&ptdev->base, "FW fast reset failed, trying a slow reset");
+
+ ret = panthor_vm_flush_all(ptdev->fw->vm);
+ if (ret) {
+ drm_err(&ptdev->base, "FW slow reset failed (couldn't flush FW's AS l2cache)");
+ return ret;
+ }
}
/* Reload all sections, including RO ones. We're not supposed
@@ -1099,7 +1105,7 @@ int panthor_fw_post_reset(struct panthor_device *ptdev)
ret = panthor_fw_start(ptdev);
if (ret) {
- drm_err(&ptdev->base, "FW slow reset failed");
+ drm_err(&ptdev->base, "FW slow reset failed (couldn't start the FW )");
return ret;
}
diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
index d47972806d50..bbc12728437f 100644
--- a/drivers/gpu/drm/panthor/panthor_mmu.c
+++ b/drivers/gpu/drm/panthor/panthor_mmu.c
@@ -576,6 +576,12 @@ static int mmu_hw_do_operation_locked(struct panthor_device *ptdev, int as_nr,
if (as_nr < 0)
return 0;
+ /*
+ * If the AS number is greater than zero, then we can be sure
+ * the device is up and running, so we don't need to explicitly
+ * power it up
+ */
+
if (op != AS_COMMAND_UNLOCK)
lock_region(ptdev, as_nr, iova, size);
@@ -874,14 +880,23 @@ static int panthor_vm_flush_range(struct panthor_vm *vm, u64 iova, u64 size)
if (!drm_dev_enter(&ptdev->base, &cookie))
return 0;
- /* Flush the PTs only if we're already awake */
- if (pm_runtime_active(ptdev->base.dev))
- ret = mmu_hw_do_operation(vm, iova, size, AS_COMMAND_FLUSH_PT);
+ ret = mmu_hw_do_operation(vm, iova, size, AS_COMMAND_FLUSH_PT);
drm_dev_exit(cookie);
return ret;
}
+/**
+ * panthor_vm_flush_all() - Flush L2 caches for the entirety of a VM's AS
+ * @vm: VM whose cache to flush
+ *
+ * Return: 0 on success, a negative error code if flush failed.
+ */
+int panthor_vm_flush_all(struct panthor_vm *vm)
+{
+ return panthor_vm_flush_range(vm, vm->base.mm_start, vm->base.mm_range);
+}
+
static int panthor_vm_unmap_pages(struct panthor_vm *vm, u64 iova, u64 size)
{
struct panthor_device *ptdev = vm->ptdev;
diff --git a/drivers/gpu/drm/panthor/panthor_mmu.h b/drivers/gpu/drm/panthor/panthor_mmu.h
index f3c1ed19f973..6788771071e3 100644
--- a/drivers/gpu/drm/panthor/panthor_mmu.h
+++ b/drivers/gpu/drm/panthor/panthor_mmu.h
@@ -31,6 +31,7 @@ panthor_vm_get_bo_for_va(struct panthor_vm *vm, u64 va, u64 *bo_offset);
int panthor_vm_active(struct panthor_vm *vm);
void panthor_vm_idle(struct panthor_vm *vm);
int panthor_vm_as(struct panthor_vm *vm);
+int panthor_vm_flush_all(struct panthor_vm *vm);
struct panthor_heap_pool *
panthor_vm_get_heap_pool(struct panthor_vm *vm, bool create);
--
2.46.0
On Mon, 29 Jul 2024 16:32:36 +0200, Greg Kroah-Hartman wrote:
> In the Linux kernel, the following vulnerability has been resolved:
>
> udp: Set SOCK_RCU_FREE earlier in udp_lib_get_port().
>
> [...]
>
> We had the same bug in TCP and fixed it in commit 871019b22d1b ("net:
> set SOCK_RCU_FREE before inserting socket into hashtable").
>
> Let's apply the same fix for UDP.
>
> [...]
>
> The Linux kernel CVE team has assigned CVE-2024-41041 to this issue.
>
>
> Affected and fixed versions
> ===========================
>
> Issue introduced in 4.20 with commit 6acc9b432e67 and fixed in 5.4.280 with commit 7a67c4e47626
> Issue introduced in 4.20 with commit 6acc9b432e67 and fixed in 5.10.222 with commit 9f965684c57c
These versions don't have the TCP fix backported. Please do so.
Thanks,
Siddh
From: Daniel Borkmann <daniel(a)iogearbox.net>
commit 8520e224f547cd070c7c8f97b1fc6d58cff7ccaa upstream.
Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
Back in the days, commit bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
embedded per-socket cgroup information into sock->sk_cgrp_data and in order
to save 8 bytes in struct sock made both mutually exclusive, that is, when
cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).
The assumption made was "there is no reason to mix the two and this is in line
with how legacy and v2 compatibility is handled" as stated in bd1060a1d671.
However, with Kubernetes more widely supporting cgroups v2 as well nowadays,
this assumption no longer holds, and the possibility of the v1/v2 mixed mode
with the v2 root fallback being hit becomes a real security issue.
Many of the cgroup v2 BPF programs are also used for policy enforcement, just
to pick _one_ example, that is, to programmatically deny socket related system
calls like connect(2) or bind(2). A v2 root fallback would implicitly cause
a policy bypass for the affected Pods.
In production environments, we have recently seen this case due to various
circumstances: i) a different 3rd party agent and/or ii) a container runtime
such as [0] in the user's environment configuring legacy cgroup v1 net_cls
tags, which triggered implicitly mentioned root fallback. Another case is
Kubernetes projects like kind [1] which create Kubernetes nodes in a container
and also add cgroup namespaces to the mix, meaning programs which are attached
to the cgroup v2 root of the cgroup namespace get attached to a non-root
cgroup v2 path from init namespace point of view. And the latter's root is
out of reach for agents on a kind Kubernetes node to configure. Meaning, any
entity on the node setting cgroup v1 net_cls tag will trigger the bypass
despite cgroup v2 BPF programs attached to the namespace root.
Generally, this mutual exclusiveness does not hold anymore in today's user
environments and makes cgroup v2 usage from BPF side fragile and unreliable.
This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
sock_cgroup_data in order to address these issues; this implicitly also fixes
the tradeoffs being made back then with regards to races and refcount leaks
as stated in bd1060a1d671, and removes the fallback, so that cgroup v2 BPF
programs always operate as expected.
[0] https://github.com/nestybox/sysbox/
[1] https://kind.sigs.k8s.io/
Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Acked-by: Stanislav Fomichev <sdf(a)google.com>
Acked-by: Tejun Heo <tj(a)kernel.org>
Link: https://lore.kernel.org/bpf/20210913230759.2313-1-daniel@iogearbox.net
[resolve trivial conflicts]
Signed-off-by: Connor O'Brien <connor.obrien(a)crowdstrike.com>
---
Hello,
Requesting that these patches be applied to 5.10-stable. Tested to confirm that
the cgroup_v1v2 bpf selftest for this issue passes on 5.10 with the first patch
applied and fails without it. The syzkaller crash referenced in the second patch
reproduces on 5.10 after applying just the first patch, but not with both
patches applied.
Conflicts were due to unrelated changes to the surrounding context; the actual
code change remains the same as in the upstream patch.
Thanks,
Connor O'Brien
include/linux/cgroup-defs.h | 107 +++++++++--------------------------
include/linux/cgroup.h | 22 +------
kernel/cgroup/cgroup.c | 50 ++++------------
net/core/netclassid_cgroup.c | 7 +--
net/core/netprio_cgroup.c | 10 +---
5 files changed, 41 insertions(+), 155 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index c9fafca1c30c..6c6323a01d43 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -764,107 +764,54 @@ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) {}
* sock_cgroup_data is embedded at sock->sk_cgrp_data and contains
* per-socket cgroup information except for memcg association.
*
- * On legacy hierarchies, net_prio and net_cls controllers directly set
- * attributes on each sock which can then be tested by the network layer.
- * On the default hierarchy, each sock is associated with the cgroup it was
- * created in and the networking layer can match the cgroup directly.
- *
- * To avoid carrying all three cgroup related fields separately in sock,
- * sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer.
- * On boot, sock_cgroup_data records the cgroup that the sock was created
- * in so that cgroup2 matches can be made; however, once either net_prio or
- * net_cls starts being used, the area is overriden to carry prioidx and/or
- * classid. The two modes are distinguished by whether the lowest bit is
- * set. Clear bit indicates cgroup pointer while set bit prioidx and
- * classid.
- *
- * While userland may start using net_prio or net_cls at any time, once
- * either is used, cgroup2 matching no longer works. There is no reason to
- * mix the two and this is in line with how legacy and v2 compatibility is
- * handled. On mode switch, cgroup references which are already being
- * pointed to by socks may be leaked. While this can be remedied by adding
- * synchronization around sock_cgroup_data, given that the number of leaked
- * cgroups is bound and highly unlikely to be high, this seems to be the
- * better trade-off.
+ * On legacy hierarchies, net_prio and net_cls controllers directly
+ * set attributes on each sock which can then be tested by the network
+ * layer. On the default hierarchy, each sock is associated with the
+ * cgroup it was created in and the networking layer can match the
+ * cgroup directly.
*/
struct sock_cgroup_data {
- union {
-#ifdef __LITTLE_ENDIAN
- struct {
- u8 is_data : 1;
- u8 no_refcnt : 1;
- u8 unused : 6;
- u8 padding;
- u16 prioidx;
- u32 classid;
- } __packed;
-#else
- struct {
- u32 classid;
- u16 prioidx;
- u8 padding;
- u8 unused : 6;
- u8 no_refcnt : 1;
- u8 is_data : 1;
- } __packed;
+ struct cgroup *cgroup; /* v2 */
+#ifdef CONFIG_CGROUP_NET_CLASSID
+ u32 classid; /* v1 */
+#endif
+#ifdef CONFIG_CGROUP_NET_PRIO
+ u16 prioidx; /* v1 */
#endif
- u64 val;
- };
};
-/*
- * There's a theoretical window where the following accessors race with
- * updaters and return part of the previous pointer as the prioidx or
- * classid. Such races are short-lived and the result isn't critical.
- */
static inline u16 sock_cgroup_prioidx(const struct sock_cgroup_data *skcd)
{
- /* fallback to 1 which is always the ID of the root cgroup */
- return (skcd->is_data & 1) ? skcd->prioidx : 1;
+#ifdef CONFIG_CGROUP_NET_PRIO
+ return READ_ONCE(skcd->prioidx);
+#else
+ return 1;
+#endif
}
static inline u32 sock_cgroup_classid(const struct sock_cgroup_data *skcd)
{
- /* fallback to 0 which is the unconfigured default classid */
- return (skcd->is_data & 1) ? skcd->classid : 0;
+#ifdef CONFIG_CGROUP_NET_CLASSID
+ return READ_ONCE(skcd->classid);
+#else
+ return 0;
+#endif
}
-/*
- * If invoked concurrently, the updaters may clobber each other. The
- * caller is responsible for synchronization.
- */
static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data *skcd,
u16 prioidx)
{
- struct sock_cgroup_data skcd_buf = {{ .val = READ_ONCE(skcd->val) }};
-
- if (sock_cgroup_prioidx(&skcd_buf) == prioidx)
- return;
-
- if (!(skcd_buf.is_data & 1)) {
- skcd_buf.val = 0;
- skcd_buf.is_data = 1;
- }
-
- skcd_buf.prioidx = prioidx;
- WRITE_ONCE(skcd->val, skcd_buf.val); /* see sock_cgroup_ptr() */
+#ifdef CONFIG_CGROUP_NET_PRIO
+ WRITE_ONCE(skcd->prioidx, prioidx);
+#endif
}
static inline void sock_cgroup_set_classid(struct sock_cgroup_data *skcd,
u32 classid)
{
- struct sock_cgroup_data skcd_buf = {{ .val = READ_ONCE(skcd->val) }};
-
- if (sock_cgroup_classid(&skcd_buf) == classid)
- return;
-
- if (!(skcd_buf.is_data & 1)) {
- skcd_buf.val = 0;
- skcd_buf.is_data = 1;
- }
-
- skcd_buf.classid = classid;
- WRITE_ONCE(skcd->val, skcd_buf.val); /* see sock_cgroup_ptr() */
+#ifdef CONFIG_CGROUP_NET_CLASSID
+ WRITE_ONCE(skcd->classid, classid);
+#endif
}
#else /* CONFIG_SOCK_CGROUP_DATA */
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index c9c430712d47..15c27a2c98e2 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -816,33 +816,13 @@ static inline void cgroup_account_cputime_field(struct task_struct *task,
*/
#ifdef CONFIG_SOCK_CGROUP_DATA
-#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
-extern spinlock_t cgroup_sk_update_lock;
-#endif
-
-void cgroup_sk_alloc_disable(void);
void cgroup_sk_alloc(struct sock_cgroup_data *skcd);
void cgroup_sk_clone(struct sock_cgroup_data *skcd);
void cgroup_sk_free(struct sock_cgroup_data *skcd);
static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
{
-#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
- unsigned long v;
-
- /*
- * @skcd->val is 64bit but the following is safe on 32bit too as we
- * just need the lower ulong to be written and read atomically.
- */
- v = READ_ONCE(skcd->val);
-
- if (v & 3)
- return &cgrp_dfl_root.cgrp;
-
- return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp;
-#else
- return (struct cgroup *)(unsigned long)skcd->val;
-#endif
+ return skcd->cgroup;
}
#else /* CONFIG_CGROUP_DATA */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 11400eba6124..3ec531ef50d8 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6557,74 +6557,44 @@ int cgroup_parse_float(const char *input, unsigned dec_shift, s64 *v)
*/
#ifdef CONFIG_SOCK_CGROUP_DATA
-#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
-
-DEFINE_SPINLOCK(cgroup_sk_update_lock);
-static bool cgroup_sk_alloc_disabled __read_mostly;
-
-void cgroup_sk_alloc_disable(void)
-{
- if (cgroup_sk_alloc_disabled)
- return;
- pr_info("cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation\n");
- cgroup_sk_alloc_disabled = true;
-}
-
-#else
-
-#define cgroup_sk_alloc_disabled false
-
-#endif
-
void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
{
- if (cgroup_sk_alloc_disabled) {
- skcd->no_refcnt = 1;
- return;
- }
-
/* Don't associate the sock with unrelated interrupted task's cgroup. */
if (in_interrupt())
return;
rcu_read_lock();
-
while (true) {
struct css_set *cset;
cset = task_css_set(current);
if (likely(cgroup_tryget(cset->dfl_cgrp))) {
- skcd->val = (unsigned long)cset->dfl_cgrp;
+ skcd->cgroup = cset->dfl_cgrp;
cgroup_bpf_get(cset->dfl_cgrp);
break;
}
cpu_relax();
}
-
rcu_read_unlock();
}
void cgroup_sk_clone(struct sock_cgroup_data *skcd)
{
- if (skcd->val) {
- if (skcd->no_refcnt)
- return;
- /*
- * We might be cloning a socket which is left in an empty
- * cgroup and the cgroup might have already been rmdir'd.
- * Don't use cgroup_get_live().
- */
- cgroup_get(sock_cgroup_ptr(skcd));
- cgroup_bpf_get(sock_cgroup_ptr(skcd));
- }
+ struct cgroup *cgrp = sock_cgroup_ptr(skcd);
+
+ /*
+ * We might be cloning a socket which is left in an empty
+ * cgroup and the cgroup might have already been rmdir'd.
+ * Don't use cgroup_get_live().
+ */
+ cgroup_get(cgrp);
+ cgroup_bpf_get(cgrp);
}
void cgroup_sk_free(struct sock_cgroup_data *skcd)
{
struct cgroup *cgrp = sock_cgroup_ptr(skcd);
- if (skcd->no_refcnt)
- return;
cgroup_bpf_put(cgrp);
cgroup_put(cgrp);
}
diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index 41b24cd31562..b6de5ee22391 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -72,11 +72,8 @@ static int update_classid_sock(const void *v, struct file *file, unsigned n)
struct update_classid_context *ctx = (void *)v;
struct socket *sock = sock_from_file(file, &err);
- if (sock) {
- spin_lock(&cgroup_sk_update_lock);
+ if (sock)
sock_cgroup_set_classid(&sock->sk->sk_cgrp_data, ctx->classid);
- spin_unlock(&cgroup_sk_update_lock);
- }
if (--ctx->batch == 0) {
ctx->batch = UPDATE_CLASSID_BATCH;
return n + 1;
@@ -122,8 +119,6 @@ static int write_classid(struct cgroup_subsys_state *css, struct cftype *cft,
struct css_task_iter it;
struct task_struct *p;
- cgroup_sk_alloc_disable();
-
cs->classid = (u32)value;
css_task_iter_start(css, 0, &it);
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index 9bd4cab7d510..d4c71e382a13 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -207,8 +207,6 @@ static ssize_t write_priomap(struct kernfs_open_file *of,
if (!dev)
return -ENODEV;
- cgroup_sk_alloc_disable();
-
rtnl_lock();
ret = netprio_set_prio(of_css(of), dev, prio);
@@ -222,12 +220,10 @@ static int update_netprio(const void *v, struct file *file, unsigned n)
{
int err;
struct socket *sock = sock_from_file(file, &err);
- if (sock) {
- spin_lock(&cgroup_sk_update_lock);
+
+ if (sock)
sock_cgroup_set_prioidx(&sock->sk->sk_cgrp_data,
(unsigned long)v);
- spin_unlock(&cgroup_sk_update_lock);
- }
return 0;
}
@@ -236,8 +232,6 @@ static void net_prio_attach(struct cgroup_taskset *tset)
struct task_struct *p;
struct cgroup_subsys_state *css;
- cgroup_sk_alloc_disable();
-
cgroup_taskset_for_each(p, css, tset) {
void *v = (void *)(unsigned long)css->id;
--
2.34.1
From: Christoph Hellwig <hch(a)lst.de>
[ Upstream commit 899ee2c3829c5ac14bfc7d3c4a5846c0b709b78f ]
Metadata added by bio_integrity_prep is using plain kmalloc, which leads
to random kernel memory being written media. For PI metadata this is
limited to the app tag that isn't used by kernel generated metadata,
but for non-PI metadata the entire buffer leaks kernel memory.
Fix this by adding the __GFP_ZERO flag to allocations for writes.
Fixes: 7ba1ba12eeef ("block: Block layer data integrity support")
Signed-off-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen(a)oracle.com>
Reviewed-by: Kanchan Joshi <joshi.k(a)samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch(a)nvidia.com>
Link: https://lore.kernel.org/r/20240613084839.1044015-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
Signed-off-by: Shivani Agarwal <shivani.agarwal(a)broadcom.com>
---
block/bio-integrity.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index a4cfc9727..499697330 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -216,6 +216,7 @@ bool bio_integrity_prep(struct bio *bio)
unsigned int bytes, offset, i;
unsigned int intervals;
blk_status_t status;
+ gfp_t gfp = GFP_NOIO;
if (!bi)
return true;
@@ -238,12 +239,20 @@ bool bio_integrity_prep(struct bio *bio)
if (!bi->profile->generate_fn ||
!(bi->flags & BLK_INTEGRITY_GENERATE))
return true;
+
+ /*
+ * Zero the memory allocated to not leak uninitialized kernel
+ * memory to disk. For PI this only affects the app tag, but
+ * for non-integrity metadata it affects the entire metadata
+ * buffer.
+ */
+ gfp |= __GFP_ZERO;
}
intervals = bio_integrity_intervals(bi, bio_sectors(bio));
/* Allocate kernel buffer for protection data */
len = intervals * bi->tuple_size;
- buf = kmalloc(len, GFP_NOIO | q->bounce_gfp);
+ buf = kmalloc(len, gfp | q->bounce_gfp);
status = BLK_STS_RESOURCE;
if (unlikely(buf == NULL)) {
printk(KERN_ERR "could not allocate integrity buffer\n");
--
2.39.4
Fix a few issues in rescind handling in uio_hv_generic driver.
Patches are based on latest linux-next tip.
Steps to reproduce issue:
* Probe uio_hv_generic driver and create channels to use fcopy
* Disable the guest service on host and then Enable it.
or
* repeatedly do cat "/dev/uioX" on the device created for fcopy.
Changes since v1:
https://lore.kernel.org/all/20240822110912.13735-1-namjain@linux.microsoft.…
* Added stable kernel list to cc
* Updated commit messages for more information
* Explicitly handle rescind callback for primary channel only, and add
comment: Saurabh, Michael.
* Rebase to latest tip.
Naman Jain (1):
Drivers: hv: vmbus: Fix rescind handling in uio_hv_generic
Saurabh Sengar (1):
uio_hv_generic: Fix kernel NULL pointer dereference in hv_uio_rescind
drivers/hv/vmbus_drv.c | 1 +
drivers/uio/uio_hv_generic.c | 11 ++++++++++-
2 files changed, 11 insertions(+), 1 deletion(-)
base-commit: 195a402a75791e6e0d96d9da27ca77671bc656a8
--
2.34.1
We were allowing any users to create a high priority group without any
permission checks. As a result, this was allowing possible denial of
service.
We now only allow the DRM master or users with the CAP_SYS_NICE
capability to set higher priorities than PANTHOR_GROUP_PRIORITY_MEDIUM.
As the sole user of that uAPI lives in Mesa and hardcode a value of
MEDIUM [1], this should be safe to do.
Additionally, as those checks are performed at the ioctl level,
panthor_group_create now only check for priority level validity.
[1]https://gitlab.freedesktop.org/mesa/mesa/-/blob/f390835074bdf162a63deb031…
Signed-off-by: Mary Guillemard <mary.guillemard(a)collabora.com>
Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")
Cc: stable(a)vger.kernel.org
---
drivers/gpu/drm/panthor/panthor_drv.c | 23 +++++++++++++++++++++++
drivers/gpu/drm/panthor/panthor_sched.c | 2 +-
include/uapi/drm/panthor_drm.h | 6 +++++-
3 files changed, 29 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index b5e7b919f241..34182f67136c 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -10,6 +10,7 @@
#include <linux/platform_device.h>
#include <linux/pm_runtime.h>
+#include <drm/drm_auth.h>
#include <drm/drm_debugfs.h>
#include <drm/drm_drv.h>
#include <drm/drm_exec.h>
@@ -996,6 +997,24 @@ static int panthor_ioctl_group_destroy(struct drm_device *ddev, void *data,
return panthor_group_destroy(pfile, args->group_handle);
}
+static int group_priority_permit(struct drm_file *file,
+ u8 priority)
+{
+ /* Ensure that priority is valid */
+ if (priority > PANTHOR_GROUP_PRIORITY_HIGH)
+ return -EINVAL;
+
+ /* Medium priority and below are always allowed */
+ if (priority <= PANTHOR_GROUP_PRIORITY_MEDIUM)
+ return 0;
+
+ /* Higher priorities require CAP_SYS_NICE or DRM_MASTER */
+ if (capable(CAP_SYS_NICE) || drm_is_current_master(file))
+ return 0;
+
+ return -EACCES;
+}
+
static int panthor_ioctl_group_create(struct drm_device *ddev, void *data,
struct drm_file *file)
{
@@ -1011,6 +1030,10 @@ static int panthor_ioctl_group_create(struct drm_device *ddev, void *data,
if (ret)
return ret;
+ ret = group_priority_permit(file, args->priority);
+ if (ret)
+ return ret;
+
ret = panthor_group_create(pfile, args, queue_args);
if (ret >= 0) {
args->group_handle = ret;
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index c426a392b081..91a31b70c037 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -3092,7 +3092,7 @@ int panthor_group_create(struct panthor_file *pfile,
if (group_args->pad)
return -EINVAL;
- if (group_args->priority > PANTHOR_CSG_PRIORITY_HIGH)
+ if (group_args->priority >= PANTHOR_CSG_PRIORITY_COUNT)
return -EINVAL;
if ((group_args->compute_core_mask & ~ptdev->gpu_info.shader_present) ||
diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
index 926b1deb1116..e23a7f9b0eac 100644
--- a/include/uapi/drm/panthor_drm.h
+++ b/include/uapi/drm/panthor_drm.h
@@ -692,7 +692,11 @@ enum drm_panthor_group_priority {
/** @PANTHOR_GROUP_PRIORITY_MEDIUM: Medium priority group. */
PANTHOR_GROUP_PRIORITY_MEDIUM,
- /** @PANTHOR_GROUP_PRIORITY_HIGH: High priority group. */
+ /**
+ * @PANTHOR_GROUP_PRIORITY_HIGH: High priority group.
+ *
+ * Requires CAP_SYS_NICE or DRM_MASTER.
+ */
PANTHOR_GROUP_PRIORITY_HIGH,
};
base-commit: a15710027afb40c7c1e352902fa5b8c949f021de
--
2.46.0
From: Anirudh Rayabharam (Microsoft) <anirudh(a)anirudhrb.com>
commit 9636be85cc5b ("x86/hyperv: Fix hyperv_pcpu_input_arg handling when
CPUs go online/offline") introduces a new cpuhp state for hyperv
initialization.
cpuhp_setup_state() returns the state number if state is
CPUHP_AP_ONLINE_DYN or CPUHP_BP_PREPARE_DYN and 0 for all other states.
For the hyperv case, since a new cpuhp state was introduced it would
return 0. However, in hv_machine_shutdown(), the cpuhp_remove_state() call
is conditioned upon "hyperv_init_cpuhp > 0". This will never be true and
so hv_cpu_die() won't be called on all CPUs. This means the VP assist page
won't be reset. When the kexec kernel tries to setup the VP assist page
again, the hypervisor corrupts the memory region of the old VP assist page
causing a panic in case the kexec kernel is using that memory elsewhere.
This was originally fixed in commit dfe94d4086e4 ("x86/hyperv: Fix kexec
panic/hang issues").
Get rid of hyperv_init_cpuhp entirely since we are no longer using a
dynamic cpuhp state and use CPUHP_AP_HYPERV_ONLINE directly with
cpuhp_remove_state().
Cc: stable(a)vger.kernel.org
Fixes: 9636be85cc5b ("x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline")
Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh(a)anirudhrb.com>
---
v1->v2:
- Remove hyperv_init_cpuhp entirely and use CPUHP_AP_HYPERV_ONLINE directly
with cpuhp_remove_state().
v1: https://lore.kernel.org/linux-hyperv/87wmk2xt5i.fsf@redhat.com/T/#m54b8ae17…
---
arch/x86/hyperv/hv_init.c | 5 +----
arch/x86/include/asm/mshyperv.h | 1 -
arch/x86/kernel/cpu/mshyperv.c | 4 ++--
3 files changed, 3 insertions(+), 7 deletions(-)
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 17a71e92a343..95eada2994e1 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -35,7 +35,6 @@
#include <clocksource/hyperv_timer.h>
#include <linux/highmem.h>
-int hyperv_init_cpuhp;
u64 hv_current_partition_id = ~0ull;
EXPORT_SYMBOL_GPL(hv_current_partition_id);
@@ -607,8 +606,6 @@ void __init hyperv_init(void)
register_syscore_ops(&hv_syscore_ops);
- hyperv_init_cpuhp = cpuhp;
-
if (cpuid_ebx(HYPERV_CPUID_FEATURES) & HV_ACCESS_PARTITION_ID)
hv_get_partition_id();
@@ -637,7 +634,7 @@ void __init hyperv_init(void)
clean_guest_os_id:
wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
hv_ivm_msr_write(HV_X64_MSR_GUEST_OS_ID, 0);
- cpuhp_remove_state(cpuhp);
+ cpuhp_remove_state(CPUHP_AP_HYPERV_ONLINE);
free_ghcb_page:
free_percpu(hv_ghcb_pg);
free_vp_assist_page:
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 390c4d13956d..5f0bc6a6d025 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -40,7 +40,6 @@ static inline unsigned char hv_get_nmi_reason(void)
}
#if IS_ENABLED(CONFIG_HYPERV)
-extern int hyperv_init_cpuhp;
extern bool hyperv_paravisor_present;
extern void *hv_hypercall_pg;
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index e0fd57a8ba84..e98db51f25ba 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -199,8 +199,8 @@ static void hv_machine_shutdown(void)
* Call hv_cpu_die() on all the CPUs, otherwise later the hypervisor
* corrupts the old VP Assist Pages and can crash the kexec kernel.
*/
- if (kexec_in_progress && hyperv_init_cpuhp > 0)
- cpuhp_remove_state(hyperv_init_cpuhp);
+ if (kexec_in_progress)
+ cpuhp_remove_state(CPUHP_AP_HYPERV_ONLINE);
/* The function calls stop_other_cpus(). */
native_machine_shutdown();
--
2.45.2
[2024-09-04 19:50] Sasha Levin:
> This is a note to let you know that I've just added the patch titled
>
> drm/drm-bridge: Drop conditionals around of_node pointers
>
> to the 6.6-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> drm-drm-bridge-drop-conditionals-around-of_node-poin.patch
> and it can be found in the queue-6.6 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
>
>
>
> commit 74f5f42c35daf9aedbc96283321c30fc591c634f
> Author: Sui Jingfeng <sui.jingfeng(a)linux.dev>
> Date: Wed May 8 02:00:00 2024 +0800
>
> drm/drm-bridge: Drop conditionals around of_node pointers
>
> [ Upstream commit ad3323a6ccb7d43bbeeaa46d5311c43d5d361fc7 ]
>
> Having conditional around the of_node pointer of the drm_bridge structure
> is not necessary, since drm_bridge structure always has the of_node as its
> member.
>
> Let's drop the conditional to get a better looks, please also note that
> this is following the already accepted commitments. see commit d8dfccde2709
> ("drm/bridge: Drop conditionals around of_node pointers") for reference.
>
> Signed-off-by: Sui Jingfeng <sui.jingfeng(a)linux.dev>
> Reviewed-by: Laurent Pinchart <laurent.pinchart+renesas(a)ideasonboard.com>
> Signed-off-by: Robert Foss <rfoss(a)kernel.org>
> Link: https://patchwork.freedesktop.org/patch/msgid/20240507180001.1358816-1-sui.…
> Signed-off-by: Sasha Levin <sashal(a)kernel.org>
>
> diff --git a/drivers/gpu/drm/drm_bridge.c b/drivers/gpu/drm/drm_bridge.c
> index 62d8a291c49c..70b05582e616 100644
> --- a/drivers/gpu/drm/drm_bridge.c
> +++ b/drivers/gpu/drm/drm_bridge.c
> @@ -353,13 +353,8 @@ int drm_bridge_attach(struct drm_encoder *encoder, struct drm_bridge *bridge,
> bridge->encoder = NULL;
> list_del(&bridge->chain_node);
>
> -#ifdef CONFIG_OF
> DRM_ERROR("failed to attach bridge %pOF to encoder %s: %d\n",
> bridge->of_node, encoder->name, ret);
> -#else
> - DRM_ERROR("failed to attach bridge to encoder %s: %d\n",
> - encoder->name, ret);
> -#endif
>
> return ret;
> }
Hi Sasha,
this breaks the x86_64 build for me.
AFAICT this patch cannot work without commit
d8dfccde2709de4327c3d62b50e5dc012f08836f "drm/bridge: Drop conditionals
around of_node pointers", but that commit is only present in Linux >= 6.7.
This issue affects the 6.6, 6.1 and 5.15 branches.
Regards
Pascal
From: yangge <yangge1116(a)126.com>
If a large number of CMA memory are configured in system (for example, the
CMA memory accounts for 50% of the system memory), starting a virtual
virtual machine, it will call pin_user_pages_remote(..., FOLL_LONGTERM,
...) to pin memory. Normally if a page is present and in CMA area,
pin_user_pages_remote() will migrate the page from CMA area to non-CMA
area because of FOLL_LONGTERM flag. But the current code will cause the
migration failure due to unexpected page refcounts, and eventually cause
the virtual machine fail to start.
If a page is added in LRU batch, its refcount increases one, remove the
page from LRU batch decreases one. Page migration requires the page is not
referenced by others except page mapping. Before migrating a page, we
should try to drain the page from LRU batch in case the page is in it,
however, folio_test_lru() is not sufficient to tell whether the page is
in LRU batch or not, if the page is in LRU batch, the migration will fail.
To solve the problem above, we modify the logic of adding to LRU batch.
Before adding a page to LRU batch, we clear the LRU flag of the page so
that we can check whether the page is in LRU batch by folio_test_lru(page).
Seems making the LRU flag of the page invisible a long time is no problem,
because a new page is allocated from buddy and added to the lru batch,
its LRU flag is also not visible for a long time.
Cc: <stable(a)vger.kernel.org>
Signed-off-by: yangge <yangge1116(a)126.com>
---
mm/swap.c | 43 +++++++++++++++++++++++++++++++------------
1 file changed, 31 insertions(+), 12 deletions(-)
diff --git a/mm/swap.c b/mm/swap.c
index dc205bd..9caf6b0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -211,10 +211,6 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
for (i = 0; i < folio_batch_count(fbatch); i++) {
struct folio *folio = fbatch->folios[i];
- /* block memcg migration while the folio moves between lru */
- if (move_fn != lru_add_fn && !folio_test_clear_lru(folio))
- continue;
-
folio_lruvec_relock_irqsave(folio, &lruvec, &flags);
move_fn(lruvec, folio);
@@ -255,11 +251,16 @@ static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio)
void folio_rotate_reclaimable(struct folio *folio)
{
if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
- !folio_test_unevictable(folio) && folio_test_lru(folio)) {
+ !folio_test_unevictable(folio)) {
struct folio_batch *fbatch;
unsigned long flags;
folio_get(folio);
+ if (!folio_test_clear_lru(folio)) {
+ folio_put(folio);
+ return;
+ }
+
local_lock_irqsave(&lru_rotate.lock, flags);
fbatch = this_cpu_ptr(&lru_rotate.fbatch);
folio_batch_add_and_move(fbatch, folio, lru_move_tail_fn);
@@ -352,11 +353,15 @@ static void folio_activate_drain(int cpu)
void folio_activate(struct folio *folio)
{
- if (folio_test_lru(folio) && !folio_test_active(folio) &&
- !folio_test_unevictable(folio)) {
+ if (!folio_test_active(folio) && !folio_test_unevictable(folio)) {
struct folio_batch *fbatch;
folio_get(folio);
+ if (!folio_test_clear_lru(folio)) {
+ folio_put(folio);
+ return;
+ }
+
local_lock(&cpu_fbatches.lock);
fbatch = this_cpu_ptr(&cpu_fbatches.activate);
folio_batch_add_and_move(fbatch, folio, folio_activate_fn);
@@ -700,6 +705,11 @@ void deactivate_file_folio(struct folio *folio)
return;
folio_get(folio);
+ if (!folio_test_clear_lru(folio)) {
+ folio_put(folio);
+ return;
+ }
+
local_lock(&cpu_fbatches.lock);
fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate_file);
folio_batch_add_and_move(fbatch, folio, lru_deactivate_file_fn);
@@ -716,11 +726,16 @@ void deactivate_file_folio(struct folio *folio)
*/
void folio_deactivate(struct folio *folio)
{
- if (folio_test_lru(folio) && !folio_test_unevictable(folio) &&
- (folio_test_active(folio) || lru_gen_enabled())) {
+ if (!folio_test_unevictable(folio) && (folio_test_active(folio) ||
+ lru_gen_enabled())) {
struct folio_batch *fbatch;
folio_get(folio);
+ if (!folio_test_clear_lru(folio)) {
+ folio_put(folio);
+ return;
+ }
+
local_lock(&cpu_fbatches.lock);
fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate);
folio_batch_add_and_move(fbatch, folio, lru_deactivate_fn);
@@ -737,12 +752,16 @@ void folio_deactivate(struct folio *folio)
*/
void folio_mark_lazyfree(struct folio *folio)
{
- if (folio_test_lru(folio) && folio_test_anon(folio) &&
- folio_test_swapbacked(folio) && !folio_test_swapcache(folio) &&
- !folio_test_unevictable(folio)) {
+ if (folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+ !folio_test_swapcache(folio) && !folio_test_unevictable(folio)) {
struct folio_batch *fbatch;
folio_get(folio);
+ if (!folio_test_clear_lru(folio)) {
+ folio_put(folio);
+ return;
+ }
+
local_lock(&cpu_fbatches.lock);
fbatch = this_cpu_ptr(&cpu_fbatches.lru_lazyfree);
folio_batch_add_and_move(fbatch, folio, lru_lazyfree_fn);
--
2.7.4
SOC-integrated devices on some platforms require their PCI ATS enabled
for operation when the IOMMU is in scalable mode. Those devices are
reported via ACPI/SATC table with the ATC_REQUIRED bit set in the Flags
field.
The PCI subsystem offers the 'pci=noats' kernel command to disable PCI
ATS on all devices. Using 'pci=noat' with devices that require PCI ATS
can cause a conflict, leading to boot failure, especially if the device
is a graphics device.
To prevent this issue, check PCI ATS support before enumerating the IOMMU
devices. If any device requires PCI ATS, but PCI ATS is disabled by
'pci=noats', switch the IOMMU to operate in legacy mode to ensure
successful booting.
Fixes: 97f2f2c5317f ("iommu/vt-d: Enable ATS for the devices in SATC table")
Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12036
Cc: stable(a)vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu(a)linux.intel.com>
---
drivers/iommu/intel/iommu.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 4aa070cf56e7..8f275e046e91 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -3127,10 +3127,26 @@ int dmar_iommu_notify_scope_dev(struct dmar_pci_notify_info *info)
(void *)satc + satc->header.length,
satc->segment, satcu->devices,
satcu->devices_cnt);
- if (ret > 0)
- break;
- else if (ret < 0)
+ if (ret < 0)
return ret;
+
+ if (ret > 0) {
+ /*
+ * The device requires PCI/ATS when the IOMMU
+ * works in the scalable mode. If PCI/ATS is
+ * disabled using the pci=noats kernel parameter,
+ * the IOMMU will default to legacy mode. Users
+ * are informed of this change.
+ */
+ if (intel_iommu_sm && satcu->atc_required &&
+ !pci_ats_supported(info->dev)) {
+ pci_warn(info->dev,
+ "PCI/ATS not supported, system working in IOMMU legacy mode\n");
+ intel_iommu_sm = 0;
+ }
+
+ break;
+ }
} else if (info->event == BUS_NOTIFY_REMOVED_DEVICE) {
if (dmar_remove_dev_scope(info, satc->segment,
satcu->devices, satcu->devices_cnt))
--
2.34.1
From: Zheng Yejian <zhengyejian(a)huaweicloud.com>
In __tracing_open(), when max latency tracers took place on the cpu,
the time start of its buffer would be updated, then event entries with
timestamps being earlier than start of the buffer would be skipped
(see tracing_iter_reset()).
Softlockup will occur if the kernel is non-preemptible and too many
entries were skipped in the loop that reset every cpu buffer, so add
cond_resched() to avoid it.
Cc: stable(a)vger.kernel.org
Fixes: 2f26ebd549b9a ("tracing: use timestamp to determine start of latency traces")
Link: https://lore.kernel.org/20240827124654.3817443-1-zhengyejian@huaweicloud.com
Suggested-by: Steven Rostedt <rostedt(a)goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian(a)huaweicloud.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ebe7ce2f5f4a..edf6bc817aa1 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3958,6 +3958,8 @@ void tracing_iter_reset(struct trace_iterator *iter, int cpu)
break;
entries++;
ring_buffer_iter_advance(buf_iter);
+ /* This could be a big loop */
+ cond_resched();
}
per_cpu_ptr(iter->array_buffer->data, cpu)->skipped_entries = entries;
--
2.43.0
From: Steven Rostedt <rostedt(a)goodmis.org>
The timerlat tracer can use user space threads to check for osnoise and
timer latency. If the program using this is killed via a SIGTERM, the
threads are shutdown one at a time and another tracing instance can start
up resetting the threads before they are fully closed. That causes the
hrtimer assigned to the kthread to be shutdown and freed twice when the
dying thread finally closes the file descriptors, causing a use-after-free
bug.
Only cancel the hrtimer if the associated thread is still around.
Note, this is just a quick fix that can be backported to stable. A real
fix is to have a better synchronization between the shutdown of old
threads and the starting of new ones.
Link: https://lore.kernel.org/all/20240820130001.124768-1-tglozar@redhat.com/
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: "Luis Claudio R. Goncalves" <lgoncalv(a)redhat.com>
Link: https://lore.kernel.org/20240903111642.35292e70@gandalf.local.home
Fixes: e88ed227f639e ("tracing/timerlat: Add user-space interface")
Reported-by: Tomas Glozar <tglozar(a)redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_osnoise.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 66a871553d4a..400a72cd6ab5 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -265,6 +265,8 @@ static inline void tlat_var_reset(void)
*/
for_each_cpu(cpu, cpu_online_mask) {
tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
+ if (tlat_var->kthread)
+ hrtimer_cancel(&tlat_var->timer);
memset(tlat_var, 0, sizeof(*tlat_var));
}
}
@@ -2579,7 +2581,8 @@ static int timerlat_fd_release(struct inode *inode, struct file *file)
osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
- hrtimer_cancel(&tlat_var->timer);
+ if (tlat_var->kthread)
+ hrtimer_cancel(&tlat_var->timer);
memset(tlat_var, 0, sizeof(*tlat_var));
osn_var->sampling = 0;
--
2.43.0
The polled UART operations are used by the kernel debugger (KDB, KGDB),
which can interrupt the kernel at any point in time. The current
Qualcomm GENI implementation does not really work when there is on-going
serial output as it inadvertently "hijacks" the current tx command,
which can result in both the initial debugger output being corrupted as
well as the corruption of any on-going serial output (up to 4k
characters) when execution resumes:
0190: abcdefghijklmnopqrstuvwxyz0123456789 0190: abcdefghijklmnopqrstuvwxyz0123456789
0191: abcdefghijklmnop[ 50.825552] sysrq: DEBUG
qrstuvwxyz0123456789 0191: abcdefghijklmnopqrstuvwxyz0123456789
Entering kdb (current=0xffff53510b4cd280, pid 640) on processor 2 due to Keyboard Entry
[2]kdb> go
omlji3h3h2g2g1f1f0e0ezdzdycycxbxbwawav :t72r2rp
o9n976k5j5j4i4i3h3h2g2g1f1f0e0ezdzdycycxbxbwawavu:t7t8s8s8r2r2q0q0p
o9n9n8ml6k6k5j5j4i4i3h3h2g2g1f1f0e0ezdzdycycxbxbwawav v u:u:t9t0s4s4rq0p
o9n9n8m8m7l7l6k6k5j5j40q0p p o
o9n9n8m8m7l7l6k6k5j5j4i4i3h3h2g2g1f1f0e0ezdzdycycxbxbwawav :t8t9s4s4r4r4q0q0p
Fix this by making sure that the polled output implementation waits for
the tx fifo to drain before cancelling any on-going longer transfers. As
the polled code cannot take any locks, leave the state variables as they
are and instead make sure that the interrupt handler always starts a new
tx command when there is data in the write buffer.
Since the debugger can interrupt the interrupt handler when it is
writing data to the tx fifo, it is currently not possible to fully
prevent losing up to 64 bytes of tty output on resume.
Fixes: c4f528795d1a ("tty: serial: msm_geni_serial: Add serial driver support for GENI based QUP")
Cc: stable(a)vger.kernel.org # 4.17
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
---
drivers/tty/serial/qcom_geni_serial.c | 27 ++++++++++++++++++---------
1 file changed, 18 insertions(+), 9 deletions(-)
diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index fbed143c90a3..cf8bafd99a09 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -145,6 +145,7 @@ static const struct uart_ops qcom_geni_uart_pops;
static struct uart_driver qcom_geni_console_driver;
static struct uart_driver qcom_geni_uart_driver;
+static void __qcom_geni_serial_cancel_tx_cmd(struct uart_port *uport);
static void qcom_geni_serial_cancel_tx_cmd(struct uart_port *uport);
static inline struct qcom_geni_serial_port *to_dev_port(struct uart_port *uport)
@@ -403,13 +404,14 @@ static int qcom_geni_serial_get_char(struct uart_port *uport)
static void qcom_geni_serial_poll_put_char(struct uart_port *uport,
unsigned char c)
{
- writel(DEF_TX_WM, uport->membase + SE_GENI_TX_WATERMARK_REG);
+ if (qcom_geni_serial_main_active(uport)) {
+ qcom_geni_serial_poll_tx_done(uport);
+ __qcom_geni_serial_cancel_tx_cmd(uport);
+ }
+
writel(M_CMD_DONE_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
qcom_geni_serial_setup_tx(uport, 1);
- WARN_ON(!qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
- M_TX_FIFO_WATERMARK_EN, true));
writel(c, uport->membase + SE_GENI_TX_FIFOn);
- writel(M_TX_FIFO_WATERMARK_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
qcom_geni_serial_poll_tx_done(uport);
}
#endif
@@ -688,13 +690,10 @@ static void qcom_geni_serial_stop_tx_fifo(struct uart_port *uport)
writel(irq_en, uport->membase + SE_GENI_M_IRQ_EN);
}
-static void qcom_geni_serial_cancel_tx_cmd(struct uart_port *uport)
+static void __qcom_geni_serial_cancel_tx_cmd(struct uart_port *uport)
{
struct qcom_geni_serial_port *port = to_dev_port(uport);
- if (!qcom_geni_serial_main_active(uport))
- return;
-
geni_se_cancel_m_cmd(&port->se);
if (!qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
M_CMD_CANCEL_EN, true)) {
@@ -704,6 +703,16 @@ static void qcom_geni_serial_cancel_tx_cmd(struct uart_port *uport)
writel(M_CMD_ABORT_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
}
writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
+}
+
+static void qcom_geni_serial_cancel_tx_cmd(struct uart_port *uport)
+{
+ struct qcom_geni_serial_port *port = to_dev_port(uport);
+
+ if (!qcom_geni_serial_main_active(uport))
+ return;
+
+ __qcom_geni_serial_cancel_tx_cmd(uport);
port->tx_remaining = 0;
port->tx_queued = 0;
@@ -930,7 +939,7 @@ static void qcom_geni_serial_handle_tx_fifo(struct uart_port *uport,
if (!chunk)
goto out_write_wakeup;
- if (!port->tx_remaining) {
+ if (!active) {
qcom_geni_serial_setup_tx(uport, pending);
port->tx_remaining = pending;
port->tx_queued = 0;
--
2.44.2
The patch titled
Subject: ocfs2: cancel dqi_sync_work before freeing oinfo
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
ocfs2-cancel-dqi_sync_work-before-freeing-oinfo.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Joseph Qi <joseph.qi(a)linux.alibaba.com>
Subject: ocfs2: cancel dqi_sync_work before freeing oinfo
Date: Wed, 4 Sep 2024 15:10:03 +0800
ocfs2_global_read_info() will initialize and schedule dqi_sync_work at the
end, if error occurs after successfully reading global quota, it will
trigger the following warning with CONFIG_DEBUG_OBJECTS_* enabled:
ODEBUG: free active (active state 0) object: 00000000d8b0ce28 object type: timer_list hint: qsync_work_fn+0x0/0x16c
This reports that there is an active delayed work when freeing oinfo in
error handling, so cancel dqi_sync_work first. BTW, return status instead
of -1 when .read_file_info fails.
Link: https://syzkaller.appspot.com/bug?extid=f7af59df5d6b25f0febd
Link: https://lkml.kernel.org/r/20240904071004.2067695-1-joseph.qi@linux.alibaba.…
Fixes: 171bf93ce11f ("ocfs2: Periodic quota syncing")
Signed-off-by: Joseph Qi <joseph.qi(a)linux.alibaba.com>
Reviewed-by: Heming Zhao <heming.zhao(a)suse.com>
Reported-by: syzbot+f7af59df5d6b25f0febd(a)syzkaller.appspotmail.com
Tested-by: syzbot+f7af59df5d6b25f0febd(a)syzkaller.appspotmail.com
Cc: Mark Fasheh <mark(a)fasheh.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Changwei Ge <gechangwei(a)live.cn>
Cc: Gang He <ghe(a)suse.com>
Cc: Jun Piao <piaojun(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/ocfs2/quota_local.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
--- a/fs/ocfs2/quota_local.c~ocfs2-cancel-dqi_sync_work-before-freeing-oinfo
+++ a/fs/ocfs2/quota_local.c
@@ -692,7 +692,7 @@ static int ocfs2_local_read_info(struct
int status;
struct buffer_head *bh = NULL;
struct ocfs2_quota_recovery *rec;
- int locked = 0;
+ int locked = 0, global_read = 0;
info->dqi_max_spc_limit = 0x7fffffffffffffffLL;
info->dqi_max_ino_limit = 0x7fffffffffffffffLL;
@@ -700,6 +700,7 @@ static int ocfs2_local_read_info(struct
if (!oinfo) {
mlog(ML_ERROR, "failed to allocate memory for ocfs2 quota"
" info.");
+ status = -ENOMEM;
goto out_err;
}
info->dqi_priv = oinfo;
@@ -712,6 +713,7 @@ static int ocfs2_local_read_info(struct
status = ocfs2_global_read_info(sb, type);
if (status < 0)
goto out_err;
+ global_read = 1;
status = ocfs2_inode_lock(lqinode, &oinfo->dqi_lqi_bh, 1);
if (status < 0) {
@@ -782,10 +784,12 @@ out_err:
if (locked)
ocfs2_inode_unlock(lqinode, 1);
ocfs2_release_local_quota_bitmaps(&oinfo->dqi_chunk);
+ if (global_read)
+ cancel_delayed_work_sync(&oinfo->dqi_sync_work);
kfree(oinfo);
}
brelse(bh);
- return -1;
+ return status;
}
/* Write local info to quota file */
_
Patches currently in -mm which might be from joseph.qi(a)linux.alibaba.com are
ocfs2-cancel-dqi_sync_work-before-freeing-oinfo.patch
No upstream commit exists for this patch.
Remove kfree(&vi->smb_vol), since &vi->smb_vol
is a pointer to an area inside the allocated memory.
The issue was fixed on the way by upstream commit 837e3a1bbfdc
("cifs: rename dup_vol to smb3_fs_context_dup and move it into fs_context.c")
but this commit is not material for stable branches.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 54be1f6c1c37 ("cifs: Add DFS cache routines")
Signed-off-by: Alexandra Diupina <adiupina(a)astralinux.ru>
---
fs/cifs/dfs_cache.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/fs/cifs/dfs_cache.c b/fs/cifs/dfs_cache.c
index 7b6db272fd0b..da6d775102f2 100644
--- a/fs/cifs/dfs_cache.c
+++ b/fs/cifs/dfs_cache.c
@@ -1194,7 +1194,6 @@ static int dup_vol(struct smb_vol *vol, struct smb_vol *new)
kfree_sensitive(new->password);
err_free_username:
kfree(new->username);
- kfree(new);
return -ENOMEM;
}
--
2.30.2
Hi, all.
Recently syzbot reported a bug as following:
kernel BUG at fs/f2fs/inode.c:896!
CPU: 1 UID: 0 PID: 5217 Comm: syz-executor605 Not tainted 6.11.0-rc4-syzkaller-00033-g872cf28b8df9 #0
RIP: 0010:f2fs_evict_inode+0x1598/0x15c0 fs/f2fs/inode.c:896
Call Trace:
<TASK>
evict+0x532/0x950 fs/inode.c:704
dispose_list fs/inode.c:747 [inline]
evict_inodes+0x5f9/0x690 fs/inode.c:797
generic_shutdown_super+0x9d/0x2d0 fs/super.c:627
kill_block_super+0x44/0x90 fs/super.c:1696
kill_f2fs_super+0x344/0x690 fs/f2fs/super.c:4898
deactivate_locked_super+0xc4/0x130 fs/super.c:473
cleanup_mnt+0x41f/0x4b0 fs/namespace.c:1373
task_work_run+0x24f/0x310 kernel/task_work.c:228
ptrace_notify+0x2d2/0x380 kernel/signal.c:2402
ptrace_report_syscall include/linux/ptrace.h:415 [inline]
ptrace_report_syscall_exit include/linux/ptrace.h:477 [inline]
syscall_exit_work+0xc6/0x190 kernel/entry/common.c:173
syscall_exit_to_user_mode_prepare kernel/entry/common.c:200 [inline]
__syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline]
syscall_exit_to_user_mode+0x279/0x370 kernel/entry/common.c:218
do_syscall_64+0x100/0x230 arch/x86/entry/common.c:89
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The syzbot constructed the following scenario: concurrently
creating directories and setting the file system to read-only.
In this case, while f2fs was making dir, the filesystem switched to
readonly, and when it tried to clear the dirty flag, it triggered this
code path: f2fs_mkdir()-> f2fs_sync_fs()->f2fs_write_checkpoint()
->f2fs_readonly(). This resulted FI_DIRTY_INODE flag not being cleared,
which eventually led to a bug being triggered during the FI_DIRTY_INODE
check in f2fs_evict_inode().
In this case, we cannot do anything further, so if filesystem is readonly,
do not trigger the BUG. Instead, clean up resources to the best of our
ability to prevent triggering subsequent resource leak checks.
If there is anything important I'm missing, please let me know, thanks.
Reported-by: syzbot+ebea2790904673d7c618(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=ebea2790904673d7c618
Fixes: ca7d802a7d8e ("f2fs: detect dirty inode in evict_inode")
CC: stable(a)vger.kernel.org
Signed-off-by: Julian Sun <sunjunchao2870(a)gmail.com>
---
fs/f2fs/inode.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
index aef57172014f..ebf825dba0a5 100644
--- a/fs/f2fs/inode.c
+++ b/fs/f2fs/inode.c
@@ -892,7 +892,8 @@ void f2fs_evict_inode(struct inode *inode)
atomic_read(&fi->i_compr_blocks));
if (likely(!f2fs_cp_error(sbi) &&
- !is_sbi_flag_set(sbi, SBI_CP_DISABLED)))
+ !is_sbi_flag_set(sbi, SBI_CP_DISABLED)) &&
+ !f2fs_readonly(sbi->sb))
f2fs_bug_on(sbi, is_inode_flag_set(inode, FI_DIRTY_INODE));
else
f2fs_inode_synced(inode);
--
2.39.2
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x f18fa2abf81099d822d842a107f8c9889c86043c
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024083033-whoopee-visa-f9dd@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
f18fa2abf810 ("selftests: mptcp: join: check re-re-adding ID 0 signal")
20ccc7c5f7a3 ("selftests: mptcp: join: validate event numbers")
d397d7246c11 ("selftests: mptcp: join: check re-re-adding ID 0 endp")
1c2326fcae4f ("selftests: mptcp: join: check re-adding init endp with != id")
5f94b08c0012 ("selftests: mptcp: join: check removing ID 0 endpoint")
65fb58afa341 ("selftests: mptcp: join: check re-using ID of closed subflow")
a13d5aad4dd9 ("selftests: mptcp: join: check re-using ID of unused ADD_ADDR")
b5e2fb832f48 ("selftests: mptcp: add explicit test case for remove/readd")
40061817d95b ("selftests: mptcp: join: fix dev in check_endpoint")
23a0485d1c04 ("selftests: mptcp: declare event macros in mptcp_lib")
35bc143a8514 ("selftests: mptcp: add mptcp_lib_events helper")
3a0f9bed3c28 ("selftests: mptcp: add mptcp_lib_ns_init/exit helpers")
3fb8c33ef4b9 ("selftests: mptcp: add mptcp_lib_check_tools helper")
7c2eac649054 ("selftests: mptcp: stop forcing iptables-legacy")
4cc5cc7ca052 ("selftests: mptcp: userspace pm get addr tests")
38f027fca1b7 ("selftests: mptcp: dump userspace addrs list")
2d0c1d27ea4e ("selftests: mptcp: add mptcp_lib_check_output helper")
9480f388a2ef ("selftests: mptcp: join: add ss mptcp support check")
7092dbee2328 ("selftests: mptcp: rm subflow with v4/v4mapped addr")
04b57c9e096a ("selftests: mptcp: join: stop transfer when check is done (part 2)")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f18fa2abf81099d822d842a107f8c9889c86043c Mon Sep 17 00:00:00 2001
From: "Matthieu Baerts (NGI0)" <matttbe(a)kernel.org>
Date: Wed, 28 Aug 2024 08:14:38 +0200
Subject: [PATCH] selftests: mptcp: join: check re-re-adding ID 0 signal
This test extends "delete re-add signal" to validate the previous
commit: when the 'signal' endpoint linked to the initial subflow (ID 0)
is re-added multiple times, it will re-send the ADD_ADDR with id 0. The
client should still be able to re-create this subflow, even if the
add_addr_accepted limit has been reached as this special address is not
considered as a new address.
The 'Fixes' tag here below is the same as the one from the previous
commit: this patch here is not fixing anything wrong in the selftests,
but it validates the previous fix for an issue introduced by this commit
ID.
Fixes: d0876b2284cf ("mptcp: add the incoming RM_ADDR support")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index a8ea0fe200fb..a4762c49a878 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -3688,7 +3688,7 @@ endpoint_tests()
# broadcast IP: no packet for this address will be received on ns1
pm_nl_add_endpoint $ns1 224.0.0.1 id 2 flags signal
pm_nl_add_endpoint $ns1 10.0.1.1 id 42 flags signal
- test_linkfail=4 speed=20 \
+ test_linkfail=4 speed=5 \
run_tests $ns1 $ns2 10.0.1.1 &
local tests_pid=$!
@@ -3717,7 +3717,17 @@ endpoint_tests()
pm_nl_add_endpoint $ns1 10.0.1.1 id 99 flags signal
wait_mpj $ns2
- chk_subflow_nr "after re-add" 3
+ chk_subflow_nr "after re-add ID 0" 3
+ chk_mptcp_info subflows 3 subflows 3
+
+ pm_nl_del_endpoint $ns1 99 10.0.1.1
+ sleep 0.5
+ chk_subflow_nr "after re-delete ID 0" 2
+ chk_mptcp_info subflows 2 subflows 2
+
+ pm_nl_add_endpoint $ns1 10.0.1.1 id 88 flags signal
+ wait_mpj $ns2
+ chk_subflow_nr "after re-re-add ID 0" 3
chk_mptcp_info subflows 3 subflows 3
mptcp_lib_kill_wait $tests_pid
@@ -3727,19 +3737,19 @@ endpoint_tests()
chk_evt_nr ns1 MPTCP_LIB_EVENT_ESTABLISHED 1
chk_evt_nr ns1 MPTCP_LIB_EVENT_ANNOUNCED 0
chk_evt_nr ns1 MPTCP_LIB_EVENT_REMOVED 0
- chk_evt_nr ns1 MPTCP_LIB_EVENT_SUB_ESTABLISHED 4
- chk_evt_nr ns1 MPTCP_LIB_EVENT_SUB_CLOSED 2
+ chk_evt_nr ns1 MPTCP_LIB_EVENT_SUB_ESTABLISHED 5
+ chk_evt_nr ns1 MPTCP_LIB_EVENT_SUB_CLOSED 3
chk_evt_nr ns2 MPTCP_LIB_EVENT_CREATED 1
chk_evt_nr ns2 MPTCP_LIB_EVENT_ESTABLISHED 1
- chk_evt_nr ns2 MPTCP_LIB_EVENT_ANNOUNCED 5
- chk_evt_nr ns2 MPTCP_LIB_EVENT_REMOVED 3
- chk_evt_nr ns2 MPTCP_LIB_EVENT_SUB_ESTABLISHED 4
- chk_evt_nr ns2 MPTCP_LIB_EVENT_SUB_CLOSED 2
+ chk_evt_nr ns2 MPTCP_LIB_EVENT_ANNOUNCED 6
+ chk_evt_nr ns2 MPTCP_LIB_EVENT_REMOVED 4
+ chk_evt_nr ns2 MPTCP_LIB_EVENT_SUB_ESTABLISHED 5
+ chk_evt_nr ns2 MPTCP_LIB_EVENT_SUB_CLOSED 3
- chk_join_nr 4 4 4
- chk_add_nr 5 5
- chk_rm_nr 3 2 invert
+ chk_join_nr 5 5 5
+ chk_add_nr 6 6
+ chk_rm_nr 4 3 invert
fi
# flush and re-add
commit e93681afcb96864ec26c3b2ce94008ce93577373 upstream.
Thanks to the previous commit, the MPTCP subflows are now closed on both
directions even when only the MPTCP path-manager of one peer asks for
their closure.
In the two tests modified here -- "userspace pm add & remove address"
and "userspace pm create destroy subflow" -- one peer is controlled by
the userspace PM, and the other one by the in-kernel PM. When the
userspace PM sends a RM_ADDR notification, the in-kernel PM will
automatically react by closing all subflows using this address. Now,
thanks to the previous commit, the subflows are properly closed on both
directions, the userspace PM can then no longer closes the same
subflows if they are already closed. Before, it was OK to do that,
because the subflows were still half-opened, still OK to send a RM_ADDR.
In other words, thanks to the previous commit closing the subflows, an
error will be returned to the userspace if it tries to close a subflow
that has already been closed. So no need to run this command, which mean
that the linked counters will then not be incremented.
These tests are then no longer sending both a RM_ADDR, then closing the
linked subflow just after. The test with the userspace PM on the server
side is now removing one subflow linked to one address, then sending
a RM_ADDR for another address. The test with the userspace PM on the
client side is now only removing the subflow that was previously
created.
Cc: stable(a)vger.kernel.org
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20240826-net-mptcp-close-extra-sf-fin-v1-2-905199f…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Fixes: 97040cf9806e ("selftests: mptcp: userspace pm address tests")
Fixes: 5e986ec46874 ("selftests: mptcp: userspace pm subflow tests")
[ It looks like this patch is needed for the same reasons as mentioned
above, but the resolution is different: the subflows and addresses are
removed elsewhere. The same type of adaptations have been applied
here. The Fixes tag has been replaced by better appropriated ones. ]
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
tools/testing/selftests/net/mptcp/mptcp_join.sh | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index 20561d569697..446b8daa23e0 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -984,8 +984,6 @@ do_transfer()
dp=$(grep "type:10" "$evts_ns1" |
sed -n 's/.*\(dport:\)\([[:digit:]]*\).*$/\2/p;q')
ip netns exec ${listener_ns} ./pm_nl_ctl rem token $tk id $id
- ip netns exec ${listener_ns} ./pm_nl_ctl dsf lip "$addr" \
- lport $sp rip $da rport $dp token $tk
fi
counter=$((counter + 1))
@@ -1051,7 +1049,6 @@ do_transfer()
sleep 1
sp=$(grep "type:10" "$evts_ns2" |
sed -n 's/.*\(sport:\)\([[:digit:]]*\).*$/\2/p;q')
- ip netns exec ${connector_ns} ./pm_nl_ctl rem token $tk id $id
ip netns exec ${connector_ns} ./pm_nl_ctl dsf lip $addr lport $sp \
rip $da rport $dp token $tk
fi
@@ -3280,7 +3277,7 @@ userspace_tests()
run_tests $ns1 $ns2 10.0.1.1 0 userspace_1 0 slow
chk_join_nr 1 1 1
chk_add_nr 1 1
- chk_rm_nr 1 1 invert
+ chk_rm_nr 1 0 invert
fi
# userspace pm create destroy subflow
@@ -3290,7 +3287,7 @@ userspace_tests()
pm_nl_set_limits $ns1 0 1
run_tests $ns1 $ns2 10.0.1.1 0 0 userspace_1 slow
chk_join_nr 1 1 1
- chk_rm_nr 1 1
+ chk_rm_nr 0 1
fi
}
--
2.45.2