From: Rodrigo Siqueira <Rodrigo.Siqueira(a)amd.com>
[ Upstream commit 7222f5841ff49709ca666b05ff336776e0664a20 ]
[Why & How]
DC now uses a new commit sequence which is more robust since it
addresses cases where we need to reorganize pipes based on planes and
other parameters. As a result, this new commit sequence reset the DC
state by cleaning plane states and re-creating them accordingly with the
need. For this reason, the dce_transform_set_pixel_storage_depth can be
invoked after a plane state is destroyed and before its re-creation. In
this situation and on DCE devices, DC will hit a condition that will
trigger a dmesg log that looks like this:
Console: switching to colour frame buffer device 240x67
------------[ cut here ]------------
[..]
Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 5603 07/28/2020
RIP: 0010:dce_transform_set_pixel_storage_depth+0x3f8/0x480 [amdgpu]
[..]
RSP: 0018:ffffc9000202b850 EFLAGS: 00010293
RAX: ffffffffa081d100 RBX: ffff888110790000 RCX: 000000000000000c
RDX: ffff888100bedbf8 RSI: 0000000000001a50 RDI: ffff88810463c900
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007
R10: 0000000000000001 R11: 0000000000000f00 R12: ffff88810f500010
R13: ffff888100bedbf8 R14: ffff88810f515688 R15: 0000000000000000
FS: 00007ff0159249c0(0000) GS:ffff88840e940000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff01528e550 CR3: 0000000002a10000 CR4: 00000000003506e0
Call Trace:
<TASK>
? dm_write_reg_func+0x21/0x80 [amdgpu 340dadd3f7c8cf4be11cf0bdc850245e99abe0e8]
dc_stream_set_dither_option+0xfb/0x130 [amdgpu 340dadd3f7c8cf4be11cf0bdc850245e99abe0e8]
amdgpu_dm_crtc_configure_crc_source+0x10b/0x190 [amdgpu 340dadd3f7c8cf4be11cf0bdc850245e99abe0e8]
amdgpu_dm_atomic_commit_tail+0x20a8/0x2a90 [amdgpu 340dadd3f7c8cf4be11cf0bdc850245e99abe0e8]
? free_unref_page_commit+0x98/0x170
? free_unref_page+0xcc/0x150
commit_tail+0x94/0x120
drm_atomic_helper_commit+0x10f/0x140
drm_atomic_commit+0x94/0xc0
? drm_plane_get_damage_clips.cold+0x1c/0x1c
drm_client_modeset_commit_atomic+0x203/0x250
drm_client_modeset_commit_locked+0x56/0x150
drm_client_modeset_commit+0x21/0x40
drm_fb_helper_lastclose+0x42/0x70
amdgpu_driver_lastclose_kms+0xa/0x10 [amdgpu 340dadd3f7c8cf4be11cf0bdc850245e99abe0e8]
drm_release+0xda/0x110
__fput+0x89/0x240
task_work_run+0x5c/0x90
do_exit+0x333/0xae0
do_group_exit+0x2d/0x90
__x64_sys_exit_group+0x14/0x20
do_syscall_64+0x5b/0x80
? exit_to_user_mode_prepare+0x1e/0x140
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7ff016ceaca1
Code: Unable to access opcode bytes at RIP 0x7ff016ceac77.
RSP: 002b:00007ffe7a2357e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00007ff016e15a00 RCX: 00007ff016ceaca1
RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffffffffffffff78 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ff016e15a00
R13: 0000000000000000 R14: 00007ff016e1aee8 R15: 00007ff016e1af00
</TASK>
Since this issue only happens in a transition state on DC, this commit
replace BREAK_TO_DEBUGGER with DC_LOG_DC.
Reviewed-by: Harry Wentland <Harry.Wentland(a)amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo(a)amd.com>
Signed-off-by: Rodrigo Siqueira <Rodrigo.Siqueira(a)amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/gpu/drm/amd/display/dc/dce/dce_transform.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/display/dc/dce/dce_transform.c b/drivers/gpu/drm/amd/display/dc/dce/dce_transform.c
index 6fd57cfb112f5..96fdc18ecb3bf 100644
--- a/drivers/gpu/drm/amd/display/dc/dce/dce_transform.c
+++ b/drivers/gpu/drm/amd/display/dc/dce/dce_transform.c
@@ -778,7 +778,7 @@ static void dce_transform_set_pixel_storage_depth(
color_depth = COLOR_DEPTH_101010;
pixel_depth = 0;
expan_mode = 1;
- BREAK_TO_DEBUGGER();
+ DC_LOG_DC("The pixel depth %d is not valid, set COLOR_DEPTH_101010 instead.", depth);
break;
}
@@ -792,8 +792,7 @@ static void dce_transform_set_pixel_storage_depth(
if (!(xfm_dce->lb_pixel_depth_supported & depth)) {
/*we should use unsupported capabilities
* unless it is required by w/a*/
- DC_LOG_WARNING("%s: Capability not supported",
- __func__);
+ DC_LOG_DC("%s: Capability not supported", __func__);
}
}
--
2.39.2
From: Rodrigo Siqueira <Rodrigo.Siqueira(a)amd.com>
[ Upstream commit 7222f5841ff49709ca666b05ff336776e0664a20 ]
[Why & How]
DC now uses a new commit sequence which is more robust since it
addresses cases where we need to reorganize pipes based on planes and
other parameters. As a result, this new commit sequence reset the DC
state by cleaning plane states and re-creating them accordingly with the
need. For this reason, the dce_transform_set_pixel_storage_depth can be
invoked after a plane state is destroyed and before its re-creation. In
this situation and on DCE devices, DC will hit a condition that will
trigger a dmesg log that looks like this:
Console: switching to colour frame buffer device 240x67
------------[ cut here ]------------
[..]
Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 5603 07/28/2020
RIP: 0010:dce_transform_set_pixel_storage_depth+0x3f8/0x480 [amdgpu]
[..]
RSP: 0018:ffffc9000202b850 EFLAGS: 00010293
RAX: ffffffffa081d100 RBX: ffff888110790000 RCX: 000000000000000c
RDX: ffff888100bedbf8 RSI: 0000000000001a50 RDI: ffff88810463c900
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007
R10: 0000000000000001 R11: 0000000000000f00 R12: ffff88810f500010
R13: ffff888100bedbf8 R14: ffff88810f515688 R15: 0000000000000000
FS: 00007ff0159249c0(0000) GS:ffff88840e940000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff01528e550 CR3: 0000000002a10000 CR4: 00000000003506e0
Call Trace:
<TASK>
? dm_write_reg_func+0x21/0x80 [amdgpu 340dadd3f7c8cf4be11cf0bdc850245e99abe0e8]
dc_stream_set_dither_option+0xfb/0x130 [amdgpu 340dadd3f7c8cf4be11cf0bdc850245e99abe0e8]
amdgpu_dm_crtc_configure_crc_source+0x10b/0x190 [amdgpu 340dadd3f7c8cf4be11cf0bdc850245e99abe0e8]
amdgpu_dm_atomic_commit_tail+0x20a8/0x2a90 [amdgpu 340dadd3f7c8cf4be11cf0bdc850245e99abe0e8]
? free_unref_page_commit+0x98/0x170
? free_unref_page+0xcc/0x150
commit_tail+0x94/0x120
drm_atomic_helper_commit+0x10f/0x140
drm_atomic_commit+0x94/0xc0
? drm_plane_get_damage_clips.cold+0x1c/0x1c
drm_client_modeset_commit_atomic+0x203/0x250
drm_client_modeset_commit_locked+0x56/0x150
drm_client_modeset_commit+0x21/0x40
drm_fb_helper_lastclose+0x42/0x70
amdgpu_driver_lastclose_kms+0xa/0x10 [amdgpu 340dadd3f7c8cf4be11cf0bdc850245e99abe0e8]
drm_release+0xda/0x110
__fput+0x89/0x240
task_work_run+0x5c/0x90
do_exit+0x333/0xae0
do_group_exit+0x2d/0x90
__x64_sys_exit_group+0x14/0x20
do_syscall_64+0x5b/0x80
? exit_to_user_mode_prepare+0x1e/0x140
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7ff016ceaca1
Code: Unable to access opcode bytes at RIP 0x7ff016ceac77.
RSP: 002b:00007ffe7a2357e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00007ff016e15a00 RCX: 00007ff016ceaca1
RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffffffffffffff78 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ff016e15a00
R13: 0000000000000000 R14: 00007ff016e1aee8 R15: 00007ff016e1af00
</TASK>
Since this issue only happens in a transition state on DC, this commit
replace BREAK_TO_DEBUGGER with DC_LOG_DC.
Reviewed-by: Harry Wentland <Harry.Wentland(a)amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo(a)amd.com>
Signed-off-by: Rodrigo Siqueira <Rodrigo.Siqueira(a)amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/gpu/drm/amd/display/dc/dce/dce_transform.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/display/dc/dce/dce_transform.c b/drivers/gpu/drm/amd/display/dc/dce/dce_transform.c
index 6fd57cfb112f5..96fdc18ecb3bf 100644
--- a/drivers/gpu/drm/amd/display/dc/dce/dce_transform.c
+++ b/drivers/gpu/drm/amd/display/dc/dce/dce_transform.c
@@ -778,7 +778,7 @@ static void dce_transform_set_pixel_storage_depth(
color_depth = COLOR_DEPTH_101010;
pixel_depth = 0;
expan_mode = 1;
- BREAK_TO_DEBUGGER();
+ DC_LOG_DC("The pixel depth %d is not valid, set COLOR_DEPTH_101010 instead.", depth);
break;
}
@@ -792,8 +792,7 @@ static void dce_transform_set_pixel_storage_depth(
if (!(xfm_dce->lb_pixel_depth_supported & depth)) {
/*we should use unsupported capabilities
* unless it is required by w/a*/
- DC_LOG_WARNING("%s: Capability not supported",
- __func__);
+ DC_LOG_DC("%s: Capability not supported", __func__);
}
}
--
2.39.2
From: Rodrigo Siqueira <Rodrigo.Siqueira(a)amd.com>
[ Upstream commit 7222f5841ff49709ca666b05ff336776e0664a20 ]
[Why & How]
DC now uses a new commit sequence which is more robust since it
addresses cases where we need to reorganize pipes based on planes and
other parameters. As a result, this new commit sequence reset the DC
state by cleaning plane states and re-creating them accordingly with the
need. For this reason, the dce_transform_set_pixel_storage_depth can be
invoked after a plane state is destroyed and before its re-creation. In
this situation and on DCE devices, DC will hit a condition that will
trigger a dmesg log that looks like this:
Console: switching to colour frame buffer device 240x67
------------[ cut here ]------------
[..]
Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 5603 07/28/2020
RIP: 0010:dce_transform_set_pixel_storage_depth+0x3f8/0x480 [amdgpu]
[..]
RSP: 0018:ffffc9000202b850 EFLAGS: 00010293
RAX: ffffffffa081d100 RBX: ffff888110790000 RCX: 000000000000000c
RDX: ffff888100bedbf8 RSI: 0000000000001a50 RDI: ffff88810463c900
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007
R10: 0000000000000001 R11: 0000000000000f00 R12: ffff88810f500010
R13: ffff888100bedbf8 R14: ffff88810f515688 R15: 0000000000000000
FS: 00007ff0159249c0(0000) GS:ffff88840e940000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff01528e550 CR3: 0000000002a10000 CR4: 00000000003506e0
Call Trace:
<TASK>
? dm_write_reg_func+0x21/0x80 [amdgpu 340dadd3f7c8cf4be11cf0bdc850245e99abe0e8]
dc_stream_set_dither_option+0xfb/0x130 [amdgpu 340dadd3f7c8cf4be11cf0bdc850245e99abe0e8]
amdgpu_dm_crtc_configure_crc_source+0x10b/0x190 [amdgpu 340dadd3f7c8cf4be11cf0bdc850245e99abe0e8]
amdgpu_dm_atomic_commit_tail+0x20a8/0x2a90 [amdgpu 340dadd3f7c8cf4be11cf0bdc850245e99abe0e8]
? free_unref_page_commit+0x98/0x170
? free_unref_page+0xcc/0x150
commit_tail+0x94/0x120
drm_atomic_helper_commit+0x10f/0x140
drm_atomic_commit+0x94/0xc0
? drm_plane_get_damage_clips.cold+0x1c/0x1c
drm_client_modeset_commit_atomic+0x203/0x250
drm_client_modeset_commit_locked+0x56/0x150
drm_client_modeset_commit+0x21/0x40
drm_fb_helper_lastclose+0x42/0x70
amdgpu_driver_lastclose_kms+0xa/0x10 [amdgpu 340dadd3f7c8cf4be11cf0bdc850245e99abe0e8]
drm_release+0xda/0x110
__fput+0x89/0x240
task_work_run+0x5c/0x90
do_exit+0x333/0xae0
do_group_exit+0x2d/0x90
__x64_sys_exit_group+0x14/0x20
do_syscall_64+0x5b/0x80
? exit_to_user_mode_prepare+0x1e/0x140
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7ff016ceaca1
Code: Unable to access opcode bytes at RIP 0x7ff016ceac77.
RSP: 002b:00007ffe7a2357e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00007ff016e15a00 RCX: 00007ff016ceaca1
RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffffffffffffff78 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ff016e15a00
R13: 0000000000000000 R14: 00007ff016e1aee8 R15: 00007ff016e1af00
</TASK>
Since this issue only happens in a transition state on DC, this commit
replace BREAK_TO_DEBUGGER with DC_LOG_DC.
Reviewed-by: Harry Wentland <Harry.Wentland(a)amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo(a)amd.com>
Signed-off-by: Rodrigo Siqueira <Rodrigo.Siqueira(a)amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/gpu/drm/amd/display/dc/dce/dce_transform.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/display/dc/dce/dce_transform.c b/drivers/gpu/drm/amd/display/dc/dce/dce_transform.c
index e2e79025825f8..a54a309879246 100644
--- a/drivers/gpu/drm/amd/display/dc/dce/dce_transform.c
+++ b/drivers/gpu/drm/amd/display/dc/dce/dce_transform.c
@@ -1011,7 +1011,7 @@ static void dce_transform_set_pixel_storage_depth(
color_depth = COLOR_DEPTH_101010;
pixel_depth = 0;
expan_mode = 1;
- BREAK_TO_DEBUGGER();
+ DC_LOG_DC("The pixel depth %d is not valid, set COLOR_DEPTH_101010 instead.", depth);
break;
}
@@ -1025,8 +1025,7 @@ static void dce_transform_set_pixel_storage_depth(
if (!(xfm_dce->lb_pixel_depth_supported & depth)) {
/*we should use unsupported capabilities
* unless it is required by w/a*/
- DC_LOG_WARNING("%s: Capability not supported",
- __func__);
+ DC_LOG_DC("%s: Capability not supported", __func__);
}
}
--
2.39.2
I'm announcing the release of the 6.2.14 kernel.
All users of the 6.2 kernel series must upgrade.
The updated 6.2.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-6.2.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Documentation/riscv/vm-layout.rst | 6
Makefile | 2
arch/riscv/include/asm/fixmap.h | 8
arch/riscv/include/asm/pgtable.h | 8
arch/riscv/kernel/setup.c | 6
arch/riscv/mm/init.c | 82 +++-----
arch/x86/Makefile.um | 11 +
drivers/base/dd.c | 7
drivers/gpio/gpiolib-acpi.c | 13 +
drivers/gpu/drm/drm_fb_helper.c | 3
drivers/net/wireless/broadcom/brcm80211/brcmfmac/bcmsdh.c | 9
drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c | 5
drivers/usb/serial/option.c | 6
fs/btrfs/send.c | 2
fs/btrfs/volumes.c | 2
include/linux/mmc/sdio_ids.h | 5
kernel/rcu/tree.c | 27 +-
mm/mempolicy.c | 115 +++++-------
net/bluetooth/hci_sock.c | 9
19 files changed, 193 insertions(+), 133 deletions(-)
Alexandre Ghiti (3):
riscv: Move early dtb mapping into the fixmap region
riscv: Do not set initial_boot_params to the linear address of the dtb
riscv: No need to relocate the dtb as it lies in the fixmap region
Arınç ÜNAL (1):
USB: serial: option: add UNISOC vendor and TOZED LT70C product
Daniel Vetter (1):
drm/fb-helper: set x/yres_virtual in drm_fb_helper_check_var
David Gow (2):
rust: arch/um: Disable FP/SIMD instruction to match x86
um: Only disable SSE on clang to work around old GCC bugs
Genjian Zhang (1):
btrfs: fix uninitialized variable warnings
Greg Kroah-Hartman (1):
Linux 6.2.14
Jisoo Jang (1):
wifi: brcmfmac: slab-out-of-bounds read in brcmf_get_assoc_ies()
Liam R. Howlett (1):
mm/mempolicy: fix use-after-free of VMA iterator
Marek Vasut (1):
wifi: brcmfmac: add Cypress 43439 SDIO ids
Ruihan Li (1):
bluetooth: Perform careful capability checks in hci_sock_ioctl()
Stephen Boyd (1):
driver core: Don't require dynamic_debug for initcall_debug probe timing
Werner Sembach (1):
gpiolib: acpi: Add a ignore wakeup quirk for Clevo NL5xNU
Ziwei Dai (1):
rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period
I'm announcing the release of the 5.15.110 kernel.
All users of the 5.15 kernel series must upgrade.
The updated 5.15.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-5.15.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Documentation/riscv/vm-layout.rst | 2
Makefile | 2
arch/arm64/kvm/mmu.c | 47 +++-----
arch/arm64/kvm/psci.c | 2
arch/riscv/include/asm/fixmap.h | 8 +
arch/riscv/include/asm/pgtable.h | 8 +
arch/riscv/kernel/setup.c | 6 -
arch/riscv/mm/init.c | 68 ++++++------
drivers/base/dd.c | 7 +
drivers/gpu/drm/drm_fb_helper.c | 3
drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c | 5
drivers/pci/pci.c | 3
drivers/pci/pci.h | 2
drivers/pci/pcie/aspm.c | 19 ---
drivers/usb/serial/option.c | 6 +
net/bluetooth/hci_sock.c | 9 +
tools/testing/selftests/kselftest/runner.sh | 28 +++-
tools/testing/selftests/net/mptcp/mptcp_join.sh | 2
18 files changed, 123 insertions(+), 104 deletions(-)
Alexandre Ghiti (3):
riscv: Move early dtb mapping into the fixmap region
riscv: Do not set initial_boot_params to the linear address of the dtb
riscv: No need to relocate the dtb as it lies in the fixmap region
Arınç ÜNAL (1):
USB: serial: option: add UNISOC vendor and TOZED LT70C product
Dan Carpenter (1):
KVM: arm64: Fix buffer overflow in kvm_arm_set_fw_reg()
Daniel Vetter (1):
drm/fb-helper: set x/yres_virtual in drm_fb_helper_check_var
David Matlack (1):
KVM: arm64: Retry fault if vma_lookup() results become invalid
Greg Kroah-Hartman (1):
Linux 5.15.110
Jisoo Jang (1):
wifi: brcmfmac: slab-out-of-bounds read in brcmf_get_assoc_ies()
Kai-Heng Feng (1):
PCI/ASPM: Remove pcie_aspm_pm_state_change()
Matthieu Baerts (1):
selftests: mptcp: join: fix "invalid address, ADD_ADDR timeout"
Ruihan Li (1):
bluetooth: Perform careful capability checks in hci_sock_ioctl()
SeongJae Park (1):
selftests/kselftest/runner/run_one(): allow running non-executable files
Stephen Boyd (1):
driver core: Don't require dynamic_debug for initcall_debug probe timing
I have a transaction which is of mutual benefits and I would like to share with you. if interested for more information please get back to me via my email: :david.murray606@gmail.com
Regards.
David Murray
The patch titled
Subject: mm/mempolicy: correctly update prev when policy is equal on mbind
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-mempolicy-correctly-update-prev-when-policy-is-equal-on-mbind.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Lorenzo Stoakes <lstoakes(a)gmail.com>
Subject: mm/mempolicy: correctly update prev when policy is equal on mbind
Date: Sun, 30 Apr 2023 16:07:07 +0100
The refactoring in commit f4e9e0e69468 ("mm/mempolicy: fix use-after-free
of VMA iterator") introduces a subtle bug which arises when attempting to
apply a new NUMA policy across a range of VMAs in mbind_range().
The refactoring passes a **prev pointer to keep track of the previous VMA
in order to reduce duplication, and in all but one case it keeps this
correctly updated.
The bug arises when a VMA within the specified range has an equivalent
policy as determined by mpol_equal() - which unlike other cases, does not
update prev.
This can result in a situation where, later in the iteration, a VMA is
found whose policy does need to change. At this point, vma_merge() is
invoked with prev pointing to a VMA which is before the previous VMA.
Since vma_merge() discovers the curr VMA by looking for the one
immediately after prev, it will now be in a situation where this VMA is
incorrect and the merge will not proceed correctly.
This is checked in the VM_WARN_ON() invariant case with end >
curr->vm_end, which, if a merge is possible, results in a warning (if
CONFIG_DEBUG_VM is specified).
I note that vma_merge() performs these invariant checks only after
merge_prev/merge_next are checked, which is debatable as it hides this
issue if no merge is possible even though a buggy situation has arisen.
The solution is simply to update the prev pointer even when policies are
equal.
This caused a bug to arise in the 6.2.y stable tree, and this patch
resolves this bug.
Link: https://lkml.kernel.org/r/83f1d612acb519d777bebf7f3359317c4e7f4265.16828666…
Fixes: f4e9e0e69468 ("mm/mempolicy: fix use-after-free of VMA iterator")
Signed-off-by: Lorenzo Stoakes <lstoakes(a)gmail.com>
Reported-by: kernel test robot <oliver.sang(a)intel.com>
Link: https://lore.kernel.org/oe-lkp/202304292203.44ddeff6-oliver.sang@intel.com
Cc: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mempolicy.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
--- a/mm/mempolicy.c~mm-mempolicy-correctly-update-prev-when-policy-is-equal-on-mbind
+++ a/mm/mempolicy.c
@@ -808,8 +808,10 @@ static int mbind_range(struct vma_iterat
vmstart = vma->vm_start;
}
- if (mpol_equal(vma_policy(vma), new_pol))
+ if (mpol_equal(vma_policy(vma), new_pol)) {
+ *prev = vma;
return 0;
+ }
pgoff = vma->vm_pgoff + ((vmstart - vma->vm_start) >> PAGE_SHIFT);
merged = vma_merge(vmi, vma->vm_mm, *prev, vmstart, vmend, vma->vm_flags,
_
Patches currently in -mm which might be from lstoakes(a)gmail.com are
mm-mempolicy-correctly-update-prev-when-policy-is-equal-on-mbind.patch
The refactoring in commit f4e9e0e69468 ("mm/mempolicy: fix use-after-free
of VMA iterator") introduces a subtle bug which arises when attempting to
apply a new NUMA policy across a range of VMAs in mbind_range().
The refactoring passes a **prev pointer to keep track of the previous VMA
in order to reduce duplication, and in all but one case it keeps this
correctly updated.
The bug arises when a VMA within the specified range has an equivalent
policy as determined by mpol_equal() - which unlike other cases, does not
update prev.
This can result in a situation where, later in the iteration, a VMA is
found whose policy does need to change. At this point, vma_merge() is
invoked with prev pointing to a VMA which is before the previous VMA.
Since vma_merge() discovers the curr VMA by looking for the one immediately
after prev, it will now be in a situation where this VMA is incorrect and
the merge will not proceed correctly.
This is checked in the VM_WARN_ON() invariant case with end > curr->vm_end,
which, if a merge is possible, results in a warning (if CONFIG_DEBUG_VM is
specified).
I note that vma_merge() performs these invariant checks only after
merge_prev/merge_next are checked, which is debatable as it hides this
issue if no merge is possible even though a buggy situation has arisen.
The solution is simply to update the prev pointer even when policies are
equal.
This caused a bug to arise in the 6.2.y stable tree, and this patch
resolves this bug.
Reported-by: kernel test robot <oliver.sang(a)intel.com>
Link: https://lore.kernel.org/oe-lkp/202304292203.44ddeff6-oliver.sang@intel.com
Fixes: f4e9e0e69468 ("mm/mempolicy: fix use-after-free of VMA iterator")
Signed-off-by: Lorenzo Stoakes <lstoakes(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
---
v2: updated to correctly cc the stable list :)
v1:
https://lore.kernel.org/all/db42467a692d78c654ec5c1953329401bd8a9c34.168285…
mm/mempolicy.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2068b594dc88..1756389a0609 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -808,8 +808,10 @@ static int mbind_range(struct vma_iterator *vmi, struct vm_area_struct *vma,
vmstart = vma->vm_start;
}
- if (mpol_equal(vma_policy(vma), new_pol))
+ if (mpol_equal(vma_policy(vma), new_pol)) {
+ *prev = vma;
return 0;
+ }
pgoff = vma->vm_pgoff + ((vmstart - vma->vm_start) >> PAGE_SHIFT);
merged = vma_merge(vmi, vma->vm_mm, *prev, vmstart, vmend, vma->vm_flags,
--
2.40.1
The refactoring in commit f4e9e0e69468 ("mm/mempolicy: fix use-after-free
of VMA iterator") introduces a subtle bug which arises when attempting to
apply a new NUMA policy across a range of VMAs in mbind_range().
The refactoring passes a **prev pointer to keep track of the previous VMA
in order to reduce duplication, and in all but one case it keeps this
correctly updated.
The bug arises when a VMA within the specified range has an equivalent
policy as determined by mpol_equal() - which unlike other cases, does not
update prev.
This can result in a situation where, later in the iteration, a VMA is
found whose policy does need to change. At this point, vma_merge() is
invoked with prev pointing to a VMA which is before the previous VMA.
Since vma_merge() discovers the curr VMA by looking for the one immediately
after prev, it will now be in a situation where this VMA is incorrect and
the merge will not proceed correctly.
This is checked in the VM_WARN_ON() invariant case with end > curr->vm_end,
which, if a merge is possible, results in a warning (if CONFIG_DEBUG_VM is
specified).
I note that vma_merge() performs these invariant checks only after
merge_prev/merge_next are checked, which is debatable as it hides this
issue if no merge is possible even though a buggy situation has arisen.
The solution is simply to update the prev pointer even when policies are
equal.
This caused a bug to arise in the 6.2.y stable tree, and this patch
resolves this bug.
Reported-by: kernel test robot <oliver.sang(a)intel.com>
Link: https://lore.kernel.org/oe-lkp/202304292203.44ddeff6-oliver.sang@intel.com
Fixes: f4e9e0e69468 ("mm/mempolicy: fix use-after-free of VMA iterator")
Signed-off-by: Lorenzo Stoakes <lstoakes(a)gmail.com>
---
mm/mempolicy.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2068b594dc88..1756389a0609 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -808,8 +808,10 @@ static int mbind_range(struct vma_iterator *vmi, struct vm_area_struct *vma,
vmstart = vma->vm_start;
}
- if (mpol_equal(vma_policy(vma), new_pol))
+ if (mpol_equal(vma_policy(vma), new_pol)) {
+ *prev = vma;
return 0;
+ }
pgoff = vma->vm_pgoff + ((vmstart - vma->vm_start) >> PAGE_SHIFT);
merged = vma_merge(vmi, vma->vm_mm, *prev, vmstart, vmend, vma->vm_flags,
--
2.40.1
From: Christian Brauner <brauner(a)kernel.org>
[ Upstream commit 43b450632676fb60e9faeddff285d9fac94a4f58 ]
After a couple of years and multiple LTS releases we received a report
that the behavior of O_DIRECTORY | O_CREAT changed starting with v5.7.
On kernels prior to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL
had the following semantics:
(1) open("/tmp/d", O_DIRECTORY | O_CREAT)
* d doesn't exist: create regular file
* d exists and is a regular file: ENOTDIR
* d exists and is a directory: EISDIR
(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL)
* d doesn't exist: create regular file
* d exists and is a regular file: EEXIST
* d exists and is a directory: EEXIST
(3) open("/tmp/d", O_DIRECTORY | O_EXCL)
* d doesn't exist: ENOENT
* d exists and is a regular file: ENOTDIR
* d exists and is a directory: open directory
On kernels since to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL
have the following semantics:
(1) open("/tmp/d", O_DIRECTORY | O_CREAT)
* d doesn't exist: ENOTDIR (create regular file)
* d exists and is a regular file: ENOTDIR
* d exists and is a directory: EISDIR
(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL)
* d doesn't exist: ENOTDIR (create regular file)
* d exists and is a regular file: EEXIST
* d exists and is a directory: EEXIST
(3) open("/tmp/d", O_DIRECTORY | O_EXCL)
* d doesn't exist: ENOENT
* d exists and is a regular file: ENOTDIR
* d exists and is a directory: open directory
This is a fairly substantial semantic change that userspace didn't
notice until Pedro took the time to deliberately figure out corner
cases. Since no one noticed this breakage we can somewhat safely assume
that O_DIRECTORY | O_CREAT combinations are likely unused.
The v5.7 breakage is especially weird because while ENOTDIR is returned
indicating failure a regular file is actually created. This doesn't make
a lot of sense.
Time was spent finding potential users of this combination. Searching on
codesearch.debian.net showed that codebases often express semantical
expectations about O_DIRECTORY | O_CREAT which are completely contrary
to what our code has done and currently does.
The expectation often is that this particular combination would create
and open a directory. This suggests users who tried to use that
combination would stumble upon the counterintuitive behavior no matter
if pre-v5.7 or post v5.7 and quickly realize neither semantics give them
what they want. For some examples see the code examples in [1] to [3]
and the discussion in [4].
There are various ways to address this issue. The lazy/simple option
would be to restore the pre-v5.7 behavior and to just live with that bug
forever. But since there's a real chance that the O_DIRECTORY | O_CREAT
quirk isn't relied upon we should try to get away with murder(ing bad
semantics) first. If we need to Frankenstein pre-v5.7 behavior later so
be it.
So let's simply return EINVAL categorically for O_DIRECTORY | O_CREAT
combinations. In addition to cleaning up the old bug this also opens up
the possiblity to make that flag combination do something more intuitive
in the future.
Starting with this commit the following semantics apply:
(1) open("/tmp/d", O_DIRECTORY | O_CREAT)
* d doesn't exist: EINVAL
* d exists and is a regular file: EINVAL
* d exists and is a directory: EINVAL
(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL)
* d doesn't exist: EINVAL
* d exists and is a regular file: EINVAL
* d exists and is a directory: EINVAL
(3) open("/tmp/d", O_DIRECTORY | O_EXCL)
* d doesn't exist: ENOENT
* d exists and is a regular file: ENOTDIR
* d exists and is a directory: open directory
One additional note, O_TMPFILE is implemented as:
#define __O_TMPFILE 020000000
#define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
For older kernels it was important to return an explicit error when
O_TMPFILE wasn't supported. So O_TMPFILE requires that O_DIRECTORY is
raised alongside __O_TMPFILE. It also enforced that O_CREAT wasn't
specified. Since O_DIRECTORY | O_CREAT could be used to create a regular
allowing that combination together with __O_TMPFILE would've meant that
false positives were possible, i.e., that a regular file was created
instead of a O_TMPFILE. This could've been used to trick userspace into
thinking it operated on a O_TMPFILE when it wasn't.
Now that we block O_DIRECTORY | O_CREAT completely the check for O_CREAT
in the __O_TMPFILE branch via if ((flags & O_TMPFILE_MASK) != O_TMPFILE)
can be dropped. Instead we can simply check verify that O_DIRECTORY is
raised via if (!(flags & O_DIRECTORY)) and explain this in two comments.
As Aleksa pointed out O_PATH is unaffected by this change since it
always returned EINVAL if O_CREAT was specified - with or without
O_DIRECTORY.
Link: https://lore.kernel.org/lkml/20230320071442.172228-1-pedro.falcato@gmail.com
Link: https://sources.debian.org/src/flatpak/1.14.4-1/subprojects/libglnx/glnx-di… [1]
Link: https://sources.debian.org/src/flatpak-builder/1.2.3-1/subprojects/libglnx/… [2]
Link: https://sources.debian.org/src/ostree/2022.7-2/libglnx/glnx-dirfd.c/?hl=324… [3]
Link: https://www.openwall.com/lists/oss-security/2014/11/26/14 [4]
Reported-by: Pedro Falcato <pedro.falcato(a)gmail.com>
Cc: Aleksa Sarai <cyphar(a)cyphar.com>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Christian Brauner <brauner(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
fs/open.c | 18 +++++++++++++-----
include/uapi/asm-generic/fcntl.h | 1 -
tools/include/uapi/asm-generic/fcntl.h | 1 -
3 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index 20717ec510c07..9541430ec5b30 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1158,13 +1158,21 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
}
/*
- * In order to ensure programs get explicit errors when trying to use
- * O_TMPFILE on old kernels, O_TMPFILE is implemented such that it
- * looks like (O_DIRECTORY|O_RDWR & ~O_CREAT) to old kernels. But we
- * have to require userspace to explicitly set it.
+ * Block bugs where O_DIRECTORY | O_CREAT created regular files.
+ * Note, that blocking O_DIRECTORY | O_CREAT here also protects
+ * O_TMPFILE below which requires O_DIRECTORY being raised.
*/
+ if ((flags & (O_DIRECTORY | O_CREAT)) == (O_DIRECTORY | O_CREAT))
+ return -EINVAL;
+
+ /* Now handle the creative implementation of O_TMPFILE. */
if (flags & __O_TMPFILE) {
- if ((flags & O_TMPFILE_MASK) != O_TMPFILE)
+ /*
+ * In order to ensure programs get explicit errors when trying
+ * to use O_TMPFILE on old kernels we enforce that O_DIRECTORY
+ * is raised alongside __O_TMPFILE.
+ */
+ if (!(flags & O_DIRECTORY))
return -EINVAL;
if (!(acc_mode & MAY_WRITE))
return -EINVAL;
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 1ecdb911add8d..80f37a0d40d7d 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -91,7 +91,6 @@
/* a horrid kludge trying to make sure that this will fail on old kernels */
#define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
-#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
#ifndef O_NDELAY
#define O_NDELAY O_NONBLOCK
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index b02c8e0f40575..1c7a0f6632c09 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -91,7 +91,6 @@
/* a horrid kludge trying to make sure that this will fail on old kernels */
#define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
-#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
#ifndef O_NDELAY
#define O_NDELAY O_NONBLOCK
--
2.39.2
From: Christian Brauner <brauner(a)kernel.org>
[ Upstream commit 43b450632676fb60e9faeddff285d9fac94a4f58 ]
After a couple of years and multiple LTS releases we received a report
that the behavior of O_DIRECTORY | O_CREAT changed starting with v5.7.
On kernels prior to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL
had the following semantics:
(1) open("/tmp/d", O_DIRECTORY | O_CREAT)
* d doesn't exist: create regular file
* d exists and is a regular file: ENOTDIR
* d exists and is a directory: EISDIR
(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL)
* d doesn't exist: create regular file
* d exists and is a regular file: EEXIST
* d exists and is a directory: EEXIST
(3) open("/tmp/d", O_DIRECTORY | O_EXCL)
* d doesn't exist: ENOENT
* d exists and is a regular file: ENOTDIR
* d exists and is a directory: open directory
On kernels since to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL
have the following semantics:
(1) open("/tmp/d", O_DIRECTORY | O_CREAT)
* d doesn't exist: ENOTDIR (create regular file)
* d exists and is a regular file: ENOTDIR
* d exists and is a directory: EISDIR
(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL)
* d doesn't exist: ENOTDIR (create regular file)
* d exists and is a regular file: EEXIST
* d exists and is a directory: EEXIST
(3) open("/tmp/d", O_DIRECTORY | O_EXCL)
* d doesn't exist: ENOENT
* d exists and is a regular file: ENOTDIR
* d exists and is a directory: open directory
This is a fairly substantial semantic change that userspace didn't
notice until Pedro took the time to deliberately figure out corner
cases. Since no one noticed this breakage we can somewhat safely assume
that O_DIRECTORY | O_CREAT combinations are likely unused.
The v5.7 breakage is especially weird because while ENOTDIR is returned
indicating failure a regular file is actually created. This doesn't make
a lot of sense.
Time was spent finding potential users of this combination. Searching on
codesearch.debian.net showed that codebases often express semantical
expectations about O_DIRECTORY | O_CREAT which are completely contrary
to what our code has done and currently does.
The expectation often is that this particular combination would create
and open a directory. This suggests users who tried to use that
combination would stumble upon the counterintuitive behavior no matter
if pre-v5.7 or post v5.7 and quickly realize neither semantics give them
what they want. For some examples see the code examples in [1] to [3]
and the discussion in [4].
There are various ways to address this issue. The lazy/simple option
would be to restore the pre-v5.7 behavior and to just live with that bug
forever. But since there's a real chance that the O_DIRECTORY | O_CREAT
quirk isn't relied upon we should try to get away with murder(ing bad
semantics) first. If we need to Frankenstein pre-v5.7 behavior later so
be it.
So let's simply return EINVAL categorically for O_DIRECTORY | O_CREAT
combinations. In addition to cleaning up the old bug this also opens up
the possiblity to make that flag combination do something more intuitive
in the future.
Starting with this commit the following semantics apply:
(1) open("/tmp/d", O_DIRECTORY | O_CREAT)
* d doesn't exist: EINVAL
* d exists and is a regular file: EINVAL
* d exists and is a directory: EINVAL
(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL)
* d doesn't exist: EINVAL
* d exists and is a regular file: EINVAL
* d exists and is a directory: EINVAL
(3) open("/tmp/d", O_DIRECTORY | O_EXCL)
* d doesn't exist: ENOENT
* d exists and is a regular file: ENOTDIR
* d exists and is a directory: open directory
One additional note, O_TMPFILE is implemented as:
#define __O_TMPFILE 020000000
#define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
For older kernels it was important to return an explicit error when
O_TMPFILE wasn't supported. So O_TMPFILE requires that O_DIRECTORY is
raised alongside __O_TMPFILE. It also enforced that O_CREAT wasn't
specified. Since O_DIRECTORY | O_CREAT could be used to create a regular
allowing that combination together with __O_TMPFILE would've meant that
false positives were possible, i.e., that a regular file was created
instead of a O_TMPFILE. This could've been used to trick userspace into
thinking it operated on a O_TMPFILE when it wasn't.
Now that we block O_DIRECTORY | O_CREAT completely the check for O_CREAT
in the __O_TMPFILE branch via if ((flags & O_TMPFILE_MASK) != O_TMPFILE)
can be dropped. Instead we can simply check verify that O_DIRECTORY is
raised via if (!(flags & O_DIRECTORY)) and explain this in two comments.
As Aleksa pointed out O_PATH is unaffected by this change since it
always returned EINVAL if O_CREAT was specified - with or without
O_DIRECTORY.
Link: https://lore.kernel.org/lkml/20230320071442.172228-1-pedro.falcato@gmail.com
Link: https://sources.debian.org/src/flatpak/1.14.4-1/subprojects/libglnx/glnx-di… [1]
Link: https://sources.debian.org/src/flatpak-builder/1.2.3-1/subprojects/libglnx/… [2]
Link: https://sources.debian.org/src/ostree/2022.7-2/libglnx/glnx-dirfd.c/?hl=324… [3]
Link: https://www.openwall.com/lists/oss-security/2014/11/26/14 [4]
Reported-by: Pedro Falcato <pedro.falcato(a)gmail.com>
Cc: Aleksa Sarai <cyphar(a)cyphar.com>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Christian Brauner <brauner(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
fs/open.c | 18 +++++++++++++-----
include/uapi/asm-generic/fcntl.h | 1 -
tools/include/uapi/asm-generic/fcntl.h | 1 -
3 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index ceb88ac0ca3b2..f652833feffb5 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1158,13 +1158,21 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
}
/*
- * In order to ensure programs get explicit errors when trying to use
- * O_TMPFILE on old kernels, O_TMPFILE is implemented such that it
- * looks like (O_DIRECTORY|O_RDWR & ~O_CREAT) to old kernels. But we
- * have to require userspace to explicitly set it.
+ * Block bugs where O_DIRECTORY | O_CREAT created regular files.
+ * Note, that blocking O_DIRECTORY | O_CREAT here also protects
+ * O_TMPFILE below which requires O_DIRECTORY being raised.
*/
+ if ((flags & (O_DIRECTORY | O_CREAT)) == (O_DIRECTORY | O_CREAT))
+ return -EINVAL;
+
+ /* Now handle the creative implementation of O_TMPFILE. */
if (flags & __O_TMPFILE) {
- if ((flags & O_TMPFILE_MASK) != O_TMPFILE)
+ /*
+ * In order to ensure programs get explicit errors when trying
+ * to use O_TMPFILE on old kernels we enforce that O_DIRECTORY
+ * is raised alongside __O_TMPFILE.
+ */
+ if (!(flags & O_DIRECTORY))
return -EINVAL;
if (!(acc_mode & MAY_WRITE))
return -EINVAL;
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 1ecdb911add8d..80f37a0d40d7d 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -91,7 +91,6 @@
/* a horrid kludge trying to make sure that this will fail on old kernels */
#define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
-#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
#ifndef O_NDELAY
#define O_NDELAY O_NONBLOCK
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index b02c8e0f40575..1c7a0f6632c09 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -91,7 +91,6 @@
/* a horrid kludge trying to make sure that this will fail on old kernels */
#define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
-#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
#ifndef O_NDELAY
#define O_NDELAY O_NONBLOCK
--
2.39.2
From: Christian Brauner <brauner(a)kernel.org>
[ Upstream commit 43b450632676fb60e9faeddff285d9fac94a4f58 ]
After a couple of years and multiple LTS releases we received a report
that the behavior of O_DIRECTORY | O_CREAT changed starting with v5.7.
On kernels prior to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL
had the following semantics:
(1) open("/tmp/d", O_DIRECTORY | O_CREAT)
* d doesn't exist: create regular file
* d exists and is a regular file: ENOTDIR
* d exists and is a directory: EISDIR
(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL)
* d doesn't exist: create regular file
* d exists and is a regular file: EEXIST
* d exists and is a directory: EEXIST
(3) open("/tmp/d", O_DIRECTORY | O_EXCL)
* d doesn't exist: ENOENT
* d exists and is a regular file: ENOTDIR
* d exists and is a directory: open directory
On kernels since to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL
have the following semantics:
(1) open("/tmp/d", O_DIRECTORY | O_CREAT)
* d doesn't exist: ENOTDIR (create regular file)
* d exists and is a regular file: ENOTDIR
* d exists and is a directory: EISDIR
(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL)
* d doesn't exist: ENOTDIR (create regular file)
* d exists and is a regular file: EEXIST
* d exists and is a directory: EEXIST
(3) open("/tmp/d", O_DIRECTORY | O_EXCL)
* d doesn't exist: ENOENT
* d exists and is a regular file: ENOTDIR
* d exists and is a directory: open directory
This is a fairly substantial semantic change that userspace didn't
notice until Pedro took the time to deliberately figure out corner
cases. Since no one noticed this breakage we can somewhat safely assume
that O_DIRECTORY | O_CREAT combinations are likely unused.
The v5.7 breakage is especially weird because while ENOTDIR is returned
indicating failure a regular file is actually created. This doesn't make
a lot of sense.
Time was spent finding potential users of this combination. Searching on
codesearch.debian.net showed that codebases often express semantical
expectations about O_DIRECTORY | O_CREAT which are completely contrary
to what our code has done and currently does.
The expectation often is that this particular combination would create
and open a directory. This suggests users who tried to use that
combination would stumble upon the counterintuitive behavior no matter
if pre-v5.7 or post v5.7 and quickly realize neither semantics give them
what they want. For some examples see the code examples in [1] to [3]
and the discussion in [4].
There are various ways to address this issue. The lazy/simple option
would be to restore the pre-v5.7 behavior and to just live with that bug
forever. But since there's a real chance that the O_DIRECTORY | O_CREAT
quirk isn't relied upon we should try to get away with murder(ing bad
semantics) first. If we need to Frankenstein pre-v5.7 behavior later so
be it.
So let's simply return EINVAL categorically for O_DIRECTORY | O_CREAT
combinations. In addition to cleaning up the old bug this also opens up
the possiblity to make that flag combination do something more intuitive
in the future.
Starting with this commit the following semantics apply:
(1) open("/tmp/d", O_DIRECTORY | O_CREAT)
* d doesn't exist: EINVAL
* d exists and is a regular file: EINVAL
* d exists and is a directory: EINVAL
(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL)
* d doesn't exist: EINVAL
* d exists and is a regular file: EINVAL
* d exists and is a directory: EINVAL
(3) open("/tmp/d", O_DIRECTORY | O_EXCL)
* d doesn't exist: ENOENT
* d exists and is a regular file: ENOTDIR
* d exists and is a directory: open directory
One additional note, O_TMPFILE is implemented as:
#define __O_TMPFILE 020000000
#define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
For older kernels it was important to return an explicit error when
O_TMPFILE wasn't supported. So O_TMPFILE requires that O_DIRECTORY is
raised alongside __O_TMPFILE. It also enforced that O_CREAT wasn't
specified. Since O_DIRECTORY | O_CREAT could be used to create a regular
allowing that combination together with __O_TMPFILE would've meant that
false positives were possible, i.e., that a regular file was created
instead of a O_TMPFILE. This could've been used to trick userspace into
thinking it operated on a O_TMPFILE when it wasn't.
Now that we block O_DIRECTORY | O_CREAT completely the check for O_CREAT
in the __O_TMPFILE branch via if ((flags & O_TMPFILE_MASK) != O_TMPFILE)
can be dropped. Instead we can simply check verify that O_DIRECTORY is
raised via if (!(flags & O_DIRECTORY)) and explain this in two comments.
As Aleksa pointed out O_PATH is unaffected by this change since it
always returned EINVAL if O_CREAT was specified - with or without
O_DIRECTORY.
Link: https://lore.kernel.org/lkml/20230320071442.172228-1-pedro.falcato@gmail.com
Link: https://sources.debian.org/src/flatpak/1.14.4-1/subprojects/libglnx/glnx-di… [1]
Link: https://sources.debian.org/src/flatpak-builder/1.2.3-1/subprojects/libglnx/… [2]
Link: https://sources.debian.org/src/ostree/2022.7-2/libglnx/glnx-dirfd.c/?hl=324… [3]
Link: https://www.openwall.com/lists/oss-security/2014/11/26/14 [4]
Reported-by: Pedro Falcato <pedro.falcato(a)gmail.com>
Cc: Aleksa Sarai <cyphar(a)cyphar.com>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Christian Brauner <brauner(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
fs/open.c | 18 +++++++++++++-----
include/uapi/asm-generic/fcntl.h | 1 -
tools/include/uapi/asm-generic/fcntl.h | 1 -
3 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index 4401a73d4032d..4478adcc4f3a0 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1196,13 +1196,21 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
}
/*
- * In order to ensure programs get explicit errors when trying to use
- * O_TMPFILE on old kernels, O_TMPFILE is implemented such that it
- * looks like (O_DIRECTORY|O_RDWR & ~O_CREAT) to old kernels. But we
- * have to require userspace to explicitly set it.
+ * Block bugs where O_DIRECTORY | O_CREAT created regular files.
+ * Note, that blocking O_DIRECTORY | O_CREAT here also protects
+ * O_TMPFILE below which requires O_DIRECTORY being raised.
*/
+ if ((flags & (O_DIRECTORY | O_CREAT)) == (O_DIRECTORY | O_CREAT))
+ return -EINVAL;
+
+ /* Now handle the creative implementation of O_TMPFILE. */
if (flags & __O_TMPFILE) {
- if ((flags & O_TMPFILE_MASK) != O_TMPFILE)
+ /*
+ * In order to ensure programs get explicit errors when trying
+ * to use O_TMPFILE on old kernels we enforce that O_DIRECTORY
+ * is raised alongside __O_TMPFILE.
+ */
+ if (!(flags & O_DIRECTORY))
return -EINVAL;
if (!(acc_mode & MAY_WRITE))
return -EINVAL;
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 1ecdb911add8d..80f37a0d40d7d 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -91,7 +91,6 @@
/* a horrid kludge trying to make sure that this will fail on old kernels */
#define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
-#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
#ifndef O_NDELAY
#define O_NDELAY O_NONBLOCK
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index b02c8e0f40575..1c7a0f6632c09 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -91,7 +91,6 @@
/* a horrid kludge trying to make sure that this will fail on old kernels */
#define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
-#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
#ifndef O_NDELAY
#define O_NDELAY O_NONBLOCK
--
2.39.2
Sir/Ma'm. please kindly want to know if you're capable for business investment
project in
your country because i
need a serious partnership with good background, kindly reply
me to discuss details immediately. i will appreciate you to contact me
on this email below.
Thanks and awaiting for your quick response,
Amos!!
igc_configure_rx_ring() function will be called as part of XDP program
setup. If Rx hardware timestamp is enabled prio to XDP program setup,
this timestamp enablement will be overwritten when buffer size is
written into SRRCTL register.
Thus, this commit read the register value before write to SRRCTL
register. This commit is tested by using xdp_hw_metadata bpf selftest
tool. The tool enables Rx hardware timestamp and then attach XDP program
to igc driver. It will display hardware timestamp of UDP packet with
port number 9092. Below are detail of test steps and results.
Command on DUT:
sudo ./xdp_hw_metadata <interface name>
Command on Link Partner:
echo -n skb | nc -u -q1 <destination IPv4 addr> 9092
Result before this patch:
skb hwtstamp is not found!
Result after this patch:
found skb hwtstamp = 1677762212.590696226
Fixes: fc9df2a0b520 ("igc: Enable RX via AF_XDP zero-copy")
Cc: <stable(a)vger.kernel.org> # 5.14+
Signed-off-by: Song Yoong Siang <yoong.siang.song(a)intel.com>
---
drivers/net/ethernet/intel/igc/igc_base.h | 7 +++++--
drivers/net/ethernet/intel/igc/igc_main.c | 5 ++++-
2 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/intel/igc/igc_base.h b/drivers/net/ethernet/intel/igc/igc_base.h
index 7a992befca24..b95007d51d13 100644
--- a/drivers/net/ethernet/intel/igc/igc_base.h
+++ b/drivers/net/ethernet/intel/igc/igc_base.h
@@ -87,8 +87,11 @@ union igc_adv_rx_desc {
#define IGC_RXDCTL_SWFLUSH 0x04000000 /* Receive Software Flush */
/* SRRCTL bit definitions */
-#define IGC_SRRCTL_BSIZEPKT_SHIFT 10 /* Shift _right_ */
-#define IGC_SRRCTL_BSIZEHDRSIZE_SHIFT 2 /* Shift _left_ */
+#define IGC_SRRCTL_BSIZEPKT_MASK GENMASK(6, 0)
+#define IGC_SRRCTL_BSIZEPKT_SHIFT 10 /* Shift _right_ */
+#define IGC_SRRCTL_BSIZEHDRSIZE_MASK GENMASK(13, 8)
+#define IGC_SRRCTL_BSIZEHDRSIZE_SHIFT 2 /* Shift _left_ */
+#define IGC_SRRCTL_DESCTYPE_MASK GENMASK(27, 25)
#define IGC_SRRCTL_DESCTYPE_ADV_ONEBUF 0x02000000
#endif /* _IGC_BASE_H */
diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index 25fc6c65209b..de7b21c2ccd6 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -641,7 +641,10 @@ static void igc_configure_rx_ring(struct igc_adapter *adapter,
else
buf_size = IGC_RXBUFFER_2048;
- srrctl = IGC_RX_HDR_LEN << IGC_SRRCTL_BSIZEHDRSIZE_SHIFT;
+ srrctl = rd32(IGC_SRRCTL(reg_idx));
+ srrctl &= ~(IGC_SRRCTL_BSIZEPKT_MASK | IGC_SRRCTL_BSIZEHDRSIZE_MASK |
+ IGC_SRRCTL_DESCTYPE_MASK);
+ srrctl |= IGC_RX_HDR_LEN << IGC_SRRCTL_BSIZEHDRSIZE_SHIFT;
srrctl |= buf_size >> IGC_SRRCTL_BSIZEPKT_SHIFT;
srrctl |= IGC_SRRCTL_DESCTYPE_ADV_ONEBUF;
--
2.34.1
Every time I retest your email, it tells me to check with my ISP or
Log onto incoming mail server (POP3): Your e-mail server rejected .
Kindly verify if your email is still valid for us to talk.
Meu querido
Eu sou a Srta. Joline William, do Canadá, escrevendo da Costa do Marfim.
Por favor, quero que você me ajude a investir um total de 8,5
milhões de dólares que herdei de meu falecido pai. Oferecerei a você
35% imediatamente após receber o fundo em sua conta.
Vou atualizar quando ouvir de você
seu amor para sempre
Senhorita Joline William
We used to map the dtb differently between early_pg_dir and
swapper_pg_dir which caused issues when we referenced addresses from
the early mapping with swapper_pg_dir (reserved_mem): move the dtb mapping
to the fixmap region in patch 1, which allows to simplify dtb handling in
patch 2.
base-commit-tag: v6.2.11
Changes in v2:
- Add missing SoB
Alexandre Ghiti (3):
riscv: Move early dtb mapping into the fixmap region
riscv: Do not set initial_boot_params to the linear address of the dtb
riscv: No need to relocate the dtb as it lies in the fixmap region
Documentation/riscv/vm-layout.rst | 6 +--
arch/riscv/include/asm/fixmap.h | 8 +++
arch/riscv/include/asm/pgtable.h | 8 ++-
arch/riscv/kernel/setup.c | 6 +--
arch/riscv/mm/init.c | 82 ++++++++++++++-----------------
5 files changed, 54 insertions(+), 56 deletions(-)
--
2.37.2
From: "Paul E. McKenney" <paulmck(a)kernel.org>
[ Upstream commit 6bc6e6b27524304aadb9c04611ddb1c84dd7617a ]
The ref_scale_shutdown() kthread/function uses wait_event() to wait for
the refscale test to complete. However, although the read-side tests
are normally extremely fast, there is no law against specifying a very
large value for the refscale.loops module parameter or against having
a slow read-side primitive. Either way, this might well trigger the
hung-task timeout.
This commit therefore replaces those wait_event() calls with calls to
wait_event_idle(), which do not trigger the hung-task timeout.
Signed-off-by: Paul E. McKenney <paulmck(a)kernel.org>
Signed-off-by: Boqun Feng <boqun.feng(a)gmail.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/rcu/refscale.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/rcu/refscale.c b/kernel/rcu/refscale.c
index 952595c678b37..4e419ca6d6114 100644
--- a/kernel/rcu/refscale.c
+++ b/kernel/rcu/refscale.c
@@ -625,7 +625,7 @@ ref_scale_cleanup(void)
static int
ref_scale_shutdown(void *arg)
{
- wait_event(shutdown_wq, shutdown_start);
+ wait_event_idle(shutdown_wq, shutdown_start);
smp_mb(); // Wake before output.
ref_scale_cleanup();
--
2.39.2
From: "Paul E. McKenney" <paulmck(a)kernel.org>
[ Upstream commit 6bc6e6b27524304aadb9c04611ddb1c84dd7617a ]
The ref_scale_shutdown() kthread/function uses wait_event() to wait for
the refscale test to complete. However, although the read-side tests
are normally extremely fast, there is no law against specifying a very
large value for the refscale.loops module parameter or against having
a slow read-side primitive. Either way, this might well trigger the
hung-task timeout.
This commit therefore replaces those wait_event() calls with calls to
wait_event_idle(), which do not trigger the hung-task timeout.
Signed-off-by: Paul E. McKenney <paulmck(a)kernel.org>
Signed-off-by: Boqun Feng <boqun.feng(a)gmail.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/rcu/refscale.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/rcu/refscale.c b/kernel/rcu/refscale.c
index 66dc14cf5687e..5abb0cf52803a 100644
--- a/kernel/rcu/refscale.c
+++ b/kernel/rcu/refscale.c
@@ -777,7 +777,7 @@ ref_scale_cleanup(void)
static int
ref_scale_shutdown(void *arg)
{
- wait_event(shutdown_wq, shutdown_start);
+ wait_event_idle(shutdown_wq, shutdown_start);
smp_mb(); // Wake before output.
ref_scale_cleanup();
--
2.39.2
From: "Paul E. McKenney" <paulmck(a)kernel.org>
[ Upstream commit 6bc6e6b27524304aadb9c04611ddb1c84dd7617a ]
The ref_scale_shutdown() kthread/function uses wait_event() to wait for
the refscale test to complete. However, although the read-side tests
are normally extremely fast, there is no law against specifying a very
large value for the refscale.loops module parameter or against having
a slow read-side primitive. Either way, this might well trigger the
hung-task timeout.
This commit therefore replaces those wait_event() calls with calls to
wait_event_idle(), which do not trigger the hung-task timeout.
Signed-off-by: Paul E. McKenney <paulmck(a)kernel.org>
Signed-off-by: Boqun Feng <boqun.feng(a)gmail.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/rcu/refscale.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/rcu/refscale.c b/kernel/rcu/refscale.c
index 435c884c02b5c..d49a9d66e0000 100644
--- a/kernel/rcu/refscale.c
+++ b/kernel/rcu/refscale.c
@@ -795,7 +795,7 @@ ref_scale_cleanup(void)
static int
ref_scale_shutdown(void *arg)
{
- wait_event(shutdown_wq, shutdown_start);
+ wait_event_idle(shutdown_wq, shutdown_start);
smp_mb(); // Wake before output.
ref_scale_cleanup();
--
2.39.2
From: "Paul E. McKenney" <paulmck(a)kernel.org>
[ Upstream commit 6bc6e6b27524304aadb9c04611ddb1c84dd7617a ]
The ref_scale_shutdown() kthread/function uses wait_event() to wait for
the refscale test to complete. However, although the read-side tests
are normally extremely fast, there is no law against specifying a very
large value for the refscale.loops module parameter or against having
a slow read-side primitive. Either way, this might well trigger the
hung-task timeout.
This commit therefore replaces those wait_event() calls with calls to
wait_event_idle(), which do not trigger the hung-task timeout.
Signed-off-by: Paul E. McKenney <paulmck(a)kernel.org>
Signed-off-by: Boqun Feng <boqun.feng(a)gmail.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/rcu/refscale.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/rcu/refscale.c b/kernel/rcu/refscale.c
index 435c884c02b5c..d49a9d66e0000 100644
--- a/kernel/rcu/refscale.c
+++ b/kernel/rcu/refscale.c
@@ -795,7 +795,7 @@ ref_scale_cleanup(void)
static int
ref_scale_shutdown(void *arg)
{
- wait_event(shutdown_wq, shutdown_start);
+ wait_event_idle(shutdown_wq, shutdown_start);
smp_mb(); // Wake before output.
ref_scale_cleanup();
--
2.39.2
From: "Paul E. McKenney" <paulmck(a)kernel.org>
[ Upstream commit 6bc6e6b27524304aadb9c04611ddb1c84dd7617a ]
The ref_scale_shutdown() kthread/function uses wait_event() to wait for
the refscale test to complete. However, although the read-side tests
are normally extremely fast, there is no law against specifying a very
large value for the refscale.loops module parameter or against having
a slow read-side primitive. Either way, this might well trigger the
hung-task timeout.
This commit therefore replaces those wait_event() calls with calls to
wait_event_idle(), which do not trigger the hung-task timeout.
Signed-off-by: Paul E. McKenney <paulmck(a)kernel.org>
Signed-off-by: Boqun Feng <boqun.feng(a)gmail.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/rcu/refscale.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/rcu/refscale.c b/kernel/rcu/refscale.c
index afa3e1a2f6902..1970ce5f22d40 100644
--- a/kernel/rcu/refscale.c
+++ b/kernel/rcu/refscale.c
@@ -1031,7 +1031,7 @@ ref_scale_cleanup(void)
static int
ref_scale_shutdown(void *arg)
{
- wait_event(shutdown_wq, shutdown_start);
+ wait_event_idle(shutdown_wq, shutdown_start);
smp_mb(); // Wake before output.
ref_scale_cleanup();
--
2.39.2
[BUG]
Syzbot reported an ASSERT() got triggered during a scrub repair along
with balance:
BTRFS info (device loop5): balance: start -d -m
BTRFS info (device loop5): relocating block group 6881280 flags data|metadata
BTRFS info (device loop5): found 3 extents, stage: move data extents
BTRFS info (device loop5): scrub: started on devid 1
BTRFS info (device loop5): relocating block group 5242880 flags data|metadata
BTRFS info (device loop5): found 6 extents, stage: move data extents
BTRFS info (device loop5): found 1 extents, stage: update data pointers
BTRFS warning (device loop5): tree block 5500928 mirror 1 has bad bytenr, has 0 want 5500928
BTRFS info (device loop5): balance: ended with status: 0
BTRFS warning (device loop5): tree block 5435392 mirror 1 has bad bytenr, has 0 want 5435392
BTRFS warning (device loop5): tree block 5423104 mirror 1 has bad bytenr, has 0 want 5423104
assertion failed: 0, in fs/btrfs/scrub.c:614
------------[ cut here ]------------
kernel BUG at fs/btrfs/messages.c:259!
invalid opcode: 0000 [#2] PREEMPT SMP KASAN
Call Trace:
<TASK>
lock_full_stripe fs/btrfs/scrub.c:614 [inline]
scrub_handle_errored_block+0x1ee1/0x4730 fs/btrfs/scrub.c:1067
scrub_bio_end_io_worker+0x9bb/0x1370 fs/btrfs/scrub.c:2559
process_one_work+0x8a0/0x10e0 kernel/workqueue.c:2390
worker_thread+0xa63/0x1210 kernel/workqueue.c:2537
kthread+0x270/0x300 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
</TASK>
[CAUSE]
Btrfs can delete empty block groups either through auto-cleanup or
relcation.
Scrub normally is able to handle this situation well by doing extra
checking, and holding the block group cache pointer during the whole
scrub lifespan.
But unfortunately for lock_full_stripe() and unlock_full_stripe()
functions, due to the context restriction, they have to do an extra
search on the block group cache.
(While the main scrub threads holds a proper btrfs_block_group, but we
have no way to directly use that in repair context).
Thus it can happen that the target block group is already deleted by
relocation.
In that case, we trigger the above ASSERT().
[FIX]
Instead of triggering the ASSERT(), let's just return 0 and continue,
this would leave @locked_ret to be false, and we won't try to unlock
later.
CC: stable(a)vger.kernel.org
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
---
There would be no upstream commit, as upstream has completely rewritten
the scrub code in v6.4 merge window, and gets rid of the
lock_full_stripe()/unlock_full_stripe() functions.
I hope we don't have more scrub fixes which would only apply to older
kernels.
---
fs/btrfs/scrub.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 69c93ae333f6..43d0613c0dd3 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -610,10 +610,9 @@ static int lock_full_stripe(struct btrfs_fs_info *fs_info, u64 bytenr,
*locked_ret = false;
bg_cache = btrfs_lookup_block_group(fs_info, bytenr);
- if (!bg_cache) {
- ASSERT(0);
- return -ENOENT;
- }
+ /* The block group is removed, no need to do any lock. */
+ if (!bg_cache)
+ return 0;
/* Profiles not based on parity don't need full stripe lock */
if (!(bg_cache->flags & BTRFS_BLOCK_GROUP_RAID56_MASK))
--
2.39.2
Due to recent commit a5b2781dcab2c77979a4b8adda781d2543580901 , I'm
facing an issue where the backlight is dimmer than before
in comparison to the backlight pre-commit on my Thinkpad W530.
Downgrading to an older kernel version fixes the issue.
I also realise that this recent commit replaces the intel_backlight
folder in /sys/class/backlight with acpi_video0.
It is also worth mentioning that I'm using Nouveau on my W530. Below
are the screenshots highlighting the issue.
Kernel 6.1.22 with 10% brightness - https://i.imgur.com/7znm7xg.jpg
Kernel 6.1.22 with 35% brightness and grub parameter
acpi_backlight=video manually set - https://i.imgur.com/nD9O7pD.jpg
The "Fixes" commit mentioned below adds new MIBs counters to track some
particular cases that have been fixed by its parent commit 150d1e06c4f1
("mptcp: fix race in incoming ADD_ADDR option processing").
Unfortunately, one of the new MIB counter (AddAddrDrop) shares the same
prefix as an older one (AddAddr). This breaks one selftest because it
was doing a grep on "AddAddr" and it now gets 2 counters instead of 1.
This issue has been fixed upstream in a commit that was part of the same
set but not backported to v5.15, see commit 6ef84b1517e0 ("selftests:
mptcp: more robust signal race test"). It has not been backported
because it was fixing multiple things, some where for >v5.15.
This patch then simply extracts the only bit needed for v5.15. Now the
test passes when validating the last stable v5.15 kernel.
Fixes: f25ae162f4b3 ("mptcp: add mibs counter for ignored incoming options")
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
---
Hi Greg, Sasha,
Here is a fix just for v5.15, where f73c11946345 ("mptcp: add mibs
counter for ignored incoming options") has been backported but not
6ef84b1517e0 ("selftests: mptcp: more robust signal race test").
Thanks!
---
tools/testing/selftests/net/mptcp/mptcp_join.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index 3be615ab1588..96a090e7f47e 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -732,7 +732,7 @@ chk_add_nr()
local dump_stats
printf "%-39s %s" " " "add"
- count=`ip netns exec $ns2 nstat -as | grep MPTcpExtAddAddr | awk '{print $2}'`
+ count=`ip netns exec $ns2 nstat -as MPTcpExtAddAddr | grep MPTcpExtAddAddr | awk '{print $2}'`
[ -z "$count" ] && count=0
if [ "$count" != "$add_nr" ]; then
echo "[fail] got $count ADD_ADDR[s] expected $add_nr"
---
base-commit: f48aeeaaa64c628519273f6007a745cf55b68d95
change-id: 20230428-upstream-stable-20230428-mptcp-addaddrdropmib-b078a0442078
Best regards,
--
Matthieu Baerts <matthieu.baerts(a)tessares.net>
We used to map the dtb differently between early_pg_dir and
swapper_pg_dir which caused issues when we referenced addresses from
the early mapping with swapper_pg_dir (reserved_mem): move the dtb mapping
to the fixmap region in patch 1, which allows to simplify dtb handling in
patch 2.
base-commit-tag: v6.1.24
Changes in v2:
- Add missing SoB
Alexandre Ghiti (3):
riscv: Move early dtb mapping into the fixmap region
riscv: Do not set initial_boot_params to the linear address of the dtb
riscv: No need to relocate the dtb as it lies in the fixmap region
Documentation/riscv/vm-layout.rst | 4 +-
arch/riscv/include/asm/fixmap.h | 8 +++
arch/riscv/include/asm/pgtable.h | 8 ++-
arch/riscv/kernel/setup.c | 6 +--
arch/riscv/mm/init.c | 82 ++++++++++++++-----------------
5 files changed, 53 insertions(+), 55 deletions(-)
--
2.37.2
We used to map the dtb differently between early_pg_dir and
swapper_pg_dir which caused issues when we referenced addresses from
the early mapping with swapper_pg_dir (reserved_mem): move the dtb mapping
to the fixmap region in patch 1, which allows to simplify dtb handling in
patch 2.
base-commit-tag: v5.15.108
Changes in v3:
- Add missing SoB
Changes in v2:
- Fix upstream commit line position
Alexandre Ghiti (3):
riscv: Move early dtb mapping into the fixmap region
riscv: Do not set initial_boot_params to the linear address of the dtb
riscv: No need to relocate the dtb as it lies in the fixmap region
Documentation/riscv/vm-layout.rst | 2 +-
arch/riscv/include/asm/fixmap.h | 8 ++++
arch/riscv/include/asm/pgtable.h | 8 +++-
arch/riscv/kernel/setup.c | 6 +--
arch/riscv/mm/init.c | 68 ++++++++++++++++---------------
5 files changed, 52 insertions(+), 40 deletions(-)
--
2.37.2
On 03.04.23 08:14, Purohit, Kaushal wrote:
> Hi,
>
> Referring to patch with commit ID (*e10dcb1b6ba714243ad5a35a11b91cc14103a9a9*).
>
> This is a spec violation for CDC NCM class driver. Driver clearly says the significance of network capabilities. (snapshot below)
>
> However, with the mentioned patch these values are disrespected and commands specific to these capabilities are sent from the host regardless of device' capabilities to handle them.
>
> Currently we are setting these bits to 0 indicating no capabilities on our device and still we observe that Host (Linux kernel host cdc driver) has been sending requests specific to these capabilities.
Hi,
please test the patch I've attached to kernel.org's bugzilla.
Regards
Oliver
After upgrading build guests to v6.3, rpm started segfaulting for
specific packages, which was bisected to commit 0503ea8f5ba7 ("mm/mmap:
remove __vma_adjust()"). rpm is doing many mremap() operations with file
mappings of its db. The problem is that in vma_merge() case 3 (we merge
with the next vma, expanding it downwards) vm_pgoff is not adjusted as
it should when vm_start changes. As a result the rpm process most likely
sees data from the wrong offset of the file. Fix the vm_pgoff
calculation.
For case 8 this is a non-functional change as the resulting vm_pgoff is
the same.
Reported-and-bisected-by: Jiri Slaby <jirislaby(a)kernel.org>
Reported-and-tested-by: Fabian Vogt <fvogt(a)suse.com>
Link: https://bugzilla.suse.com/show_bug.cgi?id=1210903
Fixes: 0503ea8f5ba7 ("mm/mmap: remove __vma_adjust()")
Signed-off-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
---
Hi, I'm sending this patch on top of v6.3 as I think it should be
applied and backported to 6.3-stable rather sooner than later.
This means there would be a small conflict when merging mm/mm-stable
later. Alternatively it could be added to mm/mm-stable and upcoming 6.4
pull request, but then the stable backport would need adjustment.
It's up to Linus and Andrew.
#regzbot introduced: 0503ea8f5ba7 https://bugzilla.suse.com/show_bug.cgi?id=1210903
mm/mmap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index d5475fbf5729..eefa6f0cda28 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -978,7 +978,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
vma = next; /* case 3 */
vma_start = addr;
vma_end = next->vm_end;
- vma_pgoff = mid->vm_pgoff;
+ vma_pgoff = next->vm_pgoff - pglen;
err = 0;
if (mid != next) { /* case 8 */
remove = mid;
--
2.40.0
On Tue, Apr 25, 2023 at 04:54:12PM +0200, David Sterba wrote:
> On Sun, Apr 23, 2023 at 09:27:30AM +0700, Ammar Faizi wrote:
> > On 2/21/23 4:02 AM, Linus Torvalds wrote:
> > > On Mon, Feb 20, 2023 at 11:26 AM David Sterba <dsterba(a)suse.com> wrote:
> > >> Other:
> > >>
> > >> - locally enable -Wmaybe-uninitialized after fixing all warnings
> > >
> > > I've pulled this, but I strongly suspect this change will get reverted.
> > >
> > > I bet neither you nor linux-next is testing even _remotely_ a big
> > > chunk of the different compiler versions that are out there, and the
> > > reason flags like '-Wmaybe-uninitialized' get undone is because some
> > > random compiler version on some random config and target archiecture
> > > gives completely nonsensical warnings for odd reasons.
> > >
> > > But hey, maybe the btrfs code is special.
> >
> > Maybe it's too late for 6.3. So please fix this in 6.4 and backport it to
> > 6.3 stable.
>
> Fix for this warning is in 6.4 pull request, there's no CC:stable tag
> but we can ask to add it once the code lands in master.
It landed in master.
[ Adding stable team to the Cc list ]
Hi Greg and stable team, could you please backport:
commit 8ba7d5f5ba931be68a94b8c91bcced1622934e7a upstream
("btrfs: fix uninitialized variable warnings")
to v6.3 to fix gcc (10, 9, 7) build error?
The fs/btrfs/volumes.c hunk won't apply cleanly, but it can auto-merge:
$ git cherry-pick 8ba7d5f5ba931be68a94b8c91bcced1622934e7a
Auto-merging fs/btrfs/volumes.c
[detached HEAD 572410288a1070c1] btrfs: fix uninitialized variable warnings
Author: Genjian Zhang <zhanggenjian(a)kylinos.cn>
Date: Fri Mar 24 10:08:38 2023 +0800
2 files changed, 2 insertions(+), 2 deletions(-)
Thanks,
--
Ammar Faizi
[ Upstream commit 59f5ede3bc0f00eb856425f636dab0c10feb06d8 ]
The FPU usage related to task FPU management is either protected by
disabling interrupts (switch_to, return to user) or via fpregs_lock() which
is a wrapper around local_bh_disable(). When kernel code wants to use the
FPU then it has to check whether it is possible by calling irq_fpu_usable().
But the condition in irq_fpu_usable() is wrong. It allows FPU to be used
when:
!in_interrupt() || interrupted_user_mode() || interrupted_kernel_fpu_idle()
The latter is checking whether some other context already uses FPU in the
kernel, but if that's not the case then it allows FPU to be used
unconditionally even if the calling context interrupted a fpregs_lock()
critical region. If that happens then the FPU state of the interrupted
context becomes corrupted.
Allow in kernel FPU usage only when no other context has in kernel FPU
usage and either the calling context is not hard interrupt context or the
hard interrupt did not interrupt a local bottomhalf disabled region.
It's hard to find a proper Fixes tag as the condition was broken in one way
or the other for a very long time and the eager/lazy FPU changes caused a
lot of churn. Picked something remotely connected from the history.
This survived undetected for quite some time as FPU usage in interrupt
context is rare, but the recent changes to the random code unearthed it at
least on a kernel which had FPU debugging enabled. There is probably a
higher rate of silent corruption as not all issues can be detected by the
FPU debugging code. This will be addressed in a subsequent change.
Fixes: 5d2bd7009f30 ("x86, fpu: decouple non-lazy/eager fpu restore from xsave")
Reported-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Filipe Manana <fdmanana(a)suse.com>
Reviewed-by: Borislav Petkov <bp(a)suse.de>
Cc: stable(a)vger.kernel.org
Signed-off-by: Can Sun <cansun(a)arista.com>
Link: https://lore.kernel.org/r/20220501193102.588689270@linutronix.de
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 571220ac8bea..835b948095cd 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -25,17 +25,7 @@
*/
union fpregs_state init_fpstate __read_mostly;
-/*
- * Track whether the kernel is using the FPU state
- * currently.
- *
- * This flag is used:
- *
- * - by IRQ context code to potentially use the FPU
- * if it's unused.
- *
- * - to debug kernel_fpu_begin()/end() correctness
- */
+/* Track in-kernel FPU usage */
static DEFINE_PER_CPU(bool, in_kernel_fpu);
/*
@@ -43,42 +33,37 @@ static DEFINE_PER_CPU(bool, in_kernel_fpu);
*/
DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
-static bool kernel_fpu_disabled(void)
-{
- return this_cpu_read(in_kernel_fpu);
-}
-
-static bool interrupted_kernel_fpu_idle(void)
-{
- return !kernel_fpu_disabled();
-}
-
-/*
- * Were we in user mode (or vm86 mode) when we were
- * interrupted?
- *
- * Doing kernel_fpu_begin/end() is ok if we are running
- * in an interrupt context from user mode - we'll just
- * save the FPU state as required.
- */
-static bool interrupted_user_mode(void)
-{
- struct pt_regs *regs = get_irq_regs();
- return regs && user_mode(regs);
-}
-
/*
* Can we use the FPU in kernel mode with the
* whole "kernel_fpu_begin/end()" sequence?
- *
- * It's always ok in process context (ie "not interrupt")
- * but it is sometimes ok even from an irq.
*/
bool irq_fpu_usable(void)
{
- return !in_interrupt() ||
- interrupted_user_mode() ||
- interrupted_kernel_fpu_idle();
+ if (WARN_ON_ONCE(in_nmi()))
+ return false;
+
+ /* In kernel FPU usage already active? */
+ if (this_cpu_read(in_kernel_fpu))
+ return false;
+
+ /*
+ * When not in NMI or hard interrupt context, FPU can be used in:
+ *
+ * - Task context except from within fpregs_lock()'ed critical
+ * regions.
+ *
+ * - Soft interrupt processing context which cannot happen
+ * while in a fpregs_lock()'ed critical region.
+ */
+ if (!in_irq())
+ return true;
+
+ /*
+ * In hard interrupt context it's safe when soft interrupts
+ * are enabled, which means the interrupt did not hit in
+ * a fpregs_lock()'ed critical region.
+ */
+ return !softirq_count();
}
EXPORT_SYMBOL(irq_fpu_usable);
We used to map the dtb differently between early_pg_dir and
swapper_pg_dir which caused issues when we referenced addresses from
the early mapping with swapper_pg_dir (reserved_mem): move the dtb mapping
to the fixmap region in patch 1, which allows to simplify dtb handling in
patch 2.
base-commit-tag: v6.2.11
Alexandre Ghiti (3):
riscv: Move early dtb mapping into the fixmap region
riscv: Do not set initial_boot_params to the linear address of the dtb
riscv: No need to relocate the dtb as it lies in the fixmap region
Documentation/riscv/vm-layout.rst | 6 +--
arch/riscv/include/asm/fixmap.h | 8 +++
arch/riscv/include/asm/pgtable.h | 8 ++-
arch/riscv/kernel/setup.c | 6 +--
arch/riscv/mm/init.c | 82 ++++++++++++++-----------------
5 files changed, 54 insertions(+), 56 deletions(-)
--
2.37.2
Most exciting stuff this time around has to do with performance.
The following changes since commit 6a8f57ae2eb07ab39a6f0ccad60c760743051026:
Linux 6.3-rc7 (2023-04-16 15:23:53 -0700)
are available in the Git repository at:
https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus
for you to fetch changes up to c82729e06644f4e087f5ff0f91b8fb15e03b8890:
vhost_vdpa: fix unmap process in no-batch mode (2023-04-21 03:02:36 -0400)
----------------------------------------------------------------
virtio,vhost,vdpa: features, fixes, cleanups
reduction in interrupt rate in virtio
perf improvement for VDUSE
scalability for vhost-scsi
non power of 2 ring support for packed rings
better management for mlx5 vdpa
suspend for snet
VIRTIO_F_NOTIFICATION_DATA
shared backend with vdpa-sim-blk
user VA support in vdpa-sim
better struct packing for virtio
fixes, cleanups all over the place
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
----------------------------------------------------------------
Albert Huang (1):
virtio_ring: don't update event idx on get_buf
Alvaro Karsz (5):
vdpa/snet: support getting and setting VQ state
vdpa/snet: support the suspend vDPA callback
virtio-vdpa: add VIRTIO_F_NOTIFICATION_DATA feature support
vdpa/snet: implement kick_vq_with_data callback
vdpa/snet: use likely/unlikely macros in hot functions
Christophe JAILLET (1):
virtio: Reorder fields in 'struct virtqueue'
Cindy Lu (1):
vhost_vdpa: fix unmap process in no-batch mode
Eli Cohen (3):
vdpa/mlx5: Avoid losing link state updates
vdpa/mlx5: Make VIRTIO_NET_F_MRG_RXBUF off by default
vdpa/mlx5: Extend driver support for new features
Feng Liu (3):
virtio_ring: Avoid using inline for small functions
virtio_ring: Use const to annotate read-only pointer params
virtio_ring: Allow non power of 2 sizes for packed virtqueue
Jacob Keller (1):
vhost: use struct_size and size_add to compute flex array sizes
Mike Christie (5):
vhost-scsi: Delay releasing our refcount on the tpg
vhost-scsi: Drop device mutex use in vhost_scsi_do_plug
vhost-scsi: Check for a cleared backend before queueing an event
vhost-scsi: Drop vhost_scsi_mutex use in port callouts
vhost-scsi: Reduce vhost_scsi_mutex use
Rong Tao (2):
tools/virtio: virtio_test: Fix indentation
tools/virtio: virtio_test -h,--help should return directly
Shunsuke Mie (2):
virtio_ring: add a struct device forward declaration
tools/virtio: fix build caused by virtio_ring changes
Simon Horman (3):
vdpa: address kdoc warnings
vringh: address kdoc warnings
MAINTAINERS: add vringh.h to Virtio Core and Net Drivers
Stefano Garzarella (12):
vringh: fix typos in the vringh_init_* documentation
vdpa: add bind_mm/unbind_mm callbacks
vhost-vdpa: use bind_mm/unbind_mm device callbacks
vringh: replace kmap_atomic() with kmap_local_page()
vringh: define the stride used for translation
vringh: support VA with iotlb
vdpa_sim: make devices agnostic for work management
vdpa_sim: use kthread worker
vdpa_sim: replace the spinlock with a mutex to protect the state
vdpa_sim: add support for user VA
vdpa_sim: move buffer allocation in the devices
vdpa_sim_blk: support shared backend
Viktor Prutyanov (1):
virtio: add VIRTIO_F_NOTIFICATION_DATA feature support
Xie Yongji (11):
lib/group_cpus: Export group_cpus_evenly()
vdpa: Add set/get_vq_affinity callbacks in vdpa_config_ops
virtio-vdpa: Support interrupt affinity spreading mechanism
vduse: Refactor allocation for vduse virtqueues
vduse: Support set_vq_affinity callback
vduse: Support get_vq_affinity callback
vduse: Add sysfs interface for irq callback affinity
vdpa: Add eventfd for the vdpa callback
vduse: Signal vq trigger eventfd directly if possible
vduse: Delay iova domain creation
vduse: Support specifying bounce buffer size via sysfs
Xuan Zhuo (1):
MAINTAINERS: make me a reviewer of VIRTIO CORE AND NET DRIVERS
MAINTAINERS | 2 +
drivers/s390/virtio/virtio_ccw.c | 22 +-
drivers/vdpa/mlx5/net/mlx5_vnet.c | 261 +++++++++++++---------
drivers/vdpa/solidrun/Makefile | 1 +
drivers/vdpa/solidrun/snet_ctrl.c | 330 ++++++++++++++++++++++++++++
drivers/vdpa/solidrun/snet_hwmon.c | 2 +-
drivers/vdpa/solidrun/snet_main.c | 146 ++++++------
drivers/vdpa/solidrun/snet_vdpa.h | 20 +-
drivers/vdpa/vdpa_sim/vdpa_sim.c | 166 +++++++++++---
drivers/vdpa/vdpa_sim/vdpa_sim.h | 14 +-
drivers/vdpa/vdpa_sim/vdpa_sim_blk.c | 93 ++++++--
drivers/vdpa/vdpa_sim/vdpa_sim_net.c | 38 ++--
drivers/vdpa/vdpa_user/vduse_dev.c | 414 +++++++++++++++++++++++++++++------
drivers/vhost/scsi.c | 102 +++++----
drivers/vhost/vdpa.c | 44 +++-
drivers/vhost/vhost.c | 6 +-
drivers/vhost/vringh.c | 191 ++++++++++++----
drivers/virtio/virtio_mmio.c | 18 +-
drivers/virtio/virtio_pci_modern.c | 22 +-
drivers/virtio/virtio_ring.c | 89 +++++---
drivers/virtio/virtio_vdpa.c | 120 +++++++++-
include/linux/vdpa.h | 52 ++++-
include/linux/virtio.h | 16 +-
include/linux/virtio_ring.h | 3 +
include/linux/vringh.h | 26 ++-
include/uapi/linux/virtio_config.h | 6 +
lib/group_cpus.c | 1 +
tools/include/linux/types.h | 5 +
tools/virtio/linux/compiler.h | 2 +
tools/virtio/linux/kernel.h | 5 +-
tools/virtio/linux/uaccess.h | 11 +-
tools/virtio/virtio_test.c | 12 +-
32 files changed, 1760 insertions(+), 480 deletions(-)
create mode 100644 drivers/vdpa/solidrun/snet_ctrl.c
[BUG]
With block-group-tree feature enabled, mounting it with clear_cache
would cause the following transaction abort at mount or remount:
BTRFS info (device dm-4): force clearing of disk cache
BTRFS info (device dm-4): using free space tree
BTRFS info (device dm-4): auto enabling async discard
BTRFS info (device dm-4): clearing free space tree
BTRFS info (device dm-4): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1)
BTRFS info (device dm-4): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2)
BTRFS error (device dm-4): block-group-tree feature requires fres-space-tree and no-holes
BTRFS error (device dm-4): super block corruption detected before writing it to disk
BTRFS: error (device dm-4) in write_all_supers:4288: errno=-117 Filesystem corrupted (unexpected superblock corruption detected)
BTRFS warning (device dm-4: state E): Skipping commit of aborted transaction.
[CAUSE]
For block-group-tree feature, we have an artificial dependency on
free-space-tree.
This means if we detects block-group-tree without v2 cache, we consider
it a corruption and cause the problem.
For clear_cache mount option, it would temporary disable v2 cache, then
re-enable it.
But unfortunately for that temporary v2 cache disabled status, we refuse
to write a superblock with bg tree only flag, thus leads to the above
transaction abortion.
[FIX]
For now, just reject clear_cache and v1 cache mount option for block
group tree.
So now we got a graceful rejection other than a transaction abort:
BTRFS info (device dm-4): force clearing of disk cache
BTRFS error (device dm-4): cannot disable free space tree with block-group-tree feature
BTRFS error (device dm-4): open_ctree failed
Cc: stable(a)vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
---
For the proper fix, we need to change the behavior of clear_cache and v1
cache switch.
For pure clear_cache without switch cache version, we should allow
rebuilding v2 cache without fully disable v2 cache.
---
fs/btrfs/super.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 581845bc206a..eefae0318d4f 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -826,7 +826,12 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
!btrfs_test_opt(info, CLEAR_CACHE)) {
btrfs_err(info, "cannot disable free space tree");
ret = -EINVAL;
-
+ }
+ if (btrfs_fs_compat_ro(info, BLOCK_GROUP_TREE) &&
+ (btrfs_test_opt(info, CLEAR_CACHE) ||
+ !btrfs_test_opt(info, FREE_SPACE_TREE))) {
+ btrfs_err(info, "cannot disable free space tree with block-group-tree feature");
+ ret = -EINVAL;
}
if (!ret)
ret = btrfs_check_mountopts_zoned(info);
--
2.39.2
Before sending a TPM command, CLKRUN protocol must be disabled. This is not
done in the case of tpm1_do_selftest() call site inside tpm_tis_resume().
Address this by decorating the calls with tpm_chip_{start,stop}, which arm
and disarm the TPM chip for transmission, and take care of disabling and
re-enabling CLKRUN, among other things.
Cc: stable(a)vger.kernel.org
Reported-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
Link: https://lore.kernel.org/linux-integrity/CS68AWILHXS4.3M36M1EKZLUMS@suppilov…
Fixes: a3fbfae82b4c ("tpm: take TPM chip power gating out of tpm_transmit()")
Signed-off-by: Jarkko Sakkinen <jarkko(a)kernel.org>
---
drivers/char/tpm/tpm_tis_core.c | 43 +++++++++++++++------------------
1 file changed, 19 insertions(+), 24 deletions(-)
diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
index c2421162cf34..73707026e358 100644
--- a/drivers/char/tpm/tpm_tis_core.c
+++ b/drivers/char/tpm/tpm_tis_core.c
@@ -1209,25 +1209,20 @@ static void tpm_tis_reenable_interrupts(struct tpm_chip *chip)
u32 intmask;
int rc;
- if (chip->ops->clk_enable != NULL)
- chip->ops->clk_enable(chip, true);
-
- /* reenable interrupts that device may have lost or
- * BIOS/firmware may have disabled
+ /*
+ * Re-enable interrupts that device may have lost or BIOS/firmware may
+ * have disabled.
*/
rc = tpm_tis_write8(priv, TPM_INT_VECTOR(priv->locality), priv->irq);
- if (rc < 0)
- goto out;
+ if (rc < 0) {
+ dev_err(&chip->dev, "Setting IRQ failed.\n");
+ return;
+ }
intmask = priv->int_mask | TPM_GLOBAL_INT_ENABLE;
-
- tpm_tis_write32(priv, TPM_INT_ENABLE(priv->locality), intmask);
-
-out:
- if (chip->ops->clk_enable != NULL)
- chip->ops->clk_enable(chip, false);
-
- return;
+ rc = tpm_tis_write32(priv, TPM_INT_ENABLE(priv->locality), intmask);
+ if (rc < 0)
+ dev_err(&chip->dev, "Enabling interrupts failed.\n");
}
int tpm_tis_resume(struct device *dev)
@@ -1235,27 +1230,27 @@ int tpm_tis_resume(struct device *dev)
struct tpm_chip *chip = dev_get_drvdata(dev);
int ret;
- ret = tpm_tis_request_locality(chip, 0);
- if (ret < 0)
+ ret = tpm_chip_start(chip);
+ if (ret)
return ret;
if (chip->flags & TPM_CHIP_FLAG_IRQ)
tpm_tis_reenable_interrupts(chip);
- ret = tpm_pm_resume(dev);
- if (ret)
- goto out;
-
/*
* TPM 1.2 requires self-test on resume. This function actually returns
* an error code but for unknown reason it isn't handled.
*/
if (!(chip->flags & TPM_CHIP_FLAG_TPM2))
tpm1_do_selftest(chip);
-out:
- tpm_tis_relinquish_locality(chip, 0);
- return ret;
+ tpm_chip_stop(chip);
+
+ ret = tpm_pm_resume(dev);
+ if (ret)
+ return ret;
+
+ return 0;
}
EXPORT_SYMBOL_GPL(tpm_tis_resume);
#endif
--
2.39.2
The ftrace-direct-too sample traces the handle_mm_fault function whose
signature changed since the introduction of the sample. Since:
commit bce617edecad ("mm: do page fault accounting in handle_mm_fault")
handle_mm_fault now has 4 arguments. Therefore, the sample trampoline
should save 4 argument registers.
s390 saves all argument registers already so it does not need a change
but x86_64 needs an extra push and pop.
This also evolves the signature of the tracing function to make it
mirror the signature of the traced function.
Cc: stable(a)vger.kernel.org
Fixes: bce617edecad ("mm: do page fault accounting in handle_mm_fault")
Reviewed-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
Reviewed-by: Mark Rutland <mark.rutland(a)arm.com>
Signed-off-by: Florent Revest <revest(a)chromium.org>
---
samples/ftrace/ftrace-direct-too.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/samples/ftrace/ftrace-direct-too.c b/samples/ftrace/ftrace-direct-too.c
index f28e7b99840f..71ed4ee8cb4a 100644
--- a/samples/ftrace/ftrace-direct-too.c
+++ b/samples/ftrace/ftrace-direct-too.c
@@ -5,14 +5,14 @@
#include <linux/ftrace.h>
#include <asm/asm-offsets.h>
-extern void my_direct_func(struct vm_area_struct *vma,
- unsigned long address, unsigned int flags);
+extern void my_direct_func(struct vm_area_struct *vma, unsigned long address,
+ unsigned int flags, struct pt_regs *regs);
-void my_direct_func(struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+void my_direct_func(struct vm_area_struct *vma, unsigned long address,
+ unsigned int flags, struct pt_regs *regs)
{
- trace_printk("handle mm fault vma=%p address=%lx flags=%x\n",
- vma, address, flags);
+ trace_printk("handle mm fault vma=%p address=%lx flags=%x regs=%p\n",
+ vma, address, flags, regs);
}
extern void my_tramp(void *);
@@ -34,7 +34,9 @@ asm (
" pushq %rdi\n"
" pushq %rsi\n"
" pushq %rdx\n"
+" pushq %rcx\n"
" call my_direct_func\n"
+" popq %rcx\n"
" popq %rdx\n"
" popq %rsi\n"
" popq %rdi\n"
--
2.40.1.495.gc816e09b53d-goog
Hello best friend. i kindly wanted to know if you're capable for investment
project in
your country. i
need a serious partnership with good background, kindly reply
me to discuss details immediately. i will appreciate you to contact me
on this email.
Thanks and awaiting for your quick response,
Wormer,
Hi! I noticed a report about a regression (hang due to a deadlock with
mt76x02u_pre_tbtt_work when using a MT7610U chip as AP) that according
to the reporter started with 6.1.21; 6.2 and 6.3 work, but there
lockdep warnings occur.
There thus apparently is at least one bug in a stable tree that might or
might not be caused by a backported change that leads to the lockdep
warnings in later series.
But the reporter apparently doesn't care about 6.1.y anymore and plans
to move to 6.3. Hence the reporter afaics has no interest in bisecting
the problem on 6.1.y. But maybe some of you care or even have an idea
what's causing this. For details see:
https://bugzilla.kernel.org/show_bug.cgi?id=217341
Ciao, Thorsten
Hi Greg, Sasha,
Recently, 2 patches related to MPTCP have not been backported to v6.1
tree due to conflicts:
- 2a6a870e44dd ("mptcp: stops worker on unaccepted sockets at listener close") [1]
- 63740448a32e ("mptcp: fix accept vs worker race") [2]
I then here resolved the conflicts, documented what I did in each patch
and ran our tests suite. Everything seems OK.
These patches are based on top of the latest linux-stable-rc/linux-6.1.y
version.
Do you mind adding these two patches to v6.1 queue please?
[1] https://lore.kernel.org/r/2023042259-gravity-hate-a9a3@gregkh
[2] https://lore.kernel.org/r/2023042215-chastise-scuba-8478@gregkh
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
---
Paolo Abeni (2):
mptcp: stops worker on unaccepted sockets at listener close
mptcp: fix accept vs worker race
net/mptcp/protocol.c | 74 +++++++++++++++++++++++++++++++++---------------
net/mptcp/protocol.h | 2 ++
net/mptcp/subflow.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 130 insertions(+), 26 deletions(-)
---
base-commit: e4ff6ff54dea67f94036a357201b0f9807405cc6
change-id: 20230424-upstream-stable-20230424-conflicts-6-1-f325fe76c540
Best regards,
--
Matthieu Baerts <matthieu.baerts(a)tessares.net>
The patch below does not apply to the 6.2-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.2.y
git checkout FETCH_HEAD
git cherry-pick -x f4e9e0e69468583c2c6d9d5c7bfc975e292bf188
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023042253-speed-jolliness-682f@gregkh' --subject-prefix 'PATCH 6.2.y' HEAD^..
Possible dependencies:
f4e9e0e69468 ("mm/mempolicy: fix use-after-free of VMA iterator")
9760ebffbf55 ("mm: switch vma_merge(), split_vma(), and __split_vma to vma iterator")
47d9644de92c ("nommu: convert nommu to using the vma iterator")
a27a11f92fe2 ("mm/mremap: use vmi version of vma_merge()")
076f16bf7698 ("mmap: use vmi version of vma_merge()")
0c0c5bffd0a2 ("mmap: pass through vmi iterator to __split_vma()")
178e22ac2078 ("madvise: use vmi iterator for __split_vma() and vma_merge()")
f10c2abcdac4 ("mempolicy: convert to vma iterator")
37598f5a9d8b ("mlock: convert mlock to vma iterator")
2286a6914c77 ("mm: change mprotect_fixup to vma iterator")
11a9b90274f6 ("userfaultfd: use vma iterator")
f2ebfe43ba6c ("mm: add temporary vma iterator versions of vma_merge(), split_vma(), and __split_vma()")
183654ce26a5 ("mmap: change do_mas_munmap and do_mas_aligned_munmap() to use vma iterator")
0378c0a0e9e4 ("mm/mmap: remove preallocation from do_mas_align_munmap()")
92fed82047d7 ("mm/mmap: convert brk to use vma iterator")
baabcfc93d3b ("mm/mmap: fix typo in comment")
c5d5546ea065 ("maple_tree: remove the parameter entry of mas_preallocate")
5ab0fc155dc0 ("Sync mm-stable with mm-hotfixes-stable to pick up dependent patches")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f4e9e0e69468583c2c6d9d5c7bfc975e292bf188 Mon Sep 17 00:00:00 2001
From: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Date: Mon, 10 Apr 2023 11:22:05 -0400
Subject: [PATCH] mm/mempolicy: fix use-after-free of VMA iterator
set_mempolicy_home_node() iterates over a list of VMAs and calls
mbind_range() on each VMA, which also iterates over the singular list of
the VMA passed in and potentially splits the VMA. Since the VMA iterator
is not passed through, set_mempolicy_home_node() may now point to a stale
node in the VMA tree. This can result in a UAF as reported by syzbot.
Avoid the stale maple tree node by passing the VMA iterator through to the
underlying call to split_vma().
mbind_range() is also overly complicated, since there are two calling
functions and one already handles iterating over the VMAs. Simplify
mbind_range() to only handle merging and splitting of the VMAs.
Align the new loop in do_mbind() and existing loop in
set_mempolicy_home_node() to use the reduced mbind_range() function. This
allows for a single location of the range calculation and avoids
constantly looking up the previous VMA (since this is a loop over the
VMAs).
Link: https://lore.kernel.org/linux-mm/000000000000c93feb05f87e24ad@google.com/
Fixes: 66850be55e8e ("mm/mempolicy: use vma iterator & maple state instead of vma linked list")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Reported-by: syzbot+a7c1ec5b1d71ceaa5186(a)syzkaller.appspotmail.com
Link: https://lkml.kernel.org/r/20230410152205.2294819-1-Liam.Howlett@oracle.com
Tested-by: syzbot+a7c1ec5b1d71ceaa5186(a)syzkaller.appspotmail.com
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a256a241fd1d..2068b594dc88 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -790,61 +790,50 @@ static int vma_replace_policy(struct vm_area_struct *vma,
return err;
}
-/* Step 2: apply policy to a range and do splits. */
-static int mbind_range(struct mm_struct *mm, unsigned long start,
- unsigned long end, struct mempolicy *new_pol)
+/* Split or merge the VMA (if required) and apply the new policy */
+static int mbind_range(struct vma_iterator *vmi, struct vm_area_struct *vma,
+ struct vm_area_struct **prev, unsigned long start,
+ unsigned long end, struct mempolicy *new_pol)
{
- VMA_ITERATOR(vmi, mm, start);
- struct vm_area_struct *prev;
- struct vm_area_struct *vma;
- int err = 0;
+ struct vm_area_struct *merged;
+ unsigned long vmstart, vmend;
pgoff_t pgoff;
+ int err;
- prev = vma_prev(&vmi);
- vma = vma_find(&vmi, end);
- if (WARN_ON(!vma))
+ vmend = min(end, vma->vm_end);
+ if (start > vma->vm_start) {
+ *prev = vma;
+ vmstart = start;
+ } else {
+ vmstart = vma->vm_start;
+ }
+
+ if (mpol_equal(vma_policy(vma), new_pol))
return 0;
- if (start > vma->vm_start)
- prev = vma;
-
- do {
- unsigned long vmstart = max(start, vma->vm_start);
- unsigned long vmend = min(end, vma->vm_end);
-
- if (mpol_equal(vma_policy(vma), new_pol))
- goto next;
-
- pgoff = vma->vm_pgoff +
- ((vmstart - vma->vm_start) >> PAGE_SHIFT);
- prev = vma_merge(&vmi, mm, prev, vmstart, vmend, vma->vm_flags,
- vma->anon_vma, vma->vm_file, pgoff,
- new_pol, vma->vm_userfaultfd_ctx,
- anon_vma_name(vma));
- if (prev) {
- vma = prev;
- goto replace;
- }
- if (vma->vm_start != vmstart) {
- err = split_vma(&vmi, vma, vmstart, 1);
- if (err)
- goto out;
- }
- if (vma->vm_end != vmend) {
- err = split_vma(&vmi, vma, vmend, 0);
- if (err)
- goto out;
- }
-replace:
- err = vma_replace_policy(vma, new_pol);
+ pgoff = vma->vm_pgoff + ((vmstart - vma->vm_start) >> PAGE_SHIFT);
+ merged = vma_merge(vmi, vma->vm_mm, *prev, vmstart, vmend, vma->vm_flags,
+ vma->anon_vma, vma->vm_file, pgoff, new_pol,
+ vma->vm_userfaultfd_ctx, anon_vma_name(vma));
+ if (merged) {
+ *prev = merged;
+ return vma_replace_policy(merged, new_pol);
+ }
+
+ if (vma->vm_start != vmstart) {
+ err = split_vma(vmi, vma, vmstart, 1);
if (err)
- goto out;
-next:
- prev = vma;
- } for_each_vma_range(vmi, vma, end);
+ return err;
+ }
-out:
- return err;
+ if (vma->vm_end != vmend) {
+ err = split_vma(vmi, vma, vmend, 0);
+ if (err)
+ return err;
+ }
+
+ *prev = vma;
+ return vma_replace_policy(vma, new_pol);
}
/* Set the process memory policy */
@@ -1259,6 +1248,8 @@ static long do_mbind(unsigned long start, unsigned long len,
nodemask_t *nmask, unsigned long flags)
{
struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma, *prev;
+ struct vma_iterator vmi;
struct mempolicy *new;
unsigned long end;
int err;
@@ -1328,7 +1319,13 @@ static long do_mbind(unsigned long start, unsigned long len,
goto up_out;
}
- err = mbind_range(mm, start, end, new);
+ vma_iter_init(&vmi, mm, start);
+ prev = vma_prev(&vmi);
+ for_each_vma_range(vmi, vma, end) {
+ err = mbind_range(&vmi, vma, &prev, start, end, new);
+ if (err)
+ break;
+ }
if (!err) {
int nr_failed = 0;
@@ -1489,10 +1486,8 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
unsigned long, home_node, unsigned long, flags)
{
struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma;
+ struct vm_area_struct *vma, *prev;
struct mempolicy *new, *old;
- unsigned long vmstart;
- unsigned long vmend;
unsigned long end;
int err = -ENOENT;
VMA_ITERATOR(vmi, mm, start);
@@ -1521,6 +1516,7 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
if (end == start)
return 0;
mmap_write_lock(mm);
+ prev = vma_prev(&vmi);
for_each_vma_range(vmi, vma, end) {
/*
* If any vma in the range got policy other than MPOL_BIND
@@ -1541,9 +1537,7 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
}
new->home_node = home_node;
- vmstart = max(start, vma->vm_start);
- vmend = min(end, vma->vm_end);
- err = mbind_range(mm, vmstart, vmend, new);
+ err = mbind_range(&vmi, vma, &prev, start, end, new);
mpol_put(new);
if (err)
break;
From: David Matlack <dmatlack(a)google.com>
[ Upstream commit 13ec9308a85702af7c31f3638a2720863848a7f2 ]
Read mmu_invalidate_seq before dropping the mmap_lock so that KVM can
detect if the results of vma_lookup() (e.g. vma_shift) become stale
before it acquires kvm->mmu_lock. This fixes a theoretical bug where a
VMA could be changed by userspace after vma_lookup() and before KVM
reads the mmu_invalidate_seq, causing KVM to install page table entries
based on a (possibly) no-longer-valid vma_shift.
Re-order the MMU cache top-up to earlier in user_mem_abort() so that it
is not done after KVM has read mmu_invalidate_seq (i.e. so as to avoid
inducing spurious fault retries).
This bug has existed since KVM/ARM's inception. It's unlikely that any
sane userspace currently modifies VMAs in such a way as to trigger this
race. And even with directed testing I was unable to reproduce it. But a
sufficiently motivated host userspace might be able to exploit this
race.
Fixes: 94f8e6418d39 ("KVM: ARM: Handle guest faults in KVM")
Cc: stable(a)vger.kernel.org # 6.1 only
Reported-by: Sean Christopherson <seanjc(a)google.com>
Signed-off-by: David Matlack <dmatlack(a)google.com>
Reviewed-by: Marc Zyngier <maz(a)kernel.org>
Link: https://lore.kernel.org/r/20230313235454.2964067-1-dmatlack@google.com
Signed-off-by: Oliver Upton <oliver.upton(a)linux.dev>
[will: Use FSC_PERM instead of ESR_ELx_FSC_PERM]
Signed-off-by: Will Deacon <will(a)kernel.org>
---
arch/arm64/kvm/mmu.c | 47 ++++++++++++++++++++------------------------
1 file changed, 21 insertions(+), 26 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 019472dd98ff..54ccdcc2dbdf 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1178,6 +1178,20 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
return -EFAULT;
}
+ /*
+ * Permission faults just need to update the existing leaf entry,
+ * and so normally don't require allocations from the memcache. The
+ * only exception to this is when dirty logging is enabled at runtime
+ * and a write fault needs to collapse a block entry into a table.
+ */
+ if (fault_status != FSC_PERM ||
+ (logging_active && write_fault)) {
+ ret = kvm_mmu_topup_memory_cache(memcache,
+ kvm_mmu_cache_min_pages(kvm));
+ if (ret)
+ return ret;
+ }
+
/*
* Let's check if we will get back a huge page backed by hugetlbfs, or
* get block mapping for device MMIO region.
@@ -1234,36 +1248,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
fault_ipa &= ~(vma_pagesize - 1);
gfn = fault_ipa >> PAGE_SHIFT;
- mmap_read_unlock(current->mm);
-
- /*
- * Permission faults just need to update the existing leaf entry,
- * and so normally don't require allocations from the memcache. The
- * only exception to this is when dirty logging is enabled at runtime
- * and a write fault needs to collapse a block entry into a table.
- */
- if (fault_status != FSC_PERM || (logging_active && write_fault)) {
- ret = kvm_mmu_topup_memory_cache(memcache,
- kvm_mmu_cache_min_pages(kvm));
- if (ret)
- return ret;
- }
- mmu_seq = vcpu->kvm->mmu_invalidate_seq;
/*
- * Ensure the read of mmu_invalidate_seq happens before we call
- * gfn_to_pfn_prot (which calls get_user_pages), so that we don't risk
- * the page we just got a reference to gets unmapped before we have a
- * chance to grab the mmu_lock, which ensure that if the page gets
- * unmapped afterwards, the call to kvm_unmap_gfn will take it away
- * from us again properly. This smp_rmb() interacts with the smp_wmb()
- * in kvm_mmu_notifier_invalidate_<page|range_end>.
+ * Read mmu_invalidate_seq so that KVM can detect if the results of
+ * vma_lookup() or __gfn_to_pfn_memslot() become stale prior to
+ * acquiring kvm->mmu_lock.
*
- * Besides, __gfn_to_pfn_memslot() instead of gfn_to_pfn_prot() is
- * used to avoid unnecessary overhead introduced to locate the memory
- * slot because it's always fixed even @gfn is adjusted for huge pages.
+ * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs
+ * with the smp_wmb() in kvm_mmu_invalidate_end().
*/
- smp_rmb();
+ mmu_seq = vcpu->kvm->mmu_invalidate_seq;
+ mmap_read_unlock(current->mm);
pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
write_fault, &writable, NULL);
--
2.40.0.634.g4ca3ef3211-goog
From: Ziwei Dai <ziwei.dai(a)unisoc.com>
commit 5da7cb193db32da783a3f3e77d8b639989321d48 upstream.
Memory passed to kvfree_rcu() that is to be freed is tracked by a
per-CPU kfree_rcu_cpu structure, which in turn contains pointers
to kvfree_rcu_bulk_data structures that contain pointers to memory
that has not yet been handed to RCU, along with an kfree_rcu_cpu_work
structure that tracks the memory that has already been handed to RCU.
These structures track three categories of memory: (1) Memory for
kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived
during an OOM episode. The first two categories are tracked in a
cache-friendly manner involving a dynamically allocated page of pointers
(the aforementioned kvfree_rcu_bulk_data structures), while the third
uses a simple (but decidedly cache-unfriendly) linked list through the
rcu_head structures in each block of memory.
On a given CPU, these three categories are handled as a unit, with that
CPU's kfree_rcu_cpu_work structure having one pointer for each of the
three categories. Clearly, new memory for a given category cannot be
placed in the corresponding kfree_rcu_cpu_work structure until any old
memory has had its grace period elapse and thus has been removed. And
the kfree_rcu_monitor() function does in fact check for this.
Except that the kfree_rcu_monitor() function checks these pointers one
at a time. This means that if the previous kfree_rcu() memory passed
to RCU had only category 1 and the current one has only category 2, the
kfree_rcu_monitor() function will send that current category-2 memory
along immediately. This can result in memory being freed too soon,
that is, out from under unsuspecting RCU readers.
To see this, consider the following sequence of events, in which:
o Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset",
then is preempted.
o CPU 1 calls kfree_rcu(cset, rcu_head) in order to free "from_cset"
after a later grace period. Except that "from_cset" is freed
right after the previous grace period ended, so that "from_cset"
is immediately freed. Task A resumes and references "from_cset"'s
member, after which nothing good happens.
In full detail:
CPU 0 CPU 1
---------------------- ----------------------
count_memcg_event_mm()
|rcu_read_lock() <---
|mem_cgroup_from_task()
|// css_set_ptr is the "from_cset" mentioned on CPU 1
|css_set_ptr = rcu_dereference((task)->cgroups)
|// Hard irq comes, current task is scheduled out.
cgroup_attach_task()
|cgroup_migrate()
|cgroup_migrate_execute()
|css_set_move_task(task, from_cset, to_cset, true)
|cgroup_move_task(task, to_cset)
|rcu_assign_pointer(.., to_cset)
|...
|cgroup_migrate_finish()
|put_css_set_locked(from_cset)
|from_cset->refcount return 0
|kfree_rcu(cset, rcu_head) // free from_cset after new gp
|add_ptr_to_bulk_krc_lock()
|schedule_delayed_work(&krcp->monitor_work, ..)
kfree_rcu_monitor()
|krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
|queue_rcu_work(system_wq, &krwp->rcu_work)
|if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
|call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request new gp
// There is a perious call_rcu(.., rcu_work_rcufn)
// gp end, rcu_work_rcufn() is called.
rcu_work_rcufn()
|__queue_work(.., rwork->wq, &rwork->work);
|kfree_rcu_work()
|krwp->bulk_head_free[0] bulk is freed before new gp end!!!
|The "from_cset" is freed before new gp end.
// the task resumes some time later.
|css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.
This commit therefore causes kfree_rcu_monitor() to refrain from moving
kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU
grace period has completed for all three categories.
v2: Use helper function instead of inserted code block at kfree_rcu_monitor().
[UR: backport to 6.2-stable]
Fixes: 34c881745549 ("rcu: Support kfree_bulk() interface in kfree_rcu()")
Fixes: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs")
Reported-by: Mukesh Ojha <quic_mojha(a)quicinc.com>
Signed-off-by: Ziwei Dai <ziwei.dai(a)unisoc.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki(a)gmail.com>
Tested-by: Uladzislau Rezki (Sony) <urezki(a)gmail.com>
Signed-off-by: Paul E. McKenney <paulmck(a)kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki(a)gmail.com>
---
kernel/rcu/tree.c | 27 +++++++++++++++++++--------
1 file changed, 19 insertions(+), 8 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index cf34a961821a..522129193771 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3131,6 +3131,18 @@ need_offload_krc(struct kfree_rcu_cpu *krcp)
return !!krcp->head;
}
+static bool
+need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp)
+{
+ int i;
+
+ for (i = 0; i < FREE_N_CHANNELS; i++)
+ if (krwp->bkvhead_free[i])
+ return true;
+
+ return !!krwp->head_free;
+}
+
static void
schedule_delayed_monitor_work(struct kfree_rcu_cpu *krcp)
{
@@ -3162,14 +3174,13 @@ static void kfree_rcu_monitor(struct work_struct *work)
for (i = 0; i < KFREE_N_BATCHES; i++) {
struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
- // Try to detach bkvhead or head and attach it over any
- // available corresponding free channel. It can be that
- // a previous RCU batch is in progress, it means that
- // immediately to queue another one is not possible so
- // in that case the monitor work is rearmed.
- if ((krcp->bkvhead[0] && !krwp->bkvhead_free[0]) ||
- (krcp->bkvhead[1] && !krwp->bkvhead_free[1]) ||
- (krcp->head && !krwp->head_free)) {
+ // Try to detach bulk_head or head and attach it, only when
+ // all channels are free. Any channel is not free means at krwp
+ // there is on-going rcu work to handle krwp's free business.
+ if (need_wait_for_krwp_work(krwp))
+ continue;
+
+ if (need_offload_krc(krcp)) {
// Channel 1 corresponds to the SLAB-pointer bulk path.
// Channel 2 corresponds to vmalloc-pointer bulk path.
for (j = 0; j < FREE_N_CHANNELS; j++) {
--
2.30.2
From: Tobias Schramm <t.schramm(a)manjaro.org>
[ Upstream commit eca5bd666b0aa7dc0bca63292e4778968241134e ]
This commit fixes a race between completion of stop command and start of a
new command.
Previously the command ready interrupt was enabled before stop command
was written to the command register. This caused the command ready
interrupt to fire immediately since the CMDRDY flag is asserted constantly
while there is no command in progress.
Consequently the command state machine will immediately advance to the
next state when the tasklet function is executed again, no matter
actual completion state of the stop command.
Thus a new command can then be dispatched immediately, interrupting and
corrupting the stop command on the CMD line.
Fix that by dropping the command ready interrupt enable before calling
atmci_send_stop_cmd. atmci_send_stop_cmd does already enable the
command ready interrupt, no further writes to ATMCI_IER are necessary.
Signed-off-by: Tobias Schramm <t.schramm(a)manjaro.org>
Acked-by: Ludovic Desroches <ludovic.desroches(a)microchip.com>
Link: https://lore.kernel.org/r/20221230194315.809903-2-t.schramm@manjaro.org
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/mmc/host/atmel-mci.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/drivers/mmc/host/atmel-mci.c b/drivers/mmc/host/atmel-mci.c
index bb9bbf1c927b6..dd18440a90c58 100644
--- a/drivers/mmc/host/atmel-mci.c
+++ b/drivers/mmc/host/atmel-mci.c
@@ -1817,7 +1817,6 @@ static void atmci_tasklet_func(struct tasklet_struct *t)
atmci_writel(host, ATMCI_IER, ATMCI_NOTBUSY);
state = STATE_WAITING_NOTBUSY;
} else if (host->mrq->stop) {
- atmci_writel(host, ATMCI_IER, ATMCI_CMDRDY);
atmci_send_stop_cmd(host, data);
state = STATE_SENDING_STOP;
} else {
@@ -1850,8 +1849,6 @@ static void atmci_tasklet_func(struct tasklet_struct *t)
* command to send.
*/
if (host->mrq->stop) {
- atmci_writel(host, ATMCI_IER,
- ATMCI_CMDRDY);
atmci_send_stop_cmd(host, data);
state = STATE_SENDING_STOP;
} else {
--
2.39.2
Hi Greg and Sasha,
On Tue, 10 Aug 2021 16:45:34 +0000 SeongJae Park <sj38.park(a)gmail.com> wrote:
> From: SeongJae Park <sjpark(a)amazon.de>
>
> When running a test program, 'run_one()' checks if the program has the
> execution permission and fails if it doesn't. However, it's easy to
> mistakenly missing the permission, as some common tools like 'diff'
> don't support the permission change well[1]. Compared to that, making
> mistakes in the test program's path would only rare, as those are
> explicitly listed in 'TEST_PROGS'. Therefore, it might make more sense
> to resolve the situation on our own and run the program.
>
> For the reason, this commit makes the test program runner function to
> still print the warning message but try parsing the interpreter of the
> program and explicitly run it with the interpreter, in the case.
>
> [1] https://lore.kernel.org/mm-commits/YRJisBs9AunccCD4@kroah.com/
>
> Suggested-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
> Signed-off-by: SeongJae Park <sjpark(a)amazon.de>
This patch has merged into the mainline by the commit 303f8e2d0200
("selftests/kselftest/runner/run_one(): allow running non-executable files").
However, this patch has not added to v5.15.y, while there are some selftests
having no execution permission, including that for DAMON. As a result, the
selftests always fail unless this patch is manually applied. Could you please
add this patch to v5.15.y? I confirmed this patch can cleanly cherry-picked on
the latest v5.15.y.
Thanks,
SJ
[ Upstream commit 59f5ede3bc0f00eb856425f636dab0c10feb06d8 ]
The FPU usage related to task FPU management is either protected by
disabling interrupts (switch_to, return to user) or via fpregs_lock() which
is a wrapper around local_bh_disable(). When kernel code wants to use the
FPU then it has to check whether it is possible by calling irq_fpu_usable().
But the condition in irq_fpu_usable() is wrong. It allows FPU to be used
when:
!in_interrupt() || interrupted_user_mode() || interrupted_kernel_fpu_idle()
The latter is checking whether some other context already uses FPU in the
kernel, but if that's not the case then it allows FPU to be used
unconditionally even if the calling context interrupted a fpregs_lock()
critical region. If that happens then the FPU state of the interrupted
context becomes corrupted.
Allow in kernel FPU usage only when no other context has in kernel FPU
usage and either the calling context is not hard interrupt context or the
hard interrupt did not interrupt a local bottomhalf disabled region.
It's hard to find a proper Fixes tag as the condition was broken in one way
or the other for a very long time and the eager/lazy FPU changes caused a
lot of churn. Picked something remotely connected from the history.
This survived undetected for quite some time as FPU usage in interrupt
context is rare, but the recent changes to the random code unearthed it at
least on a kernel which had FPU debugging enabled. There is probably a
higher rate of silent corruption as not all issues can be detected by the
FPU debugging code. This will be addressed in a subsequent change.
Fixes: 5d2bd7009f30 ("x86, fpu: decouple non-lazy/eager fpu restore from xsave")
Reported-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Filipe Manana <fdmanana(a)suse.com>
Reviewed-by: Borislav Petkov <bp(a)suse.de>
Cc: stable(a)vger.kernel.org
Cc: Can Sun <cansun(a)arista.com>
Link: https://lore.kernel.org/r/20220501193102.588689270@linutronix.de
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 571220ac8bea..835b948095cd 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -25,17 +25,7 @@
*/
union fpregs_state init_fpstate __read_mostly;
-/*
- * Track whether the kernel is using the FPU state
- * currently.
- *
- * This flag is used:
- *
- * - by IRQ context code to potentially use the FPU
- * if it's unused.
- *
- * - to debug kernel_fpu_begin()/end() correctness
- */
+/* Track in-kernel FPU usage */
static DEFINE_PER_CPU(bool, in_kernel_fpu);
/*
@@ -43,42 +33,37 @@ static DEFINE_PER_CPU(bool, in_kernel_fpu);
*/
DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
-static bool kernel_fpu_disabled(void)
-{
- return this_cpu_read(in_kernel_fpu);
-}
-
-static bool interrupted_kernel_fpu_idle(void)
-{
- return !kernel_fpu_disabled();
-}
-
-/*
- * Were we in user mode (or vm86 mode) when we were
- * interrupted?
- *
- * Doing kernel_fpu_begin/end() is ok if we are running
- * in an interrupt context from user mode - we'll just
- * save the FPU state as required.
- */
-static bool interrupted_user_mode(void)
-{
- struct pt_regs *regs = get_irq_regs();
- return regs && user_mode(regs);
-}
-
/*
* Can we use the FPU in kernel mode with the
* whole "kernel_fpu_begin/end()" sequence?
- *
- * It's always ok in process context (ie "not interrupt")
- * but it is sometimes ok even from an irq.
*/
bool irq_fpu_usable(void)
{
- return !in_interrupt() ||
- interrupted_user_mode() ||
- interrupted_kernel_fpu_idle();
+ if (WARN_ON_ONCE(in_nmi()))
+ return false;
+
+ /* In kernel FPU usage already active? */
+ if (this_cpu_read(in_kernel_fpu))
+ return false;
+
+ /*
+ * When not in NMI or hard interrupt context, FPU can be used in:
+ *
+ * - Task context except from within fpregs_lock()'ed critical
+ * regions.
+ *
+ * - Soft interrupt processing context which cannot happen
+ * while in a fpregs_lock()'ed critical region.
+ */
+ if (!in_irq())
+ return true;
+
+ /*
+ * In hard interrupt context it's safe when soft interrupts
+ * are enabled, which means the interrupt did not hit in
+ * a fpregs_lock()'ed critical region.
+ */
+ return !softirq_count();
}
EXPORT_SYMBOL(irq_fpu_usable);
From: Tze-nan Wu <Tze-nan.Wu(a)mediatek.com>
In ring_buffer_reset_online_cpus, the buffer_size_kb write operation
may permanently fail if the cpu_online_mask changes between two
for_each_online_buffer_cpu loops. The number of increases and decreases
on both cpu_buffer->resize_disabled and cpu_buffer->record_disabled may be
inconsistent, causing some CPUs to have non-zero values for these atomic
variables after the function returns.
This issue can be reproduced by "echo 0 > trace" while hotplugging cpu.
After reproducing success, we can find out buffer_size_kb will not be
functional anymore.
To prevent leaving 'resize_disabled' and 'record_disabled' non-zero after
ring_buffer_reset_online_cpus returns, we ensure that each atomic variable
has been set up before atomic_sub() to it.
Link: https://lore.kernel.org/linux-trace-kernel/20230426062027.17451-1-Tze-nan.W…
Cc: stable(a)vger.kernel.org
Cc: <mhiramat(a)kernel.org>
Cc: npiggin(a)gmail.com
Fixes: b23d7a5f4a07 ("ring-buffer: speed up buffer resets by avoiding synchronize_rcu for each CPU")
Reviewed-by: Cheng-Jui Wang <cheng-jui.wang(a)mediatek.com>
Signed-off-by: Tze-nan Wu <Tze-nan.Wu(a)mediatek.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/ring_buffer.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 58be5b409f72..9a0cb94c3972 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5326,6 +5326,9 @@ void ring_buffer_reset_cpu(struct trace_buffer *buffer, int cpu)
}
EXPORT_SYMBOL_GPL(ring_buffer_reset_cpu);
+/* Flag to ensure proper resetting of atomic variables */
+#define RESET_BIT (1 << 30)
+
/**
* ring_buffer_reset_online_cpus - reset a ring buffer per CPU buffer
* @buffer: The ring buffer to reset a per cpu buffer of
@@ -5342,20 +5345,27 @@ void ring_buffer_reset_online_cpus(struct trace_buffer *buffer)
for_each_online_buffer_cpu(buffer, cpu) {
cpu_buffer = buffer->buffers[cpu];
- atomic_inc(&cpu_buffer->resize_disabled);
+ atomic_add(RESET_BIT, &cpu_buffer->resize_disabled);
atomic_inc(&cpu_buffer->record_disabled);
}
/* Make sure all commits have finished */
synchronize_rcu();
- for_each_online_buffer_cpu(buffer, cpu) {
+ for_each_buffer_cpu(buffer, cpu) {
cpu_buffer = buffer->buffers[cpu];
+ /*
+ * If a CPU came online during the synchronize_rcu(), then
+ * ignore it.
+ */
+ if (!(atomic_read(&cpu_buffer->resize_disabled) & RESET_BIT))
+ continue;
+
reset_disabled_cpu_buffer(cpu_buffer);
atomic_dec(&cpu_buffer->record_disabled);
- atomic_dec(&cpu_buffer->resize_disabled);
+ atomic_sub(RESET_BIT, &cpu_buffer->resize_disabled);
}
mutex_unlock(&buffer->mutex);
--
2.39.2
Dan Carpenter spotted a race condition in a couple of situations like
these in the test_firmware driver:
static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
u8 val;
int ret;
ret = kstrtou8(buf, 10, &val);
if (ret)
return ret;
mutex_lock(&test_fw_mutex);
*(u8 *)cfg = val;
mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
}
static ssize_t config_num_requests_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
int rc;
mutex_lock(&test_fw_mutex);
if (test_fw_config->reqs) {
pr_err("Must call release_all_firmware prior to changing config\n");
rc = -EINVAL;
mutex_unlock(&test_fw_mutex);
goto out;
}
mutex_unlock(&test_fw_mutex);
rc = test_dev_config_update_u8(buf, count,
&test_fw_config->num_requests);
out:
return rc;
}
static ssize_t config_read_fw_idx_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
return test_dev_config_update_u8(buf, count,
&test_fw_config->read_fw_idx);
}
The function test_dev_config_update_u8() is called from both the locked
and the unlocked context, function config_num_requests_store() and
config_read_fw_idx_store() which can both be called asynchronously as
they are driver's methods, while test_dev_config_update_u8() and siblings
change their argument pointed to by u8 *cfg or similar pointer.
To avoid deadlock on test_fw_mutex, the lock is dropped before calling
test_dev_config_update_u8() and re-acquired within test_dev_config_update_u8()
itself, but alas this creates a race condition.
Having two locks wouldn't assure a race-proof mutual exclusion.
This situation is best avoided by the introduction of a new, unlocked
function __test_dev_config_update_u8() which can be called from the locked
context and reducing test_dev_config_update_u8() to:
static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
int ret;
mutex_lock(&test_fw_mutex);
ret = __test_dev_config_update_u8(buf, size, cfg);
mutex_unlock(&test_fw_mutex);
return ret;
}
doing the locking and calling the unlocked primitive, which enables both
locked and unlocked versions without duplication of code.
The similar approach was applied to all functions called from the locked
and the unlocked context, which safely mitigates both deadlocks and race
conditions in the driver.
__test_dev_config_update_bool(), __test_dev_config_update_u8() and
__test_dev_config_update_size_t() unlocked versions of the functions
were introduced to be called from the locked contexts as a workaround
without releasing the main driver's lock and thereof causing a race
condition.
The test_dev_config_update_bool(), test_dev_config_update_u8() and
test_dev_config_update_size_t() locked versions of the functions
are being called from driver methods without the unnecessary multiplying
of the locking and unlocking code for each method, and complicating
the code with saving of the return value across lock.
Fixes: 7feebfa487b92 ("test_firmware: add support for request_firmware_into_buf")
Cc: Luis Chamberlain <mcgrof(a)kernel.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Russ Weight <russell.h.weight(a)intel.com>
Cc: Takashi Iwai <tiwai(a)suse.de>
Cc: Tianfei Zhang <tianfei.zhang(a)intel.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Colin Ian King <colin.i.king(a)gmail.com>
Cc: Randy Dunlap <rdunlap(a)infradead.org>
Cc: linux-kselftest(a)vger.kernel.org
Cc: stable(a)vger.kernel.org # v5.4
Suggested-by: Dan Carpenter <error27(a)gmail.com>
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac(a)alu.unizg.hr>
---
lib/test_firmware.c | 52 ++++++++++++++++++++++++++++++---------------
1 file changed, 35 insertions(+), 17 deletions(-)
diff --git a/lib/test_firmware.c b/lib/test_firmware.c
index 05ed84c2fc4c..35417e0af3f4 100644
--- a/lib/test_firmware.c
+++ b/lib/test_firmware.c
@@ -353,16 +353,26 @@ static ssize_t config_test_show_str(char *dst,
return len;
}
-static int test_dev_config_update_bool(const char *buf, size_t size,
+static inline int __test_dev_config_update_bool(const char *buf, size_t size,
bool *cfg)
{
int ret;
- mutex_lock(&test_fw_mutex);
if (kstrtobool(buf, cfg) < 0)
ret = -EINVAL;
else
ret = size;
+
+ return ret;
+}
+
+static int test_dev_config_update_bool(const char *buf, size_t size,
+ bool *cfg)
+{
+ int ret;
+
+ mutex_lock(&test_fw_mutex);
+ ret = __test_dev_config_update_bool(buf, size, cfg);
mutex_unlock(&test_fw_mutex);
return ret;
@@ -373,7 +383,8 @@ static ssize_t test_dev_config_show_bool(char *buf, bool val)
return snprintf(buf, PAGE_SIZE, "%d\n", val);
}
-static int test_dev_config_update_size_t(const char *buf,
+static int __test_dev_config_update_size_t(
+ const char *buf,
size_t size,
size_t *cfg)
{
@@ -384,9 +395,7 @@ static int test_dev_config_update_size_t(const char *buf,
if (ret)
return ret;
- mutex_lock(&test_fw_mutex);
*(size_t *)cfg = new;
- mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
@@ -402,7 +411,7 @@ static ssize_t test_dev_config_show_int(char *buf, int val)
return snprintf(buf, PAGE_SIZE, "%d\n", val);
}
-static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
+static int __test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
u8 val;
int ret;
@@ -411,14 +420,23 @@ static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
if (ret)
return ret;
- mutex_lock(&test_fw_mutex);
*(u8 *)cfg = val;
- mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
}
+static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
+{
+ int ret;
+
+ mutex_lock(&test_fw_mutex);
+ ret = __test_dev_config_update_u8(buf, size, cfg);
+ mutex_unlock(&test_fw_mutex);
+
+ return ret;
+}
+
static ssize_t test_dev_config_show_u8(char *buf, u8 val)
{
return snprintf(buf, PAGE_SIZE, "%u\n", val);
@@ -471,10 +489,10 @@ static ssize_t config_num_requests_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_u8(buf, count,
- &test_fw_config->num_requests);
+ rc = __test_dev_config_update_u8(buf, count,
+ &test_fw_config->num_requests);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
@@ -518,10 +536,10 @@ static ssize_t config_buf_size_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_size_t(buf, count,
- &test_fw_config->buf_size);
+ rc = __test_dev_config_update_size_t(buf, count,
+ &test_fw_config->buf_size);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
@@ -548,10 +566,10 @@ static ssize_t config_file_offset_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_size_t(buf, count,
- &test_fw_config->file_offset);
+ rc = __test_dev_config_update_size_t(buf, count,
+ &test_fw_config->file_offset);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
--
2.30.2
Hi there,
I was evaluating CVE-2022-3567 and CVE-2022-3566 which both
revolt around load tearing and reference an ancient Kernel commit:
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
I am not sure whether they are applicable to the v5.4.y branch as well.
Could you advise?
Best Regards,
Kristof Havasi
commit 7041101ff6c3073fd8f2e99920f535b111c929cb upstream.
if sch_fq is configured with "initial quantum" having values greater than
INT_MAX, the first assignment of "credit" does signed integer overflow to
a very negative value.
In this situation, the syzkaller script provided by Cristoph triggers the
CPU soft-lockup warning even with few sockets. It's not an infinite loop,
but "credit" wasn't probably meant to be minus 2Gb for each new flow.
Capping "initial quantum" to INT_MAX proved to fix the issue.
This patch doesn't use netlink validation helpers, since they might not be
available on all stable branches.
Reported-by: Christoph Paasch <cpaasch(a)apple.com>
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/377
Fixes: afe4fd062416 ("pkt_sched: fq: Fair Queue packet scheduler")
Cc: <stable(a)vger.kernel.org>
Reviewed-by: Eric Dumazet <edumazet(a)google.com>
Signed-off-by: Davide Caratti <dcaratti(a)redhat.com>
---
net/sched/sch_fq.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 48d14fb90ba0..12efbcfc2938 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -842,8 +842,16 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt,
}
}
- if (tb[TCA_FQ_INITIAL_QUANTUM])
- q->initial_quantum = nla_get_u32(tb[TCA_FQ_INITIAL_QUANTUM]);
+ if (tb[TCA_FQ_INITIAL_QUANTUM]) {
+ u32 initial_quantum = nla_get_u32(tb[TCA_FQ_INITIAL_QUANTUM]);
+
+ if (initial_quantum <= INT_MAX) {
+ q->initial_quantum = initial_quantum;
+ } else {
+ NL_SET_ERR_MSG_MOD(extack, "invalid initial quantum");
+ err = -EINVAL;
+ }
+ }
if (tb[TCA_FQ_FLOW_DEFAULT_RATE])
pr_warn_ratelimited("sch_fq: defrate %u ignored.\n",
--
2.39.2
For some reason, this email did not make it to
linux-trace-kernel(a)vger.kernel.org, and therefore did not make it into
patchwork?
John?
-- Steve
On Wed, 26 Apr 2023 09:04:44 +0800
Tze-nan.Wu <Tze-nan.Wu(a)mediatek.com> wrote:
> From: "Tze-nan Wu" <Tze-nan.Wu(a)mediatek.com>
>
> In ring_buffer_reset_online_cpus, the buffer_size_kb write operation
> may permanently fail if the cpu_online_mask changes between two
> for_each_online_buffer_cpu loops. The number of increases and decreases
> on both cpu_buffer->resize_disabled and cpu_buffer->record_disabled may be
> inconsistent, causing some CPUs to have non-zero values for these atomic
> variables after the function returns.
>
> This issue can be reproduced by "echo 0 > trace" while hotplugging cpu.
> After reproducing success, we can find out buffer_size_kb will not be
> functional anymore.
>
> To prevent leaving 'resize_disabled' and 'record_disabled' non-zero after
> ring_buffer_reset_online_cpus returns, we ensure that each atomic variable
> has been set up before atomic_sub() to it.
>
> Cc: stable(a)vger.kernel.org
> Cc: npiggin(a)gmail.com
> Fixes: b23d7a5f4a07 ("ring-buffer: speed up buffer resets by avoiding synchronize_rcu for each CPU")
> Reviewed-by: Cheng-Jui Wang <cheng-jui.wang(a)mediatek.com>
> Signed-off-by: Tze-nan Wu <Tze-nan.Wu(a)mediatek.com>
> ---
> Changes from v4 to v5: https://lore.kernel.org/lkml/20230412112401.25081-1-Tze-nan.Wu@mediatek.com/
> - Move the define before the function
> ---
> kernel/trace/ring_buffer.c | 16 +++++++++++++---
> 1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 76a2d91eecad..253ef85a9ec3 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -5345,6 +5345,9 @@ void ring_buffer_reset_cpu(struct trace_buffer *buffer, int cpu)
> }
> EXPORT_SYMBOL_GPL(ring_buffer_reset_cpu);
>
> +/* Flag to ensure proper resetting of atomic variables */
> +#define RESET_BIT (1 << 30)
> +
> /**
> * ring_buffer_reset_online_cpus - reset a ring buffer per CPU buffer
> * @buffer: The ring buffer to reset a per cpu buffer of
> @@ -5361,20 +5364,27 @@ void ring_buffer_reset_online_cpus(struct trace_buffer *buffer)
> for_each_online_buffer_cpu(buffer, cpu) {
> cpu_buffer = buffer->buffers[cpu];
>
> - atomic_inc(&cpu_buffer->resize_disabled);
> + atomic_add(RESET_BIT, &cpu_buffer->resize_disabled);
> atomic_inc(&cpu_buffer->record_disabled);
> }
>
> /* Make sure all commits have finished */
> synchronize_rcu();
>
> - for_each_online_buffer_cpu(buffer, cpu) {
> + for_each_buffer_cpu(buffer, cpu) {
> cpu_buffer = buffer->buffers[cpu];
>
> + /*
> + * If a CPU came online during the synchronize_rcu(), then
> + * ignore it.
> + */
> + if (!(atomic_read(&cpu_buffer->resize_disabled) & RESET_BIT))
> + continue;
> +
> reset_disabled_cpu_buffer(cpu_buffer);
>
> atomic_dec(&cpu_buffer->record_disabled);
> - atomic_dec(&cpu_buffer->resize_disabled);
> + atomic_sub(RESET_BIT, &cpu_buffer->resize_disabled);
> }
>
> mutex_unlock(&buffer->mutex);
In ring_buffer_reset_online_cpus, the buffer_size_kb write operation
may permanently fail if the cpu_online_mask changes between two
for_each_online_buffer_cpu loops. The number of increases and decreases
on both cpu_buffer->resize_disabled and cpu_buffer->record_disabled may be
inconsistent, causing some CPUs to have non-zero values for these atomic
variables after the function returns.
This issue can be reproduced by "echo 0 > trace" while hotplugging cpu.
After reproducing success, we can find out buffer_size_kb will not be
functional anymore.
To prevent leaving 'resize_disabled' and 'record_disabled' non-zero after
ring_buffer_reset_online_cpus returns, we ensure that each atomic variable
has been set up before atomic_sub() to it.
Cc: stable(a)vger.kernel.org
Cc: npiggin(a)gmail.com
Fixes: b23d7a5f4a07 ("ring-buffer: speed up buffer resets by avoiding synchronize_rcu for each CPU")
Reviewed-by: Cheng-Jui Wang <cheng-jui.wang(a)mediatek.com>
Signed-off-by: Tze-nan Wu <Tze-nan.Wu(a)mediatek.com>
---
Changes from v4 to v5: https://lore.kernel.org/lkml/20230412112401.25081-1-Tze-nan.Wu@mediatek.com/
- Move the define before the function
---
kernel/trace/ring_buffer.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 76a2d91eecad..253ef85a9ec3 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5345,6 +5345,9 @@ void ring_buffer_reset_cpu(struct trace_buffer *buffer, int cpu)
}
EXPORT_SYMBOL_GPL(ring_buffer_reset_cpu);
+/* Flag to ensure proper resetting of atomic variables */
+#define RESET_BIT (1 << 30)
+
/**
* ring_buffer_reset_online_cpus - reset a ring buffer per CPU buffer
* @buffer: The ring buffer to reset a per cpu buffer of
@@ -5361,20 +5364,27 @@ void ring_buffer_reset_online_cpus(struct trace_buffer *buffer)
for_each_online_buffer_cpu(buffer, cpu) {
cpu_buffer = buffer->buffers[cpu];
- atomic_inc(&cpu_buffer->resize_disabled);
+ atomic_add(RESET_BIT, &cpu_buffer->resize_disabled);
atomic_inc(&cpu_buffer->record_disabled);
}
/* Make sure all commits have finished */
synchronize_rcu();
- for_each_online_buffer_cpu(buffer, cpu) {
+ for_each_buffer_cpu(buffer, cpu) {
cpu_buffer = buffer->buffers[cpu];
+ /*
+ * If a CPU came online during the synchronize_rcu(), then
+ * ignore it.
+ */
+ if (!(atomic_read(&cpu_buffer->resize_disabled) & RESET_BIT))
+ continue;
+
reset_disabled_cpu_buffer(cpu_buffer);
atomic_dec(&cpu_buffer->record_disabled);
- atomic_dec(&cpu_buffer->resize_disabled);
+ atomic_sub(RESET_BIT, &cpu_buffer->resize_disabled);
}
mutex_unlock(&buffer->mutex);
--
2.18.0
From: Kai-Heng Feng <kai.heng.feng(a)canonical.com>
commit 08d0cc5f34265d1a1e3031f319f594bd1970976c upstream.
pcie_aspm_pm_state_change() was introduced at the inception of PCIe ASPM
code, but it can cause some issues. For instance, when ASPM config is
changed via sysfs, those changes won't persist across power state change
because pcie_aspm_pm_state_change() overwrites them.
Also, if the driver restores L1SS [1] after system resume, the restored
state will also be overwritten by pcie_aspm_pm_state_change().
Remove pcie_aspm_pm_state_change(). If there's any hardware that really
needs it to function, a quirk can be used instead.
[1] https://lore.kernel.org/linux-pci/20220201123536.12962-1-vidyas@nvidia.com/
Link: https://lore.kernel.org/r/20220509073639.2048236-1-kai.heng.feng@canonical.…
[bhelgaas: remove additional pcie_aspm_pm_state_change() call in
pci_set_low_power_state(), added by
10aa5377fc8a ("PCI/PM: Split pci_raw_set_power_state()") and moved by
7957d201456f ("PCI/PM: Relocate pci_set_low_power_state()")]
Signed-off-by: Kai-Heng Feng <kai.heng.feng(a)canonical.com>
Signed-off-by: Bjorn Helgaas <bhelgaas(a)google.com>
[manual backport: pci_set_low_power_state does not exist in v5.15]
Signed-off-by: Mark Hasemeyer <markhas(a)chromium.org>
---
This change is intended for, and has been tested against 5.15.y.
It is desired because without it, it has been observed that re-applying
aspm settings can cause the system to crash with certain pci devices
(ie. Genesys GL9755).
A manual backport was required as `pci_set_low_power_state` does not exist in
v5.15.
Tested by issuing 100 suspend/resume cycles on a symptomatic system running
5.15.107.
Test command:
```
echo +5 > /sys/class/rtc/rtc0/wakealarm && echo freeze > /sys/power/state
```
L1 settings looked identical before and after:
```
localhost ~ # lspci -vvv -d 0x17a0: | grep L1Sub
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+
L1SubCtl2: T_PwrOn=3100us
```
drivers/pci/pci.c | 3 ---
drivers/pci/pci.h | 2 --
drivers/pci/pcie/aspm.c | 19 -------------------
3 files changed, 24 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 649df298869c..4aa2e655398c 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1140,9 +1140,6 @@ static int pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state)
if (need_restore)
pci_restore_bars(dev);
- if (dev->bus->self)
- pcie_aspm_pm_state_change(dev->bus->self);
-
return 0;
}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 72280e9b23b2..e6ea6e950428 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -595,12 +595,10 @@ bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
#ifdef CONFIG_PCIEASPM
void pcie_aspm_init_link_state(struct pci_dev *pdev);
void pcie_aspm_exit_link_state(struct pci_dev *pdev);
-void pcie_aspm_pm_state_change(struct pci_dev *pdev);
void pcie_aspm_powersave_config_link(struct pci_dev *pdev);
#else
static inline void pcie_aspm_init_link_state(struct pci_dev *pdev) { }
static inline void pcie_aspm_exit_link_state(struct pci_dev *pdev) { }
-static inline void pcie_aspm_pm_state_change(struct pci_dev *pdev) { }
static inline void pcie_aspm_powersave_config_link(struct pci_dev *pdev) { }
#endif
diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index 013a47f587ce..b3ad316418f1 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -1020,25 +1020,6 @@ void pcie_aspm_exit_link_state(struct pci_dev *pdev)
up_read(&pci_bus_sem);
}
-/* @pdev: the root port or switch downstream port */
-void pcie_aspm_pm_state_change(struct pci_dev *pdev)
-{
- struct pcie_link_state *link = pdev->link_state;
-
- if (aspm_disabled || !link)
- return;
- /*
- * Devices changed PM state, we should recheck if latency
- * meets all functions' requirement
- */
- down_read(&pci_bus_sem);
- mutex_lock(&aspm_lock);
- pcie_update_aspm_capable(link->root);
- pcie_config_aspm_path(link);
- mutex_unlock(&aspm_lock);
- up_read(&pci_bus_sem);
-}
-
void pcie_aspm_powersave_config_link(struct pci_dev *pdev)
{
struct pcie_link_state *link = pdev->link_state;
--
2.40.0.634.g4ca3ef3211-goog
commit 08d0cc5f34265d1a1e3031f319f594bd1970976c upstream.
This change is desired because without it, it has been observed that
re-applying aspm settings can cause the system to crash with certain pci
devices (ie. Genesys GL9755).
Tested by issuing 100 suspend/resume cycles on a symptomatic system running
5.15.107.
L1 settings looked identical before and after:
```
localhost ~ # lspci -vvv -d 0x17a0: | grep L1Sub
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+
L1SubCtl2: T_PwrOn=3100us
```
Cc: <stable(a)vger.kernel.org> # 5.15.y
This is a resend to add stable list to cc as well as linux-i2c list
which fell off somehow.
On 09:56-20230425, Reid Tonking wrote:
> Hi Andi,
>
> On 14:45-20230425, Andi Shyti wrote:
> > Hi Reid,
> >
> > On Mon, Apr 24, 2023 at 02:53:44PM -0500, Reid Tonking wrote:
> > > Using standard mode, rare false ACK responses were appearing with
> > > i2cdetect tool. This was happening due to NACK interrupt triggering
> > > ISR thread before register access interrupt was ready. Removing the
> > > NACK interrupt's ability to trigger ISR thread lets register access
> > > ready interrupt do this instead.
> > >
> > > Fixes: 3b2f8f82dad7 ("i2c: omap: switch to threaded IRQ support")
> > >
> > > Signed-off-by: Reid Tonking <reidt(a)ti.com>
> >
> > please don't leave any space between Fixes and SoB.
> >
> > Add also:
> >
> > Cc: <stable(a)vger.kernel.org> # v3.7+
> >
> > and Cc the stable list.
> >
> > Andi
> >
>
> Thanks for the feedback, I'll make that change going forward.
>
> -Reid
-Reid
From: David Matlack <dmatlack(a)google.com>
[ Upstream commit 13ec9308a85702af7c31f3638a2720863848a7f2 ]
Read mmu_invalidate_seq before dropping the mmap_lock so that KVM can
detect if the results of vma_lookup() (e.g. vma_shift) become stale
before it acquires kvm->mmu_lock. This fixes a theoretical bug where a
VMA could be changed by userspace after vma_lookup() and before KVM
reads the mmu_invalidate_seq, causing KVM to install page table entries
based on a (possibly) no-longer-valid vma_shift.
Re-order the MMU cache top-up to earlier in user_mem_abort() so that it
is not done after KVM has read mmu_invalidate_seq (i.e. so as to avoid
inducing spurious fault retries).
This bug has existed since KVM/ARM's inception. It's unlikely that any
sane userspace currently modifies VMAs in such a way as to trigger this
race. And even with directed testing I was unable to reproduce it. But a
sufficiently motivated host userspace might be able to exploit this
race.
Fixes: 94f8e6418d39 ("KVM: ARM: Handle guest faults in KVM")
Cc: stable(a)vger.kernel.org # 5.15 only
Reported-by: Sean Christopherson <seanjc(a)google.com>
Signed-off-by: David Matlack <dmatlack(a)google.com>
Reviewed-by: Marc Zyngier <maz(a)kernel.org>
Link: https://lore.kernel.org/r/20230313235454.2964067-1-dmatlack@google.com
Signed-off-by: Oliver Upton <oliver.upton(a)linux.dev>
[will: Use FSC_PERM instead of ESR_ELx_FSC_PERM. Read 'mmu_notifier_seq'
instead of 'mmu_invalidate_seq'. Fix up function references in comment.]
Signed-off-by: Will Deacon <will(a)kernel.org>
---
arch/arm64/kvm/mmu.c | 47 ++++++++++++++++++++------------------------
1 file changed, 21 insertions(+), 26 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 9b465cd55a8d..38a8095744a0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -997,6 +997,20 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
return -EFAULT;
}
+ /*
+ * Permission faults just need to update the existing leaf entry,
+ * and so normally don't require allocations from the memcache. The
+ * only exception to this is when dirty logging is enabled at runtime
+ * and a write fault needs to collapse a block entry into a table.
+ */
+ if (fault_status != FSC_PERM ||
+ (logging_active && write_fault)) {
+ ret = kvm_mmu_topup_memory_cache(memcache,
+ kvm_mmu_cache_min_pages(kvm));
+ if (ret)
+ return ret;
+ }
+
/*
* Let's check if we will get back a huge page backed by hugetlbfs, or
* get block mapping for device MMIO region.
@@ -1051,36 +1065,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
fault_ipa &= ~(vma_pagesize - 1);
gfn = fault_ipa >> PAGE_SHIFT;
- mmap_read_unlock(current->mm);
-
- /*
- * Permission faults just need to update the existing leaf entry,
- * and so normally don't require allocations from the memcache. The
- * only exception to this is when dirty logging is enabled at runtime
- * and a write fault needs to collapse a block entry into a table.
- */
- if (fault_status != FSC_PERM || (logging_active && write_fault)) {
- ret = kvm_mmu_topup_memory_cache(memcache,
- kvm_mmu_cache_min_pages(kvm));
- if (ret)
- return ret;
- }
- mmu_seq = vcpu->kvm->mmu_notifier_seq;
/*
- * Ensure the read of mmu_notifier_seq happens before we call
- * gfn_to_pfn_prot (which calls get_user_pages), so that we don't risk
- * the page we just got a reference to gets unmapped before we have a
- * chance to grab the mmu_lock, which ensure that if the page gets
- * unmapped afterwards, the call to kvm_unmap_gfn will take it away
- * from us again properly. This smp_rmb() interacts with the smp_wmb()
- * in kvm_mmu_notifier_invalidate_<page|range_end>.
+ * Read mmu_notifier_seq so that KVM can detect if the results of
+ * vma_lookup() or __gfn_to_pfn_memslot() become stale prior to
+ * acquiring kvm->mmu_lock.
*
- * Besides, __gfn_to_pfn_memslot() instead of gfn_to_pfn_prot() is
- * used to avoid unnecessary overhead introduced to locate the memory
- * slot because it's always fixed even @gfn is adjusted for huge pages.
+ * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs
+ * with the smp_wmb() in kvm_dec_notifier_count().
*/
- smp_rmb();
+ mmu_seq = vcpu->kvm->mmu_notifier_seq;
+ mmap_read_unlock(current->mm);
pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
write_fault, &writable, NULL);
--
2.40.0.634.g4ca3ef3211-goog
From: Dan Carpenter <dan.carpenter(a)linaro.org>
[ Upstream commit a25bc8486f9c01c1af6b6c5657234b2eee2c39d6 ]
The KVM_REG_SIZE() comes from the ioctl and it can be a power of two
between 0-32768 but if it is more than sizeof(long) this will corrupt
memory.
Fixes: 99adb567632b ("KVM: arm/arm64: Add save/restore support for firmware workaround state")
Signed-off-by: Dan Carpenter <dan.carpenter(a)linaro.org>
Reviewed-by: Steven Price <steven.price(a)arm.com>
Reviewed-by: Eric Auger <eric.auger(a)redhat.com>
Reviewed-by: Marc Zyngier <maz(a)kernel.org>
Link: https://lore.kernel.org/r/4efbab8c-640f-43b2-8ac6-6d68e08280fe@kili.mountain
Signed-off-by: Oliver Upton <oliver.upton(a)linux.dev>
Cc: stable(a)vger.kernel.org # 5.10 and 5.15
[will: kvm_arm_set_fw_reg() lives in psci.c not hypercalls.c]
Signed-off-by: Will Deacon <will(a)kernel.org>
---
arch/arm64/kvm/psci.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c
index 20ba5136ac3d..32bb26be8a9b 100644
--- a/arch/arm64/kvm/psci.c
+++ b/arch/arm64/kvm/psci.c
@@ -499,6 +499,8 @@ int kvm_arm_set_fw_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg)
u64 val;
int wa_level;
+ if (KVM_REG_SIZE(reg->id) != sizeof(val))
+ return -ENOENT;
if (copy_from_user(&val, uaddr, KVM_REG_SIZE(reg->id)))
return -EFAULT;
--
2.40.0.634.g4ca3ef3211-goog
When inotify_freeing_mark() races with inotify_handle_inode_event() it
can happen that inotify_handle_inode_event() sees that i_mark->wd got
already reset to -1 and reports this value to userspace which can
confuse the inotify listener. Avoid the problem by validating that wd is
sensible (and pretend the mark got removed before the event got
generated otherwise).
CC: stable(a)vger.kernel.org
Fixes: 7e790dd5fc93 ("inotify: fix error paths in inotify_update_watch")
Reported-by: syzbot+4a06d4373fd52f0b2f9c(a)syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack(a)suse.cz>
---
fs/notify/inotify/inotify_fsnotify.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
I plan to merge this fix through my tree.
diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c
index 49cfe2ae6d23..f86d12790cb1 100644
--- a/fs/notify/inotify/inotify_fsnotify.c
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -65,7 +65,7 @@ int inotify_handle_inode_event(struct fsnotify_mark *inode_mark, u32 mask,
struct fsnotify_event *fsn_event;
struct fsnotify_group *group = inode_mark->group;
int ret;
- int len = 0;
+ int len = 0, wd;
int alloc_len = sizeof(struct inotify_event_info);
struct mem_cgroup *old_memcg;
@@ -80,6 +80,13 @@ int inotify_handle_inode_event(struct fsnotify_mark *inode_mark, u32 mask,
i_mark = container_of(inode_mark, struct inotify_inode_mark,
fsn_mark);
+ /*
+ * We can be racing with mark being detached. Don't report event with
+ * invalid wd.
+ */
+ wd = READ_ONCE(i_mark->wd);
+ if (wd == -1)
+ return 0;
/*
* Whoever is interested in the event, pays for the allocation. Do not
* trigger OOM killer in the target monitoring memcg as it may have
@@ -110,7 +117,7 @@ int inotify_handle_inode_event(struct fsnotify_mark *inode_mark, u32 mask,
fsn_event = &event->fse;
fsnotify_init_event(fsn_event);
event->mask = mask;
- event->wd = i_mark->wd;
+ event->wd = wd;
event->sync_cookie = cookie;
event->name_len = len;
if (len)
--
2.35.3
From: Ziwei Dai <ziwei.dai(a)unisoc.com>
commit 5da7cb193db32da783a3f3e77d8b639989321d48 upstream.
Memory passed to kvfree_rcu() that is to be freed is tracked by a
per-CPU kfree_rcu_cpu structure, which in turn contains pointers
to kvfree_rcu_bulk_data structures that contain pointers to memory
that has not yet been handed to RCU, along with an kfree_rcu_cpu_work
structure that tracks the memory that has already been handed to RCU.
These structures track three categories of memory: (1) Memory for
kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived
during an OOM episode. The first two categories are tracked in a
cache-friendly manner involving a dynamically allocated page of pointers
(the aforementioned kvfree_rcu_bulk_data structures), while the third
uses a simple (but decidedly cache-unfriendly) linked list through the
rcu_head structures in each block of memory.
On a given CPU, these three categories are handled as a unit, with that
CPU's kfree_rcu_cpu_work structure having one pointer for each of the
three categories. Clearly, new memory for a given category cannot be
placed in the corresponding kfree_rcu_cpu_work structure until any old
memory has had its grace period elapse and thus has been removed. And
the kfree_rcu_monitor() function does in fact check for this.
Except that the kfree_rcu_monitor() function checks these pointers one
at a time. This means that if the previous kfree_rcu() memory passed
to RCU had only category 1 and the current one has only category 2, the
kfree_rcu_monitor() function will send that current category-2 memory
along immediately. This can result in memory being freed too soon,
that is, out from under unsuspecting RCU readers.
To see this, consider the following sequence of events, in which:
o Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset",
then is preempted.
o CPU 1 calls kfree_rcu(cset, rcu_head) in order to free "from_cset"
after a later grace period. Except that "from_cset" is freed
right after the previous grace period ended, so that "from_cset"
is immediately freed. Task A resumes and references "from_cset"'s
member, after which nothing good happens.
In full detail:
CPU 0 CPU 1
---------------------- ----------------------
count_memcg_event_mm()
|rcu_read_lock() <---
|mem_cgroup_from_task()
|// css_set_ptr is the "from_cset" mentioned on CPU 1
|css_set_ptr = rcu_dereference((task)->cgroups)
|// Hard irq comes, current task is scheduled out.
cgroup_attach_task()
|cgroup_migrate()
|cgroup_migrate_execute()
|css_set_move_task(task, from_cset, to_cset, true)
|cgroup_move_task(task, to_cset)
|rcu_assign_pointer(.., to_cset)
|...
|cgroup_migrate_finish()
|put_css_set_locked(from_cset)
|from_cset->refcount return 0
|kfree_rcu(cset, rcu_head) // free from_cset after new gp
|add_ptr_to_bulk_krc_lock()
|schedule_delayed_work(&krcp->monitor_work, ..)
kfree_rcu_monitor()
|krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
|queue_rcu_work(system_wq, &krwp->rcu_work)
|if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
|call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request new gp
// There is a perious call_rcu(.., rcu_work_rcufn)
// gp end, rcu_work_rcufn() is called.
rcu_work_rcufn()
|__queue_work(.., rwork->wq, &rwork->work);
|kfree_rcu_work()
|krwp->bulk_head_free[0] bulk is freed before new gp end!!!
|The "from_cset" is freed before new gp end.
// the task resumes some time later.
|css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.
This commit therefore causes kfree_rcu_monitor() to refrain from moving
kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU
grace period has completed for all three categories.
v2: Use helper function instead of inserted code block at kfree_rcu_monitor().
[UR: backport to 5.10-stable]
[UR: Added missing need_offload_krc() function]
Fixes: 34c881745549 ("rcu: Support kfree_bulk() interface in kfree_rcu()")
Fixes: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs")
Reported-by: Mukesh Ojha <quic_mojha(a)quicinc.com>
Signed-off-by: Ziwei Dai <ziwei.dai(a)unisoc.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki(a)gmail.com>
Tested-by: Uladzislau Rezki (Sony) <urezki(a)gmail.com>
Signed-off-by: Paul E. McKenney <paulmck(a)kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki(a)gmail.com>
---
kernel/rcu/tree.c | 49 +++++++++++++++++++++++++++++++++--------------
1 file changed, 35 insertions(+), 14 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 9cce4e13af41..ab045cad105f 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3280,6 +3280,30 @@ static void kfree_rcu_work(struct work_struct *work)
}
}
+static bool
+need_offload_krc(struct kfree_rcu_cpu *krcp)
+{
+ int i;
+
+ for (i = 0; i < FREE_N_CHANNELS; i++)
+ if (krcp->bkvhead[i])
+ return true;
+
+ return !!krcp->head;
+}
+
+static bool
+need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp)
+{
+ int i;
+
+ for (i = 0; i < FREE_N_CHANNELS; i++)
+ if (krwp->bkvhead_free[i])
+ return true;
+
+ return !!krwp->head_free;
+}
+
/*
* Schedule the kfree batch RCU work to run in workqueue context after a GP.
*
@@ -3297,16 +3321,13 @@ static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp)
for (i = 0; i < KFREE_N_BATCHES; i++) {
krwp = &(krcp->krw_arr[i]);
- /*
- * Try to detach bkvhead or head and attach it over any
- * available corresponding free channel. It can be that
- * a previous RCU batch is in progress, it means that
- * immediately to queue another one is not possible so
- * return false to tell caller to retry.
- */
- if ((krcp->bkvhead[0] && !krwp->bkvhead_free[0]) ||
- (krcp->bkvhead[1] && !krwp->bkvhead_free[1]) ||
- (krcp->head && !krwp->head_free)) {
+ // Try to detach bulk_head or head and attach it, only when
+ // all channels are free. Any channel is not free means at krwp
+ // there is on-going rcu work to handle krwp's free business.
+ if (need_wait_for_krwp_work(krwp))
+ continue;
+
+ if (need_offload_krc(krcp)) {
// Channel 1 corresponds to SLAB ptrs.
// Channel 2 corresponds to vmalloc ptrs.
for (j = 0; j < FREE_N_CHANNELS; j++) {
@@ -3333,12 +3354,12 @@ static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp)
*/
queue_rcu_work(system_wq, &krwp->rcu_work);
}
-
- // Repeat if any "free" corresponding channel is still busy.
- if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head)
- repeat = true;
}
+ // Repeat if any "free" corresponding channel is still busy.
+ if (need_offload_krc(krcp))
+ repeat = true;
+
return !repeat;
}
--
2.30.2
From: Arınç ÜNAL <arinc.unal(a)arinc9.com>
The multi-chip module MT7530 switch with a 40 MHz oscillator on the
MT7621AT, MT7621DAT, and MT7621ST SoCs forwards corrupt frames using
trgmii.
This is caused by the assumption that MT7621 SoCs have got 150 MHz PLL,
hence using the ncpo1 value, 0x0780.
My testing shows this value works on Unielec U7621-06, Bartel's testing
shows it won't work on Hi-Link HLK-MT7621A and Netgear WAC104. All devices
tested have got 40 MHz oscillators.
Using the value for 125 MHz PLL, 0x0640, works on all boards at hand. The
definitions for 125 MHz PLL exist on the Banana Pi BPI-R2 BSP source code
whilst 150 MHz PLL don't.
Forwarding frames using trgmii on the MCM MT7530 switch with a 25 MHz
oscillator on the said MT7621 SoCs works fine because the ncpo1 value
defined for it is for 125 MHz PLL.
Change the 150 MHz PLL comment to 125 MHz PLL, and use the 125 MHz PLL
ncpo1 values for both oscillator frequencies.
Link: https://github.com/BPI-SINOVOIP/BPI-R2-bsp/blob/81d24bbce7d99524d0771a8bdb2…
Fixes: 7ef6f6f8d237 ("net: dsa: mt7530: Add MT7621 TRGMII mode support")
Cc: stable(a)vger.kernel.org
Tested-by: Bartel Eerdekens <bartel.eerdekens(a)constell8.be>
Tested-by: Arınç ÜNAL <arinc.unal(a)arinc9.com>
Signed-off-by: Arınç ÜNAL <arinc.unal(a)arinc9.com>
---
drivers/net/dsa/mt7530.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
index c680873819b0..7d9f9563dbda 100644
--- a/drivers/net/dsa/mt7530.c
+++ b/drivers/net/dsa/mt7530.c
@@ -426,9 +426,9 @@ mt7530_pad_clk_setup(struct dsa_switch *ds, phy_interface_t interface)
else
ssc_delta = 0x87;
if (priv->id == ID_MT7621) {
- /* PLL frequency: 150MHz: 1.2GBit */
+ /* PLL frequency: 125MHz: 1.0GBit */
if (xtal == HWTRAP_XTAL_40MHZ)
- ncpo1 = 0x0780;
+ ncpo1 = 0x0640;
if (xtal == HWTRAP_XTAL_25MHZ)
ncpo1 = 0x0a00;
} else { /* PLL frequency: 250MHz: 2.0Gbit */
--
2.37.2