This is the start of the stable review cycle for the 4.9.316 release.
There are 25 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Wed, 25 May 2022 16:56:55 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.316-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.9.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.9.316-rc1
Yang Yingliang <yangyingliang(a)huawei.com>
net: stmmac: fix missing pci_disable_device() on error in stmmac_pci_probe()
Yang Yingliang <yangyingliang(a)huawei.com>
ethernet: tulip: fix missing pci_disable_device() on error in tulip_init_one()
Felix Fietkau <nbd(a)nbd.name>
mac80211: fix rx reordering with non explicit / psmp ack policy
Gleb Chesnokov <Chesnokov.G(a)raidix.com>
scsi: qla2xxx: Fix missed DMA unmap for aborted commands
Thomas Richter <tmricht(a)linux.ibm.com>
perf bench numa: Address compiler error on s390
Kevin Mitchell <kevmitch(a)arista.com>
igb: skip phy status check where unavailable
Ard Biesheuvel <ardb(a)kernel.org>
ARM: 9197/1: spectre-bhb: fix loop8 sequence for Thumb2
Ard Biesheuvel <ardb(a)kernel.org>
ARM: 9196/1: spectre-bhb: enable for Cortex-A15
Jiasheng Jiang <jiasheng(a)iscas.ac.cn>
net: af_key: add check for pfkey_broadcast in function pfkey_process
Duoming Zhou <duoming(a)zju.edu.cn>
NFC: nci: fix sleep in atomic context bugs caused by nci_skb_alloc
Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
net/qla3xxx: Fix a test in ql_reset_work()
Zixuan Fu <r33s3n6(a)gmail.com>
net: vmxnet3: fix possible NULL pointer dereference in vmxnet3_rq_cleanup()
Zixuan Fu <r33s3n6(a)gmail.com>
net: vmxnet3: fix possible use-after-free bugs in vmxnet3_rq_alloc_rx_buf()
Hangyu Hua <hbh25y(a)gmail.com>
drm/dp/mst: fix a possible memory leak in fetch_monitor_name()
Peter Zijlstra <peterz(a)infradead.org>
perf: Fix sys_perf_event_open() race against self
Takashi Iwai <tiwai(a)suse.de>
ALSA: wavefront: Proper check of get_user() error
Ulf Hansson <ulf.hansson(a)linaro.org>
mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
Ulf Hansson <ulf.hansson(a)linaro.org>
mmc: block: Use generic_cmd6_time when modifying INAND_CMD38_ARG_EXT_CSD
Ulf Hansson <ulf.hansson(a)linaro.org>
mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC
linyujun <linyujun809(a)huawei.com>
ARM: 9191/1: arm/stacktrace, kasan: Silence KASAN warnings in unwind_frame()
Jakob Koschel <jakobkoschel(a)gmail.com>
drbd: remove usage of list iterator variable after loop
Xiaoke Wang <xkernel.wang(a)foxmail.com>
MIPS: lantiq: check the return value of kzalloc()
Jeff LaBundy <jeff(a)labundy.com>
Input: add bounds checking to input_set_capability()
David Gow <davidgow(a)google.com>
um: Cleanup syscall_handler_t definition/cast, fix warning
Willy Tarreau <w(a)1wt.eu>
floppy: use a statically allocated error counter
-------------
Diffstat:
Makefile | 4 +--
arch/arm/kernel/entry-armv.S | 2 +-
arch/arm/kernel/stacktrace.c | 10 +++---
arch/arm/mm/proc-v7-bugs.c | 1 +
arch/mips/lantiq/falcon/sysctrl.c | 2 ++
arch/mips/lantiq/xway/gptu.c | 2 ++
arch/mips/lantiq/xway/sysctrl.c | 46 +++++++++++++++---------
arch/x86/um/shared/sysdep/syscalls_64.h | 5 ++-
drivers/block/drbd/drbd_main.c | 7 ++--
drivers/block/floppy.c | 19 +++++-----
drivers/gpu/drm/drm_dp_mst_topology.c | 1 +
drivers/input/input.c | 19 ++++++++++
drivers/mmc/card/block.c | 6 ++--
drivers/mmc/core/core.c | 5 ++-
drivers/mmc/core/mmc_ops.c | 9 ++---
drivers/net/ethernet/dec/tulip/tulip_core.c | 5 ++-
drivers/net/ethernet/intel/igb/igb_main.c | 3 +-
drivers/net/ethernet/qlogic/qla3xxx.c | 3 +-
drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c | 4 +--
drivers/net/vmxnet3/vmxnet3_drv.c | 6 ++++
drivers/scsi/qla2xxx/qla_target.c | 3 ++
kernel/events/core.c | 14 ++++++++
net/key/af_key.c | 6 ++--
net/mac80211/rx.c | 3 +-
net/nfc/nci/data.c | 2 +-
net/nfc/nci/hci.c | 4 +--
sound/isa/wavefront/wavefront_synth.c | 3 +-
tools/perf/bench/numa.c | 2 +-
28 files changed, 134 insertions(+), 62 deletions(-)
#regzbot introduced v5.17.3..v5.17.4
#regzbot introduced: 001828fb3084379f3c3e228b905223c50bc237f9
Hello
Since 5.17.4 my laptop doesn't resume from suspend anymore. At resume,
symptoms are variable:
- either the laptop freezes;
- either the screen keeps blank;
- either the screen is OK but mouse is frozen;
- either display lags with several logs in dmesg:
[ 228.275492] [drm] Fence fallback timer expired on ring gfx
[ 228.395466] [drm:amdgpu_dm_atomic_commit_tail] *ERROR* Waiting for fences
timed out!
[ 228.779490] [drm] Fence fallback timer expired on ring gfx
[ 229.283484] [drm] Fence fallback timer expired on ring sdma0
[ 229.283485] [drm] Fence fallback timer expired on ring gfx
[ 229.787487] [drm] Fence fallback timer expired on ring gfx
...
I've bisected the problem.
Please note this laptop has a strange behaviour on suspend:
The first suspend request always fails (this point has never been fixed and
plagues us when trying to diagnose another regression on touchpad not resuming
in the past). The screen goes blank and I can get it OK when pressing the
power button, this seems to reset it. After that all suspend/resume works OK.
Since 5.17.4, it is not possible anymore to get the laptop working again after
the first suspend failure.
HW : HP Pavilion / Ryzen 4600H with AMD graphics integrated + NVidia 1650Ti
(turned off with ACPI call in order to get more battery, I'm not using NVidia
driver).
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From eadb2f47a3ced5c64b23b90fd2a3463f63726066 Mon Sep 17 00:00:00 2001
From: Daniel Thompson <daniel.thompson(a)linaro.org>
Date: Mon, 23 May 2022 19:11:02 +0100
Subject: [PATCH] lockdown: also lock down previous kgdb use
KGDB and KDB allow read and write access to kernel memory, and thus
should be restricted during lockdown. An attacker with access to a
serial port (for example, via a hypervisor console, which some cloud
vendors provide over the network) could trigger the debugger so it is
important that the debugger respect the lockdown mode when/if it is
triggered.
Fix this by integrating lockdown into kdb's existing permissions
mechanism. Unfortunately kgdb does not have any permissions mechanism
(although it certainly could be added later) so, for now, kgdb is simply
and brutally disabled by immediately exiting the gdb stub without taking
any action.
For lockdowns established early in the boot (e.g. the normal case) then
this should be fine but on systems where kgdb has set breakpoints before
the lockdown is enacted than "bad things" will happen.
CVE: CVE-2022-21499
Co-developed-by: Stephen Brennan <stephen.s.brennan(a)oracle.com>
Signed-off-by: Stephen Brennan <stephen.s.brennan(a)oracle.com>
Reviewed-by: Douglas Anderson <dianders(a)chromium.org>
Signed-off-by: Daniel Thompson <daniel.thompson(a)linaro.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/include/linux/security.h b/include/linux/security.h
index 25b3ef71f495..7fc4e9f49f54 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -121,10 +121,12 @@ enum lockdown_reason {
LOCKDOWN_DEBUGFS,
LOCKDOWN_XMON_WR,
LOCKDOWN_BPF_WRITE_USER,
+ LOCKDOWN_DBG_WRITE_KERNEL,
LOCKDOWN_INTEGRITY_MAX,
LOCKDOWN_KCORE,
LOCKDOWN_KPROBES,
LOCKDOWN_BPF_READ_KERNEL,
+ LOCKDOWN_DBG_READ_KERNEL,
LOCKDOWN_PERF,
LOCKDOWN_TRACEFS,
LOCKDOWN_XMON_RW,
diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
index da06a5553835..7beceb447211 100644
--- a/kernel/debug/debug_core.c
+++ b/kernel/debug/debug_core.c
@@ -53,6 +53,7 @@
#include <linux/vmacache.h>
#include <linux/rcupdate.h>
#include <linux/irq.h>
+#include <linux/security.h>
#include <asm/cacheflush.h>
#include <asm/byteorder.h>
@@ -752,6 +753,29 @@ cpu_master_loop:
continue;
kgdb_connected = 0;
} else {
+ /*
+ * This is a brutal way to interfere with the debugger
+ * and prevent gdb being used to poke at kernel memory.
+ * This could cause trouble if lockdown is applied when
+ * there is already an active gdb session. For now the
+ * answer is simply "don't do that". Typically lockdown
+ * *will* be applied before the debug core gets started
+ * so only developers using kgdb for fairly advanced
+ * early kernel debug can be biten by this. Hopefully
+ * they are sophisticated enough to take care of
+ * themselves, especially with help from the lockdown
+ * message printed on the console!
+ */
+ if (security_locked_down(LOCKDOWN_DBG_WRITE_KERNEL)) {
+ if (IS_ENABLED(CONFIG_KGDB_KDB)) {
+ /* Switch back to kdb if possible... */
+ dbg_kdb_mode = 1;
+ continue;
+ } else {
+ /* ... otherwise just bail */
+ break;
+ }
+ }
error = gdb_serial_stub(ks);
}
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
index 0852a537dad4..ead4da947127 100644
--- a/kernel/debug/kdb/kdb_main.c
+++ b/kernel/debug/kdb/kdb_main.c
@@ -45,6 +45,7 @@
#include <linux/proc_fs.h>
#include <linux/uaccess.h>
#include <linux/slab.h>
+#include <linux/security.h>
#include "kdb_private.h"
#undef MODULE_PARAM_PREFIX
@@ -166,10 +167,62 @@ struct task_struct *kdb_curr_task(int cpu)
}
/*
- * Check whether the flags of the current command and the permissions
- * of the kdb console has allow a command to be run.
+ * Update the permissions flags (kdb_cmd_enabled) to match the
+ * current lockdown state.
+ *
+ * Within this function the calls to security_locked_down() are "lazy". We
+ * avoid calling them if the current value of kdb_cmd_enabled already excludes
+ * flags that might be subject to lockdown. Additionally we deliberately check
+ * the lockdown flags independently (even though read lockdown implies write
+ * lockdown) since that results in both simpler code and clearer messages to
+ * the user on first-time debugger entry.
+ *
+ * The permission masks during a read+write lockdown permits the following
+ * flags: INSPECT, SIGNAL, REBOOT (and ALWAYS_SAFE).
+ *
+ * The INSPECT commands are not blocked during lockdown because they are
+ * not arbitrary memory reads. INSPECT covers the backtrace family (sometimes
+ * forcing them to have no arguments) and lsmod. These commands do expose
+ * some kernel state but do not allow the developer seated at the console to
+ * choose what state is reported. SIGNAL and REBOOT should not be controversial,
+ * given these are allowed for root during lockdown already.
+ */
+static void kdb_check_for_lockdown(void)
+{
+ const int write_flags = KDB_ENABLE_MEM_WRITE |
+ KDB_ENABLE_REG_WRITE |
+ KDB_ENABLE_FLOW_CTRL;
+ const int read_flags = KDB_ENABLE_MEM_READ |
+ KDB_ENABLE_REG_READ;
+
+ bool need_to_lockdown_write = false;
+ bool need_to_lockdown_read = false;
+
+ if (kdb_cmd_enabled & (KDB_ENABLE_ALL | write_flags))
+ need_to_lockdown_write =
+ security_locked_down(LOCKDOWN_DBG_WRITE_KERNEL);
+
+ if (kdb_cmd_enabled & (KDB_ENABLE_ALL | read_flags))
+ need_to_lockdown_read =
+ security_locked_down(LOCKDOWN_DBG_READ_KERNEL);
+
+ /* De-compose KDB_ENABLE_ALL if required */
+ if (need_to_lockdown_write || need_to_lockdown_read)
+ if (kdb_cmd_enabled & KDB_ENABLE_ALL)
+ kdb_cmd_enabled = KDB_ENABLE_MASK & ~KDB_ENABLE_ALL;
+
+ if (need_to_lockdown_write)
+ kdb_cmd_enabled &= ~write_flags;
+
+ if (need_to_lockdown_read)
+ kdb_cmd_enabled &= ~read_flags;
+}
+
+/*
+ * Check whether the flags of the current command, the permissions of the kdb
+ * console and the lockdown state allow a command to be run.
*/
-static inline bool kdb_check_flags(kdb_cmdflags_t flags, int permissions,
+static bool kdb_check_flags(kdb_cmdflags_t flags, int permissions,
bool no_args)
{
/* permissions comes from userspace so needs massaging slightly */
@@ -1180,6 +1233,9 @@ static int kdb_local(kdb_reason_t reason, int error, struct pt_regs *regs,
kdb_curr_task(raw_smp_processor_id());
KDB_DEBUG_STATE("kdb_local 1", reason);
+
+ kdb_check_for_lockdown();
+
kdb_go_count = 0;
if (reason == KDB_REASON_DEBUG) {
/* special case below */
diff --git a/security/security.c b/security/security.c
index b7cf5cbfdc67..aaf6566deb9f 100644
--- a/security/security.c
+++ b/security/security.c
@@ -59,10 +59,12 @@ const char *const lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
[LOCKDOWN_DEBUGFS] = "debugfs access",
[LOCKDOWN_XMON_WR] = "xmon write access",
[LOCKDOWN_BPF_WRITE_USER] = "use of bpf to write user RAM",
+ [LOCKDOWN_DBG_WRITE_KERNEL] = "use of kgdb/kdb to write kernel RAM",
[LOCKDOWN_INTEGRITY_MAX] = "integrity",
[LOCKDOWN_KCORE] = "/proc/kcore access",
[LOCKDOWN_KPROBES] = "use of kprobes",
[LOCKDOWN_BPF_READ_KERNEL] = "use of bpf to read kernel RAM",
+ [LOCKDOWN_DBG_READ_KERNEL] = "use of kgdb/kdb to read kernel RAM",
[LOCKDOWN_PERF] = "unsafe use of perf",
[LOCKDOWN_TRACEFS] = "use of tracefs",
[LOCKDOWN_XMON_RW] = "xmon read and write access",
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From eadb2f47a3ced5c64b23b90fd2a3463f63726066 Mon Sep 17 00:00:00 2001
From: Daniel Thompson <daniel.thompson(a)linaro.org>
Date: Mon, 23 May 2022 19:11:02 +0100
Subject: [PATCH] lockdown: also lock down previous kgdb use
KGDB and KDB allow read and write access to kernel memory, and thus
should be restricted during lockdown. An attacker with access to a
serial port (for example, via a hypervisor console, which some cloud
vendors provide over the network) could trigger the debugger so it is
important that the debugger respect the lockdown mode when/if it is
triggered.
Fix this by integrating lockdown into kdb's existing permissions
mechanism. Unfortunately kgdb does not have any permissions mechanism
(although it certainly could be added later) so, for now, kgdb is simply
and brutally disabled by immediately exiting the gdb stub without taking
any action.
For lockdowns established early in the boot (e.g. the normal case) then
this should be fine but on systems where kgdb has set breakpoints before
the lockdown is enacted than "bad things" will happen.
CVE: CVE-2022-21499
Co-developed-by: Stephen Brennan <stephen.s.brennan(a)oracle.com>
Signed-off-by: Stephen Brennan <stephen.s.brennan(a)oracle.com>
Reviewed-by: Douglas Anderson <dianders(a)chromium.org>
Signed-off-by: Daniel Thompson <daniel.thompson(a)linaro.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/include/linux/security.h b/include/linux/security.h
index 25b3ef71f495..7fc4e9f49f54 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -121,10 +121,12 @@ enum lockdown_reason {
LOCKDOWN_DEBUGFS,
LOCKDOWN_XMON_WR,
LOCKDOWN_BPF_WRITE_USER,
+ LOCKDOWN_DBG_WRITE_KERNEL,
LOCKDOWN_INTEGRITY_MAX,
LOCKDOWN_KCORE,
LOCKDOWN_KPROBES,
LOCKDOWN_BPF_READ_KERNEL,
+ LOCKDOWN_DBG_READ_KERNEL,
LOCKDOWN_PERF,
LOCKDOWN_TRACEFS,
LOCKDOWN_XMON_RW,
diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
index da06a5553835..7beceb447211 100644
--- a/kernel/debug/debug_core.c
+++ b/kernel/debug/debug_core.c
@@ -53,6 +53,7 @@
#include <linux/vmacache.h>
#include <linux/rcupdate.h>
#include <linux/irq.h>
+#include <linux/security.h>
#include <asm/cacheflush.h>
#include <asm/byteorder.h>
@@ -752,6 +753,29 @@ cpu_master_loop:
continue;
kgdb_connected = 0;
} else {
+ /*
+ * This is a brutal way to interfere with the debugger
+ * and prevent gdb being used to poke at kernel memory.
+ * This could cause trouble if lockdown is applied when
+ * there is already an active gdb session. For now the
+ * answer is simply "don't do that". Typically lockdown
+ * *will* be applied before the debug core gets started
+ * so only developers using kgdb for fairly advanced
+ * early kernel debug can be biten by this. Hopefully
+ * they are sophisticated enough to take care of
+ * themselves, especially with help from the lockdown
+ * message printed on the console!
+ */
+ if (security_locked_down(LOCKDOWN_DBG_WRITE_KERNEL)) {
+ if (IS_ENABLED(CONFIG_KGDB_KDB)) {
+ /* Switch back to kdb if possible... */
+ dbg_kdb_mode = 1;
+ continue;
+ } else {
+ /* ... otherwise just bail */
+ break;
+ }
+ }
error = gdb_serial_stub(ks);
}
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
index 0852a537dad4..ead4da947127 100644
--- a/kernel/debug/kdb/kdb_main.c
+++ b/kernel/debug/kdb/kdb_main.c
@@ -45,6 +45,7 @@
#include <linux/proc_fs.h>
#include <linux/uaccess.h>
#include <linux/slab.h>
+#include <linux/security.h>
#include "kdb_private.h"
#undef MODULE_PARAM_PREFIX
@@ -166,10 +167,62 @@ struct task_struct *kdb_curr_task(int cpu)
}
/*
- * Check whether the flags of the current command and the permissions
- * of the kdb console has allow a command to be run.
+ * Update the permissions flags (kdb_cmd_enabled) to match the
+ * current lockdown state.
+ *
+ * Within this function the calls to security_locked_down() are "lazy". We
+ * avoid calling them if the current value of kdb_cmd_enabled already excludes
+ * flags that might be subject to lockdown. Additionally we deliberately check
+ * the lockdown flags independently (even though read lockdown implies write
+ * lockdown) since that results in both simpler code and clearer messages to
+ * the user on first-time debugger entry.
+ *
+ * The permission masks during a read+write lockdown permits the following
+ * flags: INSPECT, SIGNAL, REBOOT (and ALWAYS_SAFE).
+ *
+ * The INSPECT commands are not blocked during lockdown because they are
+ * not arbitrary memory reads. INSPECT covers the backtrace family (sometimes
+ * forcing them to have no arguments) and lsmod. These commands do expose
+ * some kernel state but do not allow the developer seated at the console to
+ * choose what state is reported. SIGNAL and REBOOT should not be controversial,
+ * given these are allowed for root during lockdown already.
+ */
+static void kdb_check_for_lockdown(void)
+{
+ const int write_flags = KDB_ENABLE_MEM_WRITE |
+ KDB_ENABLE_REG_WRITE |
+ KDB_ENABLE_FLOW_CTRL;
+ const int read_flags = KDB_ENABLE_MEM_READ |
+ KDB_ENABLE_REG_READ;
+
+ bool need_to_lockdown_write = false;
+ bool need_to_lockdown_read = false;
+
+ if (kdb_cmd_enabled & (KDB_ENABLE_ALL | write_flags))
+ need_to_lockdown_write =
+ security_locked_down(LOCKDOWN_DBG_WRITE_KERNEL);
+
+ if (kdb_cmd_enabled & (KDB_ENABLE_ALL | read_flags))
+ need_to_lockdown_read =
+ security_locked_down(LOCKDOWN_DBG_READ_KERNEL);
+
+ /* De-compose KDB_ENABLE_ALL if required */
+ if (need_to_lockdown_write || need_to_lockdown_read)
+ if (kdb_cmd_enabled & KDB_ENABLE_ALL)
+ kdb_cmd_enabled = KDB_ENABLE_MASK & ~KDB_ENABLE_ALL;
+
+ if (need_to_lockdown_write)
+ kdb_cmd_enabled &= ~write_flags;
+
+ if (need_to_lockdown_read)
+ kdb_cmd_enabled &= ~read_flags;
+}
+
+/*
+ * Check whether the flags of the current command, the permissions of the kdb
+ * console and the lockdown state allow a command to be run.
*/
-static inline bool kdb_check_flags(kdb_cmdflags_t flags, int permissions,
+static bool kdb_check_flags(kdb_cmdflags_t flags, int permissions,
bool no_args)
{
/* permissions comes from userspace so needs massaging slightly */
@@ -1180,6 +1233,9 @@ static int kdb_local(kdb_reason_t reason, int error, struct pt_regs *regs,
kdb_curr_task(raw_smp_processor_id());
KDB_DEBUG_STATE("kdb_local 1", reason);
+
+ kdb_check_for_lockdown();
+
kdb_go_count = 0;
if (reason == KDB_REASON_DEBUG) {
/* special case below */
diff --git a/security/security.c b/security/security.c
index b7cf5cbfdc67..aaf6566deb9f 100644
--- a/security/security.c
+++ b/security/security.c
@@ -59,10 +59,12 @@ const char *const lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
[LOCKDOWN_DEBUGFS] = "debugfs access",
[LOCKDOWN_XMON_WR] = "xmon write access",
[LOCKDOWN_BPF_WRITE_USER] = "use of bpf to write user RAM",
+ [LOCKDOWN_DBG_WRITE_KERNEL] = "use of kgdb/kdb to write kernel RAM",
[LOCKDOWN_INTEGRITY_MAX] = "integrity",
[LOCKDOWN_KCORE] = "/proc/kcore access",
[LOCKDOWN_KPROBES] = "use of kprobes",
[LOCKDOWN_BPF_READ_KERNEL] = "use of bpf to read kernel RAM",
+ [LOCKDOWN_DBG_READ_KERNEL] = "use of kgdb/kdb to read kernel RAM",
[LOCKDOWN_PERF] = "unsafe use of perf",
[LOCKDOWN_TRACEFS] = "use of tracefs",
[LOCKDOWN_XMON_RW] = "xmon read and write access",
commit 7e0815b3e09986d2fe651199363e135b9358132a upstream.
When a XEN_HVM guest uses the XEN PIRQ/Eventchannel mechanism, then
PCI/MSI[-X] masking is solely controlled by the hypervisor, but contrary to
XEN_PV guests this does not disable PCI/MSI[-X] masking in the PCI/MSI
layer.
This can lead to a situation where the PCI/MSI layer masks an MSI[-X]
interrupt and the hypervisor grants the write despite the fact that it
already requested the interrupt. As a consequence interrupt delivery on the
affected device is not happening ever.
Set pci_msi_ignore_mask to prevent that like it's done for XEN_PV guests
already.
Fixes: 809f9267bbab ("xen: map MSIs into pirqs")
Reported-by: Jeremi Piotrowski <jpiotrowski(a)linux.microsoft.com>
Reported-by: Dusty Mabe <dustymabe(a)redhat.com>
Reported-by: Salvatore Bonaccorso <carnil(a)debian.org>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Noah Meyerhans <noahm(a)debian.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/87tuaduxj5.ffs@tglx
[nmeyerha(a)amazon.com: backported to 4.14]
Signed-off-by: Noah Meyerhans <nmeyerha(a)amazon.com>
---
arch/x86/pci/xen.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index c4b3646bd04c..7135f35f9de7 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -442,6 +442,11 @@ void __init xen_msi_init(void)
x86_msi.setup_msi_irqs = xen_hvm_setup_msi_irqs;
x86_msi.teardown_msi_irq = xen_teardown_msi_irq;
+ /*
+ * With XEN PIRQ/Eventchannels in use PCI/MSI[-X] masking is solely
+ * controlled by the hypervisor.
+ */
+ pci_msi_ignore_mask = 1;
}
#endif
--
2.25.1
Mike Tailor Inv is currently doing a great Promo, You have the opportunity to invest at least $250 USD and earn $2,500 USD in 4 working days. Contact the investment company via this email: ( investmentdept2022(a)miketailorinv.us ). The higher you invest the higher your profit value.
From: Jitao Shi <jitao.shi(a)mediatek.com>
To comply with the panel sequence, hold the mipi signal to LP00 before the dcs cmds transmission,
and pull the mipi signal high from LP00 to LP11 until the start of the dcs cmds transmission.
The normal panel timing is :
(1) pp1800 DC pull up
(2) avdd & avee AC pull high
(3) lcm_reset pull high -> pull low -> pull high
(4) Pull MIPI signal high (LP11) -> initial code -> send video data(HS mode)
The power-off sequence is reversed.
If dsi is not in cmd mode, then dsi will pull the mipi signal high in the mtk_output_dsi_enable function.
The delay in lane_ready func is the reaction time of dsi_rx after pulling up the mipi signal.
Fixes: 2dd8075d2185 ("drm/mediatek: mtk_dsi: Use the drm_panel_bridge API")
Cc: <stable(a)vger.kernel.org> # 5.10.x: b255d51e3967: sched: Modify dsi funcs to atomic operations
Cc: <stable(a)vger.kernel.org> # 5.10.x: 72c69c977502: sched: Separate poweron/poweroff from enable/disable and define new funcs
Cc: <stable(a)vger.kernel.org> # 5.10.x
Signed-off-by: Jitao Shi <jitao.shi(a)mediatek.com>
Signed-off-by: Xinlei Lee <xinlei.lee(a)mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno(a)collabora.com>
---
drivers/gpu/drm/mediatek/mtk_dsi.c | 28 +++++++++++++++++++++-------
1 file changed, 21 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/mediatek/mtk_dsi.c b/drivers/gpu/drm/mediatek/mtk_dsi.c
index d9a6b928dba8..25e84d9426bf 100644
--- a/drivers/gpu/drm/mediatek/mtk_dsi.c
+++ b/drivers/gpu/drm/mediatek/mtk_dsi.c
@@ -203,6 +203,7 @@ struct mtk_dsi {
struct mtk_phy_timing phy_timing;
int refcount;
bool enabled;
+ bool lanes_ready;
u32 irq_data;
wait_queue_head_t irq_wait_queue;
const struct mtk_dsi_driver_data *driver_data;
@@ -661,18 +662,11 @@ static int mtk_dsi_poweron(struct mtk_dsi *dsi)
mtk_dsi_reset_engine(dsi);
mtk_dsi_phy_timconfig(dsi);
- mtk_dsi_rxtx_control(dsi);
- usleep_range(30, 100);
- mtk_dsi_reset_dphy(dsi);
mtk_dsi_ps_control_vact(dsi);
mtk_dsi_set_vm_cmd(dsi);
mtk_dsi_config_vdo_timing(dsi);
mtk_dsi_set_interrupt_enable(dsi);
- mtk_dsi_clk_ulp_mode_leave(dsi);
- mtk_dsi_lane0_ulp_mode_leave(dsi);
- mtk_dsi_clk_hs_mode(dsi, 0);
-
return 0;
err_disable_engine_clk:
clk_disable_unprepare(dsi->engine_clk);
@@ -701,6 +695,23 @@ static void mtk_dsi_poweroff(struct mtk_dsi *dsi)
clk_disable_unprepare(dsi->digital_clk);
phy_power_off(dsi->phy);
+
+ dsi->lanes_ready = false;
+}
+
+static void mtk_dsi_lane_ready(struct mtk_dsi *dsi)
+{
+ if (!dsi->lanes_ready) {
+ dsi->lanes_ready = true;
+ mtk_dsi_rxtx_control(dsi);
+ usleep_range(30, 100);
+ mtk_dsi_reset_dphy(dsi);
+ mtk_dsi_clk_ulp_mode_leave(dsi);
+ mtk_dsi_lane0_ulp_mode_leave(dsi);
+ mtk_dsi_clk_hs_mode(dsi, 0);
+ msleep(20);
+ /* The reaction time after pulling up the mipi signal for dsi_rx */
+ }
}
static void mtk_output_dsi_enable(struct mtk_dsi *dsi)
@@ -708,6 +719,7 @@ static void mtk_output_dsi_enable(struct mtk_dsi *dsi)
if (dsi->enabled)
return;
+ mtk_dsi_lane_ready(dsi);
mtk_dsi_set_mode(dsi);
mtk_dsi_clk_hs_mode(dsi, 1);
@@ -1017,6 +1029,8 @@ static ssize_t mtk_dsi_host_transfer(struct mipi_dsi_host *host,
if (MTK_DSI_HOST_IS_READ(msg->type))
irq_flag |= LPRX_RD_RDY_INT_FLAG;
+ mtk_dsi_lane_ready(dsi);
+
ret = mtk_dsi_host_send_cmd(dsi, msg, irq_flag);
if (ret)
goto restore_dsi_mode;
--
2.18.0
The patch titled
Subject: hugetlb: fix huge_pmd_unshare address update
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
hugetlb-fix-huge_pmd_unshare-address-update.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlb: fix huge_pmd_unshare address update
Date: Tue, 24 May 2022 13:50:03 -0700
The routine huge_pmd_unshare() is passed a pointer to an address
associated with an area which may be unshared. If unshare is successful
this address is updated to 'optimize' callers iterating over huge page
addresses. For the optimization to work correctly, address should be
updated to the last huge page in the unmapped/unshared area. However, in
the common case where the passed address is PUD_SIZE aligned, the address
is incorrectly updated to the address of the preceding huge page. That
wastes CPU cycles as the unmapped/unshared range is scanned twice.
Link: https://lkml.kernel.org/r/20220524205003.126184-1-mike.kravetz@oracle.com
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
--- a/mm/hugetlb.c~hugetlb-fix-huge_pmd_unshare-address-update
+++ a/mm/hugetlb.c
@@ -6562,7 +6562,14 @@ int huge_pmd_unshare(struct mm_struct *m
pud_clear(pud);
put_page(virt_to_page(ptep));
mm_dec_nr_pmds(mm);
- *addr = ALIGN(*addr, HPAGE_SIZE * PTRS_PER_PTE) - HPAGE_SIZE;
+ /*
+ * This update of passed address optimizes loops sequentially
+ * processing addresses in increments of huge page size (PMD_SIZE
+ * in this case). By clearing the pud, a PUD_SIZE area is unmapped.
+ * Update address to the 'last page' in the cleared area so that
+ * calling loop can move to first page past this area.
+ */
+ *addr |= PUD_SIZE - PMD_SIZE;
return 1;
}
_
Patches currently in -mm which might be from mike.kravetz(a)oracle.com are
hugetlb-fix-huge_pmd_unshare-address-update.patch
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9f46c187e2e680ecd9de7983e4d081c3391acc76 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Fri, 20 May 2022 13:48:11 -0400
Subject: [PATCH] KVM: x86/mmu: fix NULL pointer dereference on guest INVPCID
With shadow paging enabled, the INVPCID instruction results in a call
to kvm_mmu_invpcid_gva. If INVPCID is executed with CR0.PG=0, the
invlpg callback is not set and the result is a NULL pointer dereference.
Fix it trivially by checking for mmu->invlpg before every call.
There are other possibilities:
- check for CR0.PG, because KVM (like all Intel processors after P5)
flushes guest TLB on CR0.PG changes so that INVPCID/INVLPG are a
nop with paging disabled
- check for EFER.LMA, because KVM syncs and flushes when switching
MMU contexts outside of 64-bit mode
All of these are tricky, go for the simple solution. This is CVE-2022-1789.
Reported-by: Yongkang Jia <kangel(a)zju.edu.cn>
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 56ebc4fb7f91..45e1573f8f1d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5470,14 +5470,16 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid)
uint i;
if (pcid == kvm_get_active_pcid(vcpu)) {
- mmu->invlpg(vcpu, gva, mmu->root.hpa);
+ if (mmu->invlpg)
+ mmu->invlpg(vcpu, gva, mmu->root.hpa);
tlb_flush = true;
}
for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
if (VALID_PAGE(mmu->prev_roots[i].hpa) &&
pcid == kvm_get_pcid(vcpu, mmu->prev_roots[i].pgd)) {
- mmu->invlpg(vcpu, gva, mmu->prev_roots[i].hpa);
+ if (mmu->invlpg)
+ mmu->invlpg(vcpu, gva, mmu->prev_roots[i].hpa);
tlb_flush = true;
}
}
This is the start of the stable review cycle for the 4.19.245 release.
There are 44 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Wed, 25 May 2022 16:56:55 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.19.245-r…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.19.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.19.245-rc1
David Howells <dhowells(a)redhat.com>
afs: Fix afs_getattr() to refetch file status if callback break occurred
Linus Torvalds <torvalds(a)linux-foundation.org>
Reinstate some of "swiotlb: rework "fix info leak with DMA_FROM_DEVICE""
Halil Pasic <pasic(a)linux.ibm.com>
swiotlb: fix info leak with DMA_FROM_DEVICE
Grant Grundler <grundler(a)chromium.org>
net: atlantic: verify hw_head_ lies within TX buffer ring
Yang Yingliang <yangyingliang(a)huawei.com>
net: stmmac: fix missing pci_disable_device() on error in stmmac_pci_probe()
Yang Yingliang <yangyingliang(a)huawei.com>
ethernet: tulip: fix missing pci_disable_device() on error in tulip_init_one()
Felix Fietkau <nbd(a)nbd.name>
mac80211: fix rx reordering with non explicit / psmp ack policy
Gleb Chesnokov <Chesnokov.G(a)raidix.com>
scsi: qla2xxx: Fix missed DMA unmap for aborted commands
Thomas Richter <tmricht(a)linux.ibm.com>
perf bench numa: Address compiler error on s390
Uwe Kleine-König <u.kleine-koenig(a)pengutronix.de>
gpio: mvebu/pwm: Refuse requests with inverted polarity
Haibo Chen <haibo.chen(a)nxp.com>
gpio: gpio-vf610: do not touch other bits when set the target bit
Andrew Lunn <andrew(a)lunn.ch>
net: bridge: Clear offload_fwd_mark when passing frame up bridge interface.
Kevin Mitchell <kevmitch(a)arista.com>
igb: skip phy status check where unavailable
Ard Biesheuvel <ardb(a)kernel.org>
ARM: 9197/1: spectre-bhb: fix loop8 sequence for Thumb2
Ard Biesheuvel <ardb(a)kernel.org>
ARM: 9196/1: spectre-bhb: enable for Cortex-A15
Jiasheng Jiang <jiasheng(a)iscas.ac.cn>
net: af_key: add check for pfkey_broadcast in function pfkey_process
Maxim Mikityanskiy <maximmi(a)nvidia.com>
net/mlx5e: Properly block LRO when XDP is enabled
Duoming Zhou <duoming(a)zju.edu.cn>
NFC: nci: fix sleep in atomic context bugs caused by nci_skb_alloc
Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
net/qla3xxx: Fix a test in ql_reset_work()
Codrin Ciubotariu <codrin.ciubotariu(a)microchip.com>
clk: at91: generated: consider range when calculating best rate
Zixuan Fu <r33s3n6(a)gmail.com>
net: vmxnet3: fix possible NULL pointer dereference in vmxnet3_rq_cleanup()
Zixuan Fu <r33s3n6(a)gmail.com>
net: vmxnet3: fix possible use-after-free bugs in vmxnet3_rq_alloc_rx_buf()
Paolo Abeni <pabeni(a)redhat.com>
net/sched: act_pedit: sanitize shift argument before usage
Harini Katakam <harini.katakam(a)xilinx.com>
net: macb: Increment rx bd head after allocating skb and buffer
Ulf Hansson <ulf.hansson(a)linaro.org>
mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
Ulf Hansson <ulf.hansson(a)linaro.org>
mmc: block: Use generic_cmd6_time when modifying INAND_CMD38_ARG_EXT_CSD
Ulf Hansson <ulf.hansson(a)linaro.org>
mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC
Ulf Hansson <ulf.hansson(a)linaro.org>
mmc: core: Cleanup BKOPS support
Hangyu Hua <hbh25y(a)gmail.com>
drm/dp/mst: fix a possible memory leak in fetch_monitor_name()
Ondrej Mosnacek <omosnace(a)redhat.com>
crypto: qcom-rng - fix infinite loop on requests not multiple of WORD_SZ
Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
PCI/PM: Avoid putting Elo i2 PCIe Ports in D3cold
Al Viro <viro(a)zeniv.linux.org.uk>
Fix double fget() in vhost_net_set_backend()
Peter Zijlstra <peterz(a)infradead.org>
perf: Fix sys_perf_event_open() race against self
Takashi Iwai <tiwai(a)suse.de>
ALSA: wavefront: Proper check of get_user() error
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: fix lockdep warnings during disk space reclamation
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: fix lockdep warnings in page operations for btree nodes
linyujun <linyujun809(a)huawei.com>
ARM: 9191/1: arm/stacktrace, kasan: Silence KASAN warnings in unwind_frame()
Jakob Koschel <jakobkoschel(a)gmail.com>
drbd: remove usage of list iterator variable after loop
Xiaoke Wang <xkernel.wang(a)foxmail.com>
MIPS: lantiq: check the return value of kzalloc()
Zheng Yongjun <zhengyongjun3(a)huawei.com>
crypto: stm32 - fix reference leak in stm32_crc_remove
Zheng Yongjun <zhengyongjun3(a)huawei.com>
Input: stmfts - fix reference leak in stmfts_input_open
Jeff LaBundy <jeff(a)labundy.com>
Input: add bounds checking to input_set_capability()
David Gow <davidgow(a)google.com>
um: Cleanup syscall_handler_t definition/cast, fix warning
Willy Tarreau <w(a)1wt.eu>
floppy: use a statically allocated error counter
-------------
Diffstat:
Makefile | 4 +-
arch/arm/kernel/entry-armv.S | 2 +-
arch/arm/kernel/stacktrace.c | 10 +-
arch/arm/mm/proc-v7-bugs.c | 1 +
arch/mips/lantiq/falcon/sysctrl.c | 2 +
arch/mips/lantiq/xway/gptu.c | 2 +
arch/mips/lantiq/xway/sysctrl.c | 46 +++---
arch/x86/um/shared/sysdep/syscalls_64.h | 5 +-
drivers/block/drbd/drbd_main.c | 7 +-
drivers/block/floppy.c | 17 +--
drivers/clk/at91/clk-generated.c | 4 +
drivers/crypto/qcom-rng.c | 1 +
drivers/crypto/stm32/stm32_crc32.c | 4 +-
drivers/gpio/gpio-mvebu.c | 3 +
drivers/gpio/gpio-vf610.c | 8 +-
drivers/gpu/drm/drm_dp_mst_topology.c | 1 +
drivers/input/input.c | 19 +++
drivers/input/touchscreen/stmfts.c | 8 +-
drivers/mmc/core/block.c | 8 +-
drivers/mmc/core/card.h | 6 +-
drivers/mmc/core/mmc.c | 6 -
drivers/mmc/core/mmc_ops.c | 110 ++++----------
drivers/mmc/core/mmc_ops.h | 3 +-
.../ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c | 7 +
drivers/net/ethernet/cadence/macb_main.c | 2 +-
drivers/net/ethernet/dec/tulip/tulip_core.c | 5 +-
drivers/net/ethernet/intel/igb/igb_main.c | 3 +-
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 7 +
drivers/net/ethernet/qlogic/qla3xxx.c | 3 +-
drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c | 4 +-
drivers/net/vmxnet3/vmxnet3_drv.c | 6 +
drivers/pci/pci.c | 10 ++
drivers/scsi/qla2xxx/qla_target.c | 3 +
drivers/vhost/net.c | 15 +-
fs/afs/inode.c | 14 +-
fs/nilfs2/btnode.c | 23 ++-
fs/nilfs2/btnode.h | 1 +
fs/nilfs2/btree.c | 27 ++--
fs/nilfs2/dat.c | 4 +-
fs/nilfs2/gcinode.c | 7 +-
fs/nilfs2/inode.c | 159 +++++++++++++++++++--
fs/nilfs2/mdt.c | 43 ++++--
fs/nilfs2/mdt.h | 6 +-
fs/nilfs2/nilfs.h | 16 +--
fs/nilfs2/page.c | 7 +-
fs/nilfs2/segment.c | 9 +-
fs/nilfs2/super.c | 5 +-
kernel/dma/swiotlb.c | 12 +-
kernel/events/core.c | 14 ++
net/bridge/br_input.c | 7 +
net/key/af_key.c | 6 +-
net/mac80211/rx.c | 3 +-
net/nfc/nci/data.c | 2 +-
net/nfc/nci/hci.c | 4 +-
net/sched/act_pedit.c | 4 +
sound/isa/wavefront/wavefront_synth.c | 3 +-
tools/perf/bench/numa.c | 2 +-
57 files changed, 483 insertions(+), 237 deletions(-)
commit 7e0815b3e09986d2fe651199363e135b9358132a upstream.
When a XEN_HVM guest uses the XEN PIRQ/Eventchannel mechanism, then
PCI/MSI[-X] masking is solely controlled by the hypervisor, but contrary to
XEN_PV guests this does not disable PCI/MSI[-X] masking in the PCI/MSI
layer.
This can lead to a situation where the PCI/MSI layer masks an MSI[-X]
interrupt and the hypervisor grants the write despite the fact that it
already requested the interrupt. As a consequence interrupt delivery on the
affected device is not happening ever.
Set pci_msi_ignore_mask to prevent that like it's done for XEN_PV guests
already.
Fixes: 809f9267bbab ("xen: map MSIs into pirqs")
Reported-by: Jeremi Piotrowski <jpiotrowski(a)linux.microsoft.com>
Reported-by: Dusty Mabe <dustymabe(a)redhat.com>
Reported-by: Salvatore Bonaccorso <carnil(a)debian.org>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Noah Meyerhans <noahm(a)debian.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/87tuaduxj5.ffs@tglx
[nmeyerha(a)amazon.com: backported to 5.4]
Signed-off-by: Noah Meyerhans <nmeyerha(a)amazon.com>
---
arch/x86/pci/xen.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index 5c11ae66b5d8..9cf8f5417e7f 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -442,6 +442,11 @@ void __init xen_msi_init(void)
x86_msi.setup_msi_irqs = xen_hvm_setup_msi_irqs;
x86_msi.teardown_msi_irq = xen_teardown_msi_irq;
+ /*
+ * With XEN PIRQ/Eventchannels in use PCI/MSI[-X] masking is solely
+ * controlled by the hypervisor.
+ */
+ pci_msi_ignore_mask = 1;
}
#endif
--
2.25.1
commit 7e0815b3e09986d2fe651199363e135b9358132a upstream.
When a XEN_HVM guest uses the XEN PIRQ/Eventchannel mechanism, then
PCI/MSI[-X] masking is solely controlled by the hypervisor, but contrary to
XEN_PV guests this does not disable PCI/MSI[-X] masking in the PCI/MSI
layer.
This can lead to a situation where the PCI/MSI layer masks an MSI[-X]
interrupt and the hypervisor grants the write despite the fact that it
already requested the interrupt. As a consequence interrupt delivery on the
affected device is not happening ever.
Set pci_msi_ignore_mask to prevent that like it's done for XEN_PV guests
already.
Fixes: 809f9267bbab ("xen: map MSIs into pirqs")
Reported-by: Jeremi Piotrowski <jpiotrowski(a)linux.microsoft.com>
Reported-by: Dusty Mabe <dustymabe(a)redhat.com>
Reported-by: Salvatore Bonaccorso <carnil(a)debian.org>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Noah Meyerhans <noahm(a)debian.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/87tuaduxj5.ffs@tglx
[nmeyerha(a)amazon.com: backported to 4.19]
Signed-off-by: Noah Meyerhans <nmeyerha(a)amazon.com>
---
arch/x86/pci/xen.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index 22da9bfd8a45..bacf8d988f65 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -441,6 +441,11 @@ void __init xen_msi_init(void)
x86_msi.setup_msi_irqs = xen_hvm_setup_msi_irqs;
x86_msi.teardown_msi_irq = xen_teardown_msi_irq;
+ /*
+ * With XEN PIRQ/Eventchannels in use PCI/MSI[-X] masking is solely
+ * controlled by the hypervisor.
+ */
+ pci_msi_ignore_mask = 1;
}
#endif
--
2.25.1
From: Eric Dumazet <edumazet(a)google.com>
commit 190cc82489f46f9d88e73c81a47e14f80a791e1a upstream
RFC 6056 (Recommendations for Transport-Protocol Port Randomization)
provides good summary of why source selection needs extra care.
David Dworken reminded us that linux implements Algorithm 3
as described in RFC 6056 3.3.3
Quoting David :
In the context of the web, this creates an interesting info leak where
websites can count how many TCP connections a user's computer is
establishing over time. For example, this allows a website to count
exactly how many subresources a third party website loaded.
This also allows:
- Distinguishing between different users behind a VPN based on
distinct source port ranges.
- Tracking users over time across multiple networks.
- Covert communication channels between different browsers/browser
profiles running on the same computer
- Tracking what applications are running on a computer based on
the pattern of how fast source ports are getting incremented.
Section 3.3.4 describes an enhancement, that reduces
attackers ability to use the basic information currently
stored into the shared 'u32 hint'.
This change also decreases collision rate when
multiple applications need to connect() to
different destinations.
Signed-off-by: Eric Dumazet <edumazet(a)google.com>
Reported-by: David Dworken <ddworken(a)google.com>
Cc: Willem de Bruijn <willemb(a)google.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
[SG: Adjusted context]
Signed-off-by: Stefan Ghinea <stefan.ghinea(a)windriver.com>
---
net/ipv4/inet_hashtables.c | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index c96a5871b49d..da9537ab3b98 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -714,6 +714,17 @@ void inet_unhash(struct sock *sk)
}
EXPORT_SYMBOL_GPL(inet_unhash);
+/* RFC 6056 3.3.4. Algorithm 4: Double-Hash Port Selection Algorithm
+ * Note that we use 32bit integers (vs RFC 'short integers')
+ * because 2^16 is not a multiple of num_ephemeral and this
+ * property might be used by clever attacker.
+ * RFC claims using TABLE_LENGTH=10 buckets gives an improvement,
+ * we use 256 instead to really give more isolation and
+ * privacy, this only consumes 1 KB of kernel memory.
+ */
+#define INET_TABLE_PERTURB_SHIFT 8
+static u32 table_perturb[1 << INET_TABLE_PERTURB_SHIFT];
+
int __inet_hash_connect(struct inet_timewait_death_row *death_row,
struct sock *sk, u32 port_offset,
int (*check_established)(struct inet_timewait_death_row *,
@@ -727,7 +738,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
struct inet_bind_bucket *tb;
u32 remaining, offset;
int ret, i, low, high;
- static u32 hint;
+ u32 index;
if (port) {
head = &hinfo->bhash[inet_bhashfn(net, port,
@@ -752,7 +763,10 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
if (likely(remaining > 1))
remaining &= ~1U;
- offset = (hint + port_offset) % remaining;
+ net_get_random_once(table_perturb, sizeof(table_perturb));
+ index = hash_32(port_offset, INET_TABLE_PERTURB_SHIFT);
+
+ offset = (READ_ONCE(table_perturb[index]) + port_offset) % remaining;
/* In first pass we try ports of @low parity.
* inet_csk_get_port() does the opposite choice.
*/
@@ -805,7 +819,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
return -EADDRNOTAVAIL;
ok:
- hint += i + 2;
+ WRITE_ONCE(table_perturb[index], READ_ONCE(table_perturb[index]) + i + 2);
/* Head lock still held and bh's disabled */
inet_bind_hash(sk, tb, port);
--
2.36.1
From: Eric Dumazet <edumazet(a)google.com>
commit 190cc82489f46f9d88e73c81a47e14f80a791e1a upstream
RFC 6056 (Recommendations for Transport-Protocol Port Randomization)
provides good summary of why source selection needs extra care.
David Dworken reminded us that linux implements Algorithm 3
as described in RFC 6056 3.3.3
Quoting David :
In the context of the web, this creates an interesting info leak where
websites can count how many TCP connections a user's computer is
establishing over time. For example, this allows a website to count
exactly how many subresources a third party website loaded.
This also allows:
- Distinguishing between different users behind a VPN based on
distinct source port ranges.
- Tracking users over time across multiple networks.
- Covert communication channels between different browsers/browser
profiles running on the same computer
- Tracking what applications are running on a computer based on
the pattern of how fast source ports are getting incremented.
Section 3.3.4 describes an enhancement, that reduces
attackers ability to use the basic information currently
stored into the shared 'u32 hint'.
This change also decreases collision rate when
multiple applications need to connect() to
different destinations.
Signed-off-by: Eric Dumazet <edumazet(a)google.com>
Reported-by: David Dworken <ddworken(a)google.com>
Cc: Willem de Bruijn <willemb(a)google.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Stefan Ghinea <stefan.ghinea(a)windriver.com>
---
net/ipv4/inet_hashtables.c | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index cbbeb0eea0c3..dbfcefc264d6 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -671,6 +671,17 @@ void inet_unhash(struct sock *sk)
}
EXPORT_SYMBOL_GPL(inet_unhash);
+/* RFC 6056 3.3.4. Algorithm 4: Double-Hash Port Selection Algorithm
+ * Note that we use 32bit integers (vs RFC 'short integers')
+ * because 2^16 is not a multiple of num_ephemeral and this
+ * property might be used by clever attacker.
+ * RFC claims using TABLE_LENGTH=10 buckets gives an improvement,
+ * we use 256 instead to really give more isolation and
+ * privacy, this only consumes 1 KB of kernel memory.
+ */
+#define INET_TABLE_PERTURB_SHIFT 8
+static u32 table_perturb[1 << INET_TABLE_PERTURB_SHIFT];
+
int __inet_hash_connect(struct inet_timewait_death_row *death_row,
struct sock *sk, u32 port_offset,
int (*check_established)(struct inet_timewait_death_row *,
@@ -684,8 +695,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
struct inet_bind_bucket *tb;
u32 remaining, offset;
int ret, i, low, high;
- static u32 hint;
int l3mdev;
+ u32 index;
if (port) {
head = &hinfo->bhash[inet_bhashfn(net, port,
@@ -712,7 +723,10 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
if (likely(remaining > 1))
remaining &= ~1U;
- offset = (hint + port_offset) % remaining;
+ net_get_random_once(table_perturb, sizeof(table_perturb));
+ index = hash_32(port_offset, INET_TABLE_PERTURB_SHIFT);
+
+ offset = (READ_ONCE(table_perturb[index]) + port_offset) % remaining;
/* In first pass we try ports of @low parity.
* inet_csk_get_port() does the opposite choice.
*/
@@ -766,7 +780,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
return -EADDRNOTAVAIL;
ok:
- hint += i + 2;
+ WRITE_ONCE(table_perturb[index], READ_ONCE(table_perturb[index]) + i + 2);
/* Head lock still held and bh's disabled */
inet_bind_hash(sk, tb, port);
--
2.36.1
From: Eric Dumazet <edumazet(a)google.com>
commit 190cc82489f46f9d88e73c81a47e14f80a791e1a upstream
RFC 6056 (Recommendations for Transport-Protocol Port Randomization)
provides good summary of why source selection needs extra care.
David Dworken reminded us that linux implements Algorithm 3
as described in RFC 6056 3.3.3
Quoting David :
In the context of the web, this creates an interesting info leak where
websites can count how many TCP connections a user's computer is
establishing over time. For example, this allows a website to count
exactly how many subresources a third party website loaded.
This also allows:
- Distinguishing between different users behind a VPN based on
distinct source port ranges.
- Tracking users over time across multiple networks.
- Covert communication channels between different browsers/browser
profiles running on the same computer
- Tracking what applications are running on a computer based on
the pattern of how fast source ports are getting incremented.
Section 3.3.4 describes an enhancement, that reduces
attackers ability to use the basic information currently
stored into the shared 'u32 hint'.
This change also decreases collision rate when
multiple applications need to connect() to
different destinations.
Signed-off-by: Eric Dumazet <edumazet(a)google.com>
Reported-by: David Dworken <ddworken(a)google.com>
Cc: Willem de Bruijn <willemb(a)google.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Stefan Ghinea <stefan.ghinea(a)windriver.com>
---
net/ipv4/inet_hashtables.c | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 915b8e1bd9ef..3beaf9e84cf2 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -722,6 +722,17 @@ void inet_unhash(struct sock *sk)
}
EXPORT_SYMBOL_GPL(inet_unhash);
+/* RFC 6056 3.3.4. Algorithm 4: Double-Hash Port Selection Algorithm
+ * Note that we use 32bit integers (vs RFC 'short integers')
+ * because 2^16 is not a multiple of num_ephemeral and this
+ * property might be used by clever attacker.
+ * RFC claims using TABLE_LENGTH=10 buckets gives an improvement,
+ * we use 256 instead to really give more isolation and
+ * privacy, this only consumes 1 KB of kernel memory.
+ */
+#define INET_TABLE_PERTURB_SHIFT 8
+static u32 table_perturb[1 << INET_TABLE_PERTURB_SHIFT];
+
int __inet_hash_connect(struct inet_timewait_death_row *death_row,
struct sock *sk, u32 port_offset,
int (*check_established)(struct inet_timewait_death_row *,
@@ -735,8 +746,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
struct inet_bind_bucket *tb;
u32 remaining, offset;
int ret, i, low, high;
- static u32 hint;
int l3mdev;
+ u32 index;
if (port) {
head = &hinfo->bhash[inet_bhashfn(net, port,
@@ -763,7 +774,10 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
if (likely(remaining > 1))
remaining &= ~1U;
- offset = (hint + port_offset) % remaining;
+ net_get_random_once(table_perturb, sizeof(table_perturb));
+ index = hash_32(port_offset, INET_TABLE_PERTURB_SHIFT);
+
+ offset = (READ_ONCE(table_perturb[index]) + port_offset) % remaining;
/* In first pass we try ports of @low parity.
* inet_csk_get_port() does the opposite choice.
*/
@@ -817,7 +831,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
return -EADDRNOTAVAIL;
ok:
- hint += i + 2;
+ WRITE_ONCE(table_perturb[index], READ_ONCE(table_perturb[index]) + i + 2);
/* Head lock still held and bh's disabled */
inet_bind_hash(sk, tb, port);
--
2.36.1
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9f46c187e2e680ecd9de7983e4d081c3391acc76 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Fri, 20 May 2022 13:48:11 -0400
Subject: [PATCH] KVM: x86/mmu: fix NULL pointer dereference on guest INVPCID
With shadow paging enabled, the INVPCID instruction results in a call
to kvm_mmu_invpcid_gva. If INVPCID is executed with CR0.PG=0, the
invlpg callback is not set and the result is a NULL pointer dereference.
Fix it trivially by checking for mmu->invlpg before every call.
There are other possibilities:
- check for CR0.PG, because KVM (like all Intel processors after P5)
flushes guest TLB on CR0.PG changes so that INVPCID/INVLPG are a
nop with paging disabled
- check for EFER.LMA, because KVM syncs and flushes when switching
MMU contexts outside of 64-bit mode
All of these are tricky, go for the simple solution. This is CVE-2022-1789.
Reported-by: Yongkang Jia <kangel(a)zju.edu.cn>
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 56ebc4fb7f91..45e1573f8f1d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5470,14 +5470,16 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid)
uint i;
if (pcid == kvm_get_active_pcid(vcpu)) {
- mmu->invlpg(vcpu, gva, mmu->root.hpa);
+ if (mmu->invlpg)
+ mmu->invlpg(vcpu, gva, mmu->root.hpa);
tlb_flush = true;
}
for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
if (VALID_PAGE(mmu->prev_roots[i].hpa) &&
pcid == kvm_get_pcid(vcpu, mmu->prev_roots[i].pgd)) {
- mmu->invlpg(vcpu, gva, mmu->prev_roots[i].hpa);
+ if (mmu->invlpg)
+ mmu->invlpg(vcpu, gva, mmu->prev_roots[i].hpa);
tlb_flush = true;
}
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9f46c187e2e680ecd9de7983e4d081c3391acc76 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Fri, 20 May 2022 13:48:11 -0400
Subject: [PATCH] KVM: x86/mmu: fix NULL pointer dereference on guest INVPCID
With shadow paging enabled, the INVPCID instruction results in a call
to kvm_mmu_invpcid_gva. If INVPCID is executed with CR0.PG=0, the
invlpg callback is not set and the result is a NULL pointer dereference.
Fix it trivially by checking for mmu->invlpg before every call.
There are other possibilities:
- check for CR0.PG, because KVM (like all Intel processors after P5)
flushes guest TLB on CR0.PG changes so that INVPCID/INVLPG are a
nop with paging disabled
- check for EFER.LMA, because KVM syncs and flushes when switching
MMU contexts outside of 64-bit mode
All of these are tricky, go for the simple solution. This is CVE-2022-1789.
Reported-by: Yongkang Jia <kangel(a)zju.edu.cn>
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 56ebc4fb7f91..45e1573f8f1d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5470,14 +5470,16 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid)
uint i;
if (pcid == kvm_get_active_pcid(vcpu)) {
- mmu->invlpg(vcpu, gva, mmu->root.hpa);
+ if (mmu->invlpg)
+ mmu->invlpg(vcpu, gva, mmu->root.hpa);
tlb_flush = true;
}
for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
if (VALID_PAGE(mmu->prev_roots[i].hpa) &&
pcid == kvm_get_pcid(vcpu, mmu->prev_roots[i].pgd)) {
- mmu->invlpg(vcpu, gva, mmu->prev_roots[i].hpa);
+ if (mmu->invlpg)
+ mmu->invlpg(vcpu, gva, mmu->prev_roots[i].hpa);
tlb_flush = true;
}
}
Querido amigo ,
Mucho tiempo, espero que todo esté bien junto con su familia, si es
así, gloria a Dios Todopoderoso. Bueno, lamento escuchar esta noticia
y entender que la culpa no es mía. Me complace informarles sobre mi
éxito en la transferencia de esos fondos bajo la cooperación de un
nuevo socio de EE. UU. Actualmente estoy en EE. UU. para recibir
tratamiento mientras tanto, no olvidé sus esfuerzos pasados ??e
intentos de obtener los fondos a pesar de que nos falló de alguna
manera. Ahora comuníquese con mi pastor mientras estaba en Togo. Dejé
$ 500,000.00 para usted.
como su propia parte para compensar sus esfuerzos de entrada durante
la transacción sin importar que falle debido a su falta de
comprensión.
Nombre: Pastora María Antonio
Dirección: 55 avenida slyvanus olympio
Correo electrónico: chika_abarla2008(a)yahoo.com
Pídale que le envíe la cantidad total de $ 500,000.00 dólares
estadounidenses que dejé como compensación por todos los esfuerzos e
intentos anteriores para ayudarme en este asunto. Aprecié mucho sus
esfuerzos en ese momento. Así que siéntase libre y póngase en contacto
con ellos e indíqueles cómo desea recibir su parte.
Por favor, avísame de inmediato si recibes tu parte para que podamos
compartir la alegría juntos después de todos los sufrimientos en ese
momento, está bien.
En este momento, estoy muy ocupado con mi tratamiento y mi salud. Así
que no dude en ponerse en contacto con mi pastor hoy de inmediato para
reclamar.
tu fondo Gracias de nuevo y que Dios los bendiga.
Saludos,
señora andrea mamón
The bug is here:
if (!p)
return ret;
The list iterator value 'p' will *always* be set and non-NULL by
list_for_each_entry(), so it is incorrect to assume that the iterator
value will be NULL if the list is empty or no element is found.
To fix the bug, Use a new value 'iter' as the list iterator, while use
the old value 'p' as a dedicated variable to point to the found element.
Cc: stable(a)vger.kernel.org
Fixes: dfaa973ae9605 ("KVM: PPC: Book3S HV: In H_SVM_INIT_DONE, migrate remaining normal-GFNs to secure-GFNs")
Signed-off-by: Xiaomeng Tong <xiam0nd.tong(a)gmail.com>
---
arch/powerpc/kvm/book3s_hv_uvmem.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index e414ca44839f..0cb20ee6a632 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -360,13 +360,15 @@ static bool kvmppc_gfn_is_uvmem_pfn(unsigned long gfn, struct kvm *kvm,
static bool kvmppc_next_nontransitioned_gfn(const struct kvm_memory_slot *memslot,
struct kvm *kvm, unsigned long *gfn)
{
- struct kvmppc_uvmem_slot *p;
+ struct kvmppc_uvmem_slot *p = NULL, *iter;
bool ret = false;
unsigned long i;
- list_for_each_entry(p, &kvm->arch.uvmem_pfns, list)
- if (*gfn >= p->base_pfn && *gfn < p->base_pfn + p->nr_pfns)
+ list_for_each_entry(iter, &kvm->arch.uvmem_pfns, list)
+ if (*gfn >= iter->base_pfn && *gfn < iter->base_pfn + iter->nr_pfns) {
+ p = iter;
break;
+ }
if (!p)
return ret;
/*
--
2.17.1
Hello my dear,
I sent this mail praying it will get to you in a good condition of
health, since I myself are in a very critical health condition in
which I sleep every night without knowing if I may be alive to see the
next day. I bring peace and love to you. It is by the grace of God, I
had no choice than to do what is lawful and right in the sight of God
for eternal life and in the sight of man, for witness of God’s mercy
and glory upon my life,I am Mrs,Jackie Grayson,a widow.I am suffering
from a long time brain tumor, It has defiled all forms of medical
treatment, and right now I have about a few months to leave, according
to medical experts.
The situation has gotten complicated recently with my inability to
hear proper, am communicating with you with the help of the chief
nurse herein the hospital, from all indication my conditions is really
deteriorating and it is quite obvious that, according to my doctors
they have advised me that I may not live too long, Because this
illness has gotten to a very bad stage. I plead that you will not
expose or betray this trust and confidence that I am about to repose
on you for the mutual benefit of the orphans and the less privilege. I
have some funds I inherited from my late husband, the sum of
($11,500,000.00 Dollars).Having known my condition, I decided to
donate this fund to you believing that you will utilize it the way i
am going to instruct herein.
I need you to assist me and reclaim this money and use it for
Charity works, for orphanages and gives justice and help to the poor,
needy and widows says The Lord." Jeremiah 22:15-16.“ and also build
schools for less privilege that will be named after my late husband if
possible and to promote the word of God and the effort that the house
of God is maintained. I do not want a situation where this money will
be used in an ungodly manner. That's why I'm taking this decision. I'm
not afraid of death, so I know where I'm going. I accept this decision
because I do not have any child who will inherit this money after I
die. Please I want your sincerely and urgent answer to know if you
will be able to execute this project for the glory of God, and I will
give you more information on how the fund will be transferred to your
bank account. May the grace, peace, love and the truth in the Word of
God be with you and all those that you love and care for.
I'm waiting for your immediate reply,
Respectfully,
Mrs,Jackie Grayson.
Writting From the hospital,
May God Bless you,
The journal no-space deadlock was reported time to time. Such deadlock
can happen in the following situation.
When all journal buckets are fully filled by active jset with heavy
write I/O load, the cache set registration (after a reboot) will load
all active jsets and inserting them into the btree again (which is
called journal replay). If a journaled bkey is inserted into a btree
node and results btree node split, new journal request might be
triggered. For example, the btree grows one more level after the node
split, then the root node record in cache device super block will be
upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no
space in journal buckets, the journal replay has to wait for new journal
bucket to be reclaimed after at least one journal bucket replayed. This
is one example that how the journal no-space deadlock happens.
The solution to avoid the deadlock is to reserve 1 journal bucket in
run time, and only permit the reserved journal bucket to be used during
cache set registration procedure for things like journal replay. Then
the journal space will never be fully filled, there is no chance for
journal no-space deadlock to happen anymore.
This patch adds a new member "bool do_reserve" in struct journal, it is
inititalized to 0 (false) when struct journal is allocated, and set to
1 (true) by bch_journal_space_reserve() when all initialization done in
run_cache_set(). In the run time when journal_reclaim() tries to
allocate a new journal bucket, free_journal_buckets() is called to check
whether there are enough free journal buckets to use. If there is only
1 free journal bucket and journal->do_reserve is 1 (true), the last
bucket is reserved and free_journal_buckets() will return 0 to indicate
no free journal bucket. Then journal_reclaim() will give up, and try
next time to see whetheer there is free journal bucket to allocate. By
this method, there is always 1 jouranl bucket reserved in run time.
During the cache set registration, journal->do_reserve is 0 (false), so
the reserved journal bucket can be used to avoid the no-space deadlock.
Reported-by: Nikhil Kshirsagar <nkshirsagar(a)gmail.com>
Signed-off-by: Coly Li <colyli(a)suse.de>
Cc: stable(a)vger.kernel.org
---
drivers/md/bcache/journal.c | 31 ++++++++++++++++++++++++++-----
drivers/md/bcache/journal.h | 2 ++
drivers/md/bcache/super.c | 1 +
3 files changed, 29 insertions(+), 5 deletions(-)
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index df5347ea450b..e5da469a4235 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -405,6 +405,11 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
return ret;
}
+void bch_journal_space_reserve(struct journal *j)
+{
+ j->do_reserve = true;
+}
+
/* Journalling */
static void btree_flush_write(struct cache_set *c)
@@ -621,12 +626,30 @@ static void do_journal_discard(struct cache *ca)
}
}
+static unsigned int free_journal_buckets(struct cache_set *c)
+{
+ struct journal *j = &c->journal;
+ struct cache *ca = c->cache;
+ struct journal_device *ja = &c->cache->journal;
+ unsigned int n;
+
+ /* In case njournal_buckets is not power of 2 */
+ if (ja->cur_idx >= ja->discard_idx)
+ n = ca->sb.njournal_buckets + ja->discard_idx - ja->cur_idx;
+ else
+ n = ja->discard_idx - ja->cur_idx;
+
+ if (n > (1 + j->do_reserve))
+ return n - (1 + j->do_reserve);
+
+ return 0;
+}
+
static void journal_reclaim(struct cache_set *c)
{
struct bkey *k = &c->journal.key;
struct cache *ca = c->cache;
uint64_t last_seq;
- unsigned int next;
struct journal_device *ja = &ca->journal;
atomic_t p __maybe_unused;
@@ -649,12 +672,10 @@ static void journal_reclaim(struct cache_set *c)
if (c->journal.blocks_free)
goto out;
- next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
- /* No space available on this device */
- if (next == ja->discard_idx)
+ if (!free_journal_buckets(c))
goto out;
- ja->cur_idx = next;
+ ja->cur_idx = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
k->ptr[0] = MAKE_PTR(0,
bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
ca->sb.nr_this_dev);
diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h
index f2ea34d5f431..cd316b4a1e95 100644
--- a/drivers/md/bcache/journal.h
+++ b/drivers/md/bcache/journal.h
@@ -105,6 +105,7 @@ struct journal {
spinlock_t lock;
spinlock_t flush_write_lock;
bool btree_flushing;
+ bool do_reserve;
/* used when waiting because the journal was full */
struct closure_waitlist wait;
struct closure io;
@@ -182,5 +183,6 @@ int bch_journal_replay(struct cache_set *c, struct list_head *list);
void bch_journal_free(struct cache_set *c);
int bch_journal_alloc(struct cache_set *c);
+void bch_journal_space_reserve(struct journal *j);
#endif /* _BCACHE_JOURNAL_H */
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index bf3de149d3c9..2bb55278d22d 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2128,6 +2128,7 @@ static int run_cache_set(struct cache_set *c)
flash_devs_run(c);
+ bch_journal_space_reserve(&c->journal);
set_bit(CACHE_SET_RUNNING, &c->flags);
return 0;
err:
--
2.35.3
After making bch_sectors_dirty_init() being multithreaded, the existing
incremental dirty sector counting in bch_root_node_dirty_init() doesn't
release btree occupation after iterating 500000 (INIT_KEYS_EACH_TIME)
bkeys. Because a read lock is added on btree root node to prevent the
btree to be split during the dirty sectors counting, other I/O requester
has no chance to gain the write lock even restart bcache_btree().
That is to say, the incremental dirty sectors counting is incompatible
to the multhreaded bch_sectors_dirty_init(). We have to choose one and
drop another one.
In my testing, with 512 bytes random writes, I generate 1.2T dirty data
and a btree with 400K nodes. With single thread and incremental dirty
sectors counting, it takes 30+ minites to register the backing device.
And with multithreaded dirty sectors counting, the backing device
registration can be accomplished within 2 minutes.
The 30+ minutes V.S. 2- minutes difference makes me decide to keep
multithreaded bch_sectors_dirty_init() and drop the incremental dirty
sectors counting. This is what this patch does.
But INIT_KEYS_EACH_TIME is kept, in sectors_dirty_init_fn() the CPU
will be released by cond_resched() after every INIT_KEYS_EACH_TIME keys
iterated. This is to avoid the watchdog reports a bogus soft lockup
warning.
Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Signed-off-by: Coly Li <colyli(a)suse.de>
Cc: stable(a)vger.kernel.org
---
drivers/md/bcache/writeback.c | 41 +++++++++++------------------------
1 file changed, 13 insertions(+), 28 deletions(-)
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index d24c09490f8e..75b71199800d 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -805,13 +805,11 @@ static int bch_writeback_thread(void *arg)
/* Init */
#define INIT_KEYS_EACH_TIME 500000
-#define INIT_KEYS_SLEEP_MS 100
struct sectors_dirty_init {
struct btree_op op;
unsigned int inode;
size_t count;
- struct bkey start;
};
static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b,
@@ -827,11 +825,8 @@ static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b,
KEY_START(k), KEY_SIZE(k));
op->count++;
- if (atomic_read(&b->c->search_inflight) &&
- !(op->count % INIT_KEYS_EACH_TIME)) {
- bkey_copy_key(&op->start, k);
- return -EAGAIN;
- }
+ if (!(op->count % INIT_KEYS_EACH_TIME))
+ cond_resched();
return MAP_CONTINUE;
}
@@ -846,24 +841,16 @@ static int bch_root_node_dirty_init(struct cache_set *c,
bch_btree_op_init(&op.op, -1);
op.inode = d->id;
op.count = 0;
- op.start = KEY(op.inode, 0, 0);
-
- do {
- ret = bcache_btree(map_keys_recurse,
- k,
- c->root,
- &op.op,
- &op.start,
- sectors_dirty_init_fn,
- 0);
- if (ret == -EAGAIN)
- schedule_timeout_interruptible(
- msecs_to_jiffies(INIT_KEYS_SLEEP_MS));
- else if (ret < 0) {
- pr_warn("sectors dirty init failed, ret=%d!\n", ret);
- break;
- }
- } while (ret == -EAGAIN);
+
+ ret = bcache_btree(map_keys_recurse,
+ k,
+ c->root,
+ &op.op,
+ &KEY(op.inode, 0, 0),
+ sectors_dirty_init_fn,
+ 0);
+ if (ret < 0)
+ pr_warn("sectors dirty init failed, ret=%d!\n", ret);
return ret;
}
@@ -907,7 +894,6 @@ static int bch_dirty_init_thread(void *arg)
goto out;
}
skip_nr--;
- cond_resched();
}
if (p) {
@@ -917,7 +903,6 @@ static int bch_dirty_init_thread(void *arg)
p = NULL;
prev_idx = cur_idx;
- cond_resched();
}
out:
@@ -956,11 +941,11 @@ void bch_sectors_dirty_init(struct bcache_device *d)
bch_btree_op_init(&op.op, -1);
op.inode = d->id;
op.count = 0;
- op.start = KEY(op.inode, 0, 0);
for_each_key_filter(&c->root->keys,
k, &iter, bch_ptr_invalid)
sectors_dirty_init_fn(&op.op, c->root, k);
+
rw_unlock(0, c->root);
return;
}
--
2.35.3
Commit b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be
multithreaded") makes bch_sectors_dirty_init() to be much faster
when counting dirty sectors by iterating all dirty keys in the btree.
But it isn't in ideal shape yet, still can be improved.
This patch does the following changes to improve current parallel dirty
keys iteration on the btree,
- Add read lock to root node when multiple threads iterating the btree,
to prevent the root node gets split by I/Os from other registered
bcache devices.
- Remove local variable "char name[32]" and generate kernel thread name
string directly when calling kthread_run().
- Allocate "struct bch_dirty_init_state state" directly on stack and
avoid the unnecessary dynamic memory allocation for it.
- Decrease BCH_DIRTY_INIT_THRD_MAX from 64 to 12 which is enough indeed.
- Increase &state->started to count created kernel thread after it
succeeds to create.
- When wait for all dirty key counting threads to finish, use
wait_event() to replace wait_event_interruptible().
With the above changes, the code is more clear, and some potential error
conditions are avoided.
Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Signed-off-by: Coly Li <colyli(a)suse.de>
Cc: stable(a)vger.kernel.org
---
drivers/md/bcache/writeback.c | 62 ++++++++++++++---------------------
drivers/md/bcache/writeback.h | 2 +-
2 files changed, 26 insertions(+), 38 deletions(-)
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 9ee0005874cd..d24c09490f8e 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -948,10 +948,10 @@ void bch_sectors_dirty_init(struct bcache_device *d)
struct btree_iter iter;
struct sectors_dirty_init op;
struct cache_set *c = d->c;
- struct bch_dirty_init_state *state;
- char name[32];
+ struct bch_dirty_init_state state;
/* Just count root keys if no leaf node */
+ rw_lock(0, c->root, c->root->level);
if (c->root->level == 0) {
bch_btree_op_init(&op.op, -1);
op.inode = d->id;
@@ -961,54 +961,42 @@ void bch_sectors_dirty_init(struct bcache_device *d)
for_each_key_filter(&c->root->keys,
k, &iter, bch_ptr_invalid)
sectors_dirty_init_fn(&op.op, c->root, k);
+ rw_unlock(0, c->root);
return;
}
- state = kzalloc(sizeof(struct bch_dirty_init_state), GFP_KERNEL);
- if (!state) {
- pr_warn("sectors dirty init failed: cannot allocate memory\n");
- return;
- }
-
- state->c = c;
- state->d = d;
- state->total_threads = bch_btre_dirty_init_thread_nr();
- state->key_idx = 0;
- spin_lock_init(&state->idx_lock);
- atomic_set(&state->started, 0);
- atomic_set(&state->enough, 0);
- init_waitqueue_head(&state->wait);
-
- for (i = 0; i < state->total_threads; i++) {
- /* Fetch latest state->enough earlier */
+ state.c = c;
+ state.d = d;
+ state.total_threads = bch_btre_dirty_init_thread_nr();
+ state.key_idx = 0;
+ spin_lock_init(&state.idx_lock);
+ atomic_set(&state.started, 0);
+ atomic_set(&state.enough, 0);
+ init_waitqueue_head(&state.wait);
+
+ for (i = 0; i < state.total_threads; i++) {
+ /* Fetch latest state.enough earlier */
smp_mb__before_atomic();
- if (atomic_read(&state->enough))
+ if (atomic_read(&state.enough))
break;
- state->infos[i].state = state;
- atomic_inc(&state->started);
- snprintf(name, sizeof(name), "bch_dirty_init[%d]", i);
-
- state->infos[i].thread =
- kthread_run(bch_dirty_init_thread,
- &state->infos[i],
- name);
- if (IS_ERR(state->infos[i].thread)) {
+ state.infos[i].state = &state;
+ state.infos[i].thread =
+ kthread_run(bch_dirty_init_thread, &state.infos[i],
+ "bch_dirtcnt[%d]", i);
+ if (IS_ERR(state.infos[i].thread)) {
pr_err("fails to run thread bch_dirty_init[%d]\n", i);
for (--i; i >= 0; i--)
- kthread_stop(state->infos[i].thread);
+ kthread_stop(state.infos[i].thread);
goto out;
}
+ atomic_inc(&state.started);
}
- /*
- * Must wait for all threads to stop.
- */
- wait_event_interruptible(state->wait,
- atomic_read(&state->started) == 0);
-
out:
- kfree(state);
+ /* Must wait for all threads to stop. */
+ wait_event(state.wait, atomic_read(&state.started) == 0);
+ rw_unlock(0, c->root);
}
void bch_cached_dev_writeback_init(struct cached_dev *dc)
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 02b2f9df73f6..31df716951f6 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -20,7 +20,7 @@
#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID 57
#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH 64
-#define BCH_DIRTY_INIT_THRD_MAX 64
+#define BCH_DIRTY_INIT_THRD_MAX 12
/*
* 14 (16384ths) is chosen here as something that each backing device
* should be a reasonable fraction of the share, and not to blow up
--
2.35.3
Commit 8e7102273f59 ("bcache: make bch_btree_check() to be
multithreaded") makes bch_btree_check() to be much faster when checking
all btree nodes during cache device registration. But it isn't in ideal
shap yet, still can be improved.
This patch does the following thing to improve current parallel btree
nodes check by multiple threads in bch_btree_check(),
- Add read lock to root node while checking all the btree nodes with
multiple threads. Although currently it is not mandatory but it is
good to have a read lock in code logic.
- Remove local variable 'char name[32]', and generate kernel thread name
string directly when calling kthread_run().
- Allocate local variable "struct btree_check_state check_state" on the
stack and avoid unnecessary dynamic memory allocation for it.
- Reduce BCH_BTR_CHKTHREAD_MAX from 64 to 12 which is enough indeed.
- Increase check_state->started to count created kernel thread after it
succeeds to create.
- When wait for all checking kernel threads to finish, use wait_event()
to replace wait_event_interruptible().
With this change, the code is more clear, and some potential error
conditions are avoided.
Fixes: 8e7102273f59 ("bcache: make bch_btree_check() to be multithreaded")
Signed-off-by: Coly Li <colyli(a)suse.de>
Cc: stable(a)vger.kernel.org
---
drivers/md/bcache/btree.c | 58 ++++++++++++++++++---------------------
drivers/md/bcache/btree.h | 2 +-
2 files changed, 27 insertions(+), 33 deletions(-)
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index ad9f16689419..2362bb8ef6d1 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -2006,8 +2006,7 @@ int bch_btree_check(struct cache_set *c)
int i;
struct bkey *k = NULL;
struct btree_iter iter;
- struct btree_check_state *check_state;
- char name[32];
+ struct btree_check_state check_state;
/* check and mark root node keys */
for_each_key_filter(&c->root->keys, k, &iter, bch_ptr_invalid)
@@ -2018,63 +2017,58 @@ int bch_btree_check(struct cache_set *c)
if (c->root->level == 0)
return 0;
- check_state = kzalloc(sizeof(struct btree_check_state), GFP_KERNEL);
- if (!check_state)
- return -ENOMEM;
-
- check_state->c = c;
- check_state->total_threads = bch_btree_chkthread_nr();
- check_state->key_idx = 0;
- spin_lock_init(&check_state->idx_lock);
- atomic_set(&check_state->started, 0);
- atomic_set(&check_state->enough, 0);
- init_waitqueue_head(&check_state->wait);
+ check_state.c = c;
+ check_state.total_threads = bch_btree_chkthread_nr();
+ check_state.key_idx = 0;
+ spin_lock_init(&check_state.idx_lock);
+ atomic_set(&check_state.started, 0);
+ atomic_set(&check_state.enough, 0);
+ init_waitqueue_head(&check_state.wait);
+ rw_lock(0, c->root, c->root->level);
/*
* Run multiple threads to check btree nodes in parallel,
- * if check_state->enough is non-zero, it means current
+ * if check_state.enough is non-zero, it means current
* running check threads are enough, unncessary to create
* more.
*/
- for (i = 0; i < check_state->total_threads; i++) {
- /* fetch latest check_state->enough earlier */
+ for (i = 0; i < check_state.total_threads; i++) {
+ /* fetch latest check_state.enough earlier */
smp_mb__before_atomic();
- if (atomic_read(&check_state->enough))
+ if (atomic_read(&check_state.enough))
break;
- check_state->infos[i].result = 0;
- check_state->infos[i].state = check_state;
- snprintf(name, sizeof(name), "bch_btrchk[%u]", i);
- atomic_inc(&check_state->started);
+ check_state.infos[i].result = 0;
+ check_state.infos[i].state = &check_state;
- check_state->infos[i].thread =
+ check_state.infos[i].thread =
kthread_run(bch_btree_check_thread,
- &check_state->infos[i],
- name);
- if (IS_ERR(check_state->infos[i].thread)) {
+ &check_state.infos[i],
+ "bch_btrchk[%d]", i);
+ if (IS_ERR(check_state.infos[i].thread)) {
pr_err("fails to run thread bch_btrchk[%d]\n", i);
for (--i; i >= 0; i--)
- kthread_stop(check_state->infos[i].thread);
+ kthread_stop(check_state.infos[i].thread);
ret = -ENOMEM;
goto out;
}
+ atomic_inc(&check_state.started);
}
/*
* Must wait for all threads to stop.
*/
- wait_event_interruptible(check_state->wait,
- atomic_read(&check_state->started) == 0);
+ wait_event(check_state.wait, atomic_read(&check_state.started) == 0);
- for (i = 0; i < check_state->total_threads; i++) {
- if (check_state->infos[i].result) {
- ret = check_state->infos[i].result;
+ for (i = 0; i < check_state.total_threads; i++) {
+ if (check_state.infos[i].result) {
+ ret = check_state.infos[i].result;
goto out;
}
}
out:
- kfree(check_state);
+ rw_unlock(0, c->root);
return ret;
}
diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h
index 50482107134f..1b5fdbc0d83e 100644
--- a/drivers/md/bcache/btree.h
+++ b/drivers/md/bcache/btree.h
@@ -226,7 +226,7 @@ struct btree_check_info {
int result;
};
-#define BCH_BTR_CHKTHREAD_MAX 64
+#define BCH_BTR_CHKTHREAD_MAX 12
struct btree_check_state {
struct cache_set *c;
int total_threads;
--
2.35.3
Helló,
Örömmel értesítjük Önt egy elérhető távmunka részmunkaidős ügyintézői
állásról, heti 500 eurós bérért. Ha az időbeosztása elég rugalmas ahhoz,
hogy betöltse ezt a pozíciót, kérjük, írjon az
"apply(a)blenheimbuildings.uk" címre a pozícióval kapcsolatos további
részletekért. Kérjük, jelentkezéskor tüntesse fel mobiltelefonszámát,
valamint a lakóhelyét. Figyelem: Ez a pozíció érkezési sorrendben
foglalható el.
Üdvözlettel
Bánkiné Józsa Mária
HR vezető/tanácsadó
Magyarországi Munkaügyi Hivatal.
--
Hello
My Dear, I am Barrister Charles Anthony, I am contacting you to assist
retrieve his huge deposit Mr. Alexander left in the bank $10.5
million, before its get confiscated by the bank. please get back to me
for more details.
Barrister. Charles Anthony
Syzbot found a corrupted list bug scenario that can be triggered from
cgroup css_create(). The reproduces writes to cgroup.subtree_control
file, which invokes cgroup_apply_control_enable(), css_create(), and
css_populate_dir(), which then randomly fails with a fault injected -ENOMEM.
In such scenario the css_create() error path rcu enqueues css_free_rwork_fn
work for an css->refcnt initialized with css_release() destructor,
and there is a chance that the css_release() function will be invoked
for a cgroup_subsys_state, for which a destroy_work has already been
queued via css_create() error path. This causes a list_add corruption
as can be seen in the syzkaller report [1].
This can be avoided by adding a check to css_release() that checks
if it has already been enqueued.
[1] https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465d…
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Zefan Li <lizefan.x(a)bytedance.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Christian Brauner <brauner(a)kernel.org>
Cc: Alexei Starovoitov <ast(a)kernel.org>
Cc: Daniel Borkmann <daniel(a)iogearbox.net>
Cc: Andrii Nakryiko <andrii(a)kernel.org>
Cc: Martin KaFai Lau <kafai(a)fb.com>
Cc: Song Liu <songliubraving(a)fb.com>
Cc: Yonghong Song <yhs(a)fb.com>
Cc: John Fastabend <john.fastabend(a)gmail.com>
Cc: KP Singh <kpsingh(a)kernel.org>
Cc: <cgroups(a)vger.kernel.org>
Cc: <netdev(a)vger.kernel.org>
Cc: <bpf(a)vger.kernel.org>
Cc: <stable(a)vger.kernel.org>
Cc: <linux-kernel(a)vger.kernel.org>
Reported-by: syzbot+e42ae441c3b10acf9e9d(a)syzkaller.appspotmail.com
Fixes: 8f36aaec9c92 ("cgroup: Use rcu_work instead of explicit rcu and work item")
Signed-off-by: Tadeusz Struk <tadeusz.struk(a)linaro.org>
---
kernel/cgroup/cgroup.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index adb820e98f24..9ae2de29f8c9 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5210,8 +5210,11 @@ static void css_release(struct percpu_ref *ref)
struct cgroup_subsys_state *css =
container_of(ref, struct cgroup_subsys_state, refcnt);
- INIT_WORK(&css->destroy_work, css_release_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
+ if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT,
+ work_data_bits(&css->destroy_work))) {
+ INIT_WORK(&css->destroy_work, css_release_work_fn);
+ queue_work(cgroup_destroy_wq, &css->destroy_work);
+ }
}
static void init_and_link_css(struct cgroup_subsys_state *css,
--
2.35.1
From: Halil Pasic <pasic(a)linux.ibm.com>
commit ddbd89deb7d32b1fbb879f48d68fda1a8ac58e8e upstream.
The problem I'm addressing was discovered by the LTP test covering
cve-2018-1000204.
A short description of what happens follows:
1) The test case issues a command code 00 (TEST UNIT READY) via the SG_IO
interface with: dxfer_len == 524288, dxdfer_dir == SG_DXFER_FROM_DEV
and a corresponding dxferp. The peculiar thing about this is that TUR
is not reading from the device.
2) In sg_start_req() the invocation of blk_rq_map_user() effectively
bounces the user-space buffer. As if the device was to transfer into
it. Since commit a45b599ad808 ("scsi: sg: allocate with __GFP_ZERO in
sg_build_indirect()") we make sure this first bounce buffer is
allocated with GFP_ZERO.
3) For the rest of the story we keep ignoring that we have a TUR, so the
device won't touch the buffer we prepare as if the we had a
DMA_FROM_DEVICE type of situation. My setup uses a virtio-scsi device
and the buffer allocated by SG is mapped by the function
virtqueue_add_split() which uses DMA_FROM_DEVICE for the "in" sgs (here
scatter-gather and not scsi generics). This mapping involves bouncing
via the swiotlb (we need swiotlb to do virtio in protected guest like
s390 Secure Execution, or AMD SEV).
4) When the SCSI TUR is done, we first copy back the content of the second
(that is swiotlb) bounce buffer (which most likely contains some
previous IO data), to the first bounce buffer, which contains all
zeros. Then we copy back the content of the first bounce buffer to
the user-space buffer.
5) The test case detects that the buffer, which it zero-initialized,
ain't all zeros and fails.
One can argue that this is an swiotlb problem, because without swiotlb
we leak all zeros, and the swiotlb should be transparent in a sense that
it does not affect the outcome (if all other participants are well
behaved).
Copying the content of the original buffer into the swiotlb buffer is
the only way I can think of to make swiotlb transparent in such
scenarios. So let's do just that if in doubt, but allow the driver
to tell us that the whole mapped buffer is going to be overwritten,
in which case we can preserve the old behavior and avoid the performance
impact of the extra bounce.
Signed-off-by: Halil Pasic <pasic(a)linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch(a)lst.de>
[OP: backport to 4.14: apply swiotlb_tbl_map_single() changes in lib/swiotlb.c]
Signed-off-by: Ovidiu Panait <ovidiu.panait(a)windriver.com>
---
Documentation/DMA-attributes.txt | 10 ++++++++++
include/linux/dma-mapping.h | 8 ++++++++
lib/swiotlb.c | 3 ++-
3 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/Documentation/DMA-attributes.txt b/Documentation/DMA-attributes.txt
index 8f8d97f65d73..7193505a98ca 100644
--- a/Documentation/DMA-attributes.txt
+++ b/Documentation/DMA-attributes.txt
@@ -156,3 +156,13 @@ accesses to DMA buffers in both privileged "supervisor" and unprivileged
subsystem that the buffer is fully accessible at the elevated privilege
level (and ideally inaccessible or at least read-only at the
lesser-privileged levels).
+
+DMA_ATTR_PRIVILEGED
+-------------------
+
+Some advanced peripherals such as remote processors and GPUs perform
+accesses to DMA buffers in both privileged "supervisor" and unprivileged
+"user" modes. This attribute is used to indicate to the DMA-mapping
+subsystem that the buffer is fully accessible at the elevated privilege
+level (and ideally inaccessible or at least read-only at the
+lesser-privileged levels).
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 9aee5f345e29..93fa253f2a37 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -70,6 +70,14 @@
*/
#define DMA_ATTR_PRIVILEGED (1UL << 9)
+/*
+ * This is a hint to the DMA-mapping subsystem that the device is expected
+ * to overwrite the entire mapped size, thus the caller does not require any
+ * of the previous buffer contents to be preserved. This allows
+ * bounce-buffering implementations to optimise DMA_FROM_DEVICE transfers.
+ */
+#define DMA_ATTR_OVERWRITE (1UL << 10)
+
/*
* A dma_addr_t can hold any valid DMA or bus address for the platform.
* It can be given to a device to use as a DMA source or target. A CPU cannot
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index e73617b11af1..5796aa1e5cbd 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -601,7 +601,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
for (i = 0; i < nslots; i++)
io_tlb_orig_addr[index+i] = orig_addr + (i << IO_TLB_SHIFT);
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
- (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
+ (!(attrs & DMA_ATTR_OVERWRITE) || dir == DMA_TO_DEVICE ||
+ dir == DMA_BIDIRECTIONAL))
swiotlb_bounce(orig_addr, tlb_addr, size, DMA_TO_DEVICE);
return tlb_addr;
--
2.36.1
From: Linus Torvalds <torvalds(a)linux-foundation.org>
commit 901c7280ca0d5e2b4a8929fbe0bfb007ac2a6544 upstream.
Halil Pasic points out [1] that the full revert of that commit (revert
in bddac7c1e02b), and that a partial revert that only reverts the
problematic case, but still keeps some of the cleanups is probably
better. 
And that partial revert [2] had already been verified by Oleksandr
Natalenko to also fix the issue, I had just missed that in the long
discussion.
So let's reinstate the cleanups from commit aa6f8dcbab47 ("swiotlb:
rework "fix info leak with DMA_FROM_DEVICE""), and effectively only
revert the part that caused problems.
Link: https://lore.kernel.org/all/20220328013731.017ae3e3.pasic@linux.ibm.com/ [1]
Link: https://lore.kernel.org/all/20220324055732.GB12078@lst.de/ [2]
Link: https://lore.kernel.org/all/4386660.LvFx2qVVIh@natalenko.name/ [3]
Suggested-by: Halil Pasic <pasic(a)linux.ibm.com>
Tested-by: Oleksandr Natalenko <oleksandr(a)natalenko.name>
Cc: Christoph Hellwig <hch(a)lst.de>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
[OP: backport to 5.10: adjusted context]
Signed-off-by: Ovidiu Panait <ovidiu.panait(a)windriver.com>
---
This is part of CVE-2022-0854 patchset:
[1] ddbd89deb7d3 ("swiotlb: fix info leak with DMA_FROM_DEVICE")
[2] 901c7280ca0d ("Reinstate some of "swiotlb: rework "fix info leak with DMA_FROM_DEVICE""")
[1] is already present in 5.10-stable.
[2] is present in 5.17/5.16/5.15, but not in 5.10 and 5.4 branches;
Documentation/core-api/dma-attributes.rst | 8 --------
include/linux/dma-mapping.h | 8 --------
kernel/dma/swiotlb.c | 12 ++++++++----
3 files changed, 8 insertions(+), 20 deletions(-)
diff --git a/Documentation/core-api/dma-attributes.rst b/Documentation/core-api/dma-attributes.rst
index 17706dc91ec9..1887d92e8e92 100644
--- a/Documentation/core-api/dma-attributes.rst
+++ b/Documentation/core-api/dma-attributes.rst
@@ -130,11 +130,3 @@ accesses to DMA buffers in both privileged "supervisor" and unprivileged
subsystem that the buffer is fully accessible at the elevated privilege
level (and ideally inaccessible or at least read-only at the
lesser-privileged levels).
-
-DMA_ATTR_OVERWRITE
-------------------
-
-This is a hint to the DMA-mapping subsystem that the device is expected to
-overwrite the entire mapped size, thus the caller does not require any of the
-previous buffer contents to be preserved. This allows bounce-buffering
-implementations to optimise DMA_FROM_DEVICE transfers.
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index a9361178c5db..a7d70cdee25e 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -61,14 +61,6 @@
*/
#define DMA_ATTR_PRIVILEGED (1UL << 9)
-/*
- * This is a hint to the DMA-mapping subsystem that the device is expected
- * to overwrite the entire mapped size, thus the caller does not require any
- * of the previous buffer contents to be preserved. This allows
- * bounce-buffering implementations to optimise DMA_FROM_DEVICE transfers.
- */
-#define DMA_ATTR_OVERWRITE (1UL << 10)
-
/*
* A dma_addr_t can hold any valid DMA or bus address for the platform. It can
* be given to a device to use as a DMA source or target. It is specific to a
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 62b1e5fa8673..0882ebf9fcd3 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -597,10 +597,14 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
io_tlb_orig_addr[index + i] = slot_addr(orig_addr, i);
tlb_addr = slot_addr(io_tlb_start, index) + offset;
- if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
- (!(attrs & DMA_ATTR_OVERWRITE) || dir == DMA_TO_DEVICE ||
- dir == DMA_BIDIRECTIONAL))
- swiotlb_bounce(orig_addr, tlb_addr, mapping_size, DMA_TO_DEVICE);
+ /*
+ * When dir == DMA_FROM_DEVICE we could omit the copy from the orig
+ * to the tlb buffer, if we knew for sure the device will
+ * overwirte the entire current content. But we don't. Thus
+ * unconditional bounce may prevent leaking swiotlb content (i.e.
+ * kernel memory) to user-space.
+ */
+ swiotlb_bounce(orig_addr, tlb_addr, mapping_size, DMA_TO_DEVICE);
return tlb_addr;
}
--
2.36.1
Hi,
this mainline patch 33121347fb1c359bd6e3e680b9f2c6ced5734a8 should be
applied to 5.15 as well.
Without loading of some modules fails, if
1. MODULE_UNLOAD=n
2. Architecture is aarch64 (maybe others as well)
3. KASLR is active
Without this patch the symbol .exit.text is not relocated and when the
linker generated a relative 32 bit relocation(PREL32) and the module is
loaded far enough away from the default loading address, it will trigger
a relocation overflow like this:
module algif_hash: overflow in relocation type 261 val ffff800010051c20
This happens to all modules, that use BUG in the exit section or if the
compiler generates a jump table in the exit section.
Thanks,
Joerg