This is the start of the stable review cycle for the 5.4.197 release. There are 34 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 05 Jun 2022 17:38:05 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.4.197-rc1... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.4.y and the diffstat can be found below.
thanks,
greg k-h
------------- Pseudo-Shortlog of commits:
Greg Kroah-Hartman gregkh@linuxfoundation.org Linux 5.4.197-rc1
Liu Jian liujian56@huawei.com bpf: Enlarge offset check value to INT_MAX in bpf_skb_{load,store}_bytes
Chuck Lever chuck.lever@oracle.com NFSD: Fix possible sleep during nfsd4_release_lockowner()
Trond Myklebust trond.myklebust@hammerspace.com NFS: Memory allocation failures are not server fatal errors
Akira Yokosawa akiyks@gmail.com docs: submitting-patches: Fix crossref to 'The canonical patch format'
Xiu Jianfeng xiujianfeng@huawei.com tpm: ibmvtpm: Correct the return value in tpm_ibmvtpm_probe()
Stefan Mahnke-Hartmann stefan.mahnke-hartmann@infineon.com tpm: Fix buffer access in tpm2_get_tpm_pt()
Marek Maślanka mm@semihalf.com HID: multitouch: Add support for Google Whiskers Touchpad
Mariusz Tkaczyk mariusz.tkaczyk@linux.intel.com raid5: introduce MD_BROKEN
Sarthak Kukreti sarthakkukreti@google.com dm verity: set DM_TARGET_IMMUTABLE feature flag
Mikulas Patocka mpatocka@redhat.com dm stats: add cond_resched when looping over entries
Mikulas Patocka mpatocka@redhat.com dm crypt: make printing of the key constant-time
Dan Carpenter dan.carpenter@oracle.com dm integrity: fix error code in dm_integrity_ctr()
Sultan Alsawaf sultan@kerneltoast.com zsmalloc: fix races between asynchronous zspage free and page migration
Vitaly Chikunov vt@altlinux.org crypto: ecrdsa - Fix incorrect use of vli_cmp
Florian Westphal fw@strlen.de netfilter: conntrack: re-fetch conntrack after insertion
Kees Cook keescook@chromium.org exec: Force single empty string when argv is empty
Gustavo A. R. Silva gustavoars@kernel.org drm/i915: Fix -Wstringop-overflow warning in call to intel_read_wm_latency()
Miri Korenblit miriam.rachel.korenblit@intel.com cfg80211: set custom regdomain after wiphy registration
Stephen Brennan stephen.s.brennan@oracle.com assoc_array: Fix BUG_ON during garbage collect
Piyush Malgujar pmalgujar@marvell.com drivers: i2c: thunderx: Allow driver to work with ACPI defined TWSI controllers
Mika Westerberg mika.westerberg@linux.intel.com i2c: ismt: Provide a DMA buffer for Interrupt Cause Logging
Joel Stanley joel@jms.id.au net: ftgmac100: Disable hardware checksum on AST2600
Thomas Bartschies thomas.bartschies@cvk.de net: af_key: check encryption module availability consistency
IotaHydrae writeforever@foxmail.com pinctrl: sunxi: fix f1c100s uart2 function
Lorenzo Pieralisi lorenzo.pieralisi@arm.com ACPI: sysfs: Fix BERT error region memory mapping
Andy Shevchenko andriy.shevchenko@linux.intel.com ACPI: sysfs: Make sparse happy about address space in use
Hans Verkuil hverkuil-cisco@xs4all.nl media: vim2m: initialize the media device earlier
Sakari Ailus sakari.ailus@linux.intel.com media: vim2m: Register video device after setting up internals
Willy Tarreau w@1wt.eu secure_seq: use the 64 bits of the siphash for port offset calculation
Eric Dumazet edumazet@google.com tcp: change source port randomizarion at connect() time
Dmitry Mastykin dmastykin@astralinux.ru Input: goodix - fix spurious key release events
Denis Efremov (Oracle) efremov@linux.com staging: rtl8723bs: prevent ->Ssid overflow in rtw_wx_set_scan()
Thomas Gleixner tglx@linutronix.de x86/pci/xen: Disable PCI/MSI[-X] masking for XEN_HVM guests
Daniel Thompson daniel.thompson@linaro.org lockdown: also lock down previous kgdb use
-------------
Diffstat:
Documentation/process/submitting-patches.rst | 2 +- Makefile | 4 +- arch/x86/pci/xen.c | 5 +++ crypto/ecrdsa.c | 8 ++-- drivers/acpi/sysfs.c | 23 +++++++--- drivers/char/tpm/tpm2-cmd.c | 11 ++++- drivers/char/tpm/tpm_ibmvtpm.c | 1 + drivers/gpu/drm/i915/intel_pm.c | 2 +- drivers/hid/hid-multitouch.c | 3 ++ drivers/i2c/busses/i2c-ismt.c | 14 ++++++ drivers/i2c/busses/i2c-thunderx-pcidrv.c | 1 + drivers/input/touchscreen/goodix.c | 2 +- drivers/md/dm-crypt.c | 14 ++++-- drivers/md/dm-integrity.c | 2 - drivers/md/dm-stats.c | 8 ++++ drivers/md/dm-verity-target.c | 1 + drivers/md/raid5.c | 47 +++++++++---------- drivers/media/platform/vim2m.c | 22 +++++---- drivers/net/ethernet/faraday/ftgmac100.c | 5 +++ drivers/pinctrl/sunxi/pinctrl-suniv-f1c100s.c | 2 +- drivers/staging/rtl8723bs/os_dep/ioctl_linux.c | 6 ++- fs/exec.c | 25 ++++++++++- fs/nfs/internal.h | 1 + fs/nfsd/nfs4state.c | 12 ++--- include/linux/security.h | 2 + include/net/inet_hashtables.h | 2 +- include/net/netfilter/nf_conntrack_core.h | 7 ++- include/net/secure_seq.h | 4 +- kernel/debug/debug_core.c | 24 ++++++++++ kernel/debug/kdb/kdb_main.c | 62 ++++++++++++++++++++++++-- lib/assoc_array.c | 8 ++++ mm/zsmalloc.c | 37 +++++++++++++-- net/core/filter.c | 4 +- net/core/secure_seq.c | 4 +- net/ipv4/inet_hashtables.c | 28 +++++++++--- net/ipv6/inet6_hashtables.c | 4 +- net/key/af_key.c | 6 +-- net/wireless/core.c | 8 ++-- net/wireless/reg.c | 1 + security/lockdown/lockdown.c | 2 + 40 files changed, 327 insertions(+), 97 deletions(-)
From: Daniel Thompson daniel.thompson@linaro.org
commit eadb2f47a3ced5c64b23b90fd2a3463f63726066 upstream.
KGDB and KDB allow read and write access to kernel memory, and thus should be restricted during lockdown. An attacker with access to a serial port (for example, via a hypervisor console, which some cloud vendors provide over the network) could trigger the debugger so it is important that the debugger respect the lockdown mode when/if it is triggered.
Fix this by integrating lockdown into kdb's existing permissions mechanism. Unfortunately kgdb does not have any permissions mechanism (although it certainly could be added later) so, for now, kgdb is simply and brutally disabled by immediately exiting the gdb stub without taking any action.
For lockdowns established early in the boot (e.g. the normal case) then this should be fine but on systems where kgdb has set breakpoints before the lockdown is enacted than "bad things" will happen.
CVE: CVE-2022-21499 Co-developed-by: Stephen Brennan stephen.s.brennan@oracle.com Signed-off-by: Stephen Brennan stephen.s.brennan@oracle.com Reviewed-by: Douglas Anderson dianders@chromium.org Signed-off-by: Daniel Thompson daniel.thompson@linaro.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/linux/security.h | 2 + kernel/debug/debug_core.c | 24 ++++++++++++++++ kernel/debug/kdb/kdb_main.c | 62 ++++++++++++++++++++++++++++++++++++++++--- security/lockdown/lockdown.c | 2 + 4 files changed, 87 insertions(+), 3 deletions(-)
--- a/include/linux/security.h +++ b/include/linux/security.h @@ -118,10 +118,12 @@ enum lockdown_reason { LOCKDOWN_MMIOTRACE, LOCKDOWN_DEBUGFS, LOCKDOWN_XMON_WR, + LOCKDOWN_DBG_WRITE_KERNEL, LOCKDOWN_INTEGRITY_MAX, LOCKDOWN_KCORE, LOCKDOWN_KPROBES, LOCKDOWN_BPF_READ, + LOCKDOWN_DBG_READ_KERNEL, LOCKDOWN_PERF, LOCKDOWN_TRACEFS, LOCKDOWN_XMON_RW, --- a/kernel/debug/debug_core.c +++ b/kernel/debug/debug_core.c @@ -56,6 +56,7 @@ #include <linux/vmacache.h> #include <linux/rcupdate.h> #include <linux/irq.h> +#include <linux/security.h>
#include <asm/cacheflush.h> #include <asm/byteorder.h> @@ -685,6 +686,29 @@ cpu_master_loop: continue; kgdb_connected = 0; } else { + /* + * This is a brutal way to interfere with the debugger + * and prevent gdb being used to poke at kernel memory. + * This could cause trouble if lockdown is applied when + * there is already an active gdb session. For now the + * answer is simply "don't do that". Typically lockdown + * *will* be applied before the debug core gets started + * so only developers using kgdb for fairly advanced + * early kernel debug can be biten by this. Hopefully + * they are sophisticated enough to take care of + * themselves, especially with help from the lockdown + * message printed on the console! + */ + if (security_locked_down(LOCKDOWN_DBG_WRITE_KERNEL)) { + if (IS_ENABLED(CONFIG_KGDB_KDB)) { + /* Switch back to kdb if possible... */ + dbg_kdb_mode = 1; + continue; + } else { + /* ... otherwise just bail */ + break; + } + } error = gdb_serial_stub(ks); }
--- a/kernel/debug/kdb/kdb_main.c +++ b/kernel/debug/kdb/kdb_main.c @@ -45,6 +45,7 @@ #include <linux/proc_fs.h> #include <linux/uaccess.h> #include <linux/slab.h> +#include <linux/security.h> #include "kdb_private.h"
#undef MODULE_PARAM_PREFIX @@ -198,10 +199,62 @@ struct task_struct *kdb_curr_task(int cp }
/* - * Check whether the flags of the current command and the permissions - * of the kdb console has allow a command to be run. + * Update the permissions flags (kdb_cmd_enabled) to match the + * current lockdown state. + * + * Within this function the calls to security_locked_down() are "lazy". We + * avoid calling them if the current value of kdb_cmd_enabled already excludes + * flags that might be subject to lockdown. Additionally we deliberately check + * the lockdown flags independently (even though read lockdown implies write + * lockdown) since that results in both simpler code and clearer messages to + * the user on first-time debugger entry. + * + * The permission masks during a read+write lockdown permits the following + * flags: INSPECT, SIGNAL, REBOOT (and ALWAYS_SAFE). + * + * The INSPECT commands are not blocked during lockdown because they are + * not arbitrary memory reads. INSPECT covers the backtrace family (sometimes + * forcing them to have no arguments) and lsmod. These commands do expose + * some kernel state but do not allow the developer seated at the console to + * choose what state is reported. SIGNAL and REBOOT should not be controversial, + * given these are allowed for root during lockdown already. + */ +static void kdb_check_for_lockdown(void) +{ + const int write_flags = KDB_ENABLE_MEM_WRITE | + KDB_ENABLE_REG_WRITE | + KDB_ENABLE_FLOW_CTRL; + const int read_flags = KDB_ENABLE_MEM_READ | + KDB_ENABLE_REG_READ; + + bool need_to_lockdown_write = false; + bool need_to_lockdown_read = false; + + if (kdb_cmd_enabled & (KDB_ENABLE_ALL | write_flags)) + need_to_lockdown_write = + security_locked_down(LOCKDOWN_DBG_WRITE_KERNEL); + + if (kdb_cmd_enabled & (KDB_ENABLE_ALL | read_flags)) + need_to_lockdown_read = + security_locked_down(LOCKDOWN_DBG_READ_KERNEL); + + /* De-compose KDB_ENABLE_ALL if required */ + if (need_to_lockdown_write || need_to_lockdown_read) + if (kdb_cmd_enabled & KDB_ENABLE_ALL) + kdb_cmd_enabled = KDB_ENABLE_MASK & ~KDB_ENABLE_ALL; + + if (need_to_lockdown_write) + kdb_cmd_enabled &= ~write_flags; + + if (need_to_lockdown_read) + kdb_cmd_enabled &= ~read_flags; +} + +/* + * Check whether the flags of the current command, the permissions of the kdb + * console and the lockdown state allow a command to be run. */ -static inline bool kdb_check_flags(kdb_cmdflags_t flags, int permissions, +static bool kdb_check_flags(kdb_cmdflags_t flags, int permissions, bool no_args) { /* permissions comes from userspace so needs massaging slightly */ @@ -1188,6 +1241,9 @@ static int kdb_local(kdb_reason_t reason kdb_curr_task(raw_smp_processor_id());
KDB_DEBUG_STATE("kdb_local 1", reason); + + kdb_check_for_lockdown(); + kdb_go_count = 0; if (reason == KDB_REASON_DEBUG) { /* special case below */ --- a/security/lockdown/lockdown.c +++ b/security/lockdown/lockdown.c @@ -33,10 +33,12 @@ static const char *const lockdown_reason [LOCKDOWN_MMIOTRACE] = "unsafe mmio", [LOCKDOWN_DEBUGFS] = "debugfs access", [LOCKDOWN_XMON_WR] = "xmon write access", + [LOCKDOWN_DBG_WRITE_KERNEL] = "use of kgdb/kdb to write kernel RAM", [LOCKDOWN_INTEGRITY_MAX] = "integrity", [LOCKDOWN_KCORE] = "/proc/kcore access", [LOCKDOWN_KPROBES] = "use of kprobes", [LOCKDOWN_BPF_READ] = "use of bpf to read kernel RAM", + [LOCKDOWN_DBG_READ_KERNEL] = "use of kgdb/kdb to read kernel RAM", [LOCKDOWN_PERF] = "unsafe use of perf", [LOCKDOWN_TRACEFS] = "use of tracefs", [LOCKDOWN_XMON_RW] = "xmon read and write access",
From: Thomas Gleixner tglx@linutronix.de
commit 7e0815b3e09986d2fe651199363e135b9358132a upstream.
When a XEN_HVM guest uses the XEN PIRQ/Eventchannel mechanism, then PCI/MSI[-X] masking is solely controlled by the hypervisor, but contrary to XEN_PV guests this does not disable PCI/MSI[-X] masking in the PCI/MSI layer.
This can lead to a situation where the PCI/MSI layer masks an MSI[-X] interrupt and the hypervisor grants the write despite the fact that it already requested the interrupt. As a consequence interrupt delivery on the affected device is not happening ever.
Set pci_msi_ignore_mask to prevent that like it's done for XEN_PV guests already.
Fixes: 809f9267bbab ("xen: map MSIs into pirqs") Reported-by: Jeremi Piotrowski jpiotrowski@linux.microsoft.com Reported-by: Dusty Mabe dustymabe@redhat.com Reported-by: Salvatore Bonaccorso carnil@debian.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Noah Meyerhans noahm@debian.org Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/87tuaduxj5.ffs@tglx [nmeyerha@amazon.com: backported to 5.4] Signed-off-by: Noah Meyerhans nmeyerha@amazon.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/pci/xen.c | 5 +++++ 1 file changed, 5 insertions(+)
--- a/arch/x86/pci/xen.c +++ b/arch/x86/pci/xen.c @@ -442,6 +442,11 @@ void __init xen_msi_init(void)
x86_msi.setup_msi_irqs = xen_hvm_setup_msi_irqs; x86_msi.teardown_msi_irq = xen_teardown_msi_irq; + /* + * With XEN PIRQ/Eventchannels in use PCI/MSI[-X] masking is solely + * controlled by the hypervisor. + */ + pci_msi_ignore_mask = 1; } #endif
From: "Denis Efremov (Oracle)" efremov@linux.com
This code has a check to prevent read overflow but it needs another check to prevent writing beyond the end of the ->Ssid[] array.
Fixes: 554c0a3abf21 ("staging: Add rtl8723bs sdio wifi driver") Cc: stable stable@vger.kernel.org Signed-off-by: Denis Efremov (Oracle) efremov@linux.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/staging/rtl8723bs/os_dep/ioctl_linux.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
--- a/drivers/staging/rtl8723bs/os_dep/ioctl_linux.c +++ b/drivers/staging/rtl8723bs/os_dep/ioctl_linux.c @@ -1351,9 +1351,11 @@ static int rtw_wx_set_scan(struct net_de
sec_len = *(pos++); len-= 1;
- if (sec_len>0 && sec_len<=len) { + if (sec_len > 0 && + sec_len <= len && + sec_len <= 32) { ssid[ssid_index].SsidLength = sec_len; - memcpy(ssid[ssid_index].Ssid, pos, ssid[ssid_index].SsidLength); + memcpy(ssid[ssid_index].Ssid, pos, sec_len); /* DBG_871X("%s COMBO_SCAN with specific ssid:%s, %d\n", __func__ */ /* , ssid[ssid_index].Ssid, ssid[ssid_index].SsidLength); */ ssid_index++;
From: Dmitry Mastykin dmastykin@astralinux.ru
commit 24ef83f6e31d20fc121a7cd732b04b498475fca3 upstream.
The goodix panel sends spurious interrupts after a 'finger up' event, which always cause a timeout. We were exiting the interrupt handler by reporting touch_num == 0, but this was still processed as valid and caused the code to use the uninitialised point_data, creating spurious key release events.
Report an error from the interrupt handler so as to avoid processing invalid point_data further.
Signed-off-by: Dmitry Mastykin dmastykin@astralinux.ru Reviewed-by: Bastien Nocera hadess@hadess.net Link: https://lore.kernel.org/r/20200316075302.3759-2-dmastykin@astralinux.ru Signed-off-by: Dmitry Torokhov dmitry.torokhov@gmail.com Cc: Fabio Estevam festevam@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/input/touchscreen/goodix.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/input/touchscreen/goodix.c +++ b/drivers/input/touchscreen/goodix.c @@ -335,7 +335,7 @@ static int goodix_ts_read_input_report(s * The Goodix panel will send spurious interrupts after a * 'finger up' event, which will always cause a timeout. */ - return 0; + return -ENOMSG; }
static void goodix_ts_report_touch_8b(struct goodix_ts_data *ts, u8 *coor_data)
From: Eric Dumazet edumazet@google.com
commit 190cc82489f46f9d88e73c81a47e14f80a791e1a upstream.
RFC 6056 (Recommendations for Transport-Protocol Port Randomization) provides good summary of why source selection needs extra care.
David Dworken reminded us that linux implements Algorithm 3 as described in RFC 6056 3.3.3
Quoting David : In the context of the web, this creates an interesting info leak where websites can count how many TCP connections a user's computer is establishing over time. For example, this allows a website to count exactly how many subresources a third party website loaded. This also allows: - Distinguishing between different users behind a VPN based on distinct source port ranges. - Tracking users over time across multiple networks. - Covert communication channels between different browsers/browser profiles running on the same computer - Tracking what applications are running on a computer based on the pattern of how fast source ports are getting incremented.
Section 3.3.4 describes an enhancement, that reduces attackers ability to use the basic information currently stored into the shared 'u32 hint'.
This change also decreases collision rate when multiple applications need to connect() to different destinations.
Signed-off-by: Eric Dumazet edumazet@google.com Reported-by: David Dworken ddworken@google.com Cc: Willem de Bruijn willemb@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Stefan Ghinea stefan.ghinea@windriver.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/ipv4/inet_hashtables.c | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-)
--- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -671,6 +671,17 @@ unlock: } EXPORT_SYMBOL_GPL(inet_unhash);
+/* RFC 6056 3.3.4. Algorithm 4: Double-Hash Port Selection Algorithm + * Note that we use 32bit integers (vs RFC 'short integers') + * because 2^16 is not a multiple of num_ephemeral and this + * property might be used by clever attacker. + * RFC claims using TABLE_LENGTH=10 buckets gives an improvement, + * we use 256 instead to really give more isolation and + * privacy, this only consumes 1 KB of kernel memory. + */ +#define INET_TABLE_PERTURB_SHIFT 8 +static u32 table_perturb[1 << INET_TABLE_PERTURB_SHIFT]; + int __inet_hash_connect(struct inet_timewait_death_row *death_row, struct sock *sk, u32 port_offset, int (*check_established)(struct inet_timewait_death_row *, @@ -684,8 +695,8 @@ int __inet_hash_connect(struct inet_time struct inet_bind_bucket *tb; u32 remaining, offset; int ret, i, low, high; - static u32 hint; int l3mdev; + u32 index;
if (port) { head = &hinfo->bhash[inet_bhashfn(net, port, @@ -712,7 +723,10 @@ int __inet_hash_connect(struct inet_time if (likely(remaining > 1)) remaining &= ~1U;
- offset = (hint + port_offset) % remaining; + net_get_random_once(table_perturb, sizeof(table_perturb)); + index = hash_32(port_offset, INET_TABLE_PERTURB_SHIFT); + + offset = (READ_ONCE(table_perturb[index]) + port_offset) % remaining; /* In first pass we try ports of @low parity. * inet_csk_get_port() does the opposite choice. */ @@ -766,7 +780,7 @@ next_port: return -EADDRNOTAVAIL;
ok: - hint += i + 2; + WRITE_ONCE(table_perturb[index], READ_ONCE(table_perturb[index]) + i + 2);
/* Head lock still held and bh's disabled */ inet_bind_hash(sk, tb, port);
From: Willy Tarreau w@1wt.eu
commit b2d057560b8107c633b39aabe517ff9d93f285e3 upstream.
SipHash replaced MD5 in secure_ipv{4,6}_port_ephemeral() via commit 7cd23e5300c1 ("secure_seq: use SipHash in place of MD5"), but the output remained truncated to 32-bit only. In order to exploit more bits from the hash, let's make the functions return the full 64-bit of siphash_3u32(). We also make sure the port offset calculation in __inet_hash_connect() remains done on 32-bit to avoid the need for div_u64_rem() and an extra cost on 32-bit systems.
Cc: Jason A. Donenfeld Jason@zx2c4.com Cc: Moshe Kol moshe.kol@mail.huji.ac.il Cc: Yossi Gilad yossi.gilad@mail.huji.ac.il Cc: Amit Klein aksecurity@gmail.com Reviewed-by: Eric Dumazet edumazet@google.com Signed-off-by: Willy Tarreau w@1wt.eu Signed-off-by: Jakub Kicinski kuba@kernel.org [SG: Adjusted context] Signed-off-by: Stefan Ghinea stefan.ghinea@windriver.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/net/inet_hashtables.h | 2 +- include/net/secure_seq.h | 4 ++-- net/core/secure_seq.c | 4 ++-- net/ipv4/inet_hashtables.c | 10 ++++++---- net/ipv6/inet6_hashtables.c | 4 ++-- 5 files changed, 13 insertions(+), 11 deletions(-)
--- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -420,7 +420,7 @@ static inline void sk_rcv_saddr_set(stru }
int __inet_hash_connect(struct inet_timewait_death_row *death_row, - struct sock *sk, u32 port_offset, + struct sock *sk, u64 port_offset, int (*check_established)(struct inet_timewait_death_row *, struct sock *, __u16, struct inet_timewait_sock **)); --- a/include/net/secure_seq.h +++ b/include/net/secure_seq.h @@ -4,8 +4,8 @@
#include <linux/types.h>
-u32 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport); -u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr, +u64 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport); +u64 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr, __be16 dport); u32 secure_tcp_seq(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport); --- a/net/core/secure_seq.c +++ b/net/core/secure_seq.c @@ -97,7 +97,7 @@ u32 secure_tcpv6_seq(const __be32 *saddr } EXPORT_SYMBOL(secure_tcpv6_seq);
-u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr, +u64 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr, __be16 dport) { const struct { @@ -147,7 +147,7 @@ u32 secure_tcp_seq(__be32 saddr, __be32 } EXPORT_SYMBOL_GPL(secure_tcp_seq);
-u32 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport) +u64 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport) { net_secret_init(); return siphash_4u32((__force u32)saddr, (__force u32)daddr, --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -464,7 +464,7 @@ not_unique: return -EADDRNOTAVAIL; }
-static u32 inet_sk_port_offset(const struct sock *sk) +static u64 inet_sk_port_offset(const struct sock *sk) { const struct inet_sock *inet = inet_sk(sk);
@@ -683,7 +683,7 @@ EXPORT_SYMBOL_GPL(inet_unhash); static u32 table_perturb[1 << INET_TABLE_PERTURB_SHIFT];
int __inet_hash_connect(struct inet_timewait_death_row *death_row, - struct sock *sk, u32 port_offset, + struct sock *sk, u64 port_offset, int (*check_established)(struct inet_timewait_death_row *, struct sock *, __u16, struct inet_timewait_sock **)) { @@ -726,7 +726,9 @@ int __inet_hash_connect(struct inet_time net_get_random_once(table_perturb, sizeof(table_perturb)); index = hash_32(port_offset, INET_TABLE_PERTURB_SHIFT);
- offset = (READ_ONCE(table_perturb[index]) + port_offset) % remaining; + offset = READ_ONCE(table_perturb[index]) + port_offset; + offset %= remaining; + /* In first pass we try ports of @low parity. * inet_csk_get_port() does the opposite choice. */ @@ -803,7 +805,7 @@ ok: int inet_hash_connect(struct inet_timewait_death_row *death_row, struct sock *sk) { - u32 port_offset = 0; + u64 port_offset = 0;
if (!inet_sk(sk)->inet_num) port_offset = inet_sk_port_offset(sk); --- a/net/ipv6/inet6_hashtables.c +++ b/net/ipv6/inet6_hashtables.c @@ -262,7 +262,7 @@ not_unique: return -EADDRNOTAVAIL; }
-static u32 inet6_sk_port_offset(const struct sock *sk) +static u64 inet6_sk_port_offset(const struct sock *sk) { const struct inet_sock *inet = inet_sk(sk);
@@ -274,7 +274,7 @@ static u32 inet6_sk_port_offset(const st int inet6_hash_connect(struct inet_timewait_death_row *death_row, struct sock *sk) { - u32 port_offset = 0; + u64 port_offset = 0;
if (!inet_sk(sk)->inet_num) port_offset = inet6_sk_port_offset(sk);
From: Sakari Ailus sakari.ailus@linux.intel.com
commit cf7f34777a5b4100a3a44ff95f3d949c62892bdd upstream.
Prevent NULL (or close to NULL) pointer dereference in various places by registering the video device only when the V4L2 m2m framework has been set up.
Fixes: commit 96d8eab5d0a1 ("V4L/DVB: [v5,2/2] v4l: Add a mem-to-mem videobuf framework test device") Signed-off-by: Sakari Ailus sakari.ailus@linux.intel.com Signed-off-by: Hans Verkuil hverkuil-cisco@xs4all.nl Signed-off-by: Mauro Carvalho Chehab mchehab+huawei@kernel.org Signed-off-by: Mark-PK Tsai mark-pk.tsai@mediatek.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/media/platform/vim2m.c | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-)
--- a/drivers/media/platform/vim2m.c +++ b/drivers/media/platform/vim2m.c @@ -1333,12 +1333,6 @@ static int vim2m_probe(struct platform_d vfd->lock = &dev->dev_mutex; vfd->v4l2_dev = &dev->v4l2_dev;
- ret = video_register_device(vfd, VFL_TYPE_GRABBER, 0); - if (ret) { - v4l2_err(&dev->v4l2_dev, "Failed to register video device\n"); - goto error_v4l2; - } - video_set_drvdata(vfd, dev); v4l2_info(&dev->v4l2_dev, "Device registered as /dev/video%d\n", vfd->num); @@ -1353,6 +1347,12 @@ static int vim2m_probe(struct platform_d goto error_dev; }
+ ret = video_register_device(vfd, VFL_TYPE_GRABBER, 0); + if (ret) { + v4l2_err(&dev->v4l2_dev, "Failed to register video device\n"); + goto error_m2m; + } + #ifdef CONFIG_MEDIA_CONTROLLER dev->mdev.dev = &pdev->dev; strscpy(dev->mdev.model, "vim2m", sizeof(dev->mdev.model)); @@ -1366,7 +1366,7 @@ static int vim2m_probe(struct platform_d MEDIA_ENT_F_PROC_VIDEO_SCALER); if (ret) { v4l2_err(&dev->v4l2_dev, "Failed to init mem2mem media controller\n"); - goto error_dev; + goto error_v4l2; }
ret = media_device_register(&dev->mdev); @@ -1381,11 +1381,13 @@ static int vim2m_probe(struct platform_d error_m2m_mc: v4l2_m2m_unregister_media_controller(dev->m2m_dev); #endif -error_dev: +error_v4l2: video_unregister_device(&dev->vfd); /* vim2m_device_release called by video_unregister_device to release various objects */ return ret; -error_v4l2: +error_m2m: + v4l2_m2m_release(dev->m2m_dev); +error_dev: v4l2_device_unregister(&dev->v4l2_dev); error_free: kfree(dev);
From: Hans Verkuil hverkuil-cisco@xs4all.nl
commit 1a28dce222a6ece725689ad58c0cf4a1b48894f4 upstream.
Before the video device node is registered, the v4l2_dev.mdev pointer must be set in order to correctly associate the video device with the media device. Move the initialization of the media device up.
Signed-off-by: Hans Verkuil hverkuil-cisco@xs4all.nl Signed-off-by: Mauro Carvalho Chehab mchehab+huawei@kernel.org Signed-off-by: Mark-PK Tsai mark-pk.tsai@mediatek.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/media/platform/vim2m.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-)
--- a/drivers/media/platform/vim2m.c +++ b/drivers/media/platform/vim2m.c @@ -1347,12 +1347,6 @@ static int vim2m_probe(struct platform_d goto error_dev; }
- ret = video_register_device(vfd, VFL_TYPE_GRABBER, 0); - if (ret) { - v4l2_err(&dev->v4l2_dev, "Failed to register video device\n"); - goto error_m2m; - } - #ifdef CONFIG_MEDIA_CONTROLLER dev->mdev.dev = &pdev->dev; strscpy(dev->mdev.model, "vim2m", sizeof(dev->mdev.model)); @@ -1361,7 +1355,15 @@ static int vim2m_probe(struct platform_d media_device_init(&dev->mdev); dev->mdev.ops = &m2m_media_ops; dev->v4l2_dev.mdev = &dev->mdev; +#endif
+ ret = video_register_device(vfd, VFL_TYPE_GRABBER, 0); + if (ret) { + v4l2_err(&dev->v4l2_dev, "Failed to register video device\n"); + goto error_m2m; + } + +#ifdef CONFIG_MEDIA_CONTROLLER ret = v4l2_m2m_register_media_controller(dev->m2m_dev, vfd, MEDIA_ENT_F_PROC_VIDEO_SCALER); if (ret) {
From: Andy Shevchenko andriy.shevchenko@linux.intel.com
commit bdd56d7d8931e842775d2e5b93d426a8d1940e33 upstream.
Sparse is not happy about address space in use in acpi_data_show():
drivers/acpi/sysfs.c:428:14: warning: incorrect type in assignment (different address spaces) drivers/acpi/sysfs.c:428:14: expected void [noderef] __iomem *base drivers/acpi/sysfs.c:428:14: got void * drivers/acpi/sysfs.c:431:59: warning: incorrect type in argument 4 (different address spaces) drivers/acpi/sysfs.c:431:59: expected void const *from drivers/acpi/sysfs.c:431:59: got void [noderef] __iomem *base drivers/acpi/sysfs.c:433:30: warning: incorrect type in argument 1 (different address spaces) drivers/acpi/sysfs.c:433:30: expected void *logical_address drivers/acpi/sysfs.c:433:30: got void [noderef] __iomem *base
Indeed, acpi_os_map_memory() returns a void pointer with dropped specific address space. Hence, we don't need to carry out __iomem in acpi_data_show().
Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Cc: dann frazier dann.frazier@canonical.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/acpi/sysfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/acpi/sysfs.c +++ b/drivers/acpi/sysfs.c @@ -438,7 +438,7 @@ static ssize_t acpi_data_show(struct fil loff_t offset, size_t count) { struct acpi_data_attr *data_attr; - void __iomem *base; + void *base; ssize_t rc;
data_attr = container_of(bin_attr, struct acpi_data_attr, attr);
From: Lorenzo Pieralisi lorenzo.pieralisi@arm.com
commit 1bbc21785b7336619fb6a67f1fff5afdaf229acc upstream.
Currently the sysfs interface maps the BERT error region as "memory" (through acpi_os_map_memory()) in order to copy the error records into memory buffers through memory operations (eg memory_read_from_buffer()).
The OS system cannot detect whether the BERT error region is part of system RAM or it is "device memory" (eg BMC memory) and therefore it cannot detect which memory attributes the bus to memory support (and corresponding kernel mapping, unless firmware provides the required information).
The acpi_os_map_memory() arch backend implementation determines the mapping attributes. On arm64, if the BERT error region is not present in the EFI memory map, the error region is mapped as device-nGnRnE; this triggers alignment faults since memcpy unaligned accesses are not allowed in device-nGnRnE regions.
The ACPI sysfs code cannot therefore map by default the BERT error region with memory semantics but should use a safer default.
Change the sysfs code to map the BERT error region as MMIO (through acpi_os_map_iomem()) and use the memcpy_fromio() interface to read the error region into the kernel buffer.
Link: https://lore.kernel.org/linux-arm-kernel/31ffe8fc-f5ee-2858-26c5-0fd8bdd6870... Link: https://lore.kernel.org/linux-acpi/CAJZ5v0g+OVbhuUUDrLUCfX_mVqY_e8ubgLTU98=j... Signed-off-by: Lorenzo Pieralisi lorenzo.pieralisi@arm.com Tested-by: Veronika Kabatova vkabatov@redhat.com Tested-by: Aristeu Rozanski aris@redhat.com Acked-by: Ard Biesheuvel ardb@kernel.org Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Cc: dann frazier dann.frazier@canonical.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/acpi/sysfs.c | 25 ++++++++++++++++++------- 1 file changed, 18 insertions(+), 7 deletions(-)
--- a/drivers/acpi/sysfs.c +++ b/drivers/acpi/sysfs.c @@ -438,19 +438,30 @@ static ssize_t acpi_data_show(struct fil loff_t offset, size_t count) { struct acpi_data_attr *data_attr; - void *base; - ssize_t rc; + void __iomem *base; + ssize_t size;
data_attr = container_of(bin_attr, struct acpi_data_attr, attr); + size = data_attr->attr.size;
- base = acpi_os_map_memory(data_attr->addr, data_attr->attr.size); + if (offset < 0) + return -EINVAL; + + if (offset >= size) + return 0; + + if (count > size - offset) + count = size - offset; + + base = acpi_os_map_iomem(data_attr->addr, size); if (!base) return -ENOMEM; - rc = memory_read_from_buffer(buf, count, &offset, base, - data_attr->attr.size); - acpi_os_unmap_memory(base, data_attr->attr.size);
- return rc; + memcpy_fromio(buf, base + offset, count); + + acpi_os_unmap_iomem(base, size); + + return count; }
static int acpi_bert_data_init(void *th, struct acpi_data_attr *data_attr)
From: IotaHydrae writeforever@foxmail.com
[ Upstream commit fa8785e5931367e2b43f2c507f26bcf3e281c0ca ]
Change suniv f1c100s pinctrl,PD14 multiplexing function lvds1 to uart2
When the pin PD13 and PD14 is setting up to uart2 function in dts, there's an error occurred: 1c20800.pinctrl: unsupported function uart2 on pin PD14
Because 'uart2' is not any one multiplexing option of PD14, and pinctrl don't know how to configure it.
So change the pin PD14 lvds1 function to uart2.
Signed-off-by: IotaHydrae writeforever@foxmail.com Reviewed-by: Andre Przywara andre.przywara@arm.com Link: https://lore.kernel.org/r/tencent_70C1308DDA794C81CAEF389049055BACEC09@qq.co... Signed-off-by: Linus Walleij linus.walleij@linaro.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/pinctrl/sunxi/pinctrl-suniv-f1c100s.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/pinctrl/sunxi/pinctrl-suniv-f1c100s.c b/drivers/pinctrl/sunxi/pinctrl-suniv-f1c100s.c index 2801ca706273..68a5b627fb9b 100644 --- a/drivers/pinctrl/sunxi/pinctrl-suniv-f1c100s.c +++ b/drivers/pinctrl/sunxi/pinctrl-suniv-f1c100s.c @@ -204,7 +204,7 @@ static const struct sunxi_desc_pin suniv_f1c100s_pins[] = { SUNXI_FUNCTION(0x0, "gpio_in"), SUNXI_FUNCTION(0x1, "gpio_out"), SUNXI_FUNCTION(0x2, "lcd"), /* D20 */ - SUNXI_FUNCTION(0x3, "lvds1"), /* RX */ + SUNXI_FUNCTION(0x3, "uart2"), /* RX */ SUNXI_FUNCTION_IRQ_BANK(0x6, 0, 14)), SUNXI_PIN(SUNXI_PINCTRL_PIN(D, 15), SUNXI_FUNCTION(0x0, "gpio_in"),
From: Thomas Bartschies thomas.bartschies@cvk.de
[ Upstream commit 015c44d7bff3f44d569716117becd570c179ca32 ]
Since the recent introduction supporting the SM3 and SM4 hash algos for IPsec, the kernel produces invalid pfkey acquire messages, when these encryption modules are disabled. This happens because the availability of the algos wasn't checked in all necessary functions. This patch adds these checks.
Signed-off-by: Thomas Bartschies thomas.bartschies@cvk.de Signed-off-by: Steffen Klassert steffen.klassert@secunet.com Signed-off-by: Sasha Levin sashal@kernel.org --- net/key/af_key.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/net/key/af_key.c b/net/key/af_key.c index f67d3ba72c49..dd064d5eff6e 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -2904,7 +2904,7 @@ static int count_ah_combs(const struct xfrm_tmpl *t) break; if (!aalg->pfkey_supported) continue; - if (aalg_tmpl_set(t, aalg)) + if (aalg_tmpl_set(t, aalg) && aalg->available) sz += sizeof(struct sadb_comb); } return sz + sizeof(struct sadb_prop); @@ -2922,7 +2922,7 @@ static int count_esp_combs(const struct xfrm_tmpl *t) if (!ealg->pfkey_supported) continue;
- if (!(ealg_tmpl_set(t, ealg))) + if (!(ealg_tmpl_set(t, ealg) && ealg->available)) continue;
for (k = 1; ; k++) { @@ -2933,7 +2933,7 @@ static int count_esp_combs(const struct xfrm_tmpl *t) if (!aalg->pfkey_supported) continue;
- if (aalg_tmpl_set(t, aalg)) + if (aalg_tmpl_set(t, aalg) && aalg->available) sz += sizeof(struct sadb_comb); } }
From: Joel Stanley joel@jms.id.au
[ Upstream commit 6fd45e79e8b93b8d22fb8fe22c32fbad7e9190bd ]
The AST2600 when using the i210 NIC over NC-SI has been observed to produce incorrect checksum results with specific MTU values. This was first observed when sending data across a long distance set of networks.
On a local network, the following test was performed using a 1MB file of random data.
On the receiver run this script:
#!/bin/bash while [ 1 ]; do # Zero the stats nstat -r > /dev/null nc -l 9899 > test-file # Check for checksum errors TcpInCsumErrors=$(nstat | grep TcpInCsumErrors) if [ -z "$TcpInCsumErrors" ]; then echo No TcpInCsumErrors else echo TcpInCsumErrors = $TcpInCsumErrors fi done
On an AST2600 system:
# nc <IP of receiver host> 9899 < test-file
The test was repeated with various MTU values:
# ip link set mtu 1410 dev eth0
The observed results:
1500 - good 1434 - bad 1400 - good 1410 - bad 1420 - good
The test was repeated after disabling tx checksumming:
# ethtool -K eth0 tx-checksumming off
And all MTU values tested resulted in transfers without error.
An issue with the driver cannot be ruled out, however there has been no bug discovered so far.
David has done the work to take the original bug report of slow data transfer between long distance connections and triaged it down to this test case.
The vendor suspects this this is a hardware issue when using NC-SI. The fixes line refers to the patch that introduced AST2600 support.
Reported-by: David Wilder wilder@us.ibm.com Reviewed-by: Dylan Hung dylan_hung@aspeedtech.com Signed-off-by: Joel Stanley joel@jms.id.au Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/faraday/ftgmac100.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c index 2c06cdcd3e75..d7478d332820 100644 --- a/drivers/net/ethernet/faraday/ftgmac100.c +++ b/drivers/net/ethernet/faraday/ftgmac100.c @@ -1880,6 +1880,11 @@ static int ftgmac100_probe(struct platform_device *pdev) /* AST2400 doesn't have working HW checksum generation */ if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac"))) netdev->hw_features &= ~NETIF_F_HW_CSUM; + + /* AST2600 tx checksum with NCSI is broken */ + if (priv->use_ncsi && of_device_is_compatible(np, "aspeed,ast2600-mac")) + netdev->hw_features &= ~NETIF_F_HW_CSUM; + if (np && of_get_property(np, "no-hw-checksum", NULL)) netdev->hw_features &= ~(NETIF_F_HW_CSUM | NETIF_F_RXCSUM); netdev->features |= netdev->hw_features;
From: Mika Westerberg mika.westerberg@linux.intel.com
[ Upstream commit 17a0f3acdc6ec8b89ad40f6e22165a4beee25663 ]
Before sending a MSI the hardware writes information pertinent to the interrupt cause to a memory location pointed by SMTICL register. This memory holds three double words where the least significant bit tells whether the interrupt cause of master/target/error is valid. The driver does not use this but we need to set it up because otherwise it will perform DMA write to the default address (0) and this will cause an IOMMU fault such as below:
DMAR: DRHD: handling fault status reg 2 DMAR: [DMA Write] Request device [00:12.0] PASID ffffffff fault addr 0 [fault reason 05] PTE Write access is not set
To prevent this from happening, provide a proper DMA buffer for this that then gets mapped by the IOMMU accordingly.
Signed-off-by: Mika Westerberg mika.westerberg@linux.intel.com Reviewed-by: From: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Wolfram Sang wsa@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/i2c/busses/i2c-ismt.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/drivers/i2c/busses/i2c-ismt.c b/drivers/i2c/busses/i2c-ismt.c index 2f95e25a10f7..53325419ec13 100644 --- a/drivers/i2c/busses/i2c-ismt.c +++ b/drivers/i2c/busses/i2c-ismt.c @@ -81,6 +81,7 @@
#define ISMT_DESC_ENTRIES 2 /* number of descriptor entries */ #define ISMT_MAX_RETRIES 3 /* number of SMBus retries to attempt */ +#define ISMT_LOG_ENTRIES 3 /* number of interrupt cause log entries */
/* Hardware Descriptor Constants - Control Field */ #define ISMT_DESC_CWRL 0x01 /* Command/Write Length */ @@ -174,6 +175,8 @@ struct ismt_priv { u8 head; /* ring buffer head pointer */ struct completion cmp; /* interrupt completion */ u8 buffer[I2C_SMBUS_BLOCK_MAX + 16]; /* temp R/W data buffer */ + dma_addr_t log_dma; + u32 *log; };
/** @@ -408,6 +411,9 @@ static int ismt_access(struct i2c_adapter *adap, u16 addr, memset(desc, 0, sizeof(struct ismt_desc)); desc->tgtaddr_rw = ISMT_DESC_ADDR_RW(addr, read_write);
+ /* Always clear the log entries */ + memset(priv->log, 0, ISMT_LOG_ENTRIES * sizeof(u32)); + /* Initialize common control bits */ if (likely(pci_dev_msi_enabled(priv->pci_dev))) desc->control = ISMT_DESC_INT | ISMT_DESC_FAIR; @@ -697,6 +703,8 @@ static void ismt_hw_init(struct ismt_priv *priv) /* initialize the Master Descriptor Base Address (MDBA) */ writeq(priv->io_rng_dma, priv->smba + ISMT_MSTR_MDBA);
+ writeq(priv->log_dma, priv->smba + ISMT_GR_SMTICL); + /* initialize the Master Control Register (MCTRL) */ writel(ISMT_MCTRL_MEIE, priv->smba + ISMT_MSTR_MCTRL);
@@ -784,6 +792,12 @@ static int ismt_dev_init(struct ismt_priv *priv) priv->head = 0; init_completion(&priv->cmp);
+ priv->log = dmam_alloc_coherent(&priv->pci_dev->dev, + ISMT_LOG_ENTRIES * sizeof(u32), + &priv->log_dma, GFP_KERNEL); + if (!priv->log) + return -ENOMEM; + return 0; }
From: Piyush Malgujar pmalgujar@marvell.com
[ Upstream commit 03a35bc856ddc09f2cc1f4701adecfbf3b464cb3 ]
Due to i2c->adap.dev.fwnode not being set, ACPI_COMPANION() wasn't properly found for TWSI controllers.
Signed-off-by: Szymon Balcerak sbalcerak@marvell.com Signed-off-by: Piyush Malgujar pmalgujar@marvell.com Signed-off-by: Wolfram Sang wsa@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/i2c/busses/i2c-thunderx-pcidrv.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/i2c/busses/i2c-thunderx-pcidrv.c b/drivers/i2c/busses/i2c-thunderx-pcidrv.c index 19f8eec38717..107aeb8b54da 100644 --- a/drivers/i2c/busses/i2c-thunderx-pcidrv.c +++ b/drivers/i2c/busses/i2c-thunderx-pcidrv.c @@ -208,6 +208,7 @@ static int thunder_i2c_probe_pci(struct pci_dev *pdev, i2c->adap.bus_recovery_info = &octeon_i2c_recovery_info; i2c->adap.dev.parent = dev; i2c->adap.dev.of_node = pdev->dev.of_node; + i2c->adap.dev.fwnode = dev->fwnode; snprintf(i2c->adap.name, sizeof(i2c->adap.name), "Cavium ThunderX i2c adapter at %s", dev_name(dev)); i2c_set_adapdata(&i2c->adap, i2c);
From: Stephen Brennan stephen.s.brennan@oracle.com
commit d1dc87763f406d4e67caf16dbe438a5647692395 upstream.
A rare BUG_ON triggered in assoc_array_gc:
[3430308.818153] kernel BUG at lib/assoc_array.c:1609!
Which corresponded to the statement currently at line 1593 upstream:
BUG_ON(assoc_array_ptr_is_meta(p));
Using the data from the core dump, I was able to generate a userspace reproducer[1] and determine the cause of the bug.
[1]: https://github.com/brenns10/kernel_stuff/tree/master/assoc_array_gc
After running the iterator on the entire branch, an internal tree node looked like the following:
NODE (nr_leaves_on_branch: 3) SLOT [0] NODE (2 leaves) SLOT [1] NODE (1 leaf) SLOT [2..f] NODE (empty)
In the userspace reproducer, the pr_devel output when compressing this node was:
-- compress node 0x5607cc089380 -- free=0, leaves=0 [0] retain node 2/1 [nx 0] [1] fold node 1/1 [nx 0] [2] fold node 0/1 [nx 2] [3] fold node 0/2 [nx 2] [4] fold node 0/3 [nx 2] [5] fold node 0/4 [nx 2] [6] fold node 0/5 [nx 2] [7] fold node 0/6 [nx 2] [8] fold node 0/7 [nx 2] [9] fold node 0/8 [nx 2] [10] fold node 0/9 [nx 2] [11] fold node 0/10 [nx 2] [12] fold node 0/11 [nx 2] [13] fold node 0/12 [nx 2] [14] fold node 0/13 [nx 2] [15] fold node 0/14 [nx 2] after: 3
At slot 0, an internal node with 2 leaves could not be folded into the node, because there was only one available slot (slot 0). Thus, the internal node was retained. At slot 1, the node had one leaf, and was able to be folded in successfully. The remaining nodes had no leaves, and so were removed. By the end of the compression stage, there were 14 free slots, and only 3 leaf nodes. The tree was ascended and then its parent node was compressed. When this node was seen, it could not be folded, due to the internal node it contained.
The invariant for compression in this function is: whenever nr_leaves_on_branch < ASSOC_ARRAY_FAN_OUT, the node should contain all leaf nodes. The compression step currently cannot guarantee this, given the corner case shown above.
To fix this issue, retry compression whenever we have retained a node, and yet nr_leaves_on_branch < ASSOC_ARRAY_FAN_OUT. This second compression will then allow the node in slot 1 to be folded in, satisfying the invariant. Below is the output of the reproducer once the fix is applied:
-- compress node 0x560e9c562380 -- free=0, leaves=0 [0] retain node 2/1 [nx 0] [1] fold node 1/1 [nx 0] [2] fold node 0/1 [nx 2] [3] fold node 0/2 [nx 2] [4] fold node 0/3 [nx 2] [5] fold node 0/4 [nx 2] [6] fold node 0/5 [nx 2] [7] fold node 0/6 [nx 2] [8] fold node 0/7 [nx 2] [9] fold node 0/8 [nx 2] [10] fold node 0/9 [nx 2] [11] fold node 0/10 [nx 2] [12] fold node 0/11 [nx 2] [13] fold node 0/12 [nx 2] [14] fold node 0/13 [nx 2] [15] fold node 0/14 [nx 2] internal nodes remain despite enough space, retrying -- compress node 0x560e9c562380 -- free=14, leaves=1 [0] fold node 2/15 [nx 0] after: 3
Changes ======= DH: - Use false instead of 0. - Reorder the inserted lines in a couple of places to put retained before next_slot.
ver #2) - Fix typo in pr_devel, correct comparison to "<="
Fixes: 3cb989501c26 ("Add a generic associative array implementation.") Cc: stable@vger.kernel.org Signed-off-by: Stephen Brennan stephen.s.brennan@oracle.com Signed-off-by: David Howells dhowells@redhat.com cc: Andrew Morton akpm@linux-foundation.org cc: keyrings@vger.kernel.org Link: https://lore.kernel.org/r/20220511225517.407935-1-stephen.s.brennan@oracle.c... # v1 Link: https://lore.kernel.org/r/20220512215045.489140-1-stephen.s.brennan@oracle.c... # v2 Reviewed-by: Jarkko Sakkinen jarkko@kernel.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- lib/assoc_array.c | 8 ++++++++ 1 file changed, 8 insertions(+)
--- a/lib/assoc_array.c +++ b/lib/assoc_array.c @@ -1462,6 +1462,7 @@ int assoc_array_gc(struct assoc_array *a struct assoc_array_ptr *cursor, *ptr; struct assoc_array_ptr *new_root, *new_parent, **new_ptr_pp; unsigned long nr_leaves_on_tree; + bool retained; int keylen, slot, nr_free, next_slot, i;
pr_devel("-->%s()\n", __func__); @@ -1538,6 +1539,7 @@ continue_node: goto descend; }
+retry_compress: pr_devel("-- compress node %p --\n", new_n);
/* Count up the number of empty slots in this node and work out the @@ -1555,6 +1557,7 @@ continue_node: pr_devel("free=%d, leaves=%lu\n", nr_free, new_n->nr_leaves_on_branch);
/* See what we can fold in */ + retained = false; next_slot = 0; for (slot = 0; slot < ASSOC_ARRAY_FAN_OUT; slot++) { struct assoc_array_shortcut *s; @@ -1604,9 +1607,14 @@ continue_node: pr_devel("[%d] retain node %lu/%d [nx %d]\n", slot, child->nr_leaves_on_branch, nr_free + 1, next_slot); + retained = true; } }
+ if (retained && new_n->nr_leaves_on_branch <= ASSOC_ARRAY_FAN_OUT) { + pr_devel("internal nodes remain despite enough space, retrying\n"); + goto retry_compress; + } pr_devel("after: %lu\n", new_n->nr_leaves_on_branch);
nr_leaves_on_tree = new_n->nr_leaves_on_branch;
From: Miri Korenblit miriam.rachel.korenblit@intel.com
commit 1b7b3ac8ff3317cdcf07a1c413de9bdb68019c2b upstream.
We used to set regulatory info before the registration of the device and then the regulatory info didn't get set, because the device isn't registered so there isn't a device to set the regulatory info for. So set the regulatory info after the device registration. Call reg_process_self_managed_hints() once again after the device registration because it does nothing before it.
Signed-off-by: Miri Korenblit miriam.rachel.korenblit@intel.com Signed-off-by: Luca Coelho luciano.coelho@intel.com Link: https://lore.kernel.org/r/iwlwifi.20210618133832.c96eadcffe80.I86799c2c866b5... Signed-off-by: Johannes Berg johannes.berg@intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/wireless/core.c | 8 ++++---- net/wireless/reg.c | 1 + 2 files changed, 5 insertions(+), 4 deletions(-)
--- a/net/wireless/core.c +++ b/net/wireless/core.c @@ -5,7 +5,7 @@ * Copyright 2006-2010 Johannes Berg johannes@sipsolutions.net * Copyright 2013-2014 Intel Mobile Communications GmbH * Copyright 2015-2017 Intel Deutschland GmbH - * Copyright (C) 2018-2019 Intel Corporation + * Copyright (C) 2018-2021 Intel Corporation */
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt @@ -891,9 +891,6 @@ int wiphy_register(struct wiphy *wiphy) return res; }
- /* set up regulatory info */ - wiphy_regulatory_register(wiphy); - list_add_rcu(&rdev->list, &cfg80211_rdev_list); cfg80211_rdev_list_generation++;
@@ -904,6 +901,9 @@ int wiphy_register(struct wiphy *wiphy) cfg80211_debugfs_rdev_add(rdev); nl80211_notify_wiphy(rdev, NL80211_CMD_NEW_WIPHY);
+ /* set up regulatory info */ + wiphy_regulatory_register(wiphy); + if (wiphy->regulatory_flags & REGULATORY_CUSTOM_REG) { struct regulatory_request request;
--- a/net/wireless/reg.c +++ b/net/wireless/reg.c @@ -3790,6 +3790,7 @@ void wiphy_regulatory_register(struct wi
wiphy_update_regulatory(wiphy, lr->initiator); wiphy_all_share_dfs_chan_state(wiphy); + reg_process_self_managed_hints(); }
void wiphy_regulatory_deregister(struct wiphy *wiphy)
From: Gustavo A. R. Silva gustavoars@kernel.org
commit 336feb502a715909a8136eb6a62a83d7268a353b upstream.
Fix the following -Wstringop-overflow warnings when building with GCC-11:
drivers/gpu/drm/i915/intel_pm.c:3106:9: warning: ‘intel_read_wm_latency’ accessing 16 bytes in a region of size 10 [-Wstringop-overflow=] 3106 | intel_read_wm_latency(dev_priv, dev_priv->wm.pri_latency); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/gpu/drm/i915/intel_pm.c:3106:9: note: referencing argument 2 of type ‘u16 *’ {aka ‘short unsigned int *’} drivers/gpu/drm/i915/intel_pm.c:2861:13: note: in a call to function ‘intel_read_wm_latency’ 2861 | static void intel_read_wm_latency(struct drm_i915_private *dev_priv, | ^~~~~~~~~~~~~~~~~~~~~
by removing the over-specified array size from the argument declarations.
It seems that this code is actually safe because the size of the array depends on the hardware generation, and the function checks for that.
Notice that wm can be an array of 5 elements: drivers/gpu/drm/i915/intel_pm.c:3109: intel_read_wm_latency(dev_priv, dev_priv->wm.pri_latency);
or an array of 8 elements: drivers/gpu/drm/i915/intel_pm.c:3131: intel_read_wm_latency(dev_priv, dev_priv->wm.skl_latency);
and the compiler legitimately complains about that.
This helps with the ongoing efforts to globally enable -Wstringop-overflow.
Link: https://github.com/KSPP/linux/issues/181 Signed-off-by: Gustavo A. R. Silva gustavoars@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/gpu/drm/i915/intel_pm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/gpu/drm/i915/intel_pm.c +++ b/drivers/gpu/drm/i915/intel_pm.c @@ -2822,7 +2822,7 @@ hsw_compute_linetime_wm(const struct int }
static void intel_read_wm_latency(struct drm_i915_private *dev_priv, - u16 wm[8]) + u16 wm[]) { struct intel_uncore *uncore = &dev_priv->uncore;
From: Kees Cook keescook@chromium.org
commit dcd46d897adb70d63e025f175a00a89797d31a43 upstream.
Quoting[1] Ariadne Conill:
"In several other operating systems, it is a hard requirement that the second argument to execve(2) be the name of a program, thus prohibiting a scenario where argc < 1. POSIX 2017 also recommends this behaviour, but it is not an explicit requirement[2]:
The argument arg0 should point to a filename string that is associated with the process being started by one of the exec functions. ... Interestingly, Michael Kerrisk opened an issue about this in 2008[3], but there was no consensus to support fixing this issue then. Hopefully now that CVE-2021-4034 shows practical exploitative use[4] of this bug in a shellcode, we can reconsider.
This issue is being tracked in the KSPP issue tracker[5]."
While the initial code searches[6][7] turned up what appeared to be mostly corner case tests, trying to that just reject argv == NULL (or an immediately terminated pointer list) quickly started tripping[8] existing userspace programs.
The next best approach is forcing a single empty string into argv and adjusting argc to match. The number of programs depending on argc == 0 seems a smaller set than those calling execve with a NULL argv.
Account for the additional stack space in bprm_stack_limits(). Inject an empty string when argc == 0 (and set argc = 1). Warn about the case so userspace has some notice about the change:
process './argc0' launched './argc0' with NULL argv: empty string added
Additionally WARN() and reject NULL argv usage for kernel threads.
[1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org... [2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html [3] https://bugzilla.kernel.org/show_bug.cgi?id=8408 [4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt [5] https://github.com/KSPP/linux/issues/176 [6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*... [7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2... [8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/
Reported-by: Ariadne Conill ariadne@dereferenced.org Reported-by: Michael Kerrisk mtk.manpages@gmail.com Cc: Matthew Wilcox willy@infradead.org Cc: Christian Brauner brauner@kernel.org Cc: Rich Felker dalias@libc.org Cc: Eric Biederman ebiederm@xmission.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: linux-fsdevel@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Kees Cook keescook@chromium.org Acked-by: Christian Brauner brauner@kernel.org Acked-by: Ariadne Conill ariadne@dereferenced.org Acked-by: Andy Lutomirski luto@kernel.org Link: https://lore.kernel.org/r/20220201000947.2453721-1-keescook@chromium.org [vegard: fixed conflicts due to missing 886d7de631da71e30909980fdbf318f7caade262^- and 3950e975431bc914f7e81b8f2a2dbdf2064acb0f^-] Signed-off-by: Vegard Nossum vegard.nossum@oracle.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/exec.c | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-)
This has been tested in both argc == 0 and argc >= 1 cases, but I would still appreciate a review given the differences with mainline. If it's considered too risky I'm also fine with dropping it -- just wanted to make sure this didn't fall through the cracks, as it does block a real (albeit old by now) exploit.
--- a/fs/exec.c +++ b/fs/exec.c @@ -454,6 +454,9 @@ static int prepare_arg_pages(struct linu unsigned long limit, ptr_size;
bprm->argc = count(argv, MAX_ARG_STRINGS); + if (bprm->argc == 0) + pr_warn_once("process '%s' launched '%s' with NULL argv: empty string added\n", + current->comm, bprm->filename); if (bprm->argc < 0) return bprm->argc;
@@ -482,8 +485,14 @@ static int prepare_arg_pages(struct linu * the stack. They aren't stored until much later when we can't * signal to the parent that the child has run out of stack space. * Instead, calculate it here so it's possible to fail gracefully. + * + * In the case of argc = 0, make sure there is space for adding a + * empty string (which will bump argc to 1), to ensure confused + * userspace programs don't start processing from argv[1], thinking + * argc can never be 0, to keep them from walking envp by accident. + * See do_execveat_common(). */ - ptr_size = (bprm->argc + bprm->envc) * sizeof(void *); + ptr_size = (max(bprm->argc, 1) + bprm->envc) * sizeof(void *); if (limit <= ptr_size) return -E2BIG; limit -= ptr_size; @@ -1848,6 +1857,20 @@ static int __do_execve_file(int fd, stru if (retval < 0) goto out;
+ /* + * When argv is empty, add an empty string ("") as argv[0] to + * ensure confused userspace programs that start processing + * from argv[1] won't end up walking envp. See also + * bprm_stack_limits(). + */ + if (bprm->argc == 0) { + const char *argv[] = { "", NULL }; + retval = copy_strings_kernel(1, argv, bprm); + if (retval < 0) + goto out; + bprm->argc = 1; + } + retval = exec_binprm(bprm); if (retval < 0) goto out;
From: Florian Westphal fw@strlen.de
commit 56b14ecec97f39118bf85c9ac2438c5a949509ed upstream.
In case the conntrack is clashing, insertion can free skb->_nfct and set skb->_nfct to the already-confirmed entry.
This wasn't found before because the conntrack entry and the extension space used to free'd after an rcu grace period, plus the race needs events enabled to trigger.
Reported-by: syzbot+793a590957d9c1b96620@syzkaller.appspotmail.com Fixes: 71d8c47fc653 ("netfilter: conntrack: introduce clash resolution on insertion race") Fixes: 2ad9d7747c10 ("netfilter: conntrack: free extension area immediately") Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/net/netfilter/nf_conntrack_core.h | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
--- a/include/net/netfilter/nf_conntrack_core.h +++ b/include/net/netfilter/nf_conntrack_core.h @@ -59,8 +59,13 @@ static inline int nf_conntrack_confirm(s int ret = NF_ACCEPT;
if (ct) { - if (!nf_ct_is_confirmed(ct)) + if (!nf_ct_is_confirmed(ct)) { ret = __nf_conntrack_confirm(skb); + + if (ret == NF_ACCEPT) + ct = (struct nf_conn *)skb_nfct(skb); + } + if (likely(ret == NF_ACCEPT)) nf_ct_deliver_cached_events(ct); }
From: Vitaly Chikunov vt@altlinux.org
commit 7cc7ab73f83ee6d50dc9536bc3355495d8600fad upstream.
Correctly compare values that shall be greater-or-equal and not just greater.
Fixes: 0d7a78643f69 ("crypto: ecrdsa - add EC-RDSA (GOST 34.10) algorithm") Cc: stable@vger.kernel.org Signed-off-by: Vitaly Chikunov vt@altlinux.org Signed-off-by: Herbert Xu herbert@gondor.apana.org.au Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- crypto/ecrdsa.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
--- a/crypto/ecrdsa.c +++ b/crypto/ecrdsa.c @@ -112,15 +112,15 @@ static int ecrdsa_verify(struct akcipher
/* Step 1: verify that 0 < r < q, 0 < s < q */ if (vli_is_zero(r, ndigits) || - vli_cmp(r, ctx->curve->n, ndigits) == 1 || + vli_cmp(r, ctx->curve->n, ndigits) >= 0 || vli_is_zero(s, ndigits) || - vli_cmp(s, ctx->curve->n, ndigits) == 1) + vli_cmp(s, ctx->curve->n, ndigits) >= 0) return -EKEYREJECTED;
/* Step 2: calculate hash (h) of the message (passed as input) */ /* Step 3: calculate e = h \mod q */ vli_from_le64(e, digest, ndigits); - if (vli_cmp(e, ctx->curve->n, ndigits) == 1) + if (vli_cmp(e, ctx->curve->n, ndigits) >= 0) vli_sub(e, e, ctx->curve->n, ndigits); if (vli_is_zero(e, ndigits)) e[0] = 1; @@ -136,7 +136,7 @@ static int ecrdsa_verify(struct akcipher /* Step 6: calculate point C = z_1P + z_2Q, and R = x_c \mod q */ ecc_point_mult_shamir(&cc, z1, &ctx->curve->g, z2, &ctx->pub_key, ctx->curve); - if (vli_cmp(cc.x, ctx->curve->n, ndigits) == 1) + if (vli_cmp(cc.x, ctx->curve->n, ndigits) >= 0) vli_sub(cc.x, cc.x, ctx->curve->n, ndigits);
/* Step 7: if R == r signature is valid */
From: Sultan Alsawaf sultan@kerneltoast.com
commit 2505a981114dcb715f8977b8433f7540854851d8 upstream.
The asynchronous zspage free worker tries to lock a zspage's entire page list without defending against page migration. Since pages which haven't yet been locked can concurrently migrate off the zspage page list while lock_zspage() churns away, lock_zspage() can suffer from a few different lethal races.
It can lock a page which no longer belongs to the zspage and unsafely dereference page_private(), it can unsafely dereference a torn pointer to the next page (since there's a data race), and it can observe a spurious NULL pointer to the next page and thus not lock all of the zspage's pages (since a single page migration will reconstruct the entire page list, and create_page_chain() unconditionally zeroes out each list pointer in the process).
Fix the races by using migrate_read_lock() in lock_zspage() to synchronize with page migration.
Link: https://lkml.kernel.org/r/20220509024703.243847-1-sultan@kerneltoast.com Fixes: 77ff465799c602 ("zsmalloc: zs_page_migrate: skip unnecessary loops but not return -EBUSY if zspage is not inuse") Signed-off-by: Sultan Alsawaf sultan@kerneltoast.com Acked-by: Minchan Kim minchan@kernel.org Cc: Nitin Gupta ngupta@vflare.org Cc: Sergey Senozhatsky senozhatsky@chromium.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/zsmalloc.c | 37 +++++++++++++++++++++++++++++++++---- 1 file changed, 33 insertions(+), 4 deletions(-)
--- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -1748,11 +1748,40 @@ static enum fullness_group putback_zspag */ static void lock_zspage(struct zspage *zspage) { - struct page *page = get_first_page(zspage); + struct page *curr_page, *page;
- do { - lock_page(page); - } while ((page = get_next_page(page)) != NULL); + /* + * Pages we haven't locked yet can be migrated off the list while we're + * trying to lock them, so we need to be careful and only attempt to + * lock each page under migrate_read_lock(). Otherwise, the page we lock + * may no longer belong to the zspage. This means that we may wait for + * the wrong page to unlock, so we must take a reference to the page + * prior to waiting for it to unlock outside migrate_read_lock(). + */ + while (1) { + migrate_read_lock(zspage); + page = get_first_page(zspage); + if (trylock_page(page)) + break; + get_page(page); + migrate_read_unlock(zspage); + wait_on_page_locked(page); + put_page(page); + } + + curr_page = page; + while ((page = get_next_page(curr_page))) { + if (trylock_page(page)) { + curr_page = page; + } else { + get_page(page); + migrate_read_unlock(zspage); + wait_on_page_locked(page); + put_page(page); + migrate_read_lock(zspage); + } + } + migrate_read_unlock(zspage); }
static int zs_init_fs_context(struct fs_context *fc)
From: Dan Carpenter dan.carpenter@oracle.com
commit d3f2a14b8906df913cb04a706367b012db94a6e8 upstream.
The "r" variable shadows an earlier "r" that has function scope. It means that we accidentally return success instead of an error code. Smatch has a warning for this:
drivers/md/dm-integrity.c:4503 dm_integrity_ctr() warn: missing error code 'r'
Fixes: 7eada909bfd7 ("dm: add integrity target") Cc: stable@vger.kernel.org Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Reviewed-by: Mikulas Patocka mpatocka@redhat.com Signed-off-by: Mike Snitzer snitzer@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/md/dm-integrity.c | 2 -- 1 file changed, 2 deletions(-)
--- a/drivers/md/dm-integrity.c +++ b/drivers/md/dm-integrity.c @@ -4149,8 +4149,6 @@ try_smaller_buffer: }
if (should_write_sb) { - int r; - init_journal(ic, 0, ic->journal_sections, 0); r = dm_integrity_failed(ic); if (unlikely(r)) {
From: Mikulas Patocka mpatocka@redhat.com
commit 567dd8f34560fa221a6343729474536aa7ede4fd upstream.
The device mapper dm-crypt target is using scnprintf("%02x", cc->key[i]) to report the current key to userspace. However, this is not a constant-time operation and it may leak information about the key via timing, via cache access patterns or via the branch predictor.
Change dm-crypt's key printing to use "%c" instead of "%02x". Also introduce hex2asc() that carefully avoids any branching or memory accesses when converting a number in the range 0 ... 15 to an ascii character.
Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka mpatocka@redhat.com Tested-by: Milan Broz gmazyland@gmail.com Signed-off-by: Mike Snitzer snitzer@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/md/dm-crypt.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
--- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -2817,6 +2817,11 @@ static int crypt_map(struct dm_target *t return DM_MAPIO_SUBMITTED; }
+static char hex2asc(unsigned char c) +{ + return c + '0' + ((unsigned)(9 - c) >> 4 & 0x27); +} + static void crypt_status(struct dm_target *ti, status_type_t type, unsigned status_flags, char *result, unsigned maxlen) { @@ -2835,9 +2840,12 @@ static void crypt_status(struct dm_targe if (cc->key_size > 0) { if (cc->key_string) DMEMIT(":%u:%s", cc->key_size, cc->key_string); - else - for (i = 0; i < cc->key_size; i++) - DMEMIT("%02x", cc->key[i]); + else { + for (i = 0; i < cc->key_size; i++) { + DMEMIT("%c%c", hex2asc(cc->key[i] >> 4), + hex2asc(cc->key[i] & 0xf)); + } + } } else DMEMIT("-");
From: Mikulas Patocka mpatocka@redhat.com
commit bfe2b0146c4d0230b68f5c71a64380ff8d361f8b upstream.
dm-stats can be used with a very large number of entries (it is only limited by 1/4 of total system memory), so add rescheduling points to the loops that iterate over the entries.
Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka mpatocka@redhat.com Signed-off-by: Mike Snitzer snitzer@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/md/dm-stats.c | 8 ++++++++ 1 file changed, 8 insertions(+)
--- a/drivers/md/dm-stats.c +++ b/drivers/md/dm-stats.c @@ -224,6 +224,7 @@ void dm_stats_cleanup(struct dm_stats *s atomic_read(&shared->in_flight[READ]), atomic_read(&shared->in_flight[WRITE])); } + cond_resched(); } dm_stat_free(&s->rcu_head); } @@ -313,6 +314,7 @@ static int dm_stats_create(struct dm_sta for (ni = 0; ni < n_entries; ni++) { atomic_set(&s->stat_shared[ni].in_flight[READ], 0); atomic_set(&s->stat_shared[ni].in_flight[WRITE], 0); + cond_resched(); }
if (s->n_histogram_entries) { @@ -325,6 +327,7 @@ static int dm_stats_create(struct dm_sta for (ni = 0; ni < n_entries; ni++) { s->stat_shared[ni].tmp.histogram = hi; hi += s->n_histogram_entries + 1; + cond_resched(); } }
@@ -345,6 +348,7 @@ static int dm_stats_create(struct dm_sta for (ni = 0; ni < n_entries; ni++) { p[ni].histogram = hi; hi += s->n_histogram_entries + 1; + cond_resched(); } } } @@ -474,6 +478,7 @@ static int dm_stats_list(struct dm_stats } DMEMIT("\n"); } + cond_resched(); } mutex_unlock(&stats->mutex);
@@ -750,6 +755,7 @@ static void __dm_stat_clear(struct dm_st local_irq_enable(); } } + cond_resched(); } }
@@ -865,6 +871,8 @@ static int dm_stats_print(struct dm_stat
if (unlikely(sz + 1 >= maxlen)) goto buffer_overflow; + + cond_resched(); }
if (clear)
From: Sarthak Kukreti sarthakkukreti@google.com
commit 4caae58406f8ceb741603eee460d79bacca9b1b5 upstream.
The device-mapper framework provides a mechanism to mark targets as immutable (and hence fail table reloads that try to change the target type). Add the DM_TARGET_IMMUTABLE flag to the dm-verity target's feature flags to prevent switching the verity target with a different target type.
Fixes: a4ffc152198e ("dm: add verity target") Cc: stable@vger.kernel.org Signed-off-by: Sarthak Kukreti sarthakkukreti@google.com Reviewed-by: Kees Cook keescook@chromium.org Signed-off-by: Mike Snitzer snitzer@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/md/dm-verity-target.c | 1 + 1 file changed, 1 insertion(+)
--- a/drivers/md/dm-verity-target.c +++ b/drivers/md/dm-verity-target.c @@ -1217,6 +1217,7 @@ bad:
static struct target_type verity_target = { .name = "verity", + .features = DM_TARGET_IMMUTABLE, .version = {1, 5, 0}, .module = THIS_MODULE, .ctr = verity_ctr,
I believe this commit introduced a regression in dm verity on systems where data device is an NVME one. Loading table fails with the following diagnostics:
device-mapper: table: table load rejected: including non-request-stackable devices
The same kernel works with the same data drive on the SCSI interface. NVME-backed dm verity works with just this commit reverted.
I believe the presence of the immutable partition is used as an indicator of special case NVME configuration and if the data device's name starts with "nvme" the code tries to switch the target type to DM_TYPE_NVME_BIO_BASED (drivers/md/dm-table.c lines 1003-1010).
The special NVME optimization case was removed in 5.10 by commit 9c37de297f6590937f95a28bec1b7ac68a38618f, so only 5.4 is affected.
On Fri, Jun 10, 2022 at 04:22:00AM +0000, Oleksandr Tymoshenko wrote:
I believe this commit introduced a regression in dm verity on systems where data device is an NVME one. Loading table fails with the following diagnostics:
device-mapper: table: table load rejected: including non-request-stackable devices
The same kernel works with the same data drive on the SCSI interface. NVME-backed dm verity works with just this commit reverted.
I believe the presence of the immutable partition is used as an indicator of special case NVME configuration and if the data device's name starts with "nvme" the code tries to switch the target type to DM_TYPE_NVME_BIO_BASED (drivers/md/dm-table.c lines 1003-1010).
The special NVME optimization case was removed in 5.10 by commit 9c37de297f6590937f95a28bec1b7ac68a38618f, so only 5.4 is affected.
Why wouldn't 4.9, 4.14, and 4.19 also be affected here? Should I also just queue up 9c37de297f65 ("dm: remove special-casing of bio-based immutable singleton target on NVMe") to those older kernels? If so, have you tested this and verified that it worked?
thanks,
greg k-h
On Thu, Jun 9, 2022 at 10:15 PM Greg KH gregkh@linuxfoundation.org wrote:
On Fri, Jun 10, 2022 at 04:22:00AM +0000, Oleksandr Tymoshenko wrote:
I believe this commit introduced a regression in dm verity on systems where data device is an NVME one. Loading table fails with the following diagnostics:
device-mapper: table: table load rejected: including non-request-stackable devices
The same kernel works with the same data drive on the SCSI interface. NVME-backed dm verity works with just this commit reverted.
I believe the presence of the immutable partition is used as an indicator of special case NVME configuration and if the data device's name starts with "nvme" the code tries to switch the target type to DM_TYPE_NVME_BIO_BASED (drivers/md/dm-table.c lines 1003-1010).
The special NVME optimization case was removed in 5.10 by commit 9c37de297f6590937f95a28bec1b7ac68a38618f, so only 5.4 is affected.
Why wouldn't 4.9, 4.14, and 4.19 also be affected here?
Just a bad choice of words on my side: we use only 5.x branches and it slipped my mind to verify all actively supported branches. 4.19 is likely to be affected, it has the same code for the NVME optimization as 5.4. 4.9 and 4.14 doesn't this code so probably not affected.
Should I also just queue up 9c37de297f65 ("dm: remove special-casing of bio-based immutable singleton target on NVMe") to those older kernels? If so, have you tested this and verified that it worked?
I don't have enough expertise in this domain to recommend a solution, that's why I reported the problem instead of sending a patch. I did take a quick look though: it doesn't apply cleanly and it seems that the 9c37de297f65 was removed as a result of some other refactoring, so I think it's more complex than backporting a single commit.
thanks,
greg k-h
On Fri, Jun 10 2022 at 1:15P -0400, Greg KH gregkh@linuxfoundation.org wrote:
On Fri, Jun 10, 2022 at 04:22:00AM +0000, Oleksandr Tymoshenko wrote:
I believe this commit introduced a regression in dm verity on systems where data device is an NVME one. Loading table fails with the following diagnostics:
device-mapper: table: table load rejected: including non-request-stackable devices
The same kernel works with the same data drive on the SCSI interface. NVME-backed dm verity works with just this commit reverted.
I believe the presence of the immutable partition is used as an indicator of special case NVME configuration and if the data device's name starts with "nvme" the code tries to switch the target type to DM_TYPE_NVME_BIO_BASED (drivers/md/dm-table.c lines 1003-1010).
The special NVME optimization case was removed in 5.10 by commit 9c37de297f6590937f95a28bec1b7ac68a38618f, so only 5.4 is affected.
Why wouldn't 4.9, 4.14, and 4.19 also be affected here? Should I also just queue up 9c37de297f65 ("dm: remove special-casing of bio-based immutable singleton target on NVMe") to those older kernels? If so, have you tested this and verified that it worked?
Sorry for the unforeseen stable@ troubles here!
In general we'd be fine to apply commit 9c37de297f65 but to do it properly would require also making sure commits that remove "DM_TYPE_NVME_BIO_BASED", like 8d47e65948dd ("dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks") are applied -- basically any lingering references to DM_TYPE_NVME_BIO_BASED need to be removed.
The commit header for 8d47e65948dd documents what DM_TYPE_NVME_BIO_BASED was used for.. it was dm-mpath specific and "nvme" mode really never got used by any userspace that I'm aware of.
Sadly I currently don't have the time to do this backport for all N stable kernels... :(
But if that backport gets out of control: A simpler, albeit stable@ unicorn, way to resolve this is to simply revert 9c37de297f65 and make it so that DM-mpath and DM core just used bio-based if "nvme" is requested by dm-mpath, so also in drivers/md/dm-mpath.c e.g.:
@@ -1091,8 +1088,6 @@ static int parse_features(struct dm_arg_set *as, struct multipath *m)
if (!strcasecmp(queue_mode_name, "bio")) m->queue_mode = DM_TYPE_BIO_BASED; else if (!strcasecmp(queue_mode_name, "nvme")) - m->queue_mode = DM_TYPE_NVME_BIO_BASED; + m->queue_mode = DM_TYPE_BIO_BASED; else if (!strcasecmp(queue_mode_name, "rq")) m->queue_mode = DM_TYPE_REQUEST_BASED; else if (!strcasecmp(queue_mode_name, "mq"))
Mike
On Fri, Jun 10, 2022 at 11:11:00AM -0400, Mike Snitzer wrote:
On Fri, Jun 10 2022 at 1:15P -0400, Greg KH gregkh@linuxfoundation.org wrote:
On Fri, Jun 10, 2022 at 04:22:00AM +0000, Oleksandr Tymoshenko wrote:
I believe this commit introduced a regression in dm verity on systems where data device is an NVME one. Loading table fails with the following diagnostics:
device-mapper: table: table load rejected: including non-request-stackable devices
The same kernel works with the same data drive on the SCSI interface. NVME-backed dm verity works with just this commit reverted.
I believe the presence of the immutable partition is used as an indicator of special case NVME configuration and if the data device's name starts with "nvme" the code tries to switch the target type to DM_TYPE_NVME_BIO_BASED (drivers/md/dm-table.c lines 1003-1010).
The special NVME optimization case was removed in 5.10 by commit 9c37de297f6590937f95a28bec1b7ac68a38618f, so only 5.4 is affected.
Why wouldn't 4.9, 4.14, and 4.19 also be affected here? Should I also just queue up 9c37de297f65 ("dm: remove special-casing of bio-based immutable singleton target on NVMe") to those older kernels? If so, have you tested this and verified that it worked?
Sorry for the unforeseen stable@ troubles here!
In general we'd be fine to apply commit 9c37de297f65 but to do it properly would require also making sure commits that remove "DM_TYPE_NVME_BIO_BASED", like 8d47e65948dd ("dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks") are applied -- basically any lingering references to DM_TYPE_NVME_BIO_BASED need to be removed.
The commit header for 8d47e65948dd documents what DM_TYPE_NVME_BIO_BASED was used for.. it was dm-mpath specific and "nvme" mode really never got used by any userspace that I'm aware of.
Sadly I currently don't have the time to do this backport for all N stable kernels... :(
But if that backport gets out of control: A simpler, albeit stable@ unicorn, way to resolve this is to simply revert 9c37de297f65 and make it so that DM-mpath and DM core just used bio-based if "nvme" is requested by dm-mpath, so also in drivers/md/dm-mpath.c e.g.:
@@ -1091,8 +1088,6 @@ static int parse_features(struct dm_arg_set *as, struct multipath *m)
if (!strcasecmp(queue_mode_name, "bio")) m->queue_mode = DM_TYPE_BIO_BASED; else if (!strcasecmp(queue_mode_name, "nvme"))
m->queue_mode = DM_TYPE_NVME_BIO_BASED;
m->queue_mode = DM_TYPE_BIO_BASED; else if (!strcasecmp(queue_mode_name, "rq")) m->queue_mode = DM_TYPE_REQUEST_BASED; else if (!strcasecmp(queue_mode_name, "mq"))
Mike
Ok, please submit a working patch for the kernels that need it so that we can review and apply it to solve this regression.
thanks,
greg k-h
On Mon, Jun 13, 2022 at 11:13:21AM +0200, Greg KH wrote:
On Fri, Jun 10, 2022 at 11:11:00AM -0400, Mike Snitzer wrote:
On Fri, Jun 10 2022 at 1:15P -0400, Greg KH gregkh@linuxfoundation.org wrote:
On Fri, Jun 10, 2022 at 04:22:00AM +0000, Oleksandr Tymoshenko wrote:
I believe this commit introduced a regression in dm verity on systems where data device is an NVME one. Loading table fails with the following diagnostics:
device-mapper: table: table load rejected: including non-request-stackable devices
The same kernel works with the same data drive on the SCSI interface. NVME-backed dm verity works with just this commit reverted.
I believe the presence of the immutable partition is used as an indicator of special case NVME configuration and if the data device's name starts with "nvme" the code tries to switch the target type to DM_TYPE_NVME_BIO_BASED (drivers/md/dm-table.c lines 1003-1010).
The special NVME optimization case was removed in 5.10 by commit 9c37de297f6590937f95a28bec1b7ac68a38618f, so only 5.4 is affected.
Why wouldn't 4.9, 4.14, and 4.19 also be affected here? Should I also just queue up 9c37de297f65 ("dm: remove special-casing of bio-based immutable singleton target on NVMe") to those older kernels? If so, have you tested this and verified that it worked?
Sorry for the unforeseen stable@ troubles here!
In general we'd be fine to apply commit 9c37de297f65 but to do it properly would require also making sure commits that remove "DM_TYPE_NVME_BIO_BASED", like 8d47e65948dd ("dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks") are applied -- basically any lingering references to DM_TYPE_NVME_BIO_BASED need to be removed.
The commit header for 8d47e65948dd documents what DM_TYPE_NVME_BIO_BASED was used for.. it was dm-mpath specific and "nvme" mode really never got used by any userspace that I'm aware of.
Sadly I currently don't have the time to do this backport for all N stable kernels... :(
But if that backport gets out of control: A simpler, albeit stable@ unicorn, way to resolve this is to simply revert 9c37de297f65 and make
9c37de297f65 can not be reverted in 5.4 and older because it isn't there, and trying to apply it results in conflicts which at least I can not resolve.
it so that DM-mpath and DM core just used bio-based if "nvme" is requested by dm-mpath, so also in drivers/md/dm-mpath.c e.g.:
@@ -1091,8 +1088,6 @@ static int parse_features(struct dm_arg_set *as, struct multipath *m)
if (!strcasecmp(queue_mode_name, "bio")) m->queue_mode = DM_TYPE_BIO_BASED; else if (!strcasecmp(queue_mode_name, "nvme"))
m->queue_mode = DM_TYPE_NVME_BIO_BASED;
m->queue_mode = DM_TYPE_BIO_BASED; else if (!strcasecmp(queue_mode_name, "rq")) m->queue_mode = DM_TYPE_REQUEST_BASED; else if (!strcasecmp(queue_mode_name, "mq"))
Mike
Ok, please submit a working patch for the kernels that need it so that we can review and apply it to solve this regression.
So, effectively, v5.4.y and older are broken right now for use cases with dm on NVME drives.
Given that the regression does affect older branches, and given that we have to revert this patch to avoid regressions in ChromeOS, would it be possible to revert it from v5.4.y and older until a fix is found ?
Thanks, Guenter
On Wed, Jun 15 2022 at 10:36P -0400, Guenter Roeck linux@roeck-us.net wrote:
On Mon, Jun 13, 2022 at 11:13:21AM +0200, Greg KH wrote:
On Fri, Jun 10, 2022 at 11:11:00AM -0400, Mike Snitzer wrote:
On Fri, Jun 10 2022 at 1:15P -0400, Greg KH gregkh@linuxfoundation.org wrote:
On Fri, Jun 10, 2022 at 04:22:00AM +0000, Oleksandr Tymoshenko wrote:
I believe this commit introduced a regression in dm verity on systems where data device is an NVME one. Loading table fails with the following diagnostics:
device-mapper: table: table load rejected: including non-request-stackable devices
The same kernel works with the same data drive on the SCSI interface. NVME-backed dm verity works with just this commit reverted.
I believe the presence of the immutable partition is used as an indicator of special case NVME configuration and if the data device's name starts with "nvme" the code tries to switch the target type to DM_TYPE_NVME_BIO_BASED (drivers/md/dm-table.c lines 1003-1010).
The special NVME optimization case was removed in 5.10 by commit 9c37de297f6590937f95a28bec1b7ac68a38618f, so only 5.4 is affected.
Why wouldn't 4.9, 4.14, and 4.19 also be affected here? Should I also just queue up 9c37de297f65 ("dm: remove special-casing of bio-based immutable singleton target on NVMe") to those older kernels? If so, have you tested this and verified that it worked?
Sorry for the unforeseen stable@ troubles here!
In general we'd be fine to apply commit 9c37de297f65 but to do it properly would require also making sure commits that remove "DM_TYPE_NVME_BIO_BASED", like 8d47e65948dd ("dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks") are applied -- basically any lingering references to DM_TYPE_NVME_BIO_BASED need to be removed.
The commit header for 8d47e65948dd documents what DM_TYPE_NVME_BIO_BASED was used for.. it was dm-mpath specific and "nvme" mode really never got used by any userspace that I'm aware of.
Sadly I currently don't have the time to do this backport for all N stable kernels... :(
But if that backport gets out of control: A simpler, albeit stable@ unicorn, way to resolve this is to simply revert 9c37de297f65 and make
9c37de297f65 can not be reverted in 5.4 and older because it isn't there, and trying to apply it results in conflicts which at least I can not resolve.
it so that DM-mpath and DM core just used bio-based if "nvme" is requested by dm-mpath, so also in drivers/md/dm-mpath.c e.g.:
@@ -1091,8 +1088,6 @@ static int parse_features(struct dm_arg_set *as, struct multipath *m)
if (!strcasecmp(queue_mode_name, "bio")) m->queue_mode = DM_TYPE_BIO_BASED; else if (!strcasecmp(queue_mode_name, "nvme"))
m->queue_mode = DM_TYPE_NVME_BIO_BASED;
m->queue_mode = DM_TYPE_BIO_BASED; else if (!strcasecmp(queue_mode_name, "rq")) m->queue_mode = DM_TYPE_REQUEST_BASED; else if (!strcasecmp(queue_mode_name, "mq"))
Mike
Ok, please submit a working patch for the kernels that need it so that we can review and apply it to solve this regression.
So, effectively, v5.4.y and older are broken right now for use cases with dm on NVME drives.
Given that the regression does affect older branches, and given that we have to revert this patch to avoid regressions in ChromeOS, would it be possible to revert it from v5.4.y and older until a fix is found ?
I obviously would prefer to not have this false-start.
I'll look at latest 5.4.y _now_ and see what can be done.
Should hopefully be pretty straight-forward.
Mike
On 6/15/22 08:29, Mike Snitzer wrote:
On Wed, Jun 15 2022 at 10:36P -0400, Guenter Roeck linux@roeck-us.net wrote:
On Mon, Jun 13, 2022 at 11:13:21AM +0200, Greg KH wrote:
On Fri, Jun 10, 2022 at 11:11:00AM -0400, Mike Snitzer wrote:
On Fri, Jun 10 2022 at 1:15P -0400, Greg KH gregkh@linuxfoundation.org wrote:
On Fri, Jun 10, 2022 at 04:22:00AM +0000, Oleksandr Tymoshenko wrote:
I believe this commit introduced a regression in dm verity on systems where data device is an NVME one. Loading table fails with the following diagnostics:
device-mapper: table: table load rejected: including non-request-stackable devices
The same kernel works with the same data drive on the SCSI interface. NVME-backed dm verity works with just this commit reverted.
I believe the presence of the immutable partition is used as an indicator of special case NVME configuration and if the data device's name starts with "nvme" the code tries to switch the target type to DM_TYPE_NVME_BIO_BASED (drivers/md/dm-table.c lines 1003-1010).
The special NVME optimization case was removed in 5.10 by commit 9c37de297f6590937f95a28bec1b7ac68a38618f, so only 5.4 is affected.
Why wouldn't 4.9, 4.14, and 4.19 also be affected here? Should I also just queue up 9c37de297f65 ("dm: remove special-casing of bio-based immutable singleton target on NVMe") to those older kernels? If so, have you tested this and verified that it worked?
Sorry for the unforeseen stable@ troubles here!
In general we'd be fine to apply commit 9c37de297f65 but to do it properly would require also making sure commits that remove "DM_TYPE_NVME_BIO_BASED", like 8d47e65948dd ("dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks") are applied -- basically any lingering references to DM_TYPE_NVME_BIO_BASED need to be removed.
The commit header for 8d47e65948dd documents what DM_TYPE_NVME_BIO_BASED was used for.. it was dm-mpath specific and "nvme" mode really never got used by any userspace that I'm aware of.
Sadly I currently don't have the time to do this backport for all N stable kernels... :(
But if that backport gets out of control: A simpler, albeit stable@ unicorn, way to resolve this is to simply revert 9c37de297f65 and make
9c37de297f65 can not be reverted in 5.4 and older because it isn't there, and trying to apply it results in conflicts which at least I can not resolve.
it so that DM-mpath and DM core just used bio-based if "nvme" is requested by dm-mpath, so also in drivers/md/dm-mpath.c e.g.:
@@ -1091,8 +1088,6 @@ static int parse_features(struct dm_arg_set *as, struct multipath *m)
if (!strcasecmp(queue_mode_name, "bio")) m->queue_mode = DM_TYPE_BIO_BASED; else if (!strcasecmp(queue_mode_name, "nvme"))
m->queue_mode = DM_TYPE_NVME_BIO_BASED;
m->queue_mode = DM_TYPE_BIO_BASED; else if (!strcasecmp(queue_mode_name, "rq")) m->queue_mode = DM_TYPE_REQUEST_BASED; else if (!strcasecmp(queue_mode_name, "mq"))
Mike
Ok, please submit a working patch for the kernels that need it so that we can review and apply it to solve this regression.
So, effectively, v5.4.y and older are broken right now for use cases with dm on NVME drives.
Given that the regression does affect older branches, and given that we have to revert this patch to avoid regressions in ChromeOS, would it be possible to revert it from v5.4.y and older until a fix is found ?
I obviously would prefer to not have this false-start.
The false start has already happened since we had to revert the patch from chromeos-5.4 and older branches.
Guenter
On Wed, Jun 15 2022 at 1:50P -0400, Guenter Roeck linux@roeck-us.net wrote:
On 6/15/22 08:29, Mike Snitzer wrote:
On Wed, Jun 15 2022 at 10:36P -0400, Guenter Roeck linux@roeck-us.net wrote:
On Mon, Jun 13, 2022 at 11:13:21AM +0200, Greg KH wrote:
On Fri, Jun 10, 2022 at 11:11:00AM -0400, Mike Snitzer wrote:
On Fri, Jun 10 2022 at 1:15P -0400, Greg KH gregkh@linuxfoundation.org wrote:
On Fri, Jun 10, 2022 at 04:22:00AM +0000, Oleksandr Tymoshenko wrote: > I believe this commit introduced a regression in dm verity on systems > where data device is an NVME one. Loading table fails with the > following diagnostics: > > device-mapper: table: table load rejected: including non-request-stackable devices > > The same kernel works with the same data drive on the SCSI interface. > NVME-backed dm verity works with just this commit reverted. > > I believe the presence of the immutable partition is used as an indicator > of special case NVME configuration and if the data device's name starts > with "nvme" the code tries to switch the target type to > DM_TYPE_NVME_BIO_BASED (drivers/md/dm-table.c lines 1003-1010). > > The special NVME optimization case was removed in > 5.10 by commit 9c37de297f6590937f95a28bec1b7ac68a38618f, so only 5.4 is > affected. >
Why wouldn't 4.9, 4.14, and 4.19 also be affected here? Should I also just queue up 9c37de297f65 ("dm: remove special-casing of bio-based immutable singleton target on NVMe") to those older kernels? If so, have you tested this and verified that it worked?
Sorry for the unforeseen stable@ troubles here!
In general we'd be fine to apply commit 9c37de297f65 but to do it properly would require also making sure commits that remove "DM_TYPE_NVME_BIO_BASED", like 8d47e65948dd ("dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks") are applied -- basically any lingering references to DM_TYPE_NVME_BIO_BASED need to be removed.
The commit header for 8d47e65948dd documents what DM_TYPE_NVME_BIO_BASED was used for.. it was dm-mpath specific and "nvme" mode really never got used by any userspace that I'm aware of.
Sadly I currently don't have the time to do this backport for all N stable kernels... :(
But if that backport gets out of control: A simpler, albeit stable@ unicorn, way to resolve this is to simply revert 9c37de297f65 and make
9c37de297f65 can not be reverted in 5.4 and older because it isn't there, and trying to apply it results in conflicts which at least I can not resolve.
it so that DM-mpath and DM core just used bio-based if "nvme" is requested by dm-mpath, so also in drivers/md/dm-mpath.c e.g.:
@@ -1091,8 +1088,6 @@ static int parse_features(struct dm_arg_set *as, struct multipath *m)
if (!strcasecmp(queue_mode_name, "bio")) m->queue_mode = DM_TYPE_BIO_BASED; else if (!strcasecmp(queue_mode_name, "nvme"))
m->queue_mode = DM_TYPE_NVME_BIO_BASED;
m->queue_mode = DM_TYPE_BIO_BASED; else if (!strcasecmp(queue_mode_name, "rq")) m->queue_mode = DM_TYPE_REQUEST_BASED; else if (!strcasecmp(queue_mode_name, "mq"))
Mike
Ok, please submit a working patch for the kernels that need it so that we can review and apply it to solve this regression.
So, effectively, v5.4.y and older are broken right now for use cases with dm on NVME drives.
Given that the regression does affect older branches, and given that we have to revert this patch to avoid regressions in ChromeOS, would it be possible to revert it from v5.4.y and older until a fix is found ?
I obviously would prefer to not have this false-start.
The false start has already happened since we had to revert the patch from chromeos-5.4 and older branches.
OK, well this is pretty easy to fix in general. If there are slight differences across older trees they are easily resolved. Fact that stable@ couldn't cope with backporting 9c37de297f65 is.. what it is.
But this will fix the issue on 5.4.y:
From: Mike Snitzer snitzer@kernel.org Date: Wed, 15 Jun 2022 14:07:09 -0400 Subject: [5.4.y PATCH] dm: remove special-casing of bio-based immutable singleton target on NVMe
Commit 9c37de297f6590937f95a28bec1b7ac68a38618f upstream.
There is no benefit to DM special-casing NVMe. Remove all code used to establish DM_TYPE_NVME_BIO_BASED.
Signed-off-by: Mike Snitzer snitzer@kernel.org --- drivers/md/dm-table.c | 32 ++---------------- drivers/md/dm.c | 64 +++-------------------------------- include/linux/device-mapper.h | 1 - 3 files changed, 7 insertions(+), 90 deletions(-)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 06b382304d92..81bc36a43b32 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -872,8 +872,7 @@ EXPORT_SYMBOL(dm_consume_args); static bool __table_type_bio_based(enum dm_queue_mode table_type) { return (table_type == DM_TYPE_BIO_BASED || - table_type == DM_TYPE_DAX_BIO_BASED || - table_type == DM_TYPE_NVME_BIO_BASED); + table_type == DM_TYPE_DAX_BIO_BASED); }
static bool __table_type_request_based(enum dm_queue_mode table_type) @@ -929,8 +928,6 @@ bool dm_table_supports_dax(struct dm_table *t, return true; }
-static bool dm_table_does_not_support_partial_completion(struct dm_table *t); - static int device_is_rq_stackable(struct dm_target *ti, struct dm_dev *dev, sector_t start, sector_t len, void *data) { @@ -960,7 +957,6 @@ static int dm_table_determine_type(struct dm_table *t) goto verify_bio_based; } BUG_ON(t->type == DM_TYPE_DAX_BIO_BASED); - BUG_ON(t->type == DM_TYPE_NVME_BIO_BASED); goto verify_rq_based; }
@@ -999,15 +995,6 @@ static int dm_table_determine_type(struct dm_table *t) if (dm_table_supports_dax(t, device_not_dax_capable, &page_size) || (list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) { t->type = DM_TYPE_DAX_BIO_BASED; - } else { - /* Check if upgrading to NVMe bio-based is valid or required */ - tgt = dm_table_get_immutable_target(t); - if (tgt && !tgt->max_io_len && dm_table_does_not_support_partial_completion(t)) { - t->type = DM_TYPE_NVME_BIO_BASED; - goto verify_rq_based; /* must be stacked directly on NVMe (blk-mq) */ - } else if (list_empty(devices) && live_md_type == DM_TYPE_NVME_BIO_BASED) { - t->type = DM_TYPE_NVME_BIO_BASED; - } } return 0; } @@ -1024,8 +1011,7 @@ static int dm_table_determine_type(struct dm_table *t) * (e.g. request completion process for partial completion.) */ if (t->num_targets > 1) { - DMERR("%s DM doesn't support multiple targets", - t->type == DM_TYPE_NVME_BIO_BASED ? "nvme bio-based" : "request-based"); + DMERR("request-based DM doesn't support multiple targets"); return -EINVAL; }
@@ -1714,20 +1700,6 @@ static int device_is_not_random(struct dm_target *ti, struct dm_dev *dev, return q && !blk_queue_add_random(q); }
-static int device_is_partial_completion(struct dm_target *ti, struct dm_dev *dev, - sector_t start, sector_t len, void *data) -{ - char b[BDEVNAME_SIZE]; - - /* For now, NVMe devices are the only devices of this class */ - return (strncmp(bdevname(dev->bdev, b), "nvme", 4) != 0); -} - -static bool dm_table_does_not_support_partial_completion(struct dm_table *t) -{ - return !dm_table_any_dev_attr(t, device_is_partial_completion, NULL); -} - static int device_not_write_same_capable(struct dm_target *ti, struct dm_dev *dev, sector_t start, sector_t len, void *data) { diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 37b8bb4d80f0..3c45c389ded9 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1000,7 +1000,7 @@ static void clone_endio(struct bio *bio) struct mapped_device *md = tio->io->md; dm_endio_fn endio = tio->ti->type->end_io;
- if (unlikely(error == BLK_STS_TARGET) && md->type != DM_TYPE_NVME_BIO_BASED) { + if (unlikely(error == BLK_STS_TARGET)) { if (bio_op(bio) == REQ_OP_DISCARD && !bio->bi_disk->queue->limits.max_discard_sectors) disable_discard(md); @@ -1340,10 +1340,7 @@ static blk_qc_t __map_bio(struct dm_target_io *tio) /* the bio has been remapped so dispatch it */ trace_block_bio_remap(clone->bi_disk->queue, clone, bio_dev(io->orig_bio), sector); - if (md->type == DM_TYPE_NVME_BIO_BASED) - ret = direct_make_request(clone); - else - ret = generic_make_request(clone); + ret = generic_make_request(clone); break; case DM_MAPIO_KILL: if (unlikely(swap_bios_limit(ti, clone))) { @@ -1732,51 +1729,6 @@ static blk_qc_t __split_and_process_bio(struct mapped_device *md, return ret; }
-/* - * Optimized variant of __split_and_process_bio that leverages the - * fact that targets that use it do _not_ have a need to split bios. - */ -static blk_qc_t __process_bio(struct mapped_device *md, struct dm_table *map, - struct bio *bio, struct dm_target *ti) -{ - struct clone_info ci; - blk_qc_t ret = BLK_QC_T_NONE; - int error = 0; - - init_clone_info(&ci, md, map, bio); - - if (bio->bi_opf & REQ_PREFLUSH) { - struct bio flush_bio; - - /* - * Use an on-stack bio for this, it's safe since we don't - * need to reference it after submit. It's just used as - * the basis for the clone(s). - */ - bio_init(&flush_bio, NULL, 0); - flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC; - ci.bio = &flush_bio; - ci.sector_count = 0; - error = __send_empty_flush(&ci); - bio_uninit(ci.bio); - /* dec_pending submits any data associated with flush */ - } else { - struct dm_target_io *tio; - - ci.bio = bio; - ci.sector_count = bio_sectors(bio); - if (__process_abnormal_io(&ci, ti, &error)) - goto out; - - tio = alloc_tio(&ci, ti, 0, GFP_NOIO); - ret = __clone_and_map_simple_bio(&ci, tio, NULL); - } -out: - /* drop the extra reference count */ - dec_pending(ci.io, errno_to_blk_status(error)); - return ret; -} - static blk_qc_t dm_process_bio(struct mapped_device *md, struct dm_table *map, struct bio *bio) { @@ -1807,8 +1759,6 @@ static blk_qc_t dm_process_bio(struct mapped_device *md, /* regular IO is split by __split_and_process_bio */ }
- if (dm_get_md_type(md) == DM_TYPE_NVME_BIO_BASED) - return __process_bio(md, map, bio, ti); return __split_and_process_bio(md, map, bio); }
@@ -2200,12 +2150,10 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t, if (request_based) dm_stop_queue(q);
- if (request_based || md->type == DM_TYPE_NVME_BIO_BASED) { + if (request_based) { /* - * Leverage the fact that request-based DM targets and - * NVMe bio based targets are immutable singletons - * - used to optimize both dm_request_fn and dm_mq_queue_rq; - * and __process_bio. + * Leverage the fact that request-based DM targets are + * immutable singletons - used to optimize dm_mq_queue_rq. */ md->immutable_target = dm_table_get_immutable_target(t); } @@ -2334,7 +2282,6 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t) break; case DM_TYPE_BIO_BASED: case DM_TYPE_DAX_BIO_BASED: - case DM_TYPE_NVME_BIO_BASED: dm_init_congested_fn(md); break; case DM_TYPE_NONE: @@ -3070,7 +3017,6 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_qu switch (type) { case DM_TYPE_BIO_BASED: case DM_TYPE_DAX_BIO_BASED: - case DM_TYPE_NVME_BIO_BASED: pool_size = max(dm_get_reserved_bio_based_ios(), min_pool_size); front_pad = roundup(per_io_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone); io_front_pad = roundup(front_pad, __alignof__(struct dm_io)) + offsetof(struct dm_io, tio); diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h index a53d7d2c2d95..60631f3abddb 100644 --- a/include/linux/device-mapper.h +++ b/include/linux/device-mapper.h @@ -28,7 +28,6 @@ enum dm_queue_mode { DM_TYPE_BIO_BASED = 1, DM_TYPE_REQUEST_BASED = 2, DM_TYPE_DAX_BIO_BASED = 3, - DM_TYPE_NVME_BIO_BASED = 4, };
typedef enum { STATUSTYPE_INFO, STATUSTYPE_TABLE } status_type_t;
On Wed, Jun 15, 2022 at 04:02:36PM -0400, Mike Snitzer wrote: [ ... ]
OK, well this is pretty easy to fix in general. If there are slight differences across older trees they are easily resolved. Fact that stable@ couldn't cope with backporting 9c37de297f65 is.. what it is.
But this will fix the issue on 5.4.y:
From: Mike Snitzer snitzer@kernel.org Date: Wed, 15 Jun 2022 14:07:09 -0400 Subject: [5.4.y PATCH] dm: remove special-casing of bio-based immutable singleton target on NVMe
Commit 9c37de297f6590937f95a28bec1b7ac68a38618f upstream.
There is no benefit to DM special-casing NVMe. Remove all code used to establish DM_TYPE_NVME_BIO_BASED.
Signed-off-by: Mike Snitzer snitzer@kernel.org
I'll give it a try.
Thanks, Guenter
drivers/md/dm-table.c | 32 ++---------------- drivers/md/dm.c | 64 +++-------------------------------- include/linux/device-mapper.h | 1 - 3 files changed, 7 insertions(+), 90 deletions(-)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 06b382304d92..81bc36a43b32 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -872,8 +872,7 @@ EXPORT_SYMBOL(dm_consume_args); static bool __table_type_bio_based(enum dm_queue_mode table_type) { return (table_type == DM_TYPE_BIO_BASED ||
table_type == DM_TYPE_DAX_BIO_BASED ||
table_type == DM_TYPE_NVME_BIO_BASED);
table_type == DM_TYPE_DAX_BIO_BASED);
} static bool __table_type_request_based(enum dm_queue_mode table_type) @@ -929,8 +928,6 @@ bool dm_table_supports_dax(struct dm_table *t, return true; } -static bool dm_table_does_not_support_partial_completion(struct dm_table *t);
static int device_is_rq_stackable(struct dm_target *ti, struct dm_dev *dev, sector_t start, sector_t len, void *data) { @@ -960,7 +957,6 @@ static int dm_table_determine_type(struct dm_table *t) goto verify_bio_based; } BUG_ON(t->type == DM_TYPE_DAX_BIO_BASED);
goto verify_rq_based; }BUG_ON(t->type == DM_TYPE_NVME_BIO_BASED);
@@ -999,15 +995,6 @@ static int dm_table_determine_type(struct dm_table *t) if (dm_table_supports_dax(t, device_not_dax_capable, &page_size) || (list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) { t->type = DM_TYPE_DAX_BIO_BASED;
} else {
/* Check if upgrading to NVMe bio-based is valid or required */
tgt = dm_table_get_immutable_target(t);
if (tgt && !tgt->max_io_len && dm_table_does_not_support_partial_completion(t)) {
t->type = DM_TYPE_NVME_BIO_BASED;
goto verify_rq_based; /* must be stacked directly on NVMe (blk-mq) */
} else if (list_empty(devices) && live_md_type == DM_TYPE_NVME_BIO_BASED) {
t->type = DM_TYPE_NVME_BIO_BASED;
} return 0; }}
@@ -1024,8 +1011,7 @@ static int dm_table_determine_type(struct dm_table *t) * (e.g. request completion process for partial completion.) */ if (t->num_targets > 1) {
DMERR("%s DM doesn't support multiple targets",
t->type == DM_TYPE_NVME_BIO_BASED ? "nvme bio-based" : "request-based");
return -EINVAL; }DMERR("request-based DM doesn't support multiple targets");
@@ -1714,20 +1700,6 @@ static int device_is_not_random(struct dm_target *ti, struct dm_dev *dev, return q && !blk_queue_add_random(q); } -static int device_is_partial_completion(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
-{
- char b[BDEVNAME_SIZE];
- /* For now, NVMe devices are the only devices of this class */
- return (strncmp(bdevname(dev->bdev, b), "nvme", 4) != 0);
-}
-static bool dm_table_does_not_support_partial_completion(struct dm_table *t) -{
- return !dm_table_any_dev_attr(t, device_is_partial_completion, NULL);
-}
static int device_not_write_same_capable(struct dm_target *ti, struct dm_dev *dev, sector_t start, sector_t len, void *data) { diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 37b8bb4d80f0..3c45c389ded9 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1000,7 +1000,7 @@ static void clone_endio(struct bio *bio) struct mapped_device *md = tio->io->md; dm_endio_fn endio = tio->ti->type->end_io;
- if (unlikely(error == BLK_STS_TARGET) && md->type != DM_TYPE_NVME_BIO_BASED) {
- if (unlikely(error == BLK_STS_TARGET)) { if (bio_op(bio) == REQ_OP_DISCARD && !bio->bi_disk->queue->limits.max_discard_sectors) disable_discard(md);
@@ -1340,10 +1340,7 @@ static blk_qc_t __map_bio(struct dm_target_io *tio) /* the bio has been remapped so dispatch it */ trace_block_bio_remap(clone->bi_disk->queue, clone, bio_dev(io->orig_bio), sector);
if (md->type == DM_TYPE_NVME_BIO_BASED)
ret = direct_make_request(clone);
else
ret = generic_make_request(clone);
break; case DM_MAPIO_KILL: if (unlikely(swap_bios_limit(ti, clone))) {ret = generic_make_request(clone);
@@ -1732,51 +1729,6 @@ static blk_qc_t __split_and_process_bio(struct mapped_device *md, return ret; } -/*
- Optimized variant of __split_and_process_bio that leverages the
- fact that targets that use it do _not_ have a need to split bios.
- */
-static blk_qc_t __process_bio(struct mapped_device *md, struct dm_table *map,
struct bio *bio, struct dm_target *ti)
-{
- struct clone_info ci;
- blk_qc_t ret = BLK_QC_T_NONE;
- int error = 0;
- init_clone_info(&ci, md, map, bio);
- if (bio->bi_opf & REQ_PREFLUSH) {
struct bio flush_bio;
/*
* Use an on-stack bio for this, it's safe since we don't
* need to reference it after submit. It's just used as
* the basis for the clone(s).
*/
bio_init(&flush_bio, NULL, 0);
flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC;
ci.bio = &flush_bio;
ci.sector_count = 0;
error = __send_empty_flush(&ci);
bio_uninit(ci.bio);
/* dec_pending submits any data associated with flush */
- } else {
struct dm_target_io *tio;
ci.bio = bio;
ci.sector_count = bio_sectors(bio);
if (__process_abnormal_io(&ci, ti, &error))
goto out;
tio = alloc_tio(&ci, ti, 0, GFP_NOIO);
ret = __clone_and_map_simple_bio(&ci, tio, NULL);
- }
-out:
- /* drop the extra reference count */
- dec_pending(ci.io, errno_to_blk_status(error));
- return ret;
-}
static blk_qc_t dm_process_bio(struct mapped_device *md, struct dm_table *map, struct bio *bio) { @@ -1807,8 +1759,6 @@ static blk_qc_t dm_process_bio(struct mapped_device *md, /* regular IO is split by __split_and_process_bio */ }
- if (dm_get_md_type(md) == DM_TYPE_NVME_BIO_BASED)
return __split_and_process_bio(md, map, bio);return __process_bio(md, map, bio, ti);
} @@ -2200,12 +2150,10 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t, if (request_based) dm_stop_queue(q);
- if (request_based || md->type == DM_TYPE_NVME_BIO_BASED) {
- if (request_based) { /*
* Leverage the fact that request-based DM targets and
* NVMe bio based targets are immutable singletons
* - used to optimize both dm_request_fn and dm_mq_queue_rq;
* and __process_bio.
* Leverage the fact that request-based DM targets are
*/ md->immutable_target = dm_table_get_immutable_target(t); }* immutable singletons - used to optimize dm_mq_queue_rq.
@@ -2334,7 +2282,6 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t) break; case DM_TYPE_BIO_BASED: case DM_TYPE_DAX_BIO_BASED:
- case DM_TYPE_NVME_BIO_BASED: dm_init_congested_fn(md); break; case DM_TYPE_NONE:
@@ -3070,7 +3017,6 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_qu switch (type) { case DM_TYPE_BIO_BASED: case DM_TYPE_DAX_BIO_BASED:
- case DM_TYPE_NVME_BIO_BASED: pool_size = max(dm_get_reserved_bio_based_ios(), min_pool_size); front_pad = roundup(per_io_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone); io_front_pad = roundup(front_pad, __alignof__(struct dm_io)) + offsetof(struct dm_io, tio);
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h index a53d7d2c2d95..60631f3abddb 100644 --- a/include/linux/device-mapper.h +++ b/include/linux/device-mapper.h @@ -28,7 +28,6 @@ enum dm_queue_mode { DM_TYPE_BIO_BASED = 1, DM_TYPE_REQUEST_BASED = 2, DM_TYPE_DAX_BIO_BASED = 3,
- DM_TYPE_NVME_BIO_BASED = 4,
}; typedef enum { STATUSTYPE_INFO, STATUSTYPE_TABLE } status_type_t; -- 2.30.0
On Wed, Jun 15, 2022 at 04:02:36PM -0400, Mike Snitzer wrote:
On Wed, Jun 15 2022 at 1:50P -0400, Guenter Roeck linux@roeck-us.net wrote:
On 6/15/22 08:29, Mike Snitzer wrote:
On Wed, Jun 15 2022 at 10:36P -0400, Guenter Roeck linux@roeck-us.net wrote:
On Mon, Jun 13, 2022 at 11:13:21AM +0200, Greg KH wrote:
On Fri, Jun 10, 2022 at 11:11:00AM -0400, Mike Snitzer wrote:
On Fri, Jun 10 2022 at 1:15P -0400, Greg KH gregkh@linuxfoundation.org wrote:
> On Fri, Jun 10, 2022 at 04:22:00AM +0000, Oleksandr Tymoshenko wrote: > > I believe this commit introduced a regression in dm verity on systems > > where data device is an NVME one. Loading table fails with the > > following diagnostics: > > > > device-mapper: table: table load rejected: including non-request-stackable devices > > > > The same kernel works with the same data drive on the SCSI interface. > > NVME-backed dm verity works with just this commit reverted. > > > > I believe the presence of the immutable partition is used as an indicator > > of special case NVME configuration and if the data device's name starts > > with "nvme" the code tries to switch the target type to > > DM_TYPE_NVME_BIO_BASED (drivers/md/dm-table.c lines 1003-1010). > > > > The special NVME optimization case was removed in > > 5.10 by commit 9c37de297f6590937f95a28bec1b7ac68a38618f, so only 5.4 is > > affected. > > > > Why wouldn't 4.9, 4.14, and 4.19 also be affected here? Should I also > just queue up 9c37de297f65 ("dm: remove special-casing of bio-based > immutable singleton target on NVMe") to those older kernels? If so, > have you tested this and verified that it worked?
Sorry for the unforeseen stable@ troubles here!
In general we'd be fine to apply commit 9c37de297f65 but to do it properly would require also making sure commits that remove "DM_TYPE_NVME_BIO_BASED", like 8d47e65948dd ("dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks") are applied -- basically any lingering references to DM_TYPE_NVME_BIO_BASED need to be removed.
The commit header for 8d47e65948dd documents what DM_TYPE_NVME_BIO_BASED was used for.. it was dm-mpath specific and "nvme" mode really never got used by any userspace that I'm aware of.
Sadly I currently don't have the time to do this backport for all N stable kernels... :(
But if that backport gets out of control: A simpler, albeit stable@ unicorn, way to resolve this is to simply revert 9c37de297f65 and make
9c37de297f65 can not be reverted in 5.4 and older because it isn't there, and trying to apply it results in conflicts which at least I can not resolve.
it so that DM-mpath and DM core just used bio-based if "nvme" is requested by dm-mpath, so also in drivers/md/dm-mpath.c e.g.:
@@ -1091,8 +1088,6 @@ static int parse_features(struct dm_arg_set *as, struct multipath *m)
if (!strcasecmp(queue_mode_name, "bio")) m->queue_mode = DM_TYPE_BIO_BASED; else if (!strcasecmp(queue_mode_name, "nvme"))
m->queue_mode = DM_TYPE_NVME_BIO_BASED;
m->queue_mode = DM_TYPE_BIO_BASED; else if (!strcasecmp(queue_mode_name, "rq")) m->queue_mode = DM_TYPE_REQUEST_BASED; else if (!strcasecmp(queue_mode_name, "mq"))
Mike
Ok, please submit a working patch for the kernels that need it so that we can review and apply it to solve this regression.
So, effectively, v5.4.y and older are broken right now for use cases with dm on NVME drives.
Given that the regression does affect older branches, and given that we have to revert this patch to avoid regressions in ChromeOS, would it be possible to revert it from v5.4.y and older until a fix is found ?
I obviously would prefer to not have this false-start.
The false start has already happened since we had to revert the patch from chromeos-5.4 and older branches.
OK, well this is pretty easy to fix in general. If there are slight differences across older trees they are easily resolved. Fact that stable@ couldn't cope with backporting 9c37de297f65 is.. what it is.
But this will fix the issue on 5.4.y:
From: Mike Snitzer snitzer@kernel.org Date: Wed, 15 Jun 2022 14:07:09 -0400 Subject: [5.4.y PATCH] dm: remove special-casing of bio-based immutable singleton target on NVMe
Commit 9c37de297f6590937f95a28bec1b7ac68a38618f upstream.
There is no benefit to DM special-casing NVMe. Remove all code used to establish DM_TYPE_NVME_BIO_BASED.
Signed-off-by: Mike Snitzer snitzer@kernel.org
drivers/md/dm-table.c | 32 ++---------------- drivers/md/dm.c | 64 +++-------------------------------- include/linux/device-mapper.h | 1 - 3 files changed, 7 insertions(+), 90 deletions(-)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 06b382304d92..81bc36a43b32 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -872,8 +872,7 @@ EXPORT_SYMBOL(dm_consume_args); static bool __table_type_bio_based(enum dm_queue_mode table_type) { return (table_type == DM_TYPE_BIO_BASED ||
table_type == DM_TYPE_DAX_BIO_BASED ||
table_type == DM_TYPE_NVME_BIO_BASED);
table_type == DM_TYPE_DAX_BIO_BASED);
} static bool __table_type_request_based(enum dm_queue_mode table_type) @@ -929,8 +928,6 @@ bool dm_table_supports_dax(struct dm_table *t, return true; } -static bool dm_table_does_not_support_partial_completion(struct dm_table *t);
static int device_is_rq_stackable(struct dm_target *ti, struct dm_dev *dev, sector_t start, sector_t len, void *data) { @@ -960,7 +957,6 @@ static int dm_table_determine_type(struct dm_table *t) goto verify_bio_based; } BUG_ON(t->type == DM_TYPE_DAX_BIO_BASED);
goto verify_rq_based; }BUG_ON(t->type == DM_TYPE_NVME_BIO_BASED);
@@ -999,15 +995,6 @@ static int dm_table_determine_type(struct dm_table *t) if (dm_table_supports_dax(t, device_not_dax_capable, &page_size) || (list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) { t->type = DM_TYPE_DAX_BIO_BASED;
} else {
/* Check if upgrading to NVMe bio-based is valid or required */
tgt = dm_table_get_immutable_target(t);
if (tgt && !tgt->max_io_len && dm_table_does_not_support_partial_completion(t)) {
t->type = DM_TYPE_NVME_BIO_BASED;
goto verify_rq_based; /* must be stacked directly on NVMe (blk-mq) */
} else if (list_empty(devices) && live_md_type == DM_TYPE_NVME_BIO_BASED) {
t->type = DM_TYPE_NVME_BIO_BASED;
} return 0; }}
@@ -1024,8 +1011,7 @@ static int dm_table_determine_type(struct dm_table *t) * (e.g. request completion process for partial completion.) */ if (t->num_targets > 1) {
DMERR("%s DM doesn't support multiple targets",
t->type == DM_TYPE_NVME_BIO_BASED ? "nvme bio-based" : "request-based");
return -EINVAL; }DMERR("request-based DM doesn't support multiple targets");
@@ -1714,20 +1700,6 @@ static int device_is_not_random(struct dm_target *ti, struct dm_dev *dev, return q && !blk_queue_add_random(q); } -static int device_is_partial_completion(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
-{
- char b[BDEVNAME_SIZE];
- /* For now, NVMe devices are the only devices of this class */
- return (strncmp(bdevname(dev->bdev, b), "nvme", 4) != 0);
-}
-static bool dm_table_does_not_support_partial_completion(struct dm_table *t) -{
- return !dm_table_any_dev_attr(t, device_is_partial_completion, NULL);
-}
static int device_not_write_same_capable(struct dm_target *ti, struct dm_dev *dev, sector_t start, sector_t len, void *data) { diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 37b8bb4d80f0..3c45c389ded9 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1000,7 +1000,7 @@ static void clone_endio(struct bio *bio) struct mapped_device *md = tio->io->md; dm_endio_fn endio = tio->ti->type->end_io;
- if (unlikely(error == BLK_STS_TARGET) && md->type != DM_TYPE_NVME_BIO_BASED) {
- if (unlikely(error == BLK_STS_TARGET)) { if (bio_op(bio) == REQ_OP_DISCARD && !bio->bi_disk->queue->limits.max_discard_sectors) disable_discard(md);
@@ -1340,10 +1340,7 @@ static blk_qc_t __map_bio(struct dm_target_io *tio) /* the bio has been remapped so dispatch it */ trace_block_bio_remap(clone->bi_disk->queue, clone, bio_dev(io->orig_bio), sector);
if (md->type == DM_TYPE_NVME_BIO_BASED)
ret = direct_make_request(clone);
else
ret = generic_make_request(clone);
drivers/md/dm.c:1340:24: error: unused variable 'md'
I'll try again with this fixed.
Guenter
break; case DM_MAPIO_KILL: if (unlikely(swap_bios_limit(ti, clone))) {ret = generic_make_request(clone);
@@ -1732,51 +1729,6 @@ static blk_qc_t __split_and_process_bio(struct mapped_device *md, return ret; } -/*
- Optimized variant of __split_and_process_bio that leverages the
- fact that targets that use it do _not_ have a need to split bios.
- */
-static blk_qc_t __process_bio(struct mapped_device *md, struct dm_table *map,
struct bio *bio, struct dm_target *ti)
-{
- struct clone_info ci;
- blk_qc_t ret = BLK_QC_T_NONE;
- int error = 0;
- init_clone_info(&ci, md, map, bio);
- if (bio->bi_opf & REQ_PREFLUSH) {
struct bio flush_bio;
/*
* Use an on-stack bio for this, it's safe since we don't
* need to reference it after submit. It's just used as
* the basis for the clone(s).
*/
bio_init(&flush_bio, NULL, 0);
flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC;
ci.bio = &flush_bio;
ci.sector_count = 0;
error = __send_empty_flush(&ci);
bio_uninit(ci.bio);
/* dec_pending submits any data associated with flush */
- } else {
struct dm_target_io *tio;
ci.bio = bio;
ci.sector_count = bio_sectors(bio);
if (__process_abnormal_io(&ci, ti, &error))
goto out;
tio = alloc_tio(&ci, ti, 0, GFP_NOIO);
ret = __clone_and_map_simple_bio(&ci, tio, NULL);
- }
-out:
- /* drop the extra reference count */
- dec_pending(ci.io, errno_to_blk_status(error));
- return ret;
-}
static blk_qc_t dm_process_bio(struct mapped_device *md, struct dm_table *map, struct bio *bio) { @@ -1807,8 +1759,6 @@ static blk_qc_t dm_process_bio(struct mapped_device *md, /* regular IO is split by __split_and_process_bio */ }
- if (dm_get_md_type(md) == DM_TYPE_NVME_BIO_BASED)
return __split_and_process_bio(md, map, bio);return __process_bio(md, map, bio, ti);
} @@ -2200,12 +2150,10 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t, if (request_based) dm_stop_queue(q);
- if (request_based || md->type == DM_TYPE_NVME_BIO_BASED) {
- if (request_based) { /*
* Leverage the fact that request-based DM targets and
* NVMe bio based targets are immutable singletons
* - used to optimize both dm_request_fn and dm_mq_queue_rq;
* and __process_bio.
* Leverage the fact that request-based DM targets are
*/ md->immutable_target = dm_table_get_immutable_target(t); }* immutable singletons - used to optimize dm_mq_queue_rq.
@@ -2334,7 +2282,6 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t) break; case DM_TYPE_BIO_BASED: case DM_TYPE_DAX_BIO_BASED:
- case DM_TYPE_NVME_BIO_BASED: dm_init_congested_fn(md); break; case DM_TYPE_NONE:
@@ -3070,7 +3017,6 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_qu switch (type) { case DM_TYPE_BIO_BASED: case DM_TYPE_DAX_BIO_BASED:
- case DM_TYPE_NVME_BIO_BASED: pool_size = max(dm_get_reserved_bio_based_ios(), min_pool_size); front_pad = roundup(per_io_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone); io_front_pad = roundup(front_pad, __alignof__(struct dm_io)) + offsetof(struct dm_io, tio);
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h index a53d7d2c2d95..60631f3abddb 100644 --- a/include/linux/device-mapper.h +++ b/include/linux/device-mapper.h @@ -28,7 +28,6 @@ enum dm_queue_mode { DM_TYPE_BIO_BASED = 1, DM_TYPE_REQUEST_BASED = 2, DM_TYPE_DAX_BIO_BASED = 3,
- DM_TYPE_NVME_BIO_BASED = 4,
}; typedef enum { STATUSTYPE_INFO, STATUSTYPE_TABLE } status_type_t; -- 2.30.0
On 6/15/22 13:02, Mike Snitzer wrote:
On Wed, Jun 15 2022 at 1:50P -0400, Guenter Roeck linux@roeck-us.net wrote:
On 6/15/22 08:29, Mike Snitzer wrote:
On Wed, Jun 15 2022 at 10:36P -0400, Guenter Roeck linux@roeck-us.net wrote:
On Mon, Jun 13, 2022 at 11:13:21AM +0200, Greg KH wrote:
On Fri, Jun 10, 2022 at 11:11:00AM -0400, Mike Snitzer wrote:
On Fri, Jun 10 2022 at 1:15P -0400, Greg KH gregkh@linuxfoundation.org wrote:
> On Fri, Jun 10, 2022 at 04:22:00AM +0000, Oleksandr Tymoshenko wrote: >> I believe this commit introduced a regression in dm verity on systems >> where data device is an NVME one. Loading table fails with the >> following diagnostics: >> >> device-mapper: table: table load rejected: including non-request-stackable devices >> >> The same kernel works with the same data drive on the SCSI interface. >> NVME-backed dm verity works with just this commit reverted. >> >> I believe the presence of the immutable partition is used as an indicator >> of special case NVME configuration and if the data device's name starts >> with "nvme" the code tries to switch the target type to >> DM_TYPE_NVME_BIO_BASED (drivers/md/dm-table.c lines 1003-1010). >> >> The special NVME optimization case was removed in >> 5.10 by commit 9c37de297f6590937f95a28bec1b7ac68a38618f, so only 5.4 is >> affected. >> > > Why wouldn't 4.9, 4.14, and 4.19 also be affected here? Should I also > just queue up 9c37de297f65 ("dm: remove special-casing of bio-based > immutable singleton target on NVMe") to those older kernels? If so, > have you tested this and verified that it worked?
Sorry for the unforeseen stable@ troubles here!
In general we'd be fine to apply commit 9c37de297f65 but to do it properly would require also making sure commits that remove "DM_TYPE_NVME_BIO_BASED", like 8d47e65948dd ("dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks") are applied -- basically any lingering references to DM_TYPE_NVME_BIO_BASED need to be removed.
The commit header for 8d47e65948dd documents what DM_TYPE_NVME_BIO_BASED was used for.. it was dm-mpath specific and "nvme" mode really never got used by any userspace that I'm aware of.
Sadly I currently don't have the time to do this backport for all N stable kernels... :(
But if that backport gets out of control: A simpler, albeit stable@ unicorn, way to resolve this is to simply revert 9c37de297f65 and make
9c37de297f65 can not be reverted in 5.4 and older because it isn't there, and trying to apply it results in conflicts which at least I can not resolve.
it so that DM-mpath and DM core just used bio-based if "nvme" is requested by dm-mpath, so also in drivers/md/dm-mpath.c e.g.:
@@ -1091,8 +1088,6 @@ static int parse_features(struct dm_arg_set *as, struct multipath *m)
if (!strcasecmp(queue_mode_name, "bio")) m->queue_mode = DM_TYPE_BIO_BASED; else if (!strcasecmp(queue_mode_name, "nvme"))
m->queue_mode = DM_TYPE_NVME_BIO_BASED;
m->queue_mode = DM_TYPE_BIO_BASED; else if (!strcasecmp(queue_mode_name, "rq")) m->queue_mode = DM_TYPE_REQUEST_BASED; else if (!strcasecmp(queue_mode_name, "mq"))
Mike
Ok, please submit a working patch for the kernels that need it so that we can review and apply it to solve this regression.
So, effectively, v5.4.y and older are broken right now for use cases with dm on NVME drives.
Given that the regression does affect older branches, and given that we have to revert this patch to avoid regressions in ChromeOS, would it be possible to revert it from v5.4.y and older until a fix is found ?
I obviously would prefer to not have this false-start.
The false start has already happened since we had to revert the patch from chromeos-5.4 and older branches.
OK, well this is pretty easy to fix in general. If there are slight differences across older trees they are easily resolved. Fact that stable@ couldn't cope with backporting 9c37de297f65 is.. what it is.
But this will fix the issue on 5.4.y:
From: Mike Snitzer snitzer@kernel.org Date: Wed, 15 Jun 2022 14:07:09 -0400 Subject: [5.4.y PATCH] dm: remove special-casing of bio-based immutable singleton target on NVMe
Commit 9c37de297f6590937f95a28bec1b7ac68a38618f upstream.
There is no benefit to DM special-casing NVMe. Remove all code used to establish DM_TYPE_NVME_BIO_BASED.
Signed-off-by: Mike Snitzer snitzer@kernel.org
This patch passes our tests after I removed the unused variable.
Tested-by: Guenter Roeck linux@roeck-us.net
Thanks a lot for the backport!
Guenter
drivers/md/dm-table.c | 32 ++---------------- drivers/md/dm.c | 64 +++-------------------------------- include/linux/device-mapper.h | 1 - 3 files changed, 7 insertions(+), 90 deletions(-)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 06b382304d92..81bc36a43b32 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -872,8 +872,7 @@ EXPORT_SYMBOL(dm_consume_args); static bool __table_type_bio_based(enum dm_queue_mode table_type) { return (table_type == DM_TYPE_BIO_BASED ||
table_type == DM_TYPE_DAX_BIO_BASED ||
table_type == DM_TYPE_NVME_BIO_BASED);
}table_type == DM_TYPE_DAX_BIO_BASED);
static bool __table_type_request_based(enum dm_queue_mode table_type) @@ -929,8 +928,6 @@ bool dm_table_supports_dax(struct dm_table *t, return true; } -static bool dm_table_does_not_support_partial_completion(struct dm_table *t);
- static int device_is_rq_stackable(struct dm_target *ti, struct dm_dev *dev, sector_t start, sector_t len, void *data) {
@@ -960,7 +957,6 @@ static int dm_table_determine_type(struct dm_table *t) goto verify_bio_based; } BUG_ON(t->type == DM_TYPE_DAX_BIO_BASED);
goto verify_rq_based; }BUG_ON(t->type == DM_TYPE_NVME_BIO_BASED);
@@ -999,15 +995,6 @@ static int dm_table_determine_type(struct dm_table *t) if (dm_table_supports_dax(t, device_not_dax_capable, &page_size) || (list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) { t->type = DM_TYPE_DAX_BIO_BASED;
} else {
/* Check if upgrading to NVMe bio-based is valid or required */
tgt = dm_table_get_immutable_target(t);
if (tgt && !tgt->max_io_len && dm_table_does_not_support_partial_completion(t)) {
t->type = DM_TYPE_NVME_BIO_BASED;
goto verify_rq_based; /* must be stacked directly on NVMe (blk-mq) */
} else if (list_empty(devices) && live_md_type == DM_TYPE_NVME_BIO_BASED) {
t->type = DM_TYPE_NVME_BIO_BASED;
} return 0; }}
@@ -1024,8 +1011,7 @@ static int dm_table_determine_type(struct dm_table *t) * (e.g. request completion process for partial completion.) */ if (t->num_targets > 1) {
DMERR("%s DM doesn't support multiple targets",
t->type == DM_TYPE_NVME_BIO_BASED ? "nvme bio-based" : "request-based");
return -EINVAL; }DMERR("request-based DM doesn't support multiple targets");
@@ -1714,20 +1700,6 @@ static int device_is_not_random(struct dm_target *ti, struct dm_dev *dev, return q && !blk_queue_add_random(q); } -static int device_is_partial_completion(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
-{
- char b[BDEVNAME_SIZE];
- /* For now, NVMe devices are the only devices of this class */
- return (strncmp(bdevname(dev->bdev, b), "nvme", 4) != 0);
-}
-static bool dm_table_does_not_support_partial_completion(struct dm_table *t) -{
- return !dm_table_any_dev_attr(t, device_is_partial_completion, NULL);
-}
- static int device_not_write_same_capable(struct dm_target *ti, struct dm_dev *dev, sector_t start, sector_t len, void *data) {
diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 37b8bb4d80f0..3c45c389ded9 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1000,7 +1000,7 @@ static void clone_endio(struct bio *bio) struct mapped_device *md = tio->io->md; dm_endio_fn endio = tio->ti->type->end_io;
- if (unlikely(error == BLK_STS_TARGET) && md->type != DM_TYPE_NVME_BIO_BASED) {
- if (unlikely(error == BLK_STS_TARGET)) { if (bio_op(bio) == REQ_OP_DISCARD && !bio->bi_disk->queue->limits.max_discard_sectors) disable_discard(md);
@@ -1340,10 +1340,7 @@ static blk_qc_t __map_bio(struct dm_target_io *tio) /* the bio has been remapped so dispatch it */ trace_block_bio_remap(clone->bi_disk->queue, clone, bio_dev(io->orig_bio), sector);
if (md->type == DM_TYPE_NVME_BIO_BASED)
ret = direct_make_request(clone);
else
ret = generic_make_request(clone);
break; case DM_MAPIO_KILL: if (unlikely(swap_bios_limit(ti, clone))) {ret = generic_make_request(clone);
@@ -1732,51 +1729,6 @@ static blk_qc_t __split_and_process_bio(struct mapped_device *md, return ret; } -/*
- Optimized variant of __split_and_process_bio that leverages the
- fact that targets that use it do _not_ have a need to split bios.
- */
-static blk_qc_t __process_bio(struct mapped_device *md, struct dm_table *map,
struct bio *bio, struct dm_target *ti)
-{
- struct clone_info ci;
- blk_qc_t ret = BLK_QC_T_NONE;
- int error = 0;
- init_clone_info(&ci, md, map, bio);
- if (bio->bi_opf & REQ_PREFLUSH) {
struct bio flush_bio;
/*
* Use an on-stack bio for this, it's safe since we don't
* need to reference it after submit. It's just used as
* the basis for the clone(s).
*/
bio_init(&flush_bio, NULL, 0);
flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC;
ci.bio = &flush_bio;
ci.sector_count = 0;
error = __send_empty_flush(&ci);
bio_uninit(ci.bio);
/* dec_pending submits any data associated with flush */
- } else {
struct dm_target_io *tio;
ci.bio = bio;
ci.sector_count = bio_sectors(bio);
if (__process_abnormal_io(&ci, ti, &error))
goto out;
tio = alloc_tio(&ci, ti, 0, GFP_NOIO);
ret = __clone_and_map_simple_bio(&ci, tio, NULL);
- }
-out:
- /* drop the extra reference count */
- dec_pending(ci.io, errno_to_blk_status(error));
- return ret;
-}
- static blk_qc_t dm_process_bio(struct mapped_device *md, struct dm_table *map, struct bio *bio) {
@@ -1807,8 +1759,6 @@ static blk_qc_t dm_process_bio(struct mapped_device *md, /* regular IO is split by __split_and_process_bio */ }
- if (dm_get_md_type(md) == DM_TYPE_NVME_BIO_BASED)
return __split_and_process_bio(md, map, bio); }return __process_bio(md, map, bio, ti);
@@ -2200,12 +2150,10 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t, if (request_based) dm_stop_queue(q);
- if (request_based || md->type == DM_TYPE_NVME_BIO_BASED) {
- if (request_based) { /*
* Leverage the fact that request-based DM targets and
* NVMe bio based targets are immutable singletons
* - used to optimize both dm_request_fn and dm_mq_queue_rq;
* and __process_bio.
* Leverage the fact that request-based DM targets are
*/ md->immutable_target = dm_table_get_immutable_target(t); }* immutable singletons - used to optimize dm_mq_queue_rq.
@@ -2334,7 +2282,6 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t) break; case DM_TYPE_BIO_BASED: case DM_TYPE_DAX_BIO_BASED:
- case DM_TYPE_NVME_BIO_BASED: dm_init_congested_fn(md); break; case DM_TYPE_NONE:
@@ -3070,7 +3017,6 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_qu switch (type) { case DM_TYPE_BIO_BASED: case DM_TYPE_DAX_BIO_BASED:
- case DM_TYPE_NVME_BIO_BASED: pool_size = max(dm_get_reserved_bio_based_ios(), min_pool_size); front_pad = roundup(per_io_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone); io_front_pad = roundup(front_pad, __alignof__(struct dm_io)) + offsetof(struct dm_io, tio);
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h index a53d7d2c2d95..60631f3abddb 100644 --- a/include/linux/device-mapper.h +++ b/include/linux/device-mapper.h @@ -28,7 +28,6 @@ enum dm_queue_mode { DM_TYPE_BIO_BASED = 1, DM_TYPE_REQUEST_BASED = 2, DM_TYPE_DAX_BIO_BASED = 3,
- DM_TYPE_NVME_BIO_BASED = 4, };
typedef enum { STATUSTYPE_INFO, STATUSTYPE_TABLE } status_type_t;
On Wed, Jun 15, 2022 at 04:02:36PM -0400, Mike Snitzer wrote:
On Wed, Jun 15 2022 at 1:50P -0400, Guenter Roeck linux@roeck-us.net wrote:
On 6/15/22 08:29, Mike Snitzer wrote:
On Wed, Jun 15 2022 at 10:36P -0400, Guenter Roeck linux@roeck-us.net wrote:
On Mon, Jun 13, 2022 at 11:13:21AM +0200, Greg KH wrote:
On Fri, Jun 10, 2022 at 11:11:00AM -0400, Mike Snitzer wrote:
On Fri, Jun 10 2022 at 1:15P -0400, Greg KH gregkh@linuxfoundation.org wrote:
> On Fri, Jun 10, 2022 at 04:22:00AM +0000, Oleksandr Tymoshenko wrote: > > I believe this commit introduced a regression in dm verity on systems > > where data device is an NVME one. Loading table fails with the > > following diagnostics: > > > > device-mapper: table: table load rejected: including non-request-stackable devices > > > > The same kernel works with the same data drive on the SCSI interface. > > NVME-backed dm verity works with just this commit reverted. > > > > I believe the presence of the immutable partition is used as an indicator > > of special case NVME configuration and if the data device's name starts > > with "nvme" the code tries to switch the target type to > > DM_TYPE_NVME_BIO_BASED (drivers/md/dm-table.c lines 1003-1010). > > > > The special NVME optimization case was removed in > > 5.10 by commit 9c37de297f6590937f95a28bec1b7ac68a38618f, so only 5.4 is > > affected. > > > > Why wouldn't 4.9, 4.14, and 4.19 also be affected here? Should I also > just queue up 9c37de297f65 ("dm: remove special-casing of bio-based > immutable singleton target on NVMe") to those older kernels? If so, > have you tested this and verified that it worked?
Sorry for the unforeseen stable@ troubles here!
In general we'd be fine to apply commit 9c37de297f65 but to do it properly would require also making sure commits that remove "DM_TYPE_NVME_BIO_BASED", like 8d47e65948dd ("dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks") are applied -- basically any lingering references to DM_TYPE_NVME_BIO_BASED need to be removed.
The commit header for 8d47e65948dd documents what DM_TYPE_NVME_BIO_BASED was used for.. it was dm-mpath specific and "nvme" mode really never got used by any userspace that I'm aware of.
Sadly I currently don't have the time to do this backport for all N stable kernels... :(
But if that backport gets out of control: A simpler, albeit stable@ unicorn, way to resolve this is to simply revert 9c37de297f65 and make
9c37de297f65 can not be reverted in 5.4 and older because it isn't there, and trying to apply it results in conflicts which at least I can not resolve.
it so that DM-mpath and DM core just used bio-based if "nvme" is requested by dm-mpath, so also in drivers/md/dm-mpath.c e.g.:
@@ -1091,8 +1088,6 @@ static int parse_features(struct dm_arg_set *as, struct multipath *m)
if (!strcasecmp(queue_mode_name, "bio")) m->queue_mode = DM_TYPE_BIO_BASED; else if (!strcasecmp(queue_mode_name, "nvme"))
m->queue_mode = DM_TYPE_NVME_BIO_BASED;
m->queue_mode = DM_TYPE_BIO_BASED; else if (!strcasecmp(queue_mode_name, "rq")) m->queue_mode = DM_TYPE_REQUEST_BASED; else if (!strcasecmp(queue_mode_name, "mq"))
Mike
Ok, please submit a working patch for the kernels that need it so that we can review and apply it to solve this regression.
So, effectively, v5.4.y and older are broken right now for use cases with dm on NVME drives.
Given that the regression does affect older branches, and given that we have to revert this patch to avoid regressions in ChromeOS, would it be possible to revert it from v5.4.y and older until a fix is found ?
I obviously would prefer to not have this false-start.
The false start has already happened since we had to revert the patch from chromeos-5.4 and older branches.
OK, well this is pretty easy to fix in general. If there are slight differences across older trees they are easily resolved. Fact that stable@ couldn't cope with backporting 9c37de297f65 is.. what it is.
But this will fix the issue on 5.4.y:
From: Mike Snitzer snitzer@kernel.org Date: Wed, 15 Jun 2022 14:07:09 -0400 Subject: [5.4.y PATCH] dm: remove special-casing of bio-based immutable singleton target on NVMe
Commit 9c37de297f6590937f95a28bec1b7ac68a38618f upstream.
There is no benefit to DM special-casing NVMe. Remove all code used to establish DM_TYPE_NVME_BIO_BASED.
Signed-off-by: Mike Snitzer snitzer@kernel.org
drivers/md/dm-table.c | 32 ++---------------- drivers/md/dm.c | 64 +++-------------------------------- include/linux/device-mapper.h | 1 - 3 files changed, 7 insertions(+), 90 deletions(-)
Can someone resend this in the proper format (and fixed up), with Guenter's tested-by so that I can queue it up?
thanks,
greg k-h
Commit 9c37de297f6590937f95a28bec1b7ac68a38618f upstream.
There is no benefit to DM special-casing NVMe. Remove all code used to establish DM_TYPE_NVME_BIO_BASED.
Also, remove 3 'struct mapped_device *md' variables in __map_bio() which masked the same variable that is available within __map_bio()'s scope.
Tested-by: Guenter Roeck linux@roeck-us.net Signed-off-by: Mike Snitzer snitzer@kernel.org --- drivers/md/dm-table.c | 32 +-------------- drivers/md/dm.c | 73 ++++------------------------------- include/linux/device-mapper.h | 1 - 3 files changed, 9 insertions(+), 97 deletions(-)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 06b382304d92..81bc36a43b32 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -872,8 +872,7 @@ EXPORT_SYMBOL(dm_consume_args); static bool __table_type_bio_based(enum dm_queue_mode table_type) { return (table_type == DM_TYPE_BIO_BASED || - table_type == DM_TYPE_DAX_BIO_BASED || - table_type == DM_TYPE_NVME_BIO_BASED); + table_type == DM_TYPE_DAX_BIO_BASED); }
static bool __table_type_request_based(enum dm_queue_mode table_type) @@ -929,8 +928,6 @@ bool dm_table_supports_dax(struct dm_table *t, return true; }
-static bool dm_table_does_not_support_partial_completion(struct dm_table *t); - static int device_is_rq_stackable(struct dm_target *ti, struct dm_dev *dev, sector_t start, sector_t len, void *data) { @@ -960,7 +957,6 @@ static int dm_table_determine_type(struct dm_table *t) goto verify_bio_based; } BUG_ON(t->type == DM_TYPE_DAX_BIO_BASED); - BUG_ON(t->type == DM_TYPE_NVME_BIO_BASED); goto verify_rq_based; }
@@ -999,15 +995,6 @@ static int dm_table_determine_type(struct dm_table *t) if (dm_table_supports_dax(t, device_not_dax_capable, &page_size) || (list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) { t->type = DM_TYPE_DAX_BIO_BASED; - } else { - /* Check if upgrading to NVMe bio-based is valid or required */ - tgt = dm_table_get_immutable_target(t); - if (tgt && !tgt->max_io_len && dm_table_does_not_support_partial_completion(t)) { - t->type = DM_TYPE_NVME_BIO_BASED; - goto verify_rq_based; /* must be stacked directly on NVMe (blk-mq) */ - } else if (list_empty(devices) && live_md_type == DM_TYPE_NVME_BIO_BASED) { - t->type = DM_TYPE_NVME_BIO_BASED; - } } return 0; } @@ -1024,8 +1011,7 @@ static int dm_table_determine_type(struct dm_table *t) * (e.g. request completion process for partial completion.) */ if (t->num_targets > 1) { - DMERR("%s DM doesn't support multiple targets", - t->type == DM_TYPE_NVME_BIO_BASED ? "nvme bio-based" : "request-based"); + DMERR("request-based DM doesn't support multiple targets"); return -EINVAL; }
@@ -1714,20 +1700,6 @@ static int device_is_not_random(struct dm_target *ti, struct dm_dev *dev, return q && !blk_queue_add_random(q); }
-static int device_is_partial_completion(struct dm_target *ti, struct dm_dev *dev, - sector_t start, sector_t len, void *data) -{ - char b[BDEVNAME_SIZE]; - - /* For now, NVMe devices are the only devices of this class */ - return (strncmp(bdevname(dev->bdev, b), "nvme", 4) != 0); -} - -static bool dm_table_does_not_support_partial_completion(struct dm_table *t) -{ - return !dm_table_any_dev_attr(t, device_is_partial_completion, NULL); -} - static int device_not_write_same_capable(struct dm_target *ti, struct dm_dev *dev, sector_t start, sector_t len, void *data) { diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 37b8bb4d80f0..77e28f77c59f 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1000,7 +1000,7 @@ static void clone_endio(struct bio *bio) struct mapped_device *md = tio->io->md; dm_endio_fn endio = tio->ti->type->end_io;
- if (unlikely(error == BLK_STS_TARGET) && md->type != DM_TYPE_NVME_BIO_BASED) { + if (unlikely(error == BLK_STS_TARGET)) { if (bio_op(bio) == REQ_OP_DISCARD && !bio->bi_disk->queue->limits.max_discard_sectors) disable_discard(md); @@ -1325,7 +1325,6 @@ static blk_qc_t __map_bio(struct dm_target_io *tio) sector = clone->bi_iter.bi_sector;
if (unlikely(swap_bios_limit(ti, clone))) { - struct mapped_device *md = io->md; int latch = get_swap_bios(); if (unlikely(latch != md->swap_bios)) __set_swap_bios_limit(md, latch); @@ -1340,24 +1339,17 @@ static blk_qc_t __map_bio(struct dm_target_io *tio) /* the bio has been remapped so dispatch it */ trace_block_bio_remap(clone->bi_disk->queue, clone, bio_dev(io->orig_bio), sector); - if (md->type == DM_TYPE_NVME_BIO_BASED) - ret = direct_make_request(clone); - else - ret = generic_make_request(clone); + ret = generic_make_request(clone); break; case DM_MAPIO_KILL: - if (unlikely(swap_bios_limit(ti, clone))) { - struct mapped_device *md = io->md; + if (unlikely(swap_bios_limit(ti, clone))) up(&md->swap_bios_semaphore); - } free_tio(tio); dec_pending(io, BLK_STS_IOERR); break; case DM_MAPIO_REQUEUE: - if (unlikely(swap_bios_limit(ti, clone))) { - struct mapped_device *md = io->md; + if (unlikely(swap_bios_limit(ti, clone))) up(&md->swap_bios_semaphore); - } free_tio(tio); dec_pending(io, BLK_STS_DM_REQUEUE); break; @@ -1732,51 +1724,6 @@ static blk_qc_t __split_and_process_bio(struct mapped_device *md, return ret; }
-/* - * Optimized variant of __split_and_process_bio that leverages the - * fact that targets that use it do _not_ have a need to split bios. - */ -static blk_qc_t __process_bio(struct mapped_device *md, struct dm_table *map, - struct bio *bio, struct dm_target *ti) -{ - struct clone_info ci; - blk_qc_t ret = BLK_QC_T_NONE; - int error = 0; - - init_clone_info(&ci, md, map, bio); - - if (bio->bi_opf & REQ_PREFLUSH) { - struct bio flush_bio; - - /* - * Use an on-stack bio for this, it's safe since we don't - * need to reference it after submit. It's just used as - * the basis for the clone(s). - */ - bio_init(&flush_bio, NULL, 0); - flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC; - ci.bio = &flush_bio; - ci.sector_count = 0; - error = __send_empty_flush(&ci); - bio_uninit(ci.bio); - /* dec_pending submits any data associated with flush */ - } else { - struct dm_target_io *tio; - - ci.bio = bio; - ci.sector_count = bio_sectors(bio); - if (__process_abnormal_io(&ci, ti, &error)) - goto out; - - tio = alloc_tio(&ci, ti, 0, GFP_NOIO); - ret = __clone_and_map_simple_bio(&ci, tio, NULL); - } -out: - /* drop the extra reference count */ - dec_pending(ci.io, errno_to_blk_status(error)); - return ret; -} - static blk_qc_t dm_process_bio(struct mapped_device *md, struct dm_table *map, struct bio *bio) { @@ -1807,8 +1754,6 @@ static blk_qc_t dm_process_bio(struct mapped_device *md, /* regular IO is split by __split_and_process_bio */ }
- if (dm_get_md_type(md) == DM_TYPE_NVME_BIO_BASED) - return __process_bio(md, map, bio, ti); return __split_and_process_bio(md, map, bio); }
@@ -2200,12 +2145,10 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t, if (request_based) dm_stop_queue(q);
- if (request_based || md->type == DM_TYPE_NVME_BIO_BASED) { + if (request_based) { /* - * Leverage the fact that request-based DM targets and - * NVMe bio based targets are immutable singletons - * - used to optimize both dm_request_fn and dm_mq_queue_rq; - * and __process_bio. + * Leverage the fact that request-based DM targets are + * immutable singletons - used to optimize dm_mq_queue_rq. */ md->immutable_target = dm_table_get_immutable_target(t); } @@ -2334,7 +2277,6 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t) break; case DM_TYPE_BIO_BASED: case DM_TYPE_DAX_BIO_BASED: - case DM_TYPE_NVME_BIO_BASED: dm_init_congested_fn(md); break; case DM_TYPE_NONE: @@ -3070,7 +3012,6 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_qu switch (type) { case DM_TYPE_BIO_BASED: case DM_TYPE_DAX_BIO_BASED: - case DM_TYPE_NVME_BIO_BASED: pool_size = max(dm_get_reserved_bio_based_ios(), min_pool_size); front_pad = roundup(per_io_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone); io_front_pad = roundup(front_pad, __alignof__(struct dm_io)) + offsetof(struct dm_io, tio); diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h index a53d7d2c2d95..60631f3abddb 100644 --- a/include/linux/device-mapper.h +++ b/include/linux/device-mapper.h @@ -28,7 +28,6 @@ enum dm_queue_mode { DM_TYPE_BIO_BASED = 1, DM_TYPE_REQUEST_BASED = 2, DM_TYPE_DAX_BIO_BASED = 3, - DM_TYPE_NVME_BIO_BASED = 4, };
typedef enum { STATUSTYPE_INFO, STATUSTYPE_TABLE } status_type_t;
On Tue, Jun 21, 2022 at 12:35:04PM -0400, Mike Snitzer wrote:
Commit 9c37de297f6590937f95a28bec1b7ac68a38618f upstream.
There is no benefit to DM special-casing NVMe. Remove all code used to establish DM_TYPE_NVME_BIO_BASED.
Also, remove 3 'struct mapped_device *md' variables in __map_bio() which masked the same variable that is available within __map_bio()'s scope.
Tested-by: Guenter Roeck linux@roeck-us.net Signed-off-by: Mike Snitzer snitzer@kernel.org
drivers/md/dm-table.c | 32 +-------------- drivers/md/dm.c | 73 ++++------------------------------- include/linux/device-mapper.h | 1 - 3 files changed, 9 insertions(+), 97 deletions(-)
Now queued up, thanks.
greg k-h
From: Mariusz Tkaczyk mariusz.tkaczyk@linux.intel.com
commit 57668f0a4cc4083a120cc8c517ca0055c4543b59 upstream.
Raid456 module had allowed to achieve failed state. It was fixed by fb73b357fb9 ("raid5: block failing device if raid will be failed"). This fix introduces a bug, now if raid5 fails during IO, it may result with a hung task without completion. Faulty flag on the device is necessary to process all requests and is checked many times, mainly in analyze_stripe(). Allow to set faulty on drive again and set MD_BROKEN if raid is failed.
As a result, this level is allowed to achieve failed state again, but communication with userspace (via -EBUSY status) will be preserved.
This restores possibility to fail array via #mdadm --set-faulty command and will be fixed by additional verification on mdadm side.
Reproduction steps: mdadm -CR imsm -e imsm -n 3 /dev/nvme[0-2]n1 mdadm -CR r5 -e imsm -l5 -n3 /dev/nvme[0-2]n1 --assume-clean mkfs.xfs /dev/md126 -f mount /dev/md126 /mnt/root/
fio --filename=/mnt/root/file --size=5GB --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=240 --numjobs=4 --time_based --group_reporting --name=throughput-test-job --eta-newline=1 &
echo 1 > /sys/block/nvme2n1/device/device/remove echo 1 > /sys/block/nvme1n1/device/device/remove
[ 1475.787779] Call Trace: [ 1475.793111] __schedule+0x2a6/0x700 [ 1475.799460] schedule+0x38/0xa0 [ 1475.805454] raid5_get_active_stripe+0x469/0x5f0 [raid456] [ 1475.813856] ? finish_wait+0x80/0x80 [ 1475.820332] raid5_make_request+0x180/0xb40 [raid456] [ 1475.828281] ? finish_wait+0x80/0x80 [ 1475.834727] ? finish_wait+0x80/0x80 [ 1475.841127] ? finish_wait+0x80/0x80 [ 1475.847480] md_handle_request+0x119/0x190 [ 1475.854390] md_make_request+0x8a/0x190 [ 1475.861041] generic_make_request+0xcf/0x310 [ 1475.868145] submit_bio+0x3c/0x160 [ 1475.874355] iomap_dio_submit_bio.isra.20+0x51/0x60 [ 1475.882070] iomap_dio_bio_actor+0x175/0x390 [ 1475.889149] iomap_apply+0xff/0x310 [ 1475.895447] ? iomap_dio_bio_actor+0x390/0x390 [ 1475.902736] ? iomap_dio_bio_actor+0x390/0x390 [ 1475.909974] iomap_dio_rw+0x2f2/0x490 [ 1475.916415] ? iomap_dio_bio_actor+0x390/0x390 [ 1475.923680] ? atime_needs_update+0x77/0xe0 [ 1475.930674] ? xfs_file_dio_aio_read+0x6b/0xe0 [xfs] [ 1475.938455] xfs_file_dio_aio_read+0x6b/0xe0 [xfs] [ 1475.946084] xfs_file_read_iter+0xba/0xd0 [xfs] [ 1475.953403] aio_read+0xd5/0x180 [ 1475.959395] ? _cond_resched+0x15/0x30 [ 1475.965907] io_submit_one+0x20b/0x3c0 [ 1475.972398] __x64_sys_io_submit+0xa2/0x180 [ 1475.979335] ? do_io_getevents+0x7c/0xc0 [ 1475.986009] do_syscall_64+0x5b/0x1a0 [ 1475.992419] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 1476.000255] RIP: 0033:0x7f11fc27978d [ 1476.006631] Code: Bad RIP value. [ 1476.073251] INFO: task fio:3877 blocked for more than 120 seconds.
Cc: stable@vger.kernel.org Fixes: fb73b357fb9 ("raid5: block failing device if raid will be failed") Reviewd-by: Xiao Ni xni@redhat.com Signed-off-by: Mariusz Tkaczyk mariusz.tkaczyk@linux.intel.com Signed-off-by: Song Liu song@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/md/raid5.c | 47 ++++++++++++++++++++++------------------------- 1 file changed, 22 insertions(+), 25 deletions(-)
--- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -609,17 +609,17 @@ int raid5_calc_degraded(struct r5conf *c return degraded; }
-static int has_failed(struct r5conf *conf) +static bool has_failed(struct r5conf *conf) { - int degraded; + int degraded = conf->mddev->degraded;
- if (conf->mddev->reshape_position == MaxSector) - return conf->mddev->degraded > conf->max_degraded; + if (test_bit(MD_BROKEN, &conf->mddev->flags)) + return true;
- degraded = raid5_calc_degraded(conf); - if (degraded > conf->max_degraded) - return 1; - return 0; + if (conf->mddev->reshape_position != MaxSector) + degraded = raid5_calc_degraded(conf); + + return degraded > conf->max_degraded; }
struct stripe_head * @@ -2679,34 +2679,31 @@ static void raid5_error(struct mddev *md unsigned long flags; pr_debug("raid456: error called\n");
+ pr_crit("md/raid:%s: Disk failure on %s, disabling device.\n", + mdname(mddev), bdevname(rdev->bdev, b)); + spin_lock_irqsave(&conf->device_lock, flags); + set_bit(Faulty, &rdev->flags); + clear_bit(In_sync, &rdev->flags); + mddev->degraded = raid5_calc_degraded(conf);
- if (test_bit(In_sync, &rdev->flags) && - mddev->degraded == conf->max_degraded) { - /* - * Don't allow to achieve failed state - * Don't try to recover this device - */ + if (has_failed(conf)) { + set_bit(MD_BROKEN, &conf->mddev->flags); conf->recovery_disabled = mddev->recovery_disabled; - spin_unlock_irqrestore(&conf->device_lock, flags); - return; + + pr_crit("md/raid:%s: Cannot continue operation (%d/%d failed).\n", + mdname(mddev), mddev->degraded, conf->raid_disks); + } else { + pr_crit("md/raid:%s: Operation continuing on %d devices.\n", + mdname(mddev), conf->raid_disks - mddev->degraded); }
- set_bit(Faulty, &rdev->flags); - clear_bit(In_sync, &rdev->flags); - mddev->degraded = raid5_calc_degraded(conf); spin_unlock_irqrestore(&conf->device_lock, flags); set_bit(MD_RECOVERY_INTR, &mddev->recovery);
set_bit(Blocked, &rdev->flags); set_mask_bits(&mddev->sb_flags, 0, BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING)); - pr_crit("md/raid:%s: Disk failure on %s, disabling device.\n" - "md/raid:%s: Operation continuing on %d devices.\n", - mdname(mddev), - bdevname(rdev->bdev, b), - mdname(mddev), - conf->raid_disks - mddev->degraded); r5c_update_on_rdev_error(mddev, rdev); }
From: Marek Maślanka mm@semihalf.com
commit 1d07cef7fd7599450b3d03e1915efc2a96e1f03f upstream.
The Google Whiskers touchpad does not work properly with the default multitouch configuration. Instead, use the same configuration as Google Rose.
Signed-off-by: Marek Maslanka mm@semihalf.com Acked-by: Benjamin Tissoires benjamin.tissoires@redhat.com Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/hid/hid-multitouch.c | 3 +++ 1 file changed, 3 insertions(+)
--- a/drivers/hid/hid-multitouch.c +++ b/drivers/hid/hid-multitouch.c @@ -2158,6 +2158,9 @@ static const struct hid_device_id mt_dev { .driver_data = MT_CLS_GOOGLE, HID_DEVICE(HID_BUS_ANY, HID_GROUP_ANY, USB_VENDOR_ID_GOOGLE, USB_DEVICE_ID_GOOGLE_TOUCH_ROSE) }, + { .driver_data = MT_CLS_GOOGLE, + HID_DEVICE(BUS_USB, HID_GROUP_MULTITOUCH_WIN_8, USB_VENDOR_ID_GOOGLE, + USB_DEVICE_ID_GOOGLE_WHISKERS) },
/* Generic MT device */ { HID_DEVICE(HID_BUS_ANY, HID_GROUP_MULTITOUCH, HID_ANY_ID, HID_ANY_ID) },
From: Stefan Mahnke-Hartmann stefan.mahnke-hartmann@infineon.com
commit e57b2523bd37e6434f4e64c7a685e3715ad21e9a upstream.
Under certain conditions uninitialized memory will be accessed. As described by TCG Trusted Platform Module Library Specification, rev. 1.59 (Part 3: Commands), if a TPM2_GetCapability is received, requesting a capability, the TPM in field upgrade mode may return a zero length list. Check the property count in tpm2_get_tpm_pt().
Fixes: 2ab3241161b3 ("tpm: migrate tpm2_get_tpm_pt() to use struct tpm_buf") Cc: stable@vger.kernel.org Signed-off-by: Stefan Mahnke-Hartmann stefan.mahnke-hartmann@infineon.com Reviewed-by: Jarkko Sakkinen jarkko@kernel.org Signed-off-by: Jarkko Sakkinen jarkko@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/char/tpm/tpm2-cmd.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-)
--- a/drivers/char/tpm/tpm2-cmd.c +++ b/drivers/char/tpm/tpm2-cmd.c @@ -706,7 +706,16 @@ ssize_t tpm2_get_tpm_pt(struct tpm_chip if (!rc) { out = (struct tpm2_get_cap_out *) &buf.data[TPM_HEADER_SIZE]; - *value = be32_to_cpu(out->value); + /* + * To prevent failing boot up of some systems, Infineon TPM2.0 + * returns SUCCESS on TPM2_Startup in field upgrade mode. Also + * the TPM2_Getcapability command returns a zero length list + * in field upgrade mode. + */ + if (be32_to_cpu(out->property_cnt) > 0) + *value = be32_to_cpu(out->value); + else + rc = -ENODATA; } tpm_buf_destroy(&buf); return rc;
From: Xiu Jianfeng xiujianfeng@huawei.com
commit d0dc1a7100f19121f6e7450f9cdda11926aa3838 upstream.
Currently it returns zero when CRQ response timed out, it should return an error code instead.
Fixes: d8d74ea3c002 ("tpm: ibmvtpm: Wait for buffer to be set before proceeding") Signed-off-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Stefan Berger stefanb@linux.ibm.com Acked-by: Jarkko Sakkinen jarkko@kernel.org Signed-off-by: Jarkko Sakkinen jarkko@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/char/tpm/tpm_ibmvtpm.c | 1 + 1 file changed, 1 insertion(+)
--- a/drivers/char/tpm/tpm_ibmvtpm.c +++ b/drivers/char/tpm/tpm_ibmvtpm.c @@ -685,6 +685,7 @@ static int tpm_ibmvtpm_probe(struct vio_ if (!wait_event_timeout(ibmvtpm->crq_queue.wq, ibmvtpm->rtce_buf != NULL, HZ)) { + rc = -ENODEV; dev_err(dev, "CRQ response timed out\n"); goto init_irq_cleanup; }
From: Akira Yokosawa akiyks@gmail.com
commit 6d5aa418b3bd42cdccc36e94ee199af423ef7c84 upstream.
The reference to `explicit_in_reply_to` is pointless as when the reference was added in the form of "#15" [1], Section 15) was "The canonical patch format". The reference of "#15" had not been properly updated in a couple of reorganizations during the plain-text SubmittingPatches era.
Fix it by using `the_canonical_patch_format`.
[1]: 2ae19acaa50a ("Documentation: Add "how to write a good patch summary" to SubmittingPatches")
Signed-off-by: Akira Yokosawa akiyks@gmail.com Fixes: 5903019b2a5e ("Documentation/SubmittingPatches: convert it to ReST markup") Fixes: 9b2c76777acc ("Documentation/SubmittingPatches: enrich the Sphinx output") Cc: Jonathan Corbet corbet@lwn.net Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: stable@vger.kernel.org # v4.9+ Link: https://lore.kernel.org/r/64e105a5-50be-23f2-6cae-903a2ea98e18@gmail.com Signed-off-by: Jonathan Corbet corbet@lwn.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- Documentation/process/submitting-patches.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/Documentation/process/submitting-patches.rst +++ b/Documentation/process/submitting-patches.rst @@ -133,7 +133,7 @@ as you intend it to.
The maintainer will thank you if you write your patch description in a form which can be easily pulled into Linux's source code management -system, ``git``, as a "commit log". See :ref:`explicit_in_reply_to`. +system, ``git``, as a "commit log". See :ref:`the_canonical_patch_format`.
Solve only one problem per patch. If your description starts to get long, that's a sign that you probably need to split up your patch.
From: Trond Myklebust trond.myklebust@hammerspace.com
commit 452284407c18d8a522c3039339b1860afa0025a8 upstream.
We need to filter out ENOMEM in nfs_error_is_fatal_on_server(), because running out of memory on our client is not a server error.
Reported-by: Olga Kornievskaia aglo@umich.edu Fixes: 2dc23afffbca ("NFS: ENOMEM should also be a fatal error.") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Anna Schumaker Anna.Schumaker@Netapp.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/nfs/internal.h | 1 + 1 file changed, 1 insertion(+)
--- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -775,6 +775,7 @@ static inline bool nfs_error_is_fatal_on case 0: case -ERESTARTSYS: case -EINTR: + case -ENOMEM: return false; } return nfs_error_is_fatal(err);
From: Chuck Lever chuck.lever@oracle.com
commit ce3c4ad7f4ce5db7b4f08a1e237d8dd94b39180b upstream.
nfsd4_release_lockowner() holds clp->cl_lock when it calls check_for_locks(). However, check_for_locks() calls nfsd_file_get() / nfsd_file_put() to access the backing inode's flc_posix list, and nfsd_file_put() can sleep if the inode was recently removed.
Let's instead rely on the stateowner's reference count to gate whether the release is permitted. This should be a reliable indication of locks-in-use since file lock operations and ->lm_get_owner take appropriate references, which are released appropriately when file locks are removed.
Reported-by: Dai Ngo dai.ngo@oracle.com Signed-off-by: Chuck Lever chuck.lever@oracle.com Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/nfsd/nfs4state.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-)
--- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -6894,16 +6894,12 @@ nfsd4_release_lockowner(struct svc_rqst if (sop->so_is_open_owner || !same_owner_str(sop, owner)) continue;
- /* see if there are still any locks associated with it */ - lo = lockowner(sop); - list_for_each_entry(stp, &sop->so_stateids, st_perstateowner) { - if (check_for_locks(stp->st_stid.sc_file, lo)) { - status = nfserr_locks_held; - spin_unlock(&clp->cl_lock); - return status; - } + if (atomic_read(&sop->so_count) != 1) { + spin_unlock(&clp->cl_lock); + return nfserr_locks_held; }
+ lo = lockowner(sop); nfs4_get_stateowner(sop); break; }
From: Liu Jian liujian56@huawei.com
commit 45969b4152c1752089351cd6836a42a566d49bcf upstream.
The data length of skb frags + frag_list may be greater than 0xffff, and skb_header_pointer can not handle negative offset. So, here INT_MAX is used to check the validity of offset. Add the same change to the related function skb_store_bytes.
Fixes: 05c74e5e53f6 ("bpf: add bpf_skb_load_bytes helper") Signed-off-by: Liu Jian liujian56@huawei.com Signed-off-by: Daniel Borkmann daniel@iogearbox.net Acked-by: Song Liu songliubraving@fb.com Link: https://lore.kernel.org/bpf/20220416105801.88708-2-liujian56@huawei.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/core/filter.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
--- a/net/core/filter.c +++ b/net/core/filter.c @@ -1668,7 +1668,7 @@ BPF_CALL_5(bpf_skb_store_bytes, struct s
if (unlikely(flags & ~(BPF_F_RECOMPUTE_CSUM | BPF_F_INVALIDATE_HASH))) return -EINVAL; - if (unlikely(offset > 0xffff)) + if (unlikely(offset > INT_MAX)) return -EFAULT; if (unlikely(bpf_try_make_writable(skb, offset + len))) return -EFAULT; @@ -1703,7 +1703,7 @@ BPF_CALL_4(bpf_skb_load_bytes, const str { void *ptr;
- if (unlikely(offset > 0xffff)) + if (unlikely(offset > INT_MAX)) goto err_clear;
ptr = skb_header_pointer(skb, offset, len, to);
Hi Greg,
On Fri, Jun 03, 2022 at 07:42:56PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.4.197 release. There are 34 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 05 Jun 2022 17:38:05 +0000. Anything received after that time might be too late.
Build test: mips (gcc version 11.3.1 20220531): 65 configs -> no failure arm (gcc version 11.3.1 20220531): 106 configs -> no failure arm64 (gcc version 11.3.1 20220531): 2 configs -> no failure x86_64 (gcc version 11.3.1 20220531): 4 configs -> no failure
Boot test: x86_64: Booted on my test laptop. No regression. x86_64: Booted on qemu. No regression. [1]
[1]. https://openqa.qa.codethink.co.uk/tests/1264
Tested-by: Sudip Mukherjee sudip.mukherjee@codethink.co.uk
-- Regards Sudip
On Fri, 3 Jun 2022 at 23:14, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
This is the start of the stable review cycle for the 5.4.197 release. There are 34 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 05 Jun 2022 17:38:05 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.4.197-rc1... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.4.y and the diffstat can be found below.
thanks,
greg k-h
Results from Linaro’s test farm. No regressions on arm64, arm, x86_64, and i386.
Tested-by: Linux Kernel Functional Testing lkft@linaro.org
## Build * kernel: 5.4.197-rc1 * git: https://gitlab.com/Linaro/lkft/mirrors/stable/linux-stable-rc * git branch: linux-5.4.y * git commit: 2b69e7392fd9509c34f22e22898d4fd8de4bac19 * git describe: v5.4.196-35-g2b69e7392fd9 * test details: https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-5.4.y/build/v5.4.19...
## Test Regressions (compared to v5.4.196-11-g04a2bb5e4a0b) No test regressions found.
## Metric Regressions (compared to v5.4.196-11-g04a2bb5e4a0b) No metric regressions found.
## Test Fixes (compared to v5.4.196-11-g04a2bb5e4a0b) No test fixes found.
## Metric Fixes (compared to v5.4.196-11-g04a2bb5e4a0b) No metric fixes found.
## Test result summary total: 130079, pass: 116477, fail: 185, skip: 12140, xfail: 1277
## Build Summary * arc: 10 total, 10 passed, 0 failed * arm: 313 total, 313 passed, 0 failed * arm64: 57 total, 53 passed, 4 failed * i386: 28 total, 25 passed, 3 failed * mips: 37 total, 37 passed, 0 failed * parisc: 12 total, 12 passed, 0 failed * powerpc: 54 total, 54 passed, 0 failed * riscv: 27 total, 27 passed, 0 failed * s390: 12 total, 12 passed, 0 failed * sh: 24 total, 24 passed, 0 failed * sparc: 12 total, 12 passed, 0 failed * x86_64: 55 total, 54 passed, 1 failed
## Test suites summary * fwts * kunit * kvm-unit-tests * libgpiod * libhugetlbfs * log-parser-boot * log-parser-test * ltp-cap_bounds * ltp-cap_bounds-tests * ltp-commands * ltp-commands-tests * ltp-containers * ltp-containers-tests * ltp-controllers-tests * ltp-cpuhotplug-tests * ltp-crypto * ltp-crypto-tests * ltp-cve-tests * ltp-dio-tests * ltp-fcntl-locktests * ltp-fcntl-locktests-tests * ltp-filecaps * ltp-filecaps-tests * ltp-fs * ltp-fs-tests * ltp-fs_bind * ltp-fs_bind-tests * ltp-fs_perms_simple * ltp-fs_perms_simple-tests * ltp-fsx * ltp-fsx-tests * ltp-hugetlb * ltp-hugetlb-tests * ltp-io * ltp-io-tests * ltp-ipc * ltp-ipc-tests * ltp-math-tests * ltp-mm-tests * ltp-nptl * ltp-nptl-tests * ltp-open-posix-tests * ltp-pty * ltp-pty-tests * ltp-sched * ltp-sched-tests * ltp-securebits * ltp-securebits-tests * ltp-syscalls-tests * ltp-tracing-tests * network-basic-tests * packetdrill * perf * perf/Zstd-perf.data-compression * rcutorture * ssuite * v4l2-compliance * vdso
-- Linaro LKFT https://lkft.linaro.org
On Fri, Jun 03, 2022 at 07:42:56PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.4.197 release. There are 34 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 05 Jun 2022 17:38:05 +0000. Anything received after that time might be too late.
Build results: total: 160 pass: 160 fail: 0 Qemu test results: total: 449 pass: 449 fail: 0
Tested-by: Guenter Roeck linux@roeck-us.net
Guenter
On 2022/6/4 1:42, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.4.197 release. There are 34 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 05 Jun 2022 17:38:05 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.4.197-rc1... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.4.y and the diffstat can be found below.
thanks,
greg k-h
Tested on arm64 and x86 for 5.4.197-rc1,
Kernel repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git Branch: linux-5.4.y Version: 5.4.197-rc1 Commit: 2b69e7392fd9509c34f22e22898d4fd8de4bac19 Compiler: gcc version 7.3.0 (GCC)
arm64: -------------------------------------------------------------------- Testcase Result Summary: total: 9030 passed: 9030 failed: 0 timeout: 0 --------------------------------------------------------------------
x86: -------------------------------------------------------------------- Testcase Result Summary: total: 9030 passed: 9030 failed: 0 timeout: 0 --------------------------------------------------------------------
Tested-by: Hulk Robot hulkrobot@huawei.com
linux-stable-mirror@lists.linaro.org