This is the start of the stable review cycle for the 5.18.2 release. There are 67 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 05 Jun 2022 17:38:05 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.18.2-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.18.y and the diffstat can be found below.
thanks,
greg k-h
------------- Pseudo-Shortlog of commits:
Greg Kroah-Hartman gregkh@linuxfoundation.org Linux 5.18.2-rc1
Kumar Kartikeya Dwivedi memxor@gmail.com bpf: Do write access check for kfunc and global func
Kumar Kartikeya Dwivedi memxor@gmail.com bpf: Check PTR_TO_MEM | MEM_RDONLY in check_helper_mem_access
Kumar Kartikeya Dwivedi memxor@gmail.com bpf: Reject writes for PTR_TO_MAP_KEY in check_helper_mem_access
Yuntao Wang ytcoode@gmail.com bpf: Fix excessive memory allocation in stack_map_alloc()
KP Singh kpsingh@kernel.org bpf: Fix usage of trace RCU in local storage.
Liu Jian liujian56@huawei.com bpf: Enlarge offset check value to INT_MAX in bpf_skb_{load,store}_bytes
Alexei Starovoitov ast@kernel.org bpf: Fix combination of jit blinding and pointers to bpf subprogs.
Yuntao Wang ytcoode@gmail.com bpf: Fix potential array overflow in bpf_trampoline_get_progs()
Song Liu song@kernel.org bpf: Fill new bpf_prog_pack with illegal instructions
Chuck Lever chuck.lever@oracle.com NFSD: Fix possible sleep during nfsd4_release_lockowner()
Trond Myklebust trond.myklebust@hammerspace.com NFS: Memory allocation failures are not server fatal errors
Akira Yokosawa akiyks@gmail.com docs: submitting-patches: Fix crossref to 'The canonical patch format'
Xiu Jianfeng xiujianfeng@huawei.com tpm: ibmvtpm: Correct the return value in tpm_ibmvtpm_probe()
Stefan Mahnke-Hartmann stefan.mahnke-hartmann@infineon.com tpm: Fix buffer access in tpm2_get_tpm_pt()
Bryan O'Donoghue bryan.odonoghue@linaro.org media: i2c: imx412: Fix power_off ordering
Bryan O'Donoghue bryan.odonoghue@linaro.org media: i2c: imx412: Fix reset GPIO polarity
Reinette Chatre reinette.chatre@intel.com x86/sgx: Ensure no data in PCMD page after truncate
Reinette Chatre reinette.chatre@intel.com x86/sgx: Fix race between reclaimer and page fault handler
Reinette Chatre reinette.chatre@intel.com x86/sgx: Obtain backing storage page with enclave mutex held
Reinette Chatre reinette.chatre@intel.com x86/sgx: Mark PCMD page as dirty when modifying contents
Reinette Chatre reinette.chatre@intel.com x86/sgx: Disconnect backing page references from dirty status
Tao Jin tao-j@outlook.com HID: multitouch: add quirks to enable Lenovo X12 trackpoint
Marek Maślanka mm@semihalf.com HID: multitouch: Add support for Google Whiskers Touchpad
Randy Dunlap rdunlap@infradead.org fs/ntfs3: validate BOOT sectors_per_clusters
Mariusz Tkaczyk mariusz.tkaczyk@linux.intel.com raid5: introduce MD_BROKEN
Sarthak Kukreti sarthakkukreti@google.com dm verity: set DM_TARGET_IMMUTABLE feature flag
Mikulas Patocka mpatocka@redhat.com dm stats: add cond_resched when looping over entries
Mikulas Patocka mpatocka@redhat.com dm crypt: make printing of the key constant-time
Dan Carpenter dan.carpenter@oracle.com dm integrity: fix error code in dm_integrity_ctr()
Jonathan Bakker xc-racer2@live.ca ARM: dts: s5pv210: Correct interrupt name for bluetooth in Aries
Steven Rostedt rostedt@goodmis.org Bluetooth: hci_qca: Use del_timer_sync() before freeing
Craig McLure craig@mclure.net ALSA: usb-audio: Configure sync endpoints before data
Takashi Iwai tiwai@suse.de ALSA: usb-audio: Add missing ep_idx in fixed EP quirks
Takashi Iwai tiwai@suse.de ALSA: usb-audio: Workaround for clock setup on TEAC devices
Akira Yokosawa akiyks@gmail.com tools/memory-model/README: Update klitmus7 compat table
Sultan Alsawaf sultan@kerneltoast.com zsmalloc: fix races between asynchronous zspage free and page migration
Marco Chiappero marco.chiappero@intel.com crypto: qat - rework the VF2PF interrupt handling logic
Vitaly Chikunov vt@altlinux.org crypto: ecrdsa - Fix incorrect use of vli_cmp
Fabio Estevam festevam@denx.de crypto: caam - fix i.MX6SX entropy delay value
Ashish Kalra ashish.kalra@amd.com KVM: SVM: Use kzalloc for sev ioctl interfaces to prevent kernel data leak
Hou Wenlong houwenlong.hwl@antgroup.com KVM: x86/mmu: Don't rebuild page when the page is synced and no tlb flushing is required
Sean Christopherson seanjc@google.com KVM: x86: Drop WARNs that assert a triple fault never "escapes" from L2
Yanfei Xu yanfei.xu@intel.com KVM: x86: Fix the intel_pt PMI handling wrongly considered from guest
Maxim Levitsky mlevitsk@redhat.com KVM: x86: avoid loading a vCPU after .vm_destroy was called
Sean Christopherson seanjc@google.com KVM: x86: avoid calling x86 emulator without a decoded instruction
Maxim Levitsky mlevitsk@redhat.com KVM: x86: fix typo in __try_cmpxchg_user causing non-atomicness
Sean Christopherson seanjc@google.com KVM: x86: Use __try_cmpxchg_user() to emulate atomic accesses
Sean Christopherson seanjc@google.com KVM: x86: Use __try_cmpxchg_user() to update guest PTE A/D bits
Peter Zijlstra peterz@infradead.org x86/uaccess: Implement macros for CMPXCHG on user addresses
Paolo Bonzini pbonzini@redhat.com x86, kvm: use correct GFP flags for preemption disabled
Sean Christopherson seanjc@google.com x86/kvm: Alloc dummy async #PF token outside of raw spinlock
Sean Christopherson seanjc@google.com x86/fpu: KVM: Set the base guest FPU uABI size to sizeof(struct kvm_xsave)
Xiaomeng Tong xiam0nd.tong@gmail.com KVM: PPC: Book3S HV: fix incorrect NULL check on list iterator
Florian Westphal fw@strlen.de netfilter: conntrack: re-fetch conntrack after insertion
Pablo Neira Ayuso pablo@netfilter.org netfilter: nf_tables: double hook unregistration in netns path
Pablo Neira Ayuso pablo@netfilter.org netfilter: nf_tables: hold mutex on netns pre_exit path
Pablo Neira Ayuso pablo@netfilter.org netfilter: nf_tables: sanitize nft_set_desc_concat_parse()
Phil Sutter phil@nwl.cc netfilter: nft_limit: Clone packet limits' cost value
Yuezhang Mo Yuezhang.Mo@sony.com exfat: fix referencing wrong parent directory information after renaming
Tadeusz Struk tadeusz.struk@linaro.org exfat: check if cluster num is valid
Gustavo A. R. Silva gustavoars@kernel.org drm/i915: Fix -Wstringop-overflow warning in call to intel_read_wm_latency()
Alex Elder elder@linaro.org net: ipa: compute proper aggregation limit
David Howells dhowells@redhat.com pipe: Fix missing lock in pipe_resize_ring()
Kuniyuki Iwashima kuniyu@amazon.co.jp pipe: make poll_usage boolean and annotate its access
Stephen Brennan stephen.s.brennan@oracle.com assoc_array: Fix BUG_ON during garbage collect
Dan Carpenter dan.carpenter@oracle.com i2c: ismt: prevent memory corruption in ismt_access()
Pablo Neira Ayuso pablo@netfilter.org netfilter: nf_tables: disallow non-stateful expression in sets earlier
-------------
Diffstat:
Documentation/process/submitting-patches.rst | 2 +- Makefile | 4 +- arch/arm/boot/dts/s5pv210-aries.dtsi | 2 +- arch/powerpc/kvm/book3s_hv_uvmem.c | 8 +- arch/x86/include/asm/uaccess.h | 142 +++++++++++++++++++++ arch/x86/kernel/cpu/sgx/encl.c | 113 ++++++++++++++-- arch/x86/kernel/cpu/sgx/encl.h | 2 +- arch/x86/kernel/cpu/sgx/main.c | 13 +- arch/x86/kernel/fpu/core.c | 17 ++- arch/x86/kernel/kvm.c | 41 ++++-- arch/x86/kvm/mmu/mmu.c | 18 +-- arch/x86/kvm/mmu/paging_tmpl.h | 38 +----- arch/x86/kvm/svm/nested.c | 3 - arch/x86/kvm/svm/sev.c | 12 +- arch/x86/kvm/vmx/nested.c | 3 - arch/x86/kvm/vmx/vmx.c | 2 +- arch/x86/kvm/x86.c | 76 ++++++----- crypto/ecrdsa.c | 8 +- drivers/bluetooth/hci_qca.c | 4 +- drivers/char/tpm/tpm2-cmd.c | 11 +- drivers/char/tpm/tpm_ibmvtpm.c | 1 + drivers/crypto/caam/ctrl.c | 18 +++ drivers/crypto/qat/qat_common/adf_accel_devices.h | 2 +- drivers/crypto/qat/qat_common/adf_gen2_pfvf.c | 58 ++++++--- drivers/crypto/qat/qat_common/adf_gen4_pfvf.c | 44 +++++-- drivers/crypto/qat/qat_common/adf_isr.c | 17 +-- .../crypto/qat/qat_dh895xcc/adf_dh895xcc_hw_data.c | 76 +++++++---- drivers/gpu/drm/i915/intel_pm.c | 2 +- drivers/hid/hid-ids.h | 1 + drivers/hid/hid-multitouch.c | 9 ++ drivers/i2c/busses/i2c-ismt.c | 3 + drivers/md/dm-crypt.c | 14 +- drivers/md/dm-integrity.c | 2 - drivers/md/dm-stats.c | 8 ++ drivers/md/dm-verity-target.c | 1 + drivers/md/raid5.c | 47 ++++--- drivers/media/i2c/imx412.c | 8 +- drivers/net/ipa/ipa_endpoint.c | 9 +- fs/exfat/balloc.c | 8 +- fs/exfat/exfat_fs.h | 6 + fs/exfat/fatent.c | 6 - fs/exfat/namei.c | 27 +--- fs/nfs/internal.h | 1 + fs/nfsd/nfs4state.c | 12 +- fs/ntfs3/super.c | 10 +- fs/pipe.c | 33 +++-- include/linux/bpf_local_storage.h | 4 +- include/linux/pipe_fs_i.h | 2 +- include/net/netfilter/nf_conntrack_core.h | 7 +- kernel/bpf/bpf_inode_storage.c | 4 +- kernel/bpf/bpf_local_storage.c | 29 +++-- kernel/bpf/bpf_task_storage.c | 4 +- kernel/bpf/core.c | 20 ++- kernel/bpf/stackmap.c | 1 - kernel/bpf/trampoline.c | 18 ++- kernel/bpf/verifier.c | 61 ++++++--- lib/assoc_array.c | 8 ++ mm/zsmalloc.c | 37 +++++- net/core/bpf_sk_storage.c | 6 +- net/core/filter.c | 4 +- net/netfilter/nf_tables_api.c | 94 ++++++++++---- net/netfilter/nft_limit.c | 2 + sound/usb/clock.c | 7 + sound/usb/pcm.c | 17 ++- sound/usb/quirks-table.h | 3 + tools/memory-model/README | 3 +- 66 files changed, 882 insertions(+), 391 deletions(-)
From: Pablo Neira Ayuso pablo@netfilter.org
commit 520778042ccca019f3ffa136dd0ca565c486cedd upstream.
Since 3e135cd499bf ("netfilter: nft_dynset: dynamic stateful expression instantiation"), it is possible to attach stateful expressions to set elements.
cd5125d8f518 ("netfilter: nf_tables: split set destruction in deactivate and destroy phase") introduces conditional destruction on the object to accomodate transaction semantics.
nft_expr_init() calls expr->ops->init() first, then check for NFT_STATEFUL_EXPR, this stills allows to initialize a non-stateful lookup expressions which points to a set, which might lead to UAF since the set is not properly detached from the set->binding for this case. Anyway, this combination is non-sense from nf_tables perspective.
This patch fixes this problem by checking for NFT_STATEFUL_EXPR before expr->ops->init() is called.
The reporter provides a KASAN splat and a poc reproducer (similar to those autogenerated by syzbot to report use-after-free errors). It is unknown to me if they are using syzbot or if they use similar automated tool to locate the bug that they are reporting.
For the record, this is the KASAN splat.
[ 85.431824] ================================================================== [ 85.432901] BUG: KASAN: use-after-free in nf_tables_bind_set+0x81b/0xa20 [ 85.433825] Write of size 8 at addr ffff8880286f0e98 by task poc/776 [ 85.434756] [ 85.434999] CPU: 1 PID: 776 Comm: poc Tainted: G W 5.18.0+ #2 [ 85.436023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Fixes: 0b2d8a7b638b ("netfilter: nf_tables: add helper functions for expression handling") Reported-and-tested-by: Aaron Adams edg-e@nccgroup.com Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/netfilter/nf_tables_api.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-)
--- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -2873,27 +2873,31 @@ static struct nft_expr *nft_expr_init(co
err = nf_tables_expr_parse(ctx, nla, &expr_info); if (err < 0) - goto err1; + goto err_expr_parse; + + err = -EOPNOTSUPP; + if (!(expr_info.ops->type->flags & NFT_EXPR_STATEFUL)) + goto err_expr_stateful;
err = -ENOMEM; expr = kzalloc(expr_info.ops->size, GFP_KERNEL_ACCOUNT); if (expr == NULL) - goto err2; + goto err_expr_stateful;
err = nf_tables_newexpr(ctx, &expr_info, expr); if (err < 0) - goto err3; + goto err_expr_new;
return expr; -err3: +err_expr_new: kfree(expr); -err2: +err_expr_stateful: owner = expr_info.ops->type->owner; if (expr_info.ops->type->release_ops) expr_info.ops->type->release_ops(expr_info.ops);
module_put(owner); -err1: +err_expr_parse: return ERR_PTR(err); }
@@ -5413,9 +5417,6 @@ struct nft_expr *nft_set_elem_expr_alloc return expr;
err = -EOPNOTSUPP; - if (!(expr->ops->type->flags & NFT_EXPR_STATEFUL)) - goto err_set_elem_expr; - if (expr->ops->type->flags & NFT_EXPR_GC) { if (set->flags & NFT_SET_TIMEOUT) goto err_set_elem_expr;
From: Dan Carpenter dan.carpenter@oracle.com
commit 690b2549b19563ec5ad53e5c82f6a944d910086e upstream.
The "data->block[0]" variable comes from the user and is a number between 0-255. It needs to be capped to prevent writing beyond the end of dma_buffer[].
Fixes: 5e9a97b1f449 ("i2c: ismt: Adding support for I2C_SMBUS_BLOCK_PROC_CALL") Reported-and-tested-by: Zheyu Ma zheyuma97@gmail.com Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Reviewed-by: Mika Westerberg mika.westerberg@linux.intel.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/i2c/busses/i2c-ismt.c | 3 +++ 1 file changed, 3 insertions(+)
--- a/drivers/i2c/busses/i2c-ismt.c +++ b/drivers/i2c/busses/i2c-ismt.c @@ -528,6 +528,9 @@ static int ismt_access(struct i2c_adapte
case I2C_SMBUS_BLOCK_PROC_CALL: dev_dbg(dev, "I2C_SMBUS_BLOCK_PROC_CALL\n"); + if (data->block[0] > I2C_SMBUS_BLOCK_MAX) + return -EINVAL; + dma_size = I2C_SMBUS_BLOCK_MAX; desc->tgtaddr_rw = ISMT_DESC_ADDR_RW(addr, 1); desc->wr_len_cmd = data->block[0] + 1;
From: Stephen Brennan stephen.s.brennan@oracle.com
commit d1dc87763f406d4e67caf16dbe438a5647692395 upstream.
A rare BUG_ON triggered in assoc_array_gc:
[3430308.818153] kernel BUG at lib/assoc_array.c:1609!
Which corresponded to the statement currently at line 1593 upstream:
BUG_ON(assoc_array_ptr_is_meta(p));
Using the data from the core dump, I was able to generate a userspace reproducer[1] and determine the cause of the bug.
[1]: https://github.com/brenns10/kernel_stuff/tree/master/assoc_array_gc
After running the iterator on the entire branch, an internal tree node looked like the following:
NODE (nr_leaves_on_branch: 3) SLOT [0] NODE (2 leaves) SLOT [1] NODE (1 leaf) SLOT [2..f] NODE (empty)
In the userspace reproducer, the pr_devel output when compressing this node was:
-- compress node 0x5607cc089380 -- free=0, leaves=0 [0] retain node 2/1 [nx 0] [1] fold node 1/1 [nx 0] [2] fold node 0/1 [nx 2] [3] fold node 0/2 [nx 2] [4] fold node 0/3 [nx 2] [5] fold node 0/4 [nx 2] [6] fold node 0/5 [nx 2] [7] fold node 0/6 [nx 2] [8] fold node 0/7 [nx 2] [9] fold node 0/8 [nx 2] [10] fold node 0/9 [nx 2] [11] fold node 0/10 [nx 2] [12] fold node 0/11 [nx 2] [13] fold node 0/12 [nx 2] [14] fold node 0/13 [nx 2] [15] fold node 0/14 [nx 2] after: 3
At slot 0, an internal node with 2 leaves could not be folded into the node, because there was only one available slot (slot 0). Thus, the internal node was retained. At slot 1, the node had one leaf, and was able to be folded in successfully. The remaining nodes had no leaves, and so were removed. By the end of the compression stage, there were 14 free slots, and only 3 leaf nodes. The tree was ascended and then its parent node was compressed. When this node was seen, it could not be folded, due to the internal node it contained.
The invariant for compression in this function is: whenever nr_leaves_on_branch < ASSOC_ARRAY_FAN_OUT, the node should contain all leaf nodes. The compression step currently cannot guarantee this, given the corner case shown above.
To fix this issue, retry compression whenever we have retained a node, and yet nr_leaves_on_branch < ASSOC_ARRAY_FAN_OUT. This second compression will then allow the node in slot 1 to be folded in, satisfying the invariant. Below is the output of the reproducer once the fix is applied:
-- compress node 0x560e9c562380 -- free=0, leaves=0 [0] retain node 2/1 [nx 0] [1] fold node 1/1 [nx 0] [2] fold node 0/1 [nx 2] [3] fold node 0/2 [nx 2] [4] fold node 0/3 [nx 2] [5] fold node 0/4 [nx 2] [6] fold node 0/5 [nx 2] [7] fold node 0/6 [nx 2] [8] fold node 0/7 [nx 2] [9] fold node 0/8 [nx 2] [10] fold node 0/9 [nx 2] [11] fold node 0/10 [nx 2] [12] fold node 0/11 [nx 2] [13] fold node 0/12 [nx 2] [14] fold node 0/13 [nx 2] [15] fold node 0/14 [nx 2] internal nodes remain despite enough space, retrying -- compress node 0x560e9c562380 -- free=14, leaves=1 [0] fold node 2/15 [nx 0] after: 3
Changes ======= DH: - Use false instead of 0. - Reorder the inserted lines in a couple of places to put retained before next_slot.
ver #2) - Fix typo in pr_devel, correct comparison to "<="
Fixes: 3cb989501c26 ("Add a generic associative array implementation.") Cc: stable@vger.kernel.org Signed-off-by: Stephen Brennan stephen.s.brennan@oracle.com Signed-off-by: David Howells dhowells@redhat.com cc: Andrew Morton akpm@linux-foundation.org cc: keyrings@vger.kernel.org Link: https://lore.kernel.org/r/20220511225517.407935-1-stephen.s.brennan@oracle.c... # v1 Link: https://lore.kernel.org/r/20220512215045.489140-1-stephen.s.brennan@oracle.c... # v2 Reviewed-by: Jarkko Sakkinen jarkko@kernel.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- lib/assoc_array.c | 8 ++++++++ 1 file changed, 8 insertions(+)
--- a/lib/assoc_array.c +++ b/lib/assoc_array.c @@ -1461,6 +1461,7 @@ int assoc_array_gc(struct assoc_array *a struct assoc_array_ptr *cursor, *ptr; struct assoc_array_ptr *new_root, *new_parent, **new_ptr_pp; unsigned long nr_leaves_on_tree; + bool retained; int keylen, slot, nr_free, next_slot, i;
pr_devel("-->%s()\n", __func__); @@ -1536,6 +1537,7 @@ continue_node: goto descend; }
+retry_compress: pr_devel("-- compress node %p --\n", new_n);
/* Count up the number of empty slots in this node and work out the @@ -1553,6 +1555,7 @@ continue_node: pr_devel("free=%d, leaves=%lu\n", nr_free, new_n->nr_leaves_on_branch);
/* See what we can fold in */ + retained = false; next_slot = 0; for (slot = 0; slot < ASSOC_ARRAY_FAN_OUT; slot++) { struct assoc_array_shortcut *s; @@ -1602,9 +1605,14 @@ continue_node: pr_devel("[%d] retain node %lu/%d [nx %d]\n", slot, child->nr_leaves_on_branch, nr_free + 1, next_slot); + retained = true; } }
+ if (retained && new_n->nr_leaves_on_branch <= ASSOC_ARRAY_FAN_OUT) { + pr_devel("internal nodes remain despite enough space, retrying\n"); + goto retry_compress; + } pr_devel("after: %lu\n", new_n->nr_leaves_on_branch);
nr_leaves_on_tree = new_n->nr_leaves_on_branch;
From: Kuniyuki Iwashima kuniyu@amazon.co.jp
commit f485922d8fe4e44f6d52a5bb95a603b7c65554bb upstream.
Patch series "Fix data-races around epoll reported by KCSAN."
This series suppresses a false positive KCSAN's message and fixes a real data-race.
This patch (of 2):
pipe_poll() runs locklessly and assigns 1 to poll_usage. Once poll_usage is set to 1, it never changes in other places. However, concurrent writes of a value trigger KCSAN, so let's make KCSAN happy.
BUG: KCSAN: data-race in pipe_poll / pipe_poll
write to 0xffff8880042f6678 of 4 bytes by task 174 on cpu 3: pipe_poll (fs/pipe.c:656) ep_item_poll.isra.0 (./include/linux/poll.h:88 fs/eventpoll.c:853) do_epoll_wait (fs/eventpoll.c:1692 fs/eventpoll.c:1806 fs/eventpoll.c:2234) __x64_sys_epoll_wait (fs/eventpoll.c:2246 fs/eventpoll.c:2241 fs/eventpoll.c:2241) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
write to 0xffff8880042f6678 of 4 bytes by task 177 on cpu 1: pipe_poll (fs/pipe.c:656) ep_item_poll.isra.0 (./include/linux/poll.h:88 fs/eventpoll.c:853) do_epoll_wait (fs/eventpoll.c:1692 fs/eventpoll.c:1806 fs/eventpoll.c:2234) __x64_sys_epoll_wait (fs/eventpoll.c:2246 fs/eventpoll.c:2241 fs/eventpoll.c:2241) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 177 Comm: epoll_race Not tainted 5.17.0-58927-gf443e374ae13 #6 Hardware name: Red Hat KVM, BIOS 1.11.0-2.amzn2 04/01/2014
Link: https://lkml.kernel.org/r/20220322002653.33865-1-kuniyu@amazon.co.jp Link: https://lkml.kernel.org/r/20220322002653.33865-2-kuniyu@amazon.co.jp Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.co.jp Cc: Alexander Duyck alexander.h.duyck@intel.com Cc: Al Viro viro@zeniv.linux.org.uk Cc: Davidlohr Bueso dave@stgolabs.net Cc: Kuniyuki Iwashima kuni1840@gmail.com Cc: "Soheil Hassas Yeganeh" soheil@google.com Cc: "Sridhar Samudrala" sridhar.samudrala@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/pipe.c | 2 +- include/linux/pipe_fs_i.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
--- a/fs/pipe.c +++ b/fs/pipe.c @@ -653,7 +653,7 @@ pipe_poll(struct file *filp, poll_table unsigned int head, tail;
/* Epoll has some historical nasty semantics, this enables them */ - pipe->poll_usage = 1; + WRITE_ONCE(pipe->poll_usage, true);
/* * Reading pipe state only -- no need for acquiring the semaphore. --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -71,7 +71,7 @@ struct pipe_inode_info { unsigned int files; unsigned int r_counter; unsigned int w_counter; - unsigned int poll_usage; + bool poll_usage; struct page *tmp_page; struct fasync_struct *fasync_readers; struct fasync_struct *fasync_writers;
From: David Howells dhowells@redhat.com
commit 189b0ddc245139af81198d1a3637cac74f96e13a upstream.
pipe_resize_ring() needs to take the pipe->rd_wait.lock spinlock to prevent post_one_notification() from trying to insert into the ring whilst the ring is being replaced.
The occupancy check must be done after the lock is taken, and the lock must be taken after the new ring is allocated.
The bug can lead to an oops looking something like:
BUG: KASAN: use-after-free in post_one_notification.isra.0+0x62e/0x840 Read of size 4 at addr ffff88801cc72a70 by task poc/27196 ... Call Trace: post_one_notification.isra.0+0x62e/0x840 __post_watch_notification+0x3b7/0x650 key_create_or_update+0xb8b/0xd20 __do_sys_add_key+0x175/0x340 __x64_sys_add_key+0xbe/0x140 do_syscall_64+0x5c/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae
Reported by Selim Enes Karaduman @Enesdex working with Trend Micro Zero Day Initiative.
Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-17291 Signed-off-by: David Howells dhowells@redhat.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/pipe.c | 31 ++++++++++++++++++------------- 1 file changed, 18 insertions(+), 13 deletions(-)
--- a/fs/pipe.c +++ b/fs/pipe.c @@ -1245,30 +1245,33 @@ unsigned int round_pipe_size(unsigned lo
/* * Resize the pipe ring to a number of slots. + * + * Note the pipe can be reduced in capacity, but only if the current + * occupancy doesn't exceed nr_slots; if it does, EBUSY will be + * returned instead. */ int pipe_resize_ring(struct pipe_inode_info *pipe, unsigned int nr_slots) { struct pipe_buffer *bufs; unsigned int head, tail, mask, n;
- /* - * We can shrink the pipe, if arg is greater than the ring occupancy. - * Since we don't expect a lot of shrink+grow operations, just free and - * allocate again like we would do for growing. If the pipe currently - * contains more buffers than arg, then return busy. - */ - mask = pipe->ring_size - 1; - head = pipe->head; - tail = pipe->tail; - n = pipe_occupancy(pipe->head, pipe->tail); - if (nr_slots < n) - return -EBUSY; - bufs = kcalloc(nr_slots, sizeof(*bufs), GFP_KERNEL_ACCOUNT | __GFP_NOWARN); if (unlikely(!bufs)) return -ENOMEM;
+ spin_lock_irq(&pipe->rd_wait.lock); + mask = pipe->ring_size - 1; + head = pipe->head; + tail = pipe->tail; + + n = pipe_occupancy(head, tail); + if (nr_slots < n) { + spin_unlock_irq(&pipe->rd_wait.lock); + kfree(bufs); + return -EBUSY; + } + /* * The pipe array wraps around, so just start the new one at zero * and adjust the indices. @@ -1300,6 +1303,8 @@ int pipe_resize_ring(struct pipe_inode_i pipe->tail = tail; pipe->head = head;
+ spin_unlock_irq(&pipe->rd_wait.lock); + /* This might have made more room for writers */ wake_up_interruptible(&pipe->wr_wait); return 0;
From: Alex Elder elder@linaro.org
commit c5794097b269f15961ed78f7f27b50e51766dec9 upstream.
The aggregation byte limit for an endpoint is currently computed based on the endpoint's receive buffer size.
However, some bytes at the front of each receive buffer are reserved on the assumption that--as with SKBs--it might be useful to insert data (such as headers) before what lands in the buffer.
The aggregation byte limit currently doesn't take into account that reserved space, and as a result, aggregation could require space past that which is available in the buffer.
Fix this by reducing the size used to compute the aggregation byte limit by the NET_SKB_PAD offset reserved for each receive buffer.
Signed-off-by: Alex Elder elder@linaro.org Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ipa/ipa_endpoint.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
--- a/drivers/net/ipa/ipa_endpoint.c +++ b/drivers/net/ipa/ipa_endpoint.c @@ -130,9 +130,10 @@ static bool ipa_endpoint_data_valid_one( */ if (data->endpoint.config.aggregation) { limit += SZ_1K * aggr_byte_limit_max(ipa->version); - if (buffer_size > limit) { + if (buffer_size - NET_SKB_PAD > limit) { dev_err(dev, "RX buffer size too large for aggregated RX endpoint %u (%u > %u)\n", - data->endpoint_id, buffer_size, limit); + data->endpoint_id, + buffer_size - NET_SKB_PAD, limit);
return false; } @@ -739,6 +740,7 @@ static void ipa_endpoint_init_aggr(struc if (endpoint->data->aggregation) { if (!endpoint->toward_ipa) { const struct ipa_endpoint_rx_data *rx_data; + u32 buffer_size; bool close_eof; u32 limit;
@@ -746,7 +748,8 @@ static void ipa_endpoint_init_aggr(struc val |= u32_encode_bits(IPA_ENABLE_AGGR, AGGR_EN_FMASK); val |= u32_encode_bits(IPA_GENERIC, AGGR_TYPE_FMASK);
- limit = ipa_aggr_size_kb(rx_data->buffer_size); + buffer_size = rx_data->buffer_size; + limit = ipa_aggr_size_kb(buffer_size - NET_SKB_PAD); val |= aggr_byte_limit_encoded(version, limit);
limit = IPA_AGGR_TIME_LIMIT;
From: Gustavo A. R. Silva gustavoars@kernel.org
commit 336feb502a715909a8136eb6a62a83d7268a353b upstream.
Fix the following -Wstringop-overflow warnings when building with GCC-11:
drivers/gpu/drm/i915/intel_pm.c:3106:9: warning: ‘intel_read_wm_latency’ accessing 16 bytes in a region of size 10 [-Wstringop-overflow=] 3106 | intel_read_wm_latency(dev_priv, dev_priv->wm.pri_latency); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/gpu/drm/i915/intel_pm.c:3106:9: note: referencing argument 2 of type ‘u16 *’ {aka ‘short unsigned int *’} drivers/gpu/drm/i915/intel_pm.c:2861:13: note: in a call to function ‘intel_read_wm_latency’ 2861 | static void intel_read_wm_latency(struct drm_i915_private *dev_priv, | ^~~~~~~~~~~~~~~~~~~~~
by removing the over-specified array size from the argument declarations.
It seems that this code is actually safe because the size of the array depends on the hardware generation, and the function checks for that.
Notice that wm can be an array of 5 elements: drivers/gpu/drm/i915/intel_pm.c:3109: intel_read_wm_latency(dev_priv, dev_priv->wm.pri_latency);
or an array of 8 elements: drivers/gpu/drm/i915/intel_pm.c:3131: intel_read_wm_latency(dev_priv, dev_priv->wm.skl_latency);
and the compiler legitimately complains about that.
This helps with the ongoing efforts to globally enable -Wstringop-overflow.
Link: https://github.com/KSPP/linux/issues/181 Signed-off-by: Gustavo A. R. Silva gustavoars@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/gpu/drm/i915/intel_pm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/gpu/drm/i915/intel_pm.c +++ b/drivers/gpu/drm/i915/intel_pm.c @@ -2859,7 +2859,7 @@ static void ilk_compute_wm_level(const s }
static void intel_read_wm_latency(struct drm_i915_private *dev_priv, - u16 wm[8]) + u16 wm[]) { struct intel_uncore *uncore = &dev_priv->uncore;
From: Tadeusz Struk tadeusz.struk@linaro.org
commit 64ba4b15e5c045f8b746c6da5fc9be9a6b00b61d upstream.
Syzbot reported slab-out-of-bounds read in exfat_clear_bitmap. This was triggered by reproducer calling truncute with size 0, which causes the following trace:
BUG: KASAN: slab-out-of-bounds in exfat_clear_bitmap+0x147/0x490 fs/exfat/balloc.c:174 Read of size 8 at addr ffff888115aa9508 by task syz-executor251/365
Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack_lvl+0x1e2/0x24b lib/dump_stack.c:118 print_address_description+0x81/0x3c0 mm/kasan/report.c:233 __kasan_report mm/kasan/report.c:419 [inline] kasan_report+0x1a4/0x1f0 mm/kasan/report.c:436 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report_generic.c:309 exfat_clear_bitmap+0x147/0x490 fs/exfat/balloc.c:174 exfat_free_cluster+0x25a/0x4a0 fs/exfat/fatent.c:181 __exfat_truncate+0x99e/0xe00 fs/exfat/file.c:217 exfat_truncate+0x11b/0x4f0 fs/exfat/file.c:243 exfat_setattr+0xa03/0xd40 fs/exfat/file.c:339 notify_change+0xb76/0xe10 fs/attr.c:336 do_truncate+0x1ea/0x2d0 fs/open.c:65
Move the is_valid_cluster() helper from fatent.c to a common header to make it reusable in other *.c files. And add is_valid_cluster() to validate if cluster number is within valid range in exfat_clear_bitmap() and exfat_set_bitmap().
Link: https://syzkaller.appspot.com/bug?id=50381fc73821ecae743b8cf24b4c9a04776f767... Reported-by: syzbot+a4087e40b9c13aad7892@syzkaller.appspotmail.com Fixes: 1e49a94cf707 ("exfat: add bitmap operations") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Tadeusz Struk tadeusz.struk@linaro.org Reviewed-by: Sungjong Seo sj1557.seo@samsung.com Signed-off-by: Namjae Jeon linkinjeon@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/exfat/balloc.c | 8 ++++++-- fs/exfat/exfat_fs.h | 6 ++++++ fs/exfat/fatent.c | 6 ------ 3 files changed, 12 insertions(+), 8 deletions(-)
--- a/fs/exfat/balloc.c +++ b/fs/exfat/balloc.c @@ -148,7 +148,9 @@ int exfat_set_bitmap(struct inode *inode struct super_block *sb = inode->i_sb; struct exfat_sb_info *sbi = EXFAT_SB(sb);
- WARN_ON(clu < EXFAT_FIRST_CLUSTER); + if (!is_valid_cluster(sbi, clu)) + return -EINVAL; + ent_idx = CLUSTER_TO_BITMAP_ENT(clu); i = BITMAP_OFFSET_SECTOR_INDEX(sb, ent_idx); b = BITMAP_OFFSET_BIT_IN_SECTOR(sb, ent_idx); @@ -166,7 +168,9 @@ void exfat_clear_bitmap(struct inode *in struct exfat_sb_info *sbi = EXFAT_SB(sb); struct exfat_mount_options *opts = &sbi->options;
- WARN_ON(clu < EXFAT_FIRST_CLUSTER); + if (!is_valid_cluster(sbi, clu)) + return; + ent_idx = CLUSTER_TO_BITMAP_ENT(clu); i = BITMAP_OFFSET_SECTOR_INDEX(sb, ent_idx); b = BITMAP_OFFSET_BIT_IN_SECTOR(sb, ent_idx); --- a/fs/exfat/exfat_fs.h +++ b/fs/exfat/exfat_fs.h @@ -381,6 +381,12 @@ static inline int exfat_sector_to_cluste EXFAT_RESERVED_CLUSTERS; }
+static inline bool is_valid_cluster(struct exfat_sb_info *sbi, + unsigned int clus) +{ + return clus >= EXFAT_FIRST_CLUSTER && clus < sbi->num_clusters; +} + /* super.c */ int exfat_set_volume_dirty(struct super_block *sb); int exfat_clear_volume_dirty(struct super_block *sb); --- a/fs/exfat/fatent.c +++ b/fs/exfat/fatent.c @@ -81,12 +81,6 @@ int exfat_ent_set(struct super_block *sb return 0; }
-static inline bool is_valid_cluster(struct exfat_sb_info *sbi, - unsigned int clus) -{ - return clus >= EXFAT_FIRST_CLUSTER && clus < sbi->num_clusters; -} - int exfat_ent_get(struct super_block *sb, unsigned int loc, unsigned int *content) {
From: Yuezhang Mo Yuezhang.Mo@sony.com
commit d8dad2588addd1d861ce19e7df3b702330f0c7e3 upstream.
During renaming, the parent directory information maybe updated. But the file/directory still references to the old parent directory information.
This bug will cause 2 problems.
(1) The renamed file can not be written.
[10768.175172] exFAT-fs (sda1): error, failed to bmap (inode : 7afd50e4 iblock : 0, err : -5) [10768.184285] exFAT-fs (sda1): Filesystem has been set read-only ash: write error: Input/output error
(2) Some dentries of the renamed file/directory are not set to deleted after removing the file/directory.
exfat_update_parent_info() is a workaround for the wrong parent directory information being used after renaming. Now that bug is fixed, this is no longer needed, so remove it.
Fixes: 5f2aa075070c ("exfat: add inode operations") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Yuezhang Mo Yuezhang.Mo@sony.com Reviewed-by: Andy Wu Andy.Wu@sony.com Reviewed-by: Aoyama Wataru wataru.aoyama@sony.com Reviewed-by: Daniel Palmer daniel.palmer@sony.com Reviewed-by: Sungjong Seo sj1557.seo@samsung.com Signed-off-by: Namjae Jeon linkinjeon@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/exfat/namei.c | 27 +-------------------------- 1 file changed, 1 insertion(+), 26 deletions(-)
--- a/fs/exfat/namei.c +++ b/fs/exfat/namei.c @@ -1080,6 +1080,7 @@ static int exfat_rename_file(struct inod
exfat_remove_entries(inode, p_dir, oldentry, 0, num_old_entries); + ei->dir = *p_dir; ei->entry = newentry; } else { if (exfat_get_entry_type(epold) == TYPE_FILE) { @@ -1167,28 +1168,6 @@ static int exfat_move_file(struct inode return 0; }
-static void exfat_update_parent_info(struct exfat_inode_info *ei, - struct inode *parent_inode) -{ - struct exfat_sb_info *sbi = EXFAT_SB(parent_inode->i_sb); - struct exfat_inode_info *parent_ei = EXFAT_I(parent_inode); - loff_t parent_isize = i_size_read(parent_inode); - - /* - * the problem that struct exfat_inode_info caches wrong parent info. - * - * because of flag-mismatch of ei->dir, - * there is abnormal traversing cluster chain. - */ - if (unlikely(parent_ei->flags != ei->dir.flags || - parent_isize != EXFAT_CLU_TO_B(ei->dir.size, sbi) || - parent_ei->start_clu != ei->dir.dir)) { - exfat_chain_set(&ei->dir, parent_ei->start_clu, - EXFAT_B_TO_CLU_ROUND_UP(parent_isize, sbi), - parent_ei->flags); - } -} - /* rename or move a old file into a new file */ static int __exfat_rename(struct inode *old_parent_inode, struct exfat_inode_info *ei, struct inode *new_parent_inode, @@ -1219,8 +1198,6 @@ static int __exfat_rename(struct inode * return -ENOENT; }
- exfat_update_parent_info(ei, old_parent_inode); - exfat_chain_dup(&olddir, &ei->dir); dentry = ei->entry;
@@ -1241,8 +1218,6 @@ static int __exfat_rename(struct inode * goto out; }
- exfat_update_parent_info(new_ei, new_parent_inode); - p_dir = &(new_ei->dir); new_entry = new_ei->entry; ep = exfat_get_dentry(sb, p_dir, new_entry, &new_bh);
From: Phil Sutter phil@nwl.cc
commit 558254b0b602b8605d7246a10cfeb584b1fcabfc upstream.
When cloning a packet-based limit expression, copy the cost value as well. Otherwise the new limit is not functional anymore.
Fixes: 3b9e2ea6c11bf ("netfilter: nft_limit: move stateful fields out of expression data") Signed-off-by: Phil Sutter phil@nwl.cc Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/netfilter/nft_limit.c | 2 ++ 1 file changed, 2 insertions(+)
--- a/net/netfilter/nft_limit.c +++ b/net/netfilter/nft_limit.c @@ -213,6 +213,8 @@ static int nft_limit_pkts_clone(struct n struct nft_limit_priv_pkts *priv_dst = nft_expr_priv(dst); struct nft_limit_priv_pkts *priv_src = nft_expr_priv(src);
+ priv_dst->cost = priv_src->cost; + return nft_limit_clone(&priv_dst->limit, &priv_src->limit); }
From: Pablo Neira Ayuso pablo@netfilter.org
commit fecf31ee395b0295f2d7260aa29946b7605f7c85 upstream.
Add several sanity checks for nft_set_desc_concat_parse():
- validate desc->field_count not larger than desc->field_len array. - field length cannot be larger than desc->field_len (ie. U8_MAX) - total length of the concatenation cannot be larger than register array.
Joint work with Florian Westphal.
Fixes: f3a2181e16f1 ("netfilter: nf_tables: Support for sets with multiple ranged fields") Reported-by: zhangziming.zzm@antgroup.com Reviewed-by: Stefano Brivio sbrivio@redhat.com Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/netfilter/nf_tables_api.c | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-)
--- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -4246,6 +4246,9 @@ static int nft_set_desc_concat_parse(con u32 len; int err;
+ if (desc->field_count >= ARRAY_SIZE(desc->field_len)) + return -E2BIG; + err = nla_parse_nested_deprecated(tb, NFTA_SET_FIELD_MAX, attr, nft_concat_policy, NULL); if (err < 0) @@ -4255,9 +4258,8 @@ static int nft_set_desc_concat_parse(con return -EINVAL;
len = ntohl(nla_get_be32(tb[NFTA_SET_FIELD_LEN])); - - if (len * BITS_PER_BYTE / 32 > NFT_REG32_COUNT) - return -E2BIG; + if (!len || len > U8_MAX) + return -EINVAL;
desc->field_len[desc->field_count++] = len;
@@ -4268,7 +4270,8 @@ static int nft_set_desc_concat(struct nf const struct nlattr *nla) { struct nlattr *attr; - int rem, err; + u32 num_regs = 0; + int rem, err, i;
nla_for_each_nested(attr, nla, rem) { if (nla_type(attr) != NFTA_LIST_ELEM) @@ -4279,6 +4282,12 @@ static int nft_set_desc_concat(struct nf return err; }
+ for (i = 0; i < desc->field_count; i++) + num_regs += DIV_ROUND_UP(desc->field_len[i], sizeof(u32)); + + if (num_regs > NFT_REG32_COUNT) + return -E2BIG; + return 0; }
From: Pablo Neira Ayuso pablo@netfilter.org
commit 3923b1e4406680d57da7e873da77b1683035d83f upstream.
clean_net() runs in workqueue while walking over the lists, grab mutex.
Fixes: 767d1216bff8 ("netfilter: nftables: fix possible UAF over chains from packet path in netns") Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/netfilter/nf_tables_api.c | 4 ++++ 1 file changed, 4 insertions(+)
--- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -9892,7 +9892,11 @@ static int __net_init nf_tables_init_net
static void __net_exit nf_tables_pre_exit_net(struct net *net) { + struct nftables_pernet *nft_net = nft_pernet(net); + + mutex_lock(&nft_net->commit_mutex); __nft_release_hooks(net); + mutex_unlock(&nft_net->commit_mutex); }
static void __net_exit nf_tables_exit_net(struct net *net)
From: Pablo Neira Ayuso pablo@netfilter.org
commit f9a43007d3f7ba76d5e7f9421094f00f2ef202f8 upstream.
__nft_release_hooks() is called from pre_netns exit path which unregisters the hooks, then the NETDEV_UNREGISTER event is triggered which unregisters the hooks again.
[ 565.221461] WARNING: CPU: 18 PID: 193 at net/netfilter/core.c:495 __nf_unregister_net_hook+0x247/0x270 [...] [ 565.246890] CPU: 18 PID: 193 Comm: kworker/u64:1 Tainted: G E 5.18.0-rc7+ #27 [ 565.253682] Workqueue: netns cleanup_net [ 565.257059] RIP: 0010:__nf_unregister_net_hook+0x247/0x270 [...] [ 565.297120] Call Trace: [ 565.300900] <TASK> [ 565.304683] nf_tables_flowtable_event+0x16a/0x220 [nf_tables] [ 565.308518] raw_notifier_call_chain+0x63/0x80 [ 565.312386] unregister_netdevice_many+0x54f/0xb50
Unregister and destroy netdev hook from netns pre_exit via kfree_rcu so the NETDEV_UNREGISTER path see unregistered hooks.
Fixes: 767d1216bff8 ("netfilter: nftables: fix possible UAF over chains from packet path in netns") Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/netfilter/nf_tables_api.c | 54 +++++++++++++++++++++++++++++++----------- 1 file changed, 41 insertions(+), 13 deletions(-)
--- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -222,12 +222,18 @@ err_register: }
static void nft_netdev_unregister_hooks(struct net *net, - struct list_head *hook_list) + struct list_head *hook_list, + bool release_netdev) { - struct nft_hook *hook; + struct nft_hook *hook, *next;
- list_for_each_entry(hook, hook_list, list) + list_for_each_entry_safe(hook, next, hook_list, list) { nf_unregister_net_hook(net, &hook->ops); + if (release_netdev) { + list_del(&hook->list); + kfree_rcu(hook, rcu); + } + } }
static int nf_tables_register_hook(struct net *net, @@ -253,9 +259,10 @@ static int nf_tables_register_hook(struc return nf_register_net_hook(net, &basechain->ops); }
-static void nf_tables_unregister_hook(struct net *net, - const struct nft_table *table, - struct nft_chain *chain) +static void __nf_tables_unregister_hook(struct net *net, + const struct nft_table *table, + struct nft_chain *chain, + bool release_netdev) { struct nft_base_chain *basechain; const struct nf_hook_ops *ops; @@ -270,11 +277,19 @@ static void nf_tables_unregister_hook(st return basechain->type->ops_unregister(net, ops);
if (nft_base_chain_netdev(table->family, basechain->ops.hooknum)) - nft_netdev_unregister_hooks(net, &basechain->hook_list); + nft_netdev_unregister_hooks(net, &basechain->hook_list, + release_netdev); else nf_unregister_net_hook(net, &basechain->ops); }
+static void nf_tables_unregister_hook(struct net *net, + const struct nft_table *table, + struct nft_chain *chain) +{ + return __nf_tables_unregister_hook(net, table, chain, false); +} + static void nft_trans_commit_list_add_tail(struct net *net, struct nft_trans *trans) { struct nftables_pernet *nft_net = nft_pernet(net); @@ -7301,13 +7316,25 @@ static void nft_unregister_flowtable_hoo FLOW_BLOCK_UNBIND); }
-static void nft_unregister_flowtable_net_hooks(struct net *net, - struct list_head *hook_list) +static void __nft_unregister_flowtable_net_hooks(struct net *net, + struct list_head *hook_list, + bool release_netdev) { - struct nft_hook *hook; + struct nft_hook *hook, *next;
- list_for_each_entry(hook, hook_list, list) + list_for_each_entry_safe(hook, next, hook_list, list) { nf_unregister_net_hook(net, &hook->ops); + if (release_netdev) { + list_del(&hook->list); + kfree_rcu(hook); + } + } +} + +static void nft_unregister_flowtable_net_hooks(struct net *net, + struct list_head *hook_list) +{ + __nft_unregister_flowtable_net_hooks(net, hook_list, false); }
static int nft_register_flowtable_net_hooks(struct net *net, @@ -9751,9 +9778,10 @@ static void __nft_release_hook(struct ne struct nft_chain *chain;
list_for_each_entry(chain, &table->chains, list) - nf_tables_unregister_hook(net, table, chain); + __nf_tables_unregister_hook(net, table, chain, true); list_for_each_entry(flowtable, &table->flowtables, list) - nft_unregister_flowtable_net_hooks(net, &flowtable->hook_list); + __nft_unregister_flowtable_net_hooks(net, &flowtable->hook_list, + true); }
static void __nft_release_hooks(struct net *net)
From: Florian Westphal fw@strlen.de
commit 56b14ecec97f39118bf85c9ac2438c5a949509ed upstream.
In case the conntrack is clashing, insertion can free skb->_nfct and set skb->_nfct to the already-confirmed entry.
This wasn't found before because the conntrack entry and the extension space used to free'd after an rcu grace period, plus the race needs events enabled to trigger.
Reported-by: syzbot+793a590957d9c1b96620@syzkaller.appspotmail.com Fixes: 71d8c47fc653 ("netfilter: conntrack: introduce clash resolution on insertion race") Fixes: 2ad9d7747c10 ("netfilter: conntrack: free extension area immediately") Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/net/netfilter/nf_conntrack_core.h | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
--- a/include/net/netfilter/nf_conntrack_core.h +++ b/include/net/netfilter/nf_conntrack_core.h @@ -58,8 +58,13 @@ static inline int nf_conntrack_confirm(s int ret = NF_ACCEPT;
if (ct) { - if (!nf_ct_is_confirmed(ct)) + if (!nf_ct_is_confirmed(ct)) { ret = __nf_conntrack_confirm(skb); + + if (ret == NF_ACCEPT) + ct = (struct nf_conn *)skb_nfct(skb); + } + if (likely(ret == NF_ACCEPT)) nf_ct_deliver_cached_events(ct); }
From: Xiaomeng Tong xiam0nd.tong@gmail.com
commit 300981abddcb13f8f06ad58f52358b53a8096775 upstream.
The bug is here: if (!p) return ret;
The list iterator value 'p' will *always* be set and non-NULL by list_for_each_entry(), so it is incorrect to assume that the iterator value will be NULL if the list is empty or no element is found.
To fix the bug, Use a new value 'iter' as the list iterator, while use the old value 'p' as a dedicated variable to point to the found element.
Fixes: dfaa973ae960 ("KVM: PPC: Book3S HV: In H_SVM_INIT_DONE, migrate remaining normal-GFNs to secure-GFNs") Cc: stable@vger.kernel.org # v5.9+ Signed-off-by: Xiaomeng Tong xiam0nd.tong@gmail.com Signed-off-by: Michael Ellerman mpe@ellerman.id.au Link: https://lore.kernel.org/r/20220414062103.8153-1-xiam0nd.tong@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/powerpc/kvm/book3s_hv_uvmem.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -361,13 +361,15 @@ static bool kvmppc_gfn_is_uvmem_pfn(unsi static bool kvmppc_next_nontransitioned_gfn(const struct kvm_memory_slot *memslot, struct kvm *kvm, unsigned long *gfn) { - struct kvmppc_uvmem_slot *p; + struct kvmppc_uvmem_slot *p = NULL, *iter; bool ret = false; unsigned long i;
- list_for_each_entry(p, &kvm->arch.uvmem_pfns, list) - if (*gfn >= p->base_pfn && *gfn < p->base_pfn + p->nr_pfns) + list_for_each_entry(iter, &kvm->arch.uvmem_pfns, list) + if (*gfn >= iter->base_pfn && *gfn < iter->base_pfn + iter->nr_pfns) { + p = iter; break; + } if (!p) return ret; /*
From: Sean Christopherson seanjc@google.com
commit d187ba5312307d51818beafaad87d28a7d939adf upstream.
Set the starting uABI size of KVM's guest FPU to 'struct kvm_xsave', i.e. to KVM's historical uABI size. When saving FPU state for usersapce, KVM (well, now the FPU) sets the FP+SSE bits in the XSAVE header even if the host doesn't support XSAVE. Setting the XSAVE header allows the VM to be migrated to a host that does support XSAVE without the new host having to handle FPU state that may or may not be compatible with XSAVE.
Setting the uABI size to the host's default size results in out-of-bounds writes (setting the FP+SSE bits) and data corruption (that is thankfully caught by KASAN) when running on hosts without XSAVE, e.g. on Core2 CPUs.
WARN if the default size is larger than KVM's historical uABI size; all features that can push the FPU size beyond the historical size must be opt-in.
================================================================== BUG: KASAN: slab-out-of-bounds in fpu_copy_uabi_to_guest_fpstate+0x86/0x130 Read of size 8 at addr ffff888011e33a00 by task qemu-build/681 CPU: 1 PID: 681 Comm: qemu-build Not tainted 5.18.0-rc5-KASAN-amd64 #1 Hardware name: /DG35EC, BIOS ECG3510M.86A.0118.2010.0113.1426 01/13/2010 Call Trace: <TASK> dump_stack_lvl+0x34/0x45 print_report.cold+0x45/0x575 kasan_report+0x9b/0xd0 fpu_copy_uabi_to_guest_fpstate+0x86/0x130 kvm_arch_vcpu_ioctl+0x72a/0x1c50 [kvm] kvm_vcpu_ioctl+0x47f/0x7b0 [kvm] __x64_sys_ioctl+0x5de/0xc90 do_syscall_64+0x31/0x50 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> Allocated by task 0: (stack is not available) The buggy address belongs to the object at ffff888011e33800 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 0 bytes to the right of 512-byte region [ffff888011e33800, ffff888011e33a00) The buggy address belongs to the physical page: page:0000000089cd4adb refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11e30 head:0000000089cd4adb order:2 compound_mapcount:0 compound_pincount:0 flags: 0x4000000000010200(slab|head|zone=1) raw: 4000000000010200 dead000000000100 dead000000000122 ffff888001041c80 raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888011e33900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff888011e33980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff888011e33a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^ ffff888011e33a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888011e33b00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Disabling lock debugging due to kernel taint
Fixes: be50b2065dfa ("kvm: x86: Add support for getting/setting expanded xstate buffer") Fixes: c60427dd50ba ("x86/fpu: Add uabi_size to guest_fpu") Reported-by: Zdenek Kaspar zkaspar82@gmail.com Cc: Maciej S. Szmigiero mail@maciej.szmigiero.name Cc: Paolo Bonzini pbonzini@redhat.com Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson seanjc@google.com Tested-by: Zdenek Kaspar zkaspar82@gmail.com Message-Id: 20220504001219.983513-1-seanjc@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/fpu/core.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-)
--- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -14,6 +14,8 @@ #include <asm/traps.h> #include <asm/irq_regs.h>
+#include <uapi/asm/kvm.h> + #include <linux/hardirq.h> #include <linux/pkeys.h> #include <linux/vmalloc.h> @@ -232,7 +234,20 @@ bool fpu_alloc_guest_fpstate(struct fpu_ gfpu->fpstate = fpstate; gfpu->xfeatures = fpu_user_cfg.default_features; gfpu->perm = fpu_user_cfg.default_features; - gfpu->uabi_size = fpu_user_cfg.default_size; + + /* + * KVM sets the FP+SSE bits in the XSAVE header when copying FPU state + * to userspace, even when XSAVE is unsupported, so that restoring FPU + * state on a different CPU that does support XSAVE can cleanly load + * the incoming state using its natural XSAVE. In other words, KVM's + * uABI size may be larger than this host's default size. Conversely, + * the default size should never be larger than KVM's base uABI size; + * all features that can expand the uABI size must be opt-in. + */ + gfpu->uabi_size = sizeof(struct kvm_xsave); + if (WARN_ON_ONCE(fpu_user_cfg.default_size > gfpu->uabi_size)) + gfpu->uabi_size = fpu_user_cfg.default_size; + fpu_init_guest_permissions(gfpu);
return true;
From: Sean Christopherson seanjc@google.com
commit 0547758a6de3cc71a0cfdd031a3621a30db6a68b upstream.
Drop the raw spinlock in kvm_async_pf_task_wake() before allocating the the dummy async #PF token, the allocator is preemptible on PREEMPT_RT kernels and must not be called from truly atomic contexts.
Opportunistically document why it's ok to loop on allocation failure, i.e. why the function won't get stuck in an infinite loop.
Reported-by: Yajun Deng yajun.deng@linux.dev Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson seanjc@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/kvm.c | 41 +++++++++++++++++++++++++++-------------- 1 file changed, 27 insertions(+), 14 deletions(-)
--- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -191,7 +191,7 @@ void kvm_async_pf_task_wake(u32 token) { u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS); struct kvm_task_sleep_head *b = &async_pf_sleepers[key]; - struct kvm_task_sleep_node *n; + struct kvm_task_sleep_node *n, *dummy = NULL;
if (token == ~0) { apf_task_wake_all(); @@ -203,28 +203,41 @@ again: n = _find_apf_task(b, token); if (!n) { /* - * async PF was not yet handled. - * Add dummy entry for the token. + * Async #PF not yet handled, add a dummy entry for the token. + * Allocating the token must be down outside of the raw lock + * as the allocator is preemptible on PREEMPT_RT kernels. */ - n = kzalloc(sizeof(*n), GFP_ATOMIC); - if (!n) { + if (!dummy) { + raw_spin_unlock(&b->lock); + dummy = kzalloc(sizeof(*dummy), GFP_KERNEL); + /* - * Allocation failed! Busy wait while other cpu - * handles async PF. + * Continue looping on allocation failure, eventually + * the async #PF will be handled and allocating a new + * node will be unnecessary. + */ + if (!dummy) + cpu_relax(); + + /* + * Recheck for async #PF completion before enqueueing + * the dummy token to avoid duplicate list entries. */ - raw_spin_unlock(&b->lock); - cpu_relax(); goto again; } - n->token = token; - n->cpu = smp_processor_id(); - init_swait_queue_head(&n->wq); - hlist_add_head(&n->link, &b->list); + dummy->token = token; + dummy->cpu = smp_processor_id(); + init_swait_queue_head(&dummy->wq); + hlist_add_head(&dummy->link, &b->list); + dummy = NULL; } else { apf_task_wake_one(n); } raw_spin_unlock(&b->lock); - return; + + /* A dummy token might be allocated and ultimately not used. */ + if (dummy) + kfree(dummy); } EXPORT_SYMBOL_GPL(kvm_async_pf_task_wake);
From: Paolo Bonzini pbonzini@redhat.com
commit baec4f5a018fe2d708fc1022330dba04b38b5fe3 upstream.
Commit ddd7ed842627 ("x86/kvm: Alloc dummy async #PF token outside of raw spinlock") leads to the following Smatch static checker warning:
arch/x86/kernel/kvm.c:212 kvm_async_pf_task_wake() warn: sleeping in atomic context
arch/x86/kernel/kvm.c 202 raw_spin_lock(&b->lock); 203 n = _find_apf_task(b, token); 204 if (!n) { 205 /* 206 * Async #PF not yet handled, add a dummy entry for the token. 207 * Allocating the token must be down outside of the raw lock 208 * as the allocator is preemptible on PREEMPT_RT kernels. 209 */ 210 if (!dummy) { 211 raw_spin_unlock(&b->lock); --> 212 dummy = kzalloc(sizeof(*dummy), GFP_KERNEL); ^^^^^^^^^^ Smatch thinks the caller has preempt disabled. The `smdb.py preempt kvm_async_pf_task_wake` output call tree is:
sysvec_kvm_asyncpf_interrupt() <- disables preempt -> __sysvec_kvm_asyncpf_interrupt() -> kvm_async_pf_task_wake()
The caller is this:
arch/x86/kernel/kvm.c 290 DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_asyncpf_interrupt) 291 { 292 struct pt_regs *old_regs = set_irq_regs(regs); 293 u32 token; 294 295 ack_APIC_irq(); 296 297 inc_irq_stat(irq_hv_callback_count); 298 299 if (__this_cpu_read(apf_reason.enabled)) { 300 token = __this_cpu_read(apf_reason.token); 301 kvm_async_pf_task_wake(token); 302 __this_cpu_write(apf_reason.token, 0); 303 wrmsrl(MSR_KVM_ASYNC_PF_ACK, 1); 304 } 305 306 set_irq_regs(old_regs); 307 }
The DEFINE_IDTENTRY_SYSVEC() is a wrapper that calls this function from the call_on_irqstack_cond(). It's inside the call_on_irqstack_cond() where preempt is disabled (unless it's already disabled). The irq_enter/exit_rcu() functions disable/enable preempt.
Reported-by: Dan Carpenter dan.carpenter@oracle.com Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/kvm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -209,7 +209,7 @@ again: */ if (!dummy) { raw_spin_unlock(&b->lock); - dummy = kzalloc(sizeof(*dummy), GFP_KERNEL); + dummy = kzalloc(sizeof(*dummy), GFP_ATOMIC);
/* * Continue looping on allocation failure, eventually
From: Peter Zijlstra peterz@infradead.org
commit 989b5db215a2f22f89d730b607b071d964780f10 upstream.
Add support for CMPXCHG loops on userspace addresses. Provide both an "unsafe" version for tight loops that do their own uaccess begin/end, as well as a "safe" version for use cases where the CMPXCHG is not buried in a loop, e.g. KVM will resume the guest instead of looping when emulation of a guest atomic accesses fails the CMPXCHG.
Provide 8-byte versions for 32-bit kernels so that KVM can do CMPXCHG on guest PAE PTEs, which are accessed via userspace addresses.
Guard the asm_volatile_goto() variation with CC_HAS_ASM_GOTO_TIED_OUTPUT, the "+m" constraint fails on some compilers that otherwise support CC_HAS_ASM_GOTO_OUTPUT.
Cc: stable@vger.kernel.org Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Co-developed-by: Sean Christopherson seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com Message-Id: 20220202004945.2540433-3-seanjc@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/include/asm/uaccess.h | 142 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 142 insertions(+)
--- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -382,6 +382,103 @@ do { \
#endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT
+#ifdef CONFIG_CC_HAS_ASM_GOTO_TIED_OUTPUT +#define __try_cmpxchg_user_asm(itype, ltype, _ptr, _pold, _new, label) ({ \ + bool success; \ + __typeof__(_ptr) _old = (__typeof__(_ptr))(_pold); \ + __typeof__(*(_ptr)) __old = *_old; \ + __typeof__(*(_ptr)) __new = (_new); \ + asm_volatile_goto("\n" \ + "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\ + _ASM_EXTABLE_UA(1b, %l[label]) \ + : CC_OUT(z) (success), \ + [ptr] "+m" (*_ptr), \ + [old] "+a" (__old) \ + : [new] ltype (__new) \ + : "memory" \ + : label); \ + if (unlikely(!success)) \ + *_old = __old; \ + likely(success); }) + +#ifdef CONFIG_X86_32 +#define __try_cmpxchg64_user_asm(_ptr, _pold, _new, label) ({ \ + bool success; \ + __typeof__(_ptr) _old = (__typeof__(_ptr))(_pold); \ + __typeof__(*(_ptr)) __old = *_old; \ + __typeof__(*(_ptr)) __new = (_new); \ + asm_volatile_goto("\n" \ + "1: " LOCK_PREFIX "cmpxchg8b %[ptr]\n" \ + _ASM_EXTABLE_UA(1b, %l[label]) \ + : CC_OUT(z) (success), \ + "+A" (__old), \ + [ptr] "+m" (*_ptr) \ + : "b" ((u32)__new), \ + "c" ((u32)((u64)__new >> 32)) \ + : "memory" \ + : label); \ + if (unlikely(!success)) \ + *_old = __old; \ + likely(success); }) +#endif // CONFIG_X86_32 +#else // !CONFIG_CC_HAS_ASM_GOTO_TIED_OUTPUT +#define __try_cmpxchg_user_asm(itype, ltype, _ptr, _pold, _new, label) ({ \ + int __err = 0; \ + bool success; \ + __typeof__(_ptr) _old = (__typeof__(_ptr))(_pold); \ + __typeof__(*(_ptr)) __old = *_old; \ + __typeof__(*(_ptr)) __new = (_new); \ + asm volatile("\n" \ + "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\ + CC_SET(z) \ + "2:\n" \ + _ASM_EXTABLE_TYPE_REG(1b, 2b, EX_TYPE_EFAULT_REG, \ + %[errout]) \ + : CC_OUT(z) (success), \ + [errout] "+r" (__err), \ + [ptr] "+m" (*_ptr), \ + [old] "+a" (__old) \ + : [new] ltype (__new) \ + : "memory", "cc"); \ + if (unlikely(__err)) \ + goto label; \ + if (unlikely(!success)) \ + *_old = __old; \ + likely(success); }) + +#ifdef CONFIG_X86_32 +/* + * Unlike the normal CMPXCHG, hardcode ECX for both success/fail and error. + * There are only six GPRs available and four (EAX, EBX, ECX, and EDX) are + * hardcoded by CMPXCHG8B, leaving only ESI and EDI. If the compiler uses + * both ESI and EDI for the memory operand, compilation will fail if the error + * is an input+output as there will be no register available for input. + */ +#define __try_cmpxchg64_user_asm(_ptr, _pold, _new, label) ({ \ + int __result; \ + __typeof__(_ptr) _old = (__typeof__(_ptr))(_pold); \ + __typeof__(*(_ptr)) __old = *_old; \ + __typeof__(*(_ptr)) __new = (_new); \ + asm volatile("\n" \ + "1: " LOCK_PREFIX "cmpxchg8b %[ptr]\n" \ + "mov $0, %%ecx\n\t" \ + "setz %%cl\n" \ + "2:\n" \ + _ASM_EXTABLE_TYPE_REG(1b, 2b, EX_TYPE_EFAULT_REG, %%ecx) \ + : [result]"=c" (__result), \ + "+A" (__old), \ + [ptr] "+m" (*_ptr) \ + : "b" ((u32)__new), \ + "c" ((u32)((u64)__new >> 32)) \ + : "memory", "cc"); \ + if (unlikely(__result < 0)) \ + goto label; \ + if (unlikely(!__result)) \ + *_old = __old; \ + likely(__result); }) +#endif // CONFIG_X86_32 +#endif // CONFIG_CC_HAS_ASM_GOTO_TIED_OUTPUT + /* FIXME: this hack is definitely wrong -AK */ struct __large_struct { unsigned long buf[100]; }; #define __m(x) (*(struct __large_struct __user *)(x)) @@ -474,6 +571,51 @@ do { \ } while (0) #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT
+extern void __try_cmpxchg_user_wrong_size(void); + +#ifndef CONFIG_X86_32 +#define __try_cmpxchg64_user_asm(_ptr, _oldp, _nval, _label) \ + __try_cmpxchg_user_asm("q", "r", (_ptr), (_oldp), (_nval), _label) +#endif + +/* + * Force the pointer to u<size> to match the size expected by the asm helper. + * clang/LLVM compiles all cases and only discards the unused paths after + * processing errors, which breaks i386 if the pointer is an 8-byte value. + */ +#define unsafe_try_cmpxchg_user(_ptr, _oldp, _nval, _label) ({ \ + bool __ret; \ + __chk_user_ptr(_ptr); \ + switch (sizeof(*(_ptr))) { \ + case 1: __ret = __try_cmpxchg_user_asm("b", "q", \ + (__force u8 *)(_ptr), (_oldp), \ + (_nval), _label); \ + break; \ + case 2: __ret = __try_cmpxchg_user_asm("w", "r", \ + (__force u16 *)(_ptr), (_oldp), \ + (_nval), _label); \ + break; \ + case 4: __ret = __try_cmpxchg_user_asm("l", "r", \ + (__force u32 *)(_ptr), (_oldp), \ + (_nval), _label); \ + break; \ + case 8: __ret = __try_cmpxchg64_user_asm((__force u64 *)(_ptr), (_oldp),\ + (_nval), _label); \ + break; \ + default: __try_cmpxchg_user_wrong_size(); \ + } \ + __ret; }) + +/* "Returns" 0 on success, 1 on failure, -EFAULT if the access faults. */ +#define __try_cmpxchg_user(_ptr, _oldp, _nval, _label) ({ \ + int __ret = -EFAULT; \ + __uaccess_begin_nospec(); \ + __ret = !unsafe_try_cmpxchg_user(_ptr, _oldp, _nval, _label); \ +_label: \ + __uaccess_end(); \ + __ret; \ + }) + /* * We want the unsafe accessors to always be inlined and use * the error labels - thus the macro games.
From: Sean Christopherson seanjc@google.com
commit f122dfe4476890d60b8c679128cd2259ec96a24c upstream.
Use the recently introduced __try_cmpxchg_user() to update guest PTE A/D bits instead of mapping the PTE into kernel address space. The VM_PFNMAP path is broken as it assumes that vm_pgoff is the base pfn of the mapped VMA range, which is conceptually wrong as vm_pgoff is the offset relative to the file and has nothing to do with the pfn. The horrific hack worked for the original use case (backing guest memory with /dev/mem), but leads to accessing "random" pfns for pretty much any other VM_PFNMAP case.
Fixes: bd53cb35a3e9 ("X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs") Debugged-by: Tadeusz Struk tadeusz.struk@linaro.org Tested-by: Tadeusz Struk tadeusz.struk@linaro.org Reported-by: syzbot+6cde2282daa792c49ab8@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson seanjc@google.com Message-Id: 20220202004945.2540433-4-seanjc@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kvm/mmu/paging_tmpl.h | 38 +------------------------------------- 1 file changed, 1 insertion(+), 37 deletions(-)
--- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -144,42 +144,6 @@ static bool FNAME(is_rsvd_bits_set)(stru FNAME(is_bad_mt_xwr)(&mmu->guest_rsvd_check, gpte); }
-static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, - pt_element_t __user *ptep_user, unsigned index, - pt_element_t orig_pte, pt_element_t new_pte) -{ - signed char r; - - if (!user_access_begin(ptep_user, sizeof(pt_element_t))) - return -EFAULT; - -#ifdef CMPXCHG - asm volatile("1:" LOCK_PREFIX CMPXCHG " %[new], %[ptr]\n" - "setnz %b[r]\n" - "2:" - _ASM_EXTABLE_TYPE_REG(1b, 2b, EX_TYPE_EFAULT_REG, %k[r]) - : [ptr] "+m" (*ptep_user), - [old] "+a" (orig_pte), - [r] "=q" (r) - : [new] "r" (new_pte) - : "memory"); -#else - asm volatile("1:" LOCK_PREFIX "cmpxchg8b %[ptr]\n" - "setnz %b[r]\n" - "2:" - _ASM_EXTABLE_TYPE_REG(1b, 2b, EX_TYPE_EFAULT_REG, %k[r]) - : [ptr] "+m" (*ptep_user), - [old] "+A" (orig_pte), - [r] "=q" (r) - : [new_lo] "b" ((u32)new_pte), - [new_hi] "c" ((u32)(new_pte >> 32)) - : "memory"); -#endif - - user_access_end(); - return r; -} - static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, u64 *spte, u64 gpte) @@ -278,7 +242,7 @@ static int FNAME(update_accessed_dirty_b if (unlikely(!walker->pte_writable[level - 1])) continue;
- ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, orig_pte, pte); + ret = __try_cmpxchg_user(ptep_user, &orig_pte, pte, fault); if (ret) return ret;
From: Sean Christopherson seanjc@google.com
commit 1c2361f667f3648855ceae25f1332c18413fdb9f upstream.
Use the recently introduce __try_cmpxchg_user() to emulate atomic guest accesses via the associated userspace address instead of mapping the backing pfn into kernel address space. Using kvm_vcpu_map() is unsafe as it does not coordinate with KVM's mmu_notifier to ensure the hva=>pfn translation isn't changed/unmapped in the memremap() path, i.e. when there's no struct page and thus no elevated refcount.
Fixes: 42e35f8072c3 ("KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson seanjc@google.com Message-Id: 20220202004945.2540433-5-seanjc@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kvm/x86.c | 35 ++++++++++++++--------------------- 1 file changed, 14 insertions(+), 21 deletions(-)
--- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7229,15 +7229,8 @@ static int emulator_write_emulated(struc exception, &write_emultor); }
-#define CMPXCHG_TYPE(t, ptr, old, new) \ - (cmpxchg((t *)(ptr), *(t *)(old), *(t *)(new)) == *(t *)(old)) - -#ifdef CONFIG_X86_64 -# define CMPXCHG64(ptr, old, new) CMPXCHG_TYPE(u64, ptr, old, new) -#else -# define CMPXCHG64(ptr, old, new) \ - (cmpxchg64((u64 *)(ptr), *(u64 *)(old), *(u64 *)(new)) == *(u64 *)(old)) -#endif +#define emulator_try_cmpxchg_user(t, ptr, old, new) \ + (__try_cmpxchg_user((t __user *)(ptr), (t *)(old), *(t *)(new), efault ## t))
static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt, unsigned long addr, @@ -7246,12 +7239,11 @@ static int emulator_cmpxchg_emulated(str unsigned int bytes, struct x86_exception *exception) { - struct kvm_host_map map; struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); u64 page_line_mask; + unsigned long hva; gpa_t gpa; - char *kaddr; - bool exchanged; + int r;
/* guests cmpxchg8b have to be emulated atomically */ if (bytes > 8 || (bytes & (bytes - 1))) @@ -7275,31 +7267,32 @@ static int emulator_cmpxchg_emulated(str if (((gpa + bytes - 1) & page_line_mask) != (gpa & page_line_mask)) goto emul_write;
- if (kvm_vcpu_map(vcpu, gpa_to_gfn(gpa), &map)) + hva = kvm_vcpu_gfn_to_hva(vcpu, gpa_to_gfn(gpa)); + if (kvm_is_error_hva(addr)) goto emul_write;
- kaddr = map.hva + offset_in_page(gpa); + hva += offset_in_page(gpa);
switch (bytes) { case 1: - exchanged = CMPXCHG_TYPE(u8, kaddr, old, new); + r = emulator_try_cmpxchg_user(u8, hva, old, new); break; case 2: - exchanged = CMPXCHG_TYPE(u16, kaddr, old, new); + r = emulator_try_cmpxchg_user(u16, hva, old, new); break; case 4: - exchanged = CMPXCHG_TYPE(u32, kaddr, old, new); + r = emulator_try_cmpxchg_user(u32, hva, old, new); break; case 8: - exchanged = CMPXCHG64(kaddr, old, new); + r = emulator_try_cmpxchg_user(u64, hva, old, new); break; default: BUG(); }
- kvm_vcpu_unmap(vcpu, &map, true); - - if (!exchanged) + if (r < 0) + goto emul_write; + if (r) return X86EMUL_CMPXCHG_FAILED;
kvm_page_track_write(vcpu, gpa, new, bytes);
From: Maxim Levitsky mlevitsk@redhat.com
commit 33fbe6befa622c082f7d417896832856814bdde0 upstream.
This shows up as a TDP MMU leak when running nested. Non-working cmpxchg on L0 relies makes L1 install two different shadow pages under same spte, and one of them is leaked.
Fixes: 1c2361f667f36 ("KVM: x86: Use __try_cmpxchg_user() to emulate atomic accesses") Signed-off-by: Maxim Levitsky mlevitsk@redhat.com Message-Id: 20220512101420.306759-1-mlevitsk@redhat.com Reviewed-by: Sean Christopherson seanjc@google.com Reviewed-by: Vitaly Kuznetsov vkuznets@redhat.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kvm/x86.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7268,7 +7268,7 @@ static int emulator_cmpxchg_emulated(str goto emul_write;
hva = kvm_vcpu_gfn_to_hva(vcpu, gpa_to_gfn(gpa)); - if (kvm_is_error_hva(addr)) + if (kvm_is_error_hva(hva)) goto emul_write;
hva += offset_in_page(gpa);
From: Sean Christopherson seanjc@google.com
commit fee060cd52d69c114b62d1a2948ea9648b5131f9 upstream.
Whenever x86_decode_emulated_instruction() detects a breakpoint, it returns the value that kvm_vcpu_check_breakpoint() writes into its pass-by-reference second argument. Unfortunately this is completely bogus because the expected outcome of x86_decode_emulated_instruction is an EMULATION_* value.
Then, if kvm_vcpu_check_breakpoint() does "*r = 0" (corresponding to a KVM_EXIT_DEBUG userspace exit), it is misunderstood as EMULATION_OK and x86_emulate_instruction() is called without having decoded the instruction. This causes various havoc from running with a stale emulation context.
The fix is to move the call to kvm_vcpu_check_breakpoint() where it was before commit 4aa2691dcbd3 ("KVM: x86: Factor out x86 instruction emulation with decoding") introduced x86_decode_emulated_instruction(). The other caller of the function does not need breakpoint checks, because it is invoked as part of a vmexit and the processor has already checked those before executing the instruction that #GP'd.
This fixes CVE-2022-1852.
Reported-by: Qiuhao Li qiuhao@sysec.org Reported-by: Gaoning Pan pgn@zju.edu.cn Reported-by: Yongkang Jia kangel@zju.edu.cn Fixes: 4aa2691dcbd3 ("KVM: x86: Factor out x86 instruction emulation with decoding") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson seanjc@google.com Message-Id: 20220311032801.3467418-2-seanjc@google.com [Rewrote commit message according to Qiuhao's report, since a patch already existed to fix the bug. - Paolo] Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kvm/x86.c | 31 +++++++++++++++++++------------ 1 file changed, 19 insertions(+), 12 deletions(-)
--- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8244,7 +8244,7 @@ int kvm_skip_emulated_instruction(struct } EXPORT_SYMBOL_GPL(kvm_skip_emulated_instruction);
-static bool kvm_vcpu_check_breakpoint(struct kvm_vcpu *vcpu, int *r) +static bool kvm_vcpu_check_code_breakpoint(struct kvm_vcpu *vcpu, int *r) { if (unlikely(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP) && (vcpu->arch.guest_debug_dr7 & DR7_BP_EN_MASK)) { @@ -8313,25 +8313,23 @@ static bool is_vmware_backdoor_opcode(st }
/* - * Decode to be emulated instruction. Return EMULATION_OK if success. + * Decode an instruction for emulation. The caller is responsible for handling + * code breakpoints. Note, manually detecting code breakpoints is unnecessary + * (and wrong) when emulating on an intercepted fault-like exception[*], as + * code breakpoints have higher priority and thus have already been done by + * hardware. + * + * [*] Except #MC, which is higher priority, but KVM should never emulate in + * response to a machine check. */ int x86_decode_emulated_instruction(struct kvm_vcpu *vcpu, int emulation_type, void *insn, int insn_len) { - int r = EMULATION_OK; struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt; + int r;
init_emulate_ctxt(vcpu);
- /* - * We will reenter on the same instruction since we do not set - * complete_userspace_io. This does not handle watchpoints yet, - * those would be handled in the emulate_ops. - */ - if (!(emulation_type & EMULTYPE_SKIP) && - kvm_vcpu_check_breakpoint(vcpu, &r)) - return r; - r = x86_decode_insn(ctxt, insn, insn_len, emulation_type);
trace_kvm_emulate_insn_start(vcpu); @@ -8364,6 +8362,15 @@ int x86_emulate_instruction(struct kvm_v if (!(emulation_type & EMULTYPE_NO_DECODE)) { kvm_clear_exception_queue(vcpu);
+ /* + * Return immediately if RIP hits a code breakpoint, such #DBs + * are fault-like and are higher priority than any faults on + * the code fetch itself. + */ + if (!(emulation_type & EMULTYPE_SKIP) && + kvm_vcpu_check_code_breakpoint(vcpu, &r)) + return r; + r = x86_decode_emulated_instruction(vcpu, emulation_type, insn, insn_len); if (r != EMULATION_OK) {
From: Maxim Levitsky mlevitsk@redhat.com
commit 6fcee03df6a1a3101a77344be37bb85c6142d56c upstream.
This can cause various unexpected issues, since VM is partially destroyed at that point.
For example when AVIC is enabled, this causes avic_vcpu_load to access physical id page entry which is already freed by .vm_destroy.
Fixes: 8221c1370056 ("svm: Manage vcpu load/unload when enable AVIC") Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky mlevitsk@redhat.com Message-Id: 20220322172449.235575-2-mlevitsk@redhat.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kvm/x86.c | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-)
--- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -11747,20 +11747,15 @@ static void kvm_unload_vcpu_mmu(struct k vcpu_put(vcpu); }
-static void kvm_free_vcpus(struct kvm *kvm) +static void kvm_unload_vcpu_mmus(struct kvm *kvm) { unsigned long i; struct kvm_vcpu *vcpu;
- /* - * Unpin any mmu pages first. - */ kvm_for_each_vcpu(i, vcpu, kvm) { kvm_clear_async_pf_completion_queue(vcpu); kvm_unload_vcpu_mmu(vcpu); } - - kvm_destroy_vcpus(kvm); }
void kvm_arch_sync_events(struct kvm *kvm) @@ -11866,11 +11861,12 @@ void kvm_arch_destroy_vm(struct kvm *kvm __x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0); mutex_unlock(&kvm->slots_lock); } + kvm_unload_vcpu_mmus(kvm); static_call_cond(kvm_x86_vm_destroy)(kvm); kvm_free_msr_filter(srcu_dereference_check(kvm->arch.msr_filter, &kvm->srcu, 1)); kvm_pic_destroy(kvm); kvm_ioapic_destroy(kvm); - kvm_free_vcpus(kvm); + kvm_destroy_vcpus(kvm); kvfree(rcu_dereference_check(kvm->arch.apic_map, 1)); kfree(srcu_dereference_check(kvm->arch.pmu_event_filter, &kvm->srcu, 1)); kvm_mmu_uninit_vm(kvm);
From: Yanfei Xu yanfei.xu@intel.com
commit ffd1925a596ce68bed7d81c61cb64bc35f788a9d upstream.
When kernel handles the vm-exit caused by external interrupts and NMI, it always sets kvm_intr_type to tell if it's dealing an IRQ or NMI. For the PMI scenario, it could be IRQ or NMI.
However, intel_pt PMIs are only generated for HARDWARE perf events, and HARDWARE events are always configured to generate NMIs. Use kvm_handling_nmi_from_guest() to precisely identify if the intel_pt PMI came from the guest; this avoids false positives if an intel_pt PMI/NMI arrives while the host is handling an unrelated IRQ VM-Exit.
Fixes: db215756ae59 ("KVM: x86: More precisely identify NMI from guest when handling PMI") Signed-off-by: Yanfei Xu yanfei.xu@intel.com Message-Id: 20220523140821.1345605-1-yanfei.xu@intel.com Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kvm/vmx/vmx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7856,7 +7856,7 @@ static unsigned int vmx_handle_intel_pt_ struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
/* '0' on failure so that the !PT case can use a RET0 static call. */ - if (!kvm_arch_pmi_in_guest(vcpu)) + if (!vcpu || !kvm_handling_nmi_from_guest(vcpu)) return 0;
kvm_make_request(KVM_REQ_PMI, vcpu);
From: Sean Christopherson seanjc@google.com
commit 45846661d10422ce9e22da21f8277540b29eca22 upstream.
Remove WARNs that sanity check that KVM never lets a triple fault for L2 escape and incorrectly end up in L1. In normal operation, the sanity check is perfectly valid, but it incorrectly assumes that it's impossible for userspace to induce KVM_REQ_TRIPLE_FAULT without bouncing through KVM_RUN (which guarantees kvm_check_nested_state() will see and handle the triple fault).
The WARN can currently be triggered if userspace injects a machine check while L2 is active and CR4.MCE=0. And a future fix to allow save/restore of KVM_REQ_TRIPLE_FAULT, e.g. so that a synthesized triple fault isn't lost on migration, will make it trivially easy for userspace to trigger the WARN.
Clearing KVM_REQ_TRIPLE_FAULT when forcibly leaving guest mode is tempting, but wrong, especially if/when the request is saved/restored, e.g. if userspace restores events (including a triple fault) and then restores nested state (which may forcibly leave guest mode). Ignoring the fact that KVM doesn't currently provide the necessary APIs, it's userspace's responsibility to manage pending events during save/restore.
------------[ cut here ]------------ WARNING: CPU: 7 PID: 1399 at arch/x86/kvm/vmx/nested.c:4522 nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel] Modules linked in: kvm_intel kvm irqbypass CPU: 7 PID: 1399 Comm: state_test Not tainted 5.17.0-rc3+ #808 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel] Call Trace: <TASK> vmx_leave_nested+0x30/0x40 [kvm_intel] vmx_set_nested_state+0xca/0x3e0 [kvm_intel] kvm_arch_vcpu_ioctl+0xf49/0x13e0 [kvm] kvm_vcpu_ioctl+0x4b9/0x660 [kvm] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> ---[ end trace 0000000000000000 ]---
Fixes: cb6a32c2b877 ("KVM: x86: Handle triple fault in L2 without killing L1") Cc: stable@vger.kernel.org Cc: Chenyi Qiang chenyi.qiang@intel.com Signed-off-by: Sean Christopherson seanjc@google.com Message-Id: 20220407002315.78092-2-seanjc@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kvm/svm/nested.c | 3 --- arch/x86/kvm/vmx/nested.c | 3 --- 2 files changed, 6 deletions(-)
--- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -819,9 +819,6 @@ int nested_svm_vmexit(struct vcpu_svm *s struct kvm_host_map map; int rc;
- /* Triple faults in L2 should never escape. */ - WARN_ON_ONCE(kvm_check_request(KVM_REQ_TRIPLE_FAULT, vcpu)); - rc = kvm_vcpu_map(vcpu, gpa_to_gfn(svm->nested.vmcb12_gpa), &map); if (rc) { if (rc == -EINVAL) --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -4518,9 +4518,6 @@ void nested_vmx_vmexit(struct kvm_vcpu * /* trying to cancel vmlaunch/vmresume is a bug */ WARN_ON_ONCE(vmx->nested.nested_run_pending);
- /* Similarly, triple faults in L2 should never escape. */ - WARN_ON_ONCE(kvm_check_request(KVM_REQ_TRIPLE_FAULT, vcpu)); - if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { /* * KVM_REQ_GET_NESTED_STATE_PAGES is also used to map
From: Hou Wenlong houwenlong.hwl@antgroup.com
commit 8d5678a76689acbf91245a3791fe853ab773090f upstream.
Before Commit c3e5e415bc1e6 ("KVM: X86: Change kvm_sync_page() to return true when remote flush is needed"), the return value of kvm_sync_page() indicates whether the page is synced, and kvm_mmu_get_page() would rebuild page when the sync fails. But now, kvm_sync_page() returns false when the page is synced and no tlb flushing is required, which leads to rebuild page in kvm_mmu_get_page(). So return the return value of mmu->sync_page() directly and check it in kvm_mmu_get_page(). If the sync fails, the page will be zapped and the invalid_list is not empty, so set flush as true is accepted in mmu_sync_children().
Cc: stable@vger.kernel.org Fixes: c3e5e415bc1e6 ("KVM: X86: Change kvm_sync_page() to return true when remote flush is needed") Signed-off-by: Hou Wenlong houwenlong.hwl@antgroup.com Acked-by: Lai Jiangshan jiangshanlai@gmail.com Message-Id: 0dabeeb789f57b0d793f85d073893063e692032d.1647336064.git.houwenlong.hwl@antgroup.com [mmu_sync_children should not flush if the page is zapped. - Paolo] Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kvm/mmu/mmu.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-)
--- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1843,17 +1843,14 @@ static void kvm_mmu_commit_zap_page(stru &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)]) \ if ((_sp)->gfn != (_gfn) || (_sp)->role.direct) {} else
-static bool kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, +static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, struct list_head *invalid_list) { int ret = vcpu->arch.mmu->sync_page(vcpu, sp);
- if (ret < 0) { + if (ret < 0) kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list); - return false; - } - - return !!ret; + return ret; }
static bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm, @@ -1975,7 +1972,7 @@ static int mmu_sync_children(struct kvm_
for_each_sp(pages, sp, parents, i) { kvm_unlink_unsync_page(vcpu->kvm, sp); - flush |= kvm_sync_page(vcpu, sp, &invalid_list); + flush |= kvm_sync_page(vcpu, sp, &invalid_list) > 0; mmu_pages_clear_parents(&parents); } if (need_resched() || rwlock_needbreak(&vcpu->kvm->mmu_lock)) { @@ -2016,6 +2013,7 @@ static struct kvm_mmu_page *kvm_mmu_get_ struct hlist_head *sp_list; unsigned quadrant; struct kvm_mmu_page *sp; + int ret; int collisions = 0; LIST_HEAD(invalid_list);
@@ -2068,11 +2066,13 @@ static struct kvm_mmu_page *kvm_mmu_get_ * If the sync fails, the page is zapped. If so, break * in order to rebuild it. */ - if (!kvm_sync_page(vcpu, sp, &invalid_list)) + ret = kvm_sync_page(vcpu, sp, &invalid_list); + if (ret < 0) break;
WARN_ON(!list_empty(&invalid_list)); - kvm_flush_remote_tlbs(vcpu->kvm); + if (ret > 0) + kvm_flush_remote_tlbs(vcpu->kvm); }
__clear_sp_write_flooding_count(sp);
From: Ashish Kalra ashish.kalra@amd.com
commit d22d2474e3953996f03528b84b7f52cc26a39403 upstream.
For some sev ioctl interfaces, the length parameter that is passed maybe less than or equal to SEV_FW_BLOB_MAX_SIZE, but larger than the data that PSP firmware returns. In this case, kmalloc will allocate memory that is the size of the input rather than the size of the data. Since PSP firmware doesn't fully overwrite the allocated buffer, these sev ioctl interface may return uninitialized kernel slab memory.
Reported-by: Andy Nguyen theflow@google.com Suggested-by: David Rientjes rientjes@google.com Suggested-by: Peter Gonda pgonda@google.com Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Cc: linux-kernel@vger.kernel.org Fixes: eaf78265a4ab3 ("KVM: SVM: Move SEV code to separate file") Fixes: 2c07ded06427d ("KVM: SVM: add support for SEV attestation command") Fixes: 4cfdd47d6d95a ("KVM: SVM: Add KVM_SEV SEND_START command") Fixes: d3d1af85e2c75 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command") Fixes: eba04b20e4861 ("KVM: x86: Account a variety of miscellaneous allocations") Signed-off-by: Ashish Kalra ashish.kalra@amd.com Reviewed-by: Peter Gonda pgonda@google.com Message-Id: 20220516154310.3685678-1-Ashish.Kalra@amd.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kvm/svm/sev.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)
--- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -688,7 +688,7 @@ static int sev_launch_measure(struct kvm if (params.len > SEV_FW_BLOB_MAX_SIZE) return -EINVAL;
- blob = kmalloc(params.len, GFP_KERNEL_ACCOUNT); + blob = kzalloc(params.len, GFP_KERNEL_ACCOUNT); if (!blob) return -ENOMEM;
@@ -808,7 +808,7 @@ static int __sev_dbg_decrypt_user(struct if (!IS_ALIGNED(dst_paddr, 16) || !IS_ALIGNED(paddr, 16) || !IS_ALIGNED(size, 16)) { - tpage = (void *)alloc_page(GFP_KERNEL); + tpage = (void *)alloc_page(GFP_KERNEL | __GFP_ZERO); if (!tpage) return -ENOMEM;
@@ -1094,7 +1094,7 @@ static int sev_get_attestation_report(st if (params.len > SEV_FW_BLOB_MAX_SIZE) return -EINVAL;
- blob = kmalloc(params.len, GFP_KERNEL_ACCOUNT); + blob = kzalloc(params.len, GFP_KERNEL_ACCOUNT); if (!blob) return -ENOMEM;
@@ -1176,7 +1176,7 @@ static int sev_send_start(struct kvm *kv return -EINVAL;
/* allocate the memory to hold the session data blob */ - session_data = kmalloc(params.session_len, GFP_KERNEL_ACCOUNT); + session_data = kzalloc(params.session_len, GFP_KERNEL_ACCOUNT); if (!session_data) return -ENOMEM;
@@ -1300,11 +1300,11 @@ static int sev_send_update_data(struct k
/* allocate memory for header and transport buffer */ ret = -ENOMEM; - hdr = kmalloc(params.hdr_len, GFP_KERNEL_ACCOUNT); + hdr = kzalloc(params.hdr_len, GFP_KERNEL_ACCOUNT); if (!hdr) goto e_unpin;
- trans_data = kmalloc(params.trans_len, GFP_KERNEL_ACCOUNT); + trans_data = kzalloc(params.trans_len, GFP_KERNEL_ACCOUNT); if (!trans_data) goto e_free_hdr;
From: Fabio Estevam festevam@denx.de
commit 4ee4cdad368a26de3967f2975806a9ee2fa245df upstream.
Since commit 358ba762d9f1 ("crypto: caam - enable prediction resistance in HRWNG") the following CAAM errors can be seen on i.MX6SX:
caam_jr 2101000.jr: 20003c5b: CCB: desc idx 60: RNG: Hardware error hwrng: no data available
This error is due to an incorrect entropy delay for i.MX6SX.
Fix it by increasing the minimum entropy delay for i.MX6SX as done in U-Boot: https://patchwork.ozlabs.org/project/uboot/patch/20220415111049.2565744-1-ga...
As explained in the U-Boot patch:
"RNG self tests are run to determine the correct entropy delay. Such tests are executed with different voltages and temperatures to identify the worst case value for the entropy delay. For i.MX6SX, it was determined that after adding a margin value of 1000 the minimum entropy delay should be at least 12000."
Cc: stable@vger.kernel.org Fixes: 358ba762d9f1 ("crypto: caam - enable prediction resistance in HRWNG") Signed-off-by: Fabio Estevam festevam@denx.de Reviewed-by: Horia Geantă horia.geanta@nxp.com Reviewed-by: Vabhav Sharma vabhav.sharma@nxp.com Reviewed-by: Gaurav Jain gaurav.jain@nxp.com Signed-off-by: Herbert Xu herbert@gondor.apana.org.au Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/crypto/caam/ctrl.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
--- a/drivers/crypto/caam/ctrl.c +++ b/drivers/crypto/caam/ctrl.c @@ -609,6 +609,13 @@ static bool check_version(struct fsl_mc_ } #endif
+static bool needs_entropy_delay_adjustment(void) +{ + if (of_machine_is_compatible("fsl,imx6sx")) + return true; + return false; +} + /* Probe routine for CAAM top (controller) level */ static int caam_probe(struct platform_device *pdev) { @@ -855,6 +862,8 @@ static int caam_probe(struct platform_de * Also, if a handle was instantiated, do not change * the TRNG parameters. */ + if (needs_entropy_delay_adjustment()) + ent_delay = 12000; if (!(ctrlpriv->rng4_sh_init || inst_handles)) { dev_info(dev, "Entropy delay = %u\n", @@ -871,6 +880,15 @@ static int caam_probe(struct platform_de */ ret = instantiate_rng(dev, inst_handles, gen_sk); + /* + * Entropy delay is determined via TRNG characterization. + * TRNG characterization is run across different voltages + * and temperatures. + * If worst case value for ent_dly is identified, + * the loop can be skipped for that platform. + */ + if (needs_entropy_delay_adjustment()) + break; if (ret == -EAGAIN) /* * if here, the loop will rerun,
From: Vitaly Chikunov vt@altlinux.org
commit 7cc7ab73f83ee6d50dc9536bc3355495d8600fad upstream.
Correctly compare values that shall be greater-or-equal and not just greater.
Fixes: 0d7a78643f69 ("crypto: ecrdsa - add EC-RDSA (GOST 34.10) algorithm") Cc: stable@vger.kernel.org Signed-off-by: Vitaly Chikunov vt@altlinux.org Signed-off-by: Herbert Xu herbert@gondor.apana.org.au Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- crypto/ecrdsa.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
--- a/crypto/ecrdsa.c +++ b/crypto/ecrdsa.c @@ -113,15 +113,15 @@ static int ecrdsa_verify(struct akcipher
/* Step 1: verify that 0 < r < q, 0 < s < q */ if (vli_is_zero(r, ndigits) || - vli_cmp(r, ctx->curve->n, ndigits) == 1 || + vli_cmp(r, ctx->curve->n, ndigits) >= 0 || vli_is_zero(s, ndigits) || - vli_cmp(s, ctx->curve->n, ndigits) == 1) + vli_cmp(s, ctx->curve->n, ndigits) >= 0) return -EKEYREJECTED;
/* Step 2: calculate hash (h) of the message (passed as input) */ /* Step 3: calculate e = h \mod q */ vli_from_le64(e, digest, ndigits); - if (vli_cmp(e, ctx->curve->n, ndigits) == 1) + if (vli_cmp(e, ctx->curve->n, ndigits) >= 0) vli_sub(e, e, ctx->curve->n, ndigits); if (vli_is_zero(e, ndigits)) e[0] = 1; @@ -137,7 +137,7 @@ static int ecrdsa_verify(struct akcipher /* Step 6: calculate point C = z_1P + z_2Q, and R = x_c \mod q */ ecc_point_mult_shamir(&cc, z1, &ctx->curve->g, z2, &ctx->pub_key, ctx->curve); - if (vli_cmp(cc.x, ctx->curve->n, ndigits) == 1) + if (vli_cmp(cc.x, ctx->curve->n, ndigits) >= 0) vli_sub(cc.x, cc.x, ctx->curve->n, ndigits);
/* Step 7: if R == r signature is valid */
From: Marco Chiappero marco.chiappero@intel.com
commit c690c7f6312ce69b426af08ae1da2b9e48a0246f upstream.
Change the VF2PF interrupt handler in the PF ISR and the definition of the internal PFVF API to correct the current implementation, which can result in missed interrupts.
More specifically, current HW generations consider a write to the mask register, regardless of the value, as an acknowledge of any pending VF2PF interrupt. Therefore, if there is an interrupt between the source register read and the mask register write, such interrupt will not be delivered and silently acknowledged, resulting in a lost VF2PF message.
To work around the problem, rather than disabling specific interrupts, disable all the interrupts and re-enable only the ones that we are not serving (excluding the already disabled ones too). This will force any other pending interrupt to be triggered and be serviced by a subsequent ISR.
This new approach requires, however, changes to the interrupt related pfvf_ops functions. In particular, get_vf2pf_sources() has now been removed in favor of disable_pending_vf2pf_interrupts(), which not only retrieves and returns the pending (and enabled) sources, but also disables them. As a consequence, introduce the adf_disable_pending_vf2pf_interrupts() utility in place of adf_disable_vf2pf_interrupts_irq(), which is no longer needed.
Cc: stable@vger.kernel.org Fixes: 993161d ("crypto: qat - fix handling of VF to PF interrupts") Signed-off-by: Marco Chiappero marco.chiappero@intel.com Co-developed-by: Giovanni Cabiddu giovanni.cabiddu@intel.com Signed-off-by: Giovanni Cabiddu giovanni.cabiddu@intel.com Signed-off-by: Herbert Xu herbert@gondor.apana.org.au Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/crypto/qat/qat_common/adf_accel_devices.h | 2 drivers/crypto/qat/qat_common/adf_gen2_pfvf.c | 58 ++++++++---- drivers/crypto/qat/qat_common/adf_gen4_pfvf.c | 44 +++++++-- drivers/crypto/qat/qat_common/adf_isr.c | 17 +-- drivers/crypto/qat/qat_dh895xcc/adf_dh895xcc_hw_data.c | 76 +++++++++++------ 5 files changed, 132 insertions(+), 65 deletions(-)
--- a/drivers/crypto/qat/qat_common/adf_accel_devices.h +++ b/drivers/crypto/qat/qat_common/adf_accel_devices.h @@ -152,9 +152,9 @@ struct adf_pfvf_ops { int (*enable_comms)(struct adf_accel_dev *accel_dev); u32 (*get_pf2vf_offset)(u32 i); u32 (*get_vf2pf_offset)(u32 i); - u32 (*get_vf2pf_sources)(void __iomem *pmisc_addr); void (*enable_vf2pf_interrupts)(void __iomem *pmisc_addr, u32 vf_mask); void (*disable_vf2pf_interrupts)(void __iomem *pmisc_addr, u32 vf_mask); + u32 (*disable_pending_vf2pf_interrupts)(void __iomem *pmisc_addr); int (*send_msg)(struct adf_accel_dev *accel_dev, struct pfvf_message msg, u32 pfvf_offset, struct mutex *csr_lock); struct pfvf_message (*recv_msg)(struct adf_accel_dev *accel_dev, --- a/drivers/crypto/qat/qat_common/adf_gen2_pfvf.c +++ b/drivers/crypto/qat/qat_common/adf_gen2_pfvf.c @@ -13,6 +13,7 @@ #include "adf_pfvf_utils.h"
/* VF2PF interrupts */ +#define ADF_GEN2_VF_MSK 0xFFFF #define ADF_GEN2_ERR_REG_VF2PF(vf_src) (((vf_src) & 0x01FFFE00) >> 9) #define ADF_GEN2_ERR_MSK_VF2PF(vf_mask) (((vf_mask) & 0xFFFF) << 9)
@@ -50,23 +51,6 @@ static u32 adf_gen2_vf_get_pfvf_offset(u return ADF_GEN2_VF_PF2VF_OFFSET; }
-static u32 adf_gen2_get_vf2pf_sources(void __iomem *pmisc_addr) -{ - u32 errsou3, errmsk3, vf_int_mask; - - /* Get the interrupt sources triggered by VFs */ - errsou3 = ADF_CSR_RD(pmisc_addr, ADF_GEN2_ERRSOU3); - vf_int_mask = ADF_GEN2_ERR_REG_VF2PF(errsou3); - - /* To avoid adding duplicate entries to work queue, clear - * vf_int_mask_sets bits that are already masked in ERRMSK register. - */ - errmsk3 = ADF_CSR_RD(pmisc_addr, ADF_GEN2_ERRMSK3); - vf_int_mask &= ~ADF_GEN2_ERR_REG_VF2PF(errmsk3); - - return vf_int_mask; -} - static void adf_gen2_enable_vf2pf_interrupts(void __iomem *pmisc_addr, u32 vf_mask) { @@ -89,6 +73,44 @@ static void adf_gen2_disable_vf2pf_inter } }
+static u32 adf_gen2_disable_pending_vf2pf_interrupts(void __iomem *pmisc_addr) +{ + u32 sources, disabled, pending; + u32 errsou3, errmsk3; + + /* Get the interrupt sources triggered by VFs */ + errsou3 = ADF_CSR_RD(pmisc_addr, ADF_GEN2_ERRSOU3); + sources = ADF_GEN2_ERR_REG_VF2PF(errsou3); + + if (!sources) + return 0; + + /* Get the already disabled interrupts */ + errmsk3 = ADF_CSR_RD(pmisc_addr, ADF_GEN2_ERRMSK3); + disabled = ADF_GEN2_ERR_REG_VF2PF(errmsk3); + + pending = sources & ~disabled; + if (!pending) + return 0; + + /* Due to HW limitations, when disabling the interrupts, we can't + * just disable the requested sources, as this would lead to missed + * interrupts if ERRSOU3 changes just before writing to ERRMSK3. + * To work around it, disable all and re-enable only the sources that + * are not in vf_mask and were not already disabled. Re-enabling will + * trigger a new interrupt for the sources that have changed in the + * meantime, if any. + */ + errmsk3 |= ADF_GEN2_ERR_MSK_VF2PF(ADF_GEN2_VF_MSK); + ADF_CSR_WR(pmisc_addr, ADF_GEN2_ERRMSK3, errmsk3); + + errmsk3 &= ADF_GEN2_ERR_MSK_VF2PF(sources | disabled); + ADF_CSR_WR(pmisc_addr, ADF_GEN2_ERRMSK3, errmsk3); + + /* Return the sources of the (new) interrupt(s) */ + return pending; +} + static u32 gen2_csr_get_int_bit(enum gen2_csr_pos offset) { return ADF_PFVF_INT << offset; @@ -362,9 +384,9 @@ void adf_gen2_init_pf_pfvf_ops(struct ad pfvf_ops->enable_comms = adf_enable_pf2vf_comms; pfvf_ops->get_pf2vf_offset = adf_gen2_pf_get_pfvf_offset; pfvf_ops->get_vf2pf_offset = adf_gen2_pf_get_pfvf_offset; - pfvf_ops->get_vf2pf_sources = adf_gen2_get_vf2pf_sources; pfvf_ops->enable_vf2pf_interrupts = adf_gen2_enable_vf2pf_interrupts; pfvf_ops->disable_vf2pf_interrupts = adf_gen2_disable_vf2pf_interrupts; + pfvf_ops->disable_pending_vf2pf_interrupts = adf_gen2_disable_pending_vf2pf_interrupts; pfvf_ops->send_msg = adf_gen2_pf2vf_send; pfvf_ops->recv_msg = adf_gen2_vf2pf_recv; } --- a/drivers/crypto/qat/qat_common/adf_gen4_pfvf.c +++ b/drivers/crypto/qat/qat_common/adf_gen4_pfvf.c @@ -15,6 +15,7 @@ /* VF2PF interrupt source registers */ #define ADF_4XXX_VM2PF_SOU 0x41A180 #define ADF_4XXX_VM2PF_MSK 0x41A1C0 +#define ADF_GEN4_VF_MSK 0xFFFF
#define ADF_PFVF_GEN4_MSGTYPE_SHIFT 2 #define ADF_PFVF_GEN4_MSGTYPE_MASK 0x3F @@ -36,16 +37,6 @@ static u32 adf_gen4_pf_get_vf2pf_offset( return ADF_4XXX_VM2PF_OFFSET(i); }
-static u32 adf_gen4_get_vf2pf_sources(void __iomem *pmisc_addr) -{ - u32 sou, mask; - - sou = ADF_CSR_RD(pmisc_addr, ADF_4XXX_VM2PF_SOU); - mask = ADF_CSR_RD(pmisc_addr, ADF_4XXX_VM2PF_MSK); - - return sou & ~mask; -} - static void adf_gen4_enable_vf2pf_interrupts(void __iomem *pmisc_addr, u32 vf_mask) { @@ -64,6 +55,37 @@ static void adf_gen4_disable_vf2pf_inter ADF_CSR_WR(pmisc_addr, ADF_4XXX_VM2PF_MSK, val); }
+static u32 adf_gen4_disable_pending_vf2pf_interrupts(void __iomem *pmisc_addr) +{ + u32 sources, disabled, pending; + + /* Get the interrupt sources triggered by VFs */ + sources = ADF_CSR_RD(pmisc_addr, ADF_4XXX_VM2PF_SOU); + if (!sources) + return 0; + + /* Get the already disabled interrupts */ + disabled = ADF_CSR_RD(pmisc_addr, ADF_4XXX_VM2PF_MSK); + + pending = sources & ~disabled; + if (!pending) + return 0; + + /* Due to HW limitations, when disabling the interrupts, we can't + * just disable the requested sources, as this would lead to missed + * interrupts if VM2PF_SOU changes just before writing to VM2PF_MSK. + * To work around it, disable all and re-enable only the sources that + * are not in vf_mask and were not already disabled. Re-enabling will + * trigger a new interrupt for the sources that have changed in the + * meantime, if any. + */ + ADF_CSR_WR(pmisc_addr, ADF_4XXX_VM2PF_MSK, ADF_GEN4_VF_MSK); + ADF_CSR_WR(pmisc_addr, ADF_4XXX_VM2PF_MSK, disabled | sources); + + /* Return the sources of the (new) interrupt(s) */ + return pending; +} + static int adf_gen4_pfvf_send(struct adf_accel_dev *accel_dev, struct pfvf_message msg, u32 pfvf_offset, struct mutex *csr_lock) @@ -115,9 +137,9 @@ void adf_gen4_init_pf_pfvf_ops(struct ad pfvf_ops->enable_comms = adf_enable_pf2vf_comms; pfvf_ops->get_pf2vf_offset = adf_gen4_pf_get_pf2vf_offset; pfvf_ops->get_vf2pf_offset = adf_gen4_pf_get_vf2pf_offset; - pfvf_ops->get_vf2pf_sources = adf_gen4_get_vf2pf_sources; pfvf_ops->enable_vf2pf_interrupts = adf_gen4_enable_vf2pf_interrupts; pfvf_ops->disable_vf2pf_interrupts = adf_gen4_disable_vf2pf_interrupts; + pfvf_ops->disable_pending_vf2pf_interrupts = adf_gen4_disable_pending_vf2pf_interrupts; pfvf_ops->send_msg = adf_gen4_pfvf_send; pfvf_ops->recv_msg = adf_gen4_pfvf_recv; } --- a/drivers/crypto/qat/qat_common/adf_isr.c +++ b/drivers/crypto/qat/qat_common/adf_isr.c @@ -76,32 +76,29 @@ void adf_disable_vf2pf_interrupts(struct spin_unlock_irqrestore(&accel_dev->pf.vf2pf_ints_lock, flags); }
-static void adf_disable_vf2pf_interrupts_irq(struct adf_accel_dev *accel_dev, - u32 vf_mask) +static u32 adf_disable_pending_vf2pf_interrupts(struct adf_accel_dev *accel_dev) { void __iomem *pmisc_addr = adf_get_pmisc_base(accel_dev); + u32 pending;
spin_lock(&accel_dev->pf.vf2pf_ints_lock); - GET_PFVF_OPS(accel_dev)->disable_vf2pf_interrupts(pmisc_addr, vf_mask); + pending = GET_PFVF_OPS(accel_dev)->disable_pending_vf2pf_interrupts(pmisc_addr); spin_unlock(&accel_dev->pf.vf2pf_ints_lock); + + return pending; }
static bool adf_handle_vf2pf_int(struct adf_accel_dev *accel_dev) { - void __iomem *pmisc_addr = adf_get_pmisc_base(accel_dev); bool irq_handled = false; unsigned long vf_mask;
- /* Get the interrupt sources triggered by VFs */ - vf_mask = GET_PFVF_OPS(accel_dev)->get_vf2pf_sources(pmisc_addr); - + /* Get the interrupt sources triggered by VFs, except for those already disabled */ + vf_mask = adf_disable_pending_vf2pf_interrupts(accel_dev); if (vf_mask) { struct adf_accel_vf_info *vf_info; int i;
- /* Disable VF2PF interrupts for VFs with pending ints */ - adf_disable_vf2pf_interrupts_irq(accel_dev, vf_mask); - /* * Handle VF2PF interrupt unless the VF is malicious and * is attempting to flood the host OS with VF2PF interrupts. --- a/drivers/crypto/qat/qat_dh895xcc/adf_dh895xcc_hw_data.c +++ b/drivers/crypto/qat/qat_dh895xcc/adf_dh895xcc_hw_data.c @@ -7,6 +7,8 @@ #include "adf_dh895xcc_hw_data.h" #include "icp_qat_hw.h"
+#define ADF_DH895XCC_VF_MSK 0xFFFFFFFF + /* Worker thread to service arbiter mappings */ static const u32 thrd_to_arb_map[ADF_DH895XCC_MAX_ACCELENGINES] = { 0x12222AAA, 0x11666666, 0x12222AAA, 0x11666666, @@ -114,29 +116,6 @@ static void adf_enable_ints(struct adf_a ADF_DH895XCC_SMIA1_MASK); }
-static u32 get_vf2pf_sources(void __iomem *pmisc_bar) -{ - u32 errsou3, errmsk3, errsou5, errmsk5, vf_int_mask; - - /* Get the interrupt sources triggered by VFs */ - errsou3 = ADF_CSR_RD(pmisc_bar, ADF_GEN2_ERRSOU3); - vf_int_mask = ADF_DH895XCC_ERR_REG_VF2PF_L(errsou3); - - /* To avoid adding duplicate entries to work queue, clear - * vf_int_mask_sets bits that are already masked in ERRMSK register. - */ - errmsk3 = ADF_CSR_RD(pmisc_bar, ADF_GEN2_ERRMSK3); - vf_int_mask &= ~ADF_DH895XCC_ERR_REG_VF2PF_L(errmsk3); - - /* Do the same for ERRSOU5 */ - errsou5 = ADF_CSR_RD(pmisc_bar, ADF_GEN2_ERRSOU5); - errmsk5 = ADF_CSR_RD(pmisc_bar, ADF_GEN2_ERRMSK5); - vf_int_mask |= ADF_DH895XCC_ERR_REG_VF2PF_U(errsou5); - vf_int_mask &= ~ADF_DH895XCC_ERR_REG_VF2PF_U(errmsk5); - - return vf_int_mask; -} - static void enable_vf2pf_interrupts(void __iomem *pmisc_addr, u32 vf_mask) { /* Enable VF2PF Messaging Ints - VFs 0 through 15 per vf_mask[15:0] */ @@ -150,7 +129,6 @@ static void enable_vf2pf_interrupts(void if (vf_mask >> 16) { u32 val = ADF_CSR_RD(pmisc_addr, ADF_GEN2_ERRMSK5) & ~ADF_DH895XCC_ERR_MSK_VF2PF_U(vf_mask); - ADF_CSR_WR(pmisc_addr, ADF_GEN2_ERRMSK5, val); } } @@ -173,6 +151,54 @@ static void disable_vf2pf_interrupts(voi } }
+static u32 disable_pending_vf2pf_interrupts(void __iomem *pmisc_addr) +{ + u32 sources, pending, disabled; + u32 errsou3, errmsk3; + u32 errsou5, errmsk5; + + /* Get the interrupt sources triggered by VFs */ + errsou3 = ADF_CSR_RD(pmisc_addr, ADF_GEN2_ERRSOU3); + errsou5 = ADF_CSR_RD(pmisc_addr, ADF_GEN2_ERRSOU5); + sources = ADF_DH895XCC_ERR_REG_VF2PF_L(errsou3) + | ADF_DH895XCC_ERR_REG_VF2PF_U(errsou5); + + if (!sources) + return 0; + + /* Get the already disabled interrupts */ + errmsk3 = ADF_CSR_RD(pmisc_addr, ADF_GEN2_ERRMSK3); + errmsk5 = ADF_CSR_RD(pmisc_addr, ADF_GEN2_ERRMSK5); + disabled = ADF_DH895XCC_ERR_REG_VF2PF_L(errmsk3) + | ADF_DH895XCC_ERR_REG_VF2PF_U(errmsk5); + + pending = sources & ~disabled; + if (!pending) + return 0; + + /* Due to HW limitations, when disabling the interrupts, we can't + * just disable the requested sources, as this would lead to missed + * interrupts if sources changes just before writing to ERRMSK3 and + * ERRMSK5. + * To work around it, disable all and re-enable only the sources that + * are not in vf_mask and were not already disabled. Re-enabling will + * trigger a new interrupt for the sources that have changed in the + * meantime, if any. + */ + errmsk3 |= ADF_DH895XCC_ERR_MSK_VF2PF_L(ADF_DH895XCC_VF_MSK); + errmsk5 |= ADF_DH895XCC_ERR_MSK_VF2PF_U(ADF_DH895XCC_VF_MSK); + ADF_CSR_WR(pmisc_addr, ADF_GEN2_ERRMSK3, errmsk3); + ADF_CSR_WR(pmisc_addr, ADF_GEN2_ERRMSK5, errmsk5); + + errmsk3 &= ADF_DH895XCC_ERR_MSK_VF2PF_L(sources | disabled); + errmsk5 &= ADF_DH895XCC_ERR_MSK_VF2PF_U(sources | disabled); + ADF_CSR_WR(pmisc_addr, ADF_GEN2_ERRMSK3, errmsk3); + ADF_CSR_WR(pmisc_addr, ADF_GEN2_ERRMSK5, errmsk5); + + /* Return the sources of the (new) interrupt(s) */ + return pending; +} + static void configure_iov_threads(struct adf_accel_dev *accel_dev, bool enable) { adf_gen2_cfg_iov_thds(accel_dev, enable, @@ -220,9 +246,9 @@ void adf_init_hw_data_dh895xcc(struct ad hw_data->disable_iov = adf_disable_sriov;
adf_gen2_init_pf_pfvf_ops(&hw_data->pfvf_ops); - hw_data->pfvf_ops.get_vf2pf_sources = get_vf2pf_sources; hw_data->pfvf_ops.enable_vf2pf_interrupts = enable_vf2pf_interrupts; hw_data->pfvf_ops.disable_vf2pf_interrupts = disable_vf2pf_interrupts; + hw_data->pfvf_ops.disable_pending_vf2pf_interrupts = disable_pending_vf2pf_interrupts; adf_gen2_init_hw_csr_ops(&hw_data->csr_ops); }
From: Sultan Alsawaf sultan@kerneltoast.com
commit 2505a981114dcb715f8977b8433f7540854851d8 upstream.
The asynchronous zspage free worker tries to lock a zspage's entire page list without defending against page migration. Since pages which haven't yet been locked can concurrently migrate off the zspage page list while lock_zspage() churns away, lock_zspage() can suffer from a few different lethal races.
It can lock a page which no longer belongs to the zspage and unsafely dereference page_private(), it can unsafely dereference a torn pointer to the next page (since there's a data race), and it can observe a spurious NULL pointer to the next page and thus not lock all of the zspage's pages (since a single page migration will reconstruct the entire page list, and create_page_chain() unconditionally zeroes out each list pointer in the process).
Fix the races by using migrate_read_lock() in lock_zspage() to synchronize with page migration.
Link: https://lkml.kernel.org/r/20220509024703.243847-1-sultan@kerneltoast.com Fixes: 77ff465799c602 ("zsmalloc: zs_page_migrate: skip unnecessary loops but not return -EBUSY if zspage is not inuse") Signed-off-by: Sultan Alsawaf sultan@kerneltoast.com Acked-by: Minchan Kim minchan@kernel.org Cc: Nitin Gupta ngupta@vflare.org Cc: Sergey Senozhatsky senozhatsky@chromium.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/zsmalloc.c | 37 +++++++++++++++++++++++++++++++++---- 1 file changed, 33 insertions(+), 4 deletions(-)
--- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -1718,11 +1718,40 @@ static enum fullness_group putback_zspag */ static void lock_zspage(struct zspage *zspage) { - struct page *page = get_first_page(zspage); + struct page *curr_page, *page;
- do { - lock_page(page); - } while ((page = get_next_page(page)) != NULL); + /* + * Pages we haven't locked yet can be migrated off the list while we're + * trying to lock them, so we need to be careful and only attempt to + * lock each page under migrate_read_lock(). Otherwise, the page we lock + * may no longer belong to the zspage. This means that we may wait for + * the wrong page to unlock, so we must take a reference to the page + * prior to waiting for it to unlock outside migrate_read_lock(). + */ + while (1) { + migrate_read_lock(zspage); + page = get_first_page(zspage); + if (trylock_page(page)) + break; + get_page(page); + migrate_read_unlock(zspage); + wait_on_page_locked(page); + put_page(page); + } + + curr_page = page; + while ((page = get_next_page(curr_page))) { + if (trylock_page(page)) { + curr_page = page; + } else { + get_page(page); + migrate_read_unlock(zspage); + wait_on_page_locked(page); + put_page(page); + migrate_read_lock(zspage); + } + } + migrate_read_unlock(zspage); }
static int zs_init_fs_context(struct fs_context *fc)
From: Akira Yokosawa akiyks@gmail.com
commit 5b759db44195bb779828a188bad6b745c18dcd55 upstream.
EXPORT_SYMBOL of do_exec() was removed in v5.17. Unfortunately, kernel modules from klitmus7 7.56 have do_exec() at the end of each kthread.
herdtools7 7.56.1 has addressed the issue.
Update the compatibility table accordingly.
Signed-off-by: Akira Yokosawa akiyks@gmail.com Cc: Luc Maranget luc.maranget@inria.fr Cc: Jade Alglave j.alglave@ucl.ac.uk Cc: stable@vger.kernel.org # v5.17+ Signed-off-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- tools/memory-model/README | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
--- a/tools/memory-model/README +++ b/tools/memory-model/README @@ -54,7 +54,8 @@ klitmus7 Compatibility Table -- 4.14 7.48 -- 4.15 -- 4.19 7.49 -- 4.20 -- 5.5 7.54 -- - 5.6 -- 7.56 -- + 5.6 -- 5.16 7.56 -- + 5.17 -- 7.56.1 -- ============ ==========
From: Takashi Iwai tiwai@suse.de
commit 5ce0b06ae5e69e23142e73c5c3c0260e9f2ccb4b upstream.
Maris reported that TEAC UD-501 (0644:8043) doesn't work with the typical "clock source 41 is not valid, cannot use" errors on the recent kernels. The currently known workaround so far is to restore (partially) what we've done unconditionally at the clock setup; namely, re-setup the USB interface immediately after the clock is changed. This patch re-introduces the behavior conditionally for TEAC devices.
Further notes: - The USB interface shall be set later in snd_usb_endpoint_configure(), but this seems to be too late. - Even calling usb_set_interface() right after sne_usb_init_sample_rate() doesn't help; so this must be related with the clock validation, too. - The device may still spew the "clock source 41 is not valid" error at the first clock setup. This seems happening at the very first try of clock setup, but it disappears at later attempts. The error is likely harmless because the driver retries the clock setup (such an error is more or less expected on some devices).
Fixes: bf6313a0ff76 ("ALSA: usb-audio: Refactor endpoint management") Reported-and-tested-by: Maris Abele maris7abele@gmail.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220521064627.29292-1-tiwai@suse.de Signed-off-by: Takashi Iwai tiwai@suse.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- sound/usb/clock.c | 7 +++++++ 1 file changed, 7 insertions(+)
--- a/sound/usb/clock.c +++ b/sound/usb/clock.c @@ -572,6 +572,13 @@ static int set_sample_rate_v2v3(struct s /* continue processing */ }
+ /* FIXME - TEAC devices require the immediate interface setup */ + if (rate != prev_rate && USB_ID_VENDOR(chip->usb_id) == 0x0644) { + usb_set_interface(chip->dev, fmt->iface, fmt->altsetting); + if (chip->quirk_flags & QUIRK_FLAG_IFACE_DELAY) + msleep(50); + } + validation: /* validate clock after rate change */ if (!uac_clock_source_is_valid(chip, fmt, clock))
From: Takashi Iwai tiwai@suse.de
commit 7b0efea4baf02f5e2f89e5f9b75ef891571b45f1 upstream.
The quirk entry for Focusrite Saffire 6 had no proper ep_idx for the capture endpoint, and this confused the driver, resulting in the broken sound. This patch adds the missing ep_idx in the entry.
While we are at it, a couple of other entries (for Digidesign MBox and MOTU MicroBook II) seem to have the same problem, and those are covered as well.
Fixes: bf6313a0ff76 ("ALSA: usb-audio: Refactor endpoint management") Reported-by: André Kapelrud a.kapelrud@gmail.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220521065325.426-1-tiwai@suse.de Signed-off-by: Takashi Iwai tiwai@suse.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- sound/usb/quirks-table.h | 3 +++ 1 file changed, 3 insertions(+)
--- a/sound/usb/quirks-table.h +++ b/sound/usb/quirks-table.h @@ -2672,6 +2672,7 @@ YAMAHA_DEVICE(0x7010, "UB99"), .altset_idx = 1, .attributes = 0, .endpoint = 0x82, + .ep_idx = 1, .ep_attr = USB_ENDPOINT_XFER_ISOC, .datainterval = 1, .maxpacksize = 0x0126, @@ -2875,6 +2876,7 @@ YAMAHA_DEVICE(0x7010, "UB99"), .altset_idx = 1, .attributes = 0x4, .endpoint = 0x81, + .ep_idx = 1, .ep_attr = USB_ENDPOINT_XFER_ISOC | USB_ENDPOINT_SYNC_ASYNC, .maxpacksize = 0x130, @@ -3391,6 +3393,7 @@ YAMAHA_DEVICE(0x7010, "UB99"), .altset_idx = 1, .attributes = 0, .endpoint = 0x03, + .ep_idx = 1, .rates = SNDRV_PCM_RATE_96000, .ep_attr = USB_ENDPOINT_XFER_ISOC | USB_ENDPOINT_SYNC_ASYNC,
From: Craig McLure craig@mclure.net
commit 0e85a22d01dfe9ad9a9d9e87cd4a88acce1aad65 upstream.
Devices such as the TC-Helicon GoXLR require the sync endpoint to be configured in advance of the data endpoint in order for sound output to work.
This patch simply changes the ordering of EP configuration to resolve this.
Fixes: bf6313a0ff76 ("ALSA: usb-audio: Refactor endpoint management") BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215079 Signed-off-by: Craig McLure craig@mclure.net Reviewed-by: Jaroslav Kysela perex@perex.cz Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524062115.25968-1-tiwai@suse.de Signed-off-by: Takashi Iwai tiwai@suse.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- sound/usb/pcm.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-)
--- a/sound/usb/pcm.c +++ b/sound/usb/pcm.c @@ -439,16 +439,21 @@ static int configure_endpoints(struct sn /* stop any running stream beforehand */ if (stop_endpoints(subs, false)) sync_pending_stops(subs); + if (subs->sync_endpoint) { + err = snd_usb_endpoint_configure(chip, subs->sync_endpoint); + if (err < 0) + return err; + } err = snd_usb_endpoint_configure(chip, subs->data_endpoint); if (err < 0) return err; snd_usb_set_format_quirk(subs, subs->cur_audiofmt); - } - - if (subs->sync_endpoint) { - err = snd_usb_endpoint_configure(chip, subs->sync_endpoint); - if (err < 0) - return err; + } else { + if (subs->sync_endpoint) { + err = snd_usb_endpoint_configure(chip, subs->sync_endpoint); + if (err < 0) + return err; + } }
return 0;
From: Steven Rostedt rostedt@goodmis.org
commit 72ef98445aca568a81c2da050532500a8345ad3a upstream.
While looking at a crash report on a timer list being corrupted, which usually happens when a timer is freed while still active. This is commonly triggered by code calling del_timer() instead of del_timer_sync() just before freeing.
One possible culprit is the hci_qca driver, which does exactly that.
Eric mentioned that wake_retrans_timer could be rearmed via the work queue, so also move the destruction of the work queue before del_timer_sync().
Cc: Eric Dumazet eric.dumazet@gmail.com Cc: stable@vger.kernel.org Fixes: 0ff252c1976da ("Bluetooth: hciuart: Add support QCA chipset for UART") Signed-off-by: Steven Rostedt (Google) rostedt@goodmis.org Signed-off-by: Marcel Holtmann marcel@holtmann.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/bluetooth/hci_qca.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
--- a/drivers/bluetooth/hci_qca.c +++ b/drivers/bluetooth/hci_qca.c @@ -696,9 +696,9 @@ static int qca_close(struct hci_uart *hu skb_queue_purge(&qca->tx_wait_q); skb_queue_purge(&qca->txq); skb_queue_purge(&qca->rx_memdump_q); - del_timer(&qca->tx_idle_timer); - del_timer(&qca->wake_retrans_timer); destroy_workqueue(qca->workqueue); + del_timer_sync(&qca->tx_idle_timer); + del_timer_sync(&qca->wake_retrans_timer); qca->hu = NULL;
kfree_skb(qca->rx_skb);
From: Jonathan Bakker xc-racer2@live.ca
commit 3f5e3d3a8b895c8a11da8b0063ba2022dd9e2045 upstream.
Correct the name of the bluetooth interrupt from host-wake to host-wakeup.
Fixes: 1c65b6184441b ("ARM: dts: s5pv210: Correct BCM4329 bluetooth node") Cc: stable@vger.kernel.org Signed-off-by: Jonathan Bakker xc-racer2@live.ca Link: https://lore.kernel.org/r/CY4PR04MB0567495CFCBDC8D408D44199CB1C9@CY4PR04MB05... Signed-off-by: Krzysztof Kozlowski krzysztof.kozlowski@linaro.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/arm/boot/dts/s5pv210-aries.dtsi | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/arm/boot/dts/s5pv210-aries.dtsi +++ b/arch/arm/boot/dts/s5pv210-aries.dtsi @@ -895,7 +895,7 @@ device-wakeup-gpios = <&gpg3 4 GPIO_ACTIVE_HIGH>; interrupt-parent = <&gph2>; interrupts = <5 IRQ_TYPE_LEVEL_HIGH>; - interrupt-names = "host-wake"; + interrupt-names = "host-wakeup"; }; };
From: Dan Carpenter dan.carpenter@oracle.com
commit d3f2a14b8906df913cb04a706367b012db94a6e8 upstream.
The "r" variable shadows an earlier "r" that has function scope. It means that we accidentally return success instead of an error code. Smatch has a warning for this:
drivers/md/dm-integrity.c:4503 dm_integrity_ctr() warn: missing error code 'r'
Fixes: 7eada909bfd7 ("dm: add integrity target") Cc: stable@vger.kernel.org Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Reviewed-by: Mikulas Patocka mpatocka@redhat.com Signed-off-by: Mike Snitzer snitzer@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/md/dm-integrity.c | 2 -- 1 file changed, 2 deletions(-)
--- a/drivers/md/dm-integrity.c +++ b/drivers/md/dm-integrity.c @@ -4494,8 +4494,6 @@ try_smaller_buffer: }
if (should_write_sb) { - int r; - init_journal(ic, 0, ic->journal_sections, 0); r = dm_integrity_failed(ic); if (unlikely(r)) {
From: Mikulas Patocka mpatocka@redhat.com
commit 567dd8f34560fa221a6343729474536aa7ede4fd upstream.
The device mapper dm-crypt target is using scnprintf("%02x", cc->key[i]) to report the current key to userspace. However, this is not a constant-time operation and it may leak information about the key via timing, via cache access patterns or via the branch predictor.
Change dm-crypt's key printing to use "%c" instead of "%02x". Also introduce hex2asc() that carefully avoids any branching or memory accesses when converting a number in the range 0 ... 15 to an ascii character.
Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka mpatocka@redhat.com Tested-by: Milan Broz gmazyland@gmail.com Signed-off-by: Mike Snitzer snitzer@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/md/dm-crypt.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
--- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -3439,6 +3439,11 @@ static int crypt_map(struct dm_target *t return DM_MAPIO_SUBMITTED; }
+static char hex2asc(unsigned char c) +{ + return c + '0' + ((unsigned)(9 - c) >> 4 & 0x27); +} + static void crypt_status(struct dm_target *ti, status_type_t type, unsigned status_flags, char *result, unsigned maxlen) { @@ -3457,9 +3462,12 @@ static void crypt_status(struct dm_targe if (cc->key_size > 0) { if (cc->key_string) DMEMIT(":%u:%s", cc->key_size, cc->key_string); - else - for (i = 0; i < cc->key_size; i++) - DMEMIT("%02x", cc->key[i]); + else { + for (i = 0; i < cc->key_size; i++) { + DMEMIT("%c%c", hex2asc(cc->key[i] >> 4), + hex2asc(cc->key[i] & 0xf)); + } + } } else DMEMIT("-");
From: Mikulas Patocka mpatocka@redhat.com
commit bfe2b0146c4d0230b68f5c71a64380ff8d361f8b upstream.
dm-stats can be used with a very large number of entries (it is only limited by 1/4 of total system memory), so add rescheduling points to the loops that iterate over the entries.
Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka mpatocka@redhat.com Signed-off-by: Mike Snitzer snitzer@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/md/dm-stats.c | 8 ++++++++ 1 file changed, 8 insertions(+)
--- a/drivers/md/dm-stats.c +++ b/drivers/md/dm-stats.c @@ -225,6 +225,7 @@ void dm_stats_cleanup(struct dm_stats *s atomic_read(&shared->in_flight[READ]), atomic_read(&shared->in_flight[WRITE])); } + cond_resched(); } dm_stat_free(&s->rcu_head); } @@ -330,6 +331,7 @@ static int dm_stats_create(struct dm_sta for (ni = 0; ni < n_entries; ni++) { atomic_set(&s->stat_shared[ni].in_flight[READ], 0); atomic_set(&s->stat_shared[ni].in_flight[WRITE], 0); + cond_resched(); }
if (s->n_histogram_entries) { @@ -342,6 +344,7 @@ static int dm_stats_create(struct dm_sta for (ni = 0; ni < n_entries; ni++) { s->stat_shared[ni].tmp.histogram = hi; hi += s->n_histogram_entries + 1; + cond_resched(); } }
@@ -362,6 +365,7 @@ static int dm_stats_create(struct dm_sta for (ni = 0; ni < n_entries; ni++) { p[ni].histogram = hi; hi += s->n_histogram_entries + 1; + cond_resched(); } } } @@ -497,6 +501,7 @@ static int dm_stats_list(struct dm_stats } DMEMIT("\n"); } + cond_resched(); } mutex_unlock(&stats->mutex);
@@ -774,6 +779,7 @@ static void __dm_stat_clear(struct dm_st local_irq_enable(); } } + cond_resched(); } }
@@ -889,6 +895,8 @@ static int dm_stats_print(struct dm_stat
if (unlikely(sz + 1 >= maxlen)) goto buffer_overflow; + + cond_resched(); }
if (clear)
From: Sarthak Kukreti sarthakkukreti@google.com
commit 4caae58406f8ceb741603eee460d79bacca9b1b5 upstream.
The device-mapper framework provides a mechanism to mark targets as immutable (and hence fail table reloads that try to change the target type). Add the DM_TARGET_IMMUTABLE flag to the dm-verity target's feature flags to prevent switching the verity target with a different target type.
Fixes: a4ffc152198e ("dm: add verity target") Cc: stable@vger.kernel.org Signed-off-by: Sarthak Kukreti sarthakkukreti@google.com Reviewed-by: Kees Cook keescook@chromium.org Signed-off-by: Mike Snitzer snitzer@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/md/dm-verity-target.c | 1 + 1 file changed, 1 insertion(+)
--- a/drivers/md/dm-verity-target.c +++ b/drivers/md/dm-verity-target.c @@ -1312,6 +1312,7 @@ bad:
static struct target_type verity_target = { .name = "verity", + .features = DM_TARGET_IMMUTABLE, .version = {1, 8, 0}, .module = THIS_MODULE, .ctr = verity_ctr,
From: Mariusz Tkaczyk mariusz.tkaczyk@linux.intel.com
commit 57668f0a4cc4083a120cc8c517ca0055c4543b59 upstream.
Raid456 module had allowed to achieve failed state. It was fixed by fb73b357fb9 ("raid5: block failing device if raid will be failed"). This fix introduces a bug, now if raid5 fails during IO, it may result with a hung task without completion. Faulty flag on the device is necessary to process all requests and is checked many times, mainly in analyze_stripe(). Allow to set faulty on drive again and set MD_BROKEN if raid is failed.
As a result, this level is allowed to achieve failed state again, but communication with userspace (via -EBUSY status) will be preserved.
This restores possibility to fail array via #mdadm --set-faulty command and will be fixed by additional verification on mdadm side.
Reproduction steps: mdadm -CR imsm -e imsm -n 3 /dev/nvme[0-2]n1 mdadm -CR r5 -e imsm -l5 -n3 /dev/nvme[0-2]n1 --assume-clean mkfs.xfs /dev/md126 -f mount /dev/md126 /mnt/root/
fio --filename=/mnt/root/file --size=5GB --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=240 --numjobs=4 --time_based --group_reporting --name=throughput-test-job --eta-newline=1 &
echo 1 > /sys/block/nvme2n1/device/device/remove echo 1 > /sys/block/nvme1n1/device/device/remove
[ 1475.787779] Call Trace: [ 1475.793111] __schedule+0x2a6/0x700 [ 1475.799460] schedule+0x38/0xa0 [ 1475.805454] raid5_get_active_stripe+0x469/0x5f0 [raid456] [ 1475.813856] ? finish_wait+0x80/0x80 [ 1475.820332] raid5_make_request+0x180/0xb40 [raid456] [ 1475.828281] ? finish_wait+0x80/0x80 [ 1475.834727] ? finish_wait+0x80/0x80 [ 1475.841127] ? finish_wait+0x80/0x80 [ 1475.847480] md_handle_request+0x119/0x190 [ 1475.854390] md_make_request+0x8a/0x190 [ 1475.861041] generic_make_request+0xcf/0x310 [ 1475.868145] submit_bio+0x3c/0x160 [ 1475.874355] iomap_dio_submit_bio.isra.20+0x51/0x60 [ 1475.882070] iomap_dio_bio_actor+0x175/0x390 [ 1475.889149] iomap_apply+0xff/0x310 [ 1475.895447] ? iomap_dio_bio_actor+0x390/0x390 [ 1475.902736] ? iomap_dio_bio_actor+0x390/0x390 [ 1475.909974] iomap_dio_rw+0x2f2/0x490 [ 1475.916415] ? iomap_dio_bio_actor+0x390/0x390 [ 1475.923680] ? atime_needs_update+0x77/0xe0 [ 1475.930674] ? xfs_file_dio_aio_read+0x6b/0xe0 [xfs] [ 1475.938455] xfs_file_dio_aio_read+0x6b/0xe0 [xfs] [ 1475.946084] xfs_file_read_iter+0xba/0xd0 [xfs] [ 1475.953403] aio_read+0xd5/0x180 [ 1475.959395] ? _cond_resched+0x15/0x30 [ 1475.965907] io_submit_one+0x20b/0x3c0 [ 1475.972398] __x64_sys_io_submit+0xa2/0x180 [ 1475.979335] ? do_io_getevents+0x7c/0xc0 [ 1475.986009] do_syscall_64+0x5b/0x1a0 [ 1475.992419] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 1476.000255] RIP: 0033:0x7f11fc27978d [ 1476.006631] Code: Bad RIP value. [ 1476.073251] INFO: task fio:3877 blocked for more than 120 seconds.
Cc: stable@vger.kernel.org Fixes: fb73b357fb9 ("raid5: block failing device if raid will be failed") Reviewd-by: Xiao Ni xni@redhat.com Signed-off-by: Mariusz Tkaczyk mariusz.tkaczyk@linux.intel.com Signed-off-by: Song Liu song@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/md/raid5.c | 47 ++++++++++++++++++++++------------------------- 1 file changed, 22 insertions(+), 25 deletions(-)
--- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -686,17 +686,17 @@ int raid5_calc_degraded(struct r5conf *c return degraded; }
-static int has_failed(struct r5conf *conf) +static bool has_failed(struct r5conf *conf) { - int degraded; + int degraded = conf->mddev->degraded;
- if (conf->mddev->reshape_position == MaxSector) - return conf->mddev->degraded > conf->max_degraded; + if (test_bit(MD_BROKEN, &conf->mddev->flags)) + return true;
- degraded = raid5_calc_degraded(conf); - if (degraded > conf->max_degraded) - return 1; - return 0; + if (conf->mddev->reshape_position != MaxSector) + degraded = raid5_calc_degraded(conf); + + return degraded > conf->max_degraded; }
struct stripe_head * @@ -2863,34 +2863,31 @@ static void raid5_error(struct mddev *md unsigned long flags; pr_debug("raid456: error called\n");
+ pr_crit("md/raid:%s: Disk failure on %s, disabling device.\n", + mdname(mddev), bdevname(rdev->bdev, b)); + spin_lock_irqsave(&conf->device_lock, flags); + set_bit(Faulty, &rdev->flags); + clear_bit(In_sync, &rdev->flags); + mddev->degraded = raid5_calc_degraded(conf);
- if (test_bit(In_sync, &rdev->flags) && - mddev->degraded == conf->max_degraded) { - /* - * Don't allow to achieve failed state - * Don't try to recover this device - */ + if (has_failed(conf)) { + set_bit(MD_BROKEN, &conf->mddev->flags); conf->recovery_disabled = mddev->recovery_disabled; - spin_unlock_irqrestore(&conf->device_lock, flags); - return; + + pr_crit("md/raid:%s: Cannot continue operation (%d/%d failed).\n", + mdname(mddev), mddev->degraded, conf->raid_disks); + } else { + pr_crit("md/raid:%s: Operation continuing on %d devices.\n", + mdname(mddev), conf->raid_disks - mddev->degraded); }
- set_bit(Faulty, &rdev->flags); - clear_bit(In_sync, &rdev->flags); - mddev->degraded = raid5_calc_degraded(conf); spin_unlock_irqrestore(&conf->device_lock, flags); set_bit(MD_RECOVERY_INTR, &mddev->recovery);
set_bit(Blocked, &rdev->flags); set_mask_bits(&mddev->sb_flags, 0, BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING)); - pr_crit("md/raid:%s: Disk failure on %s, disabling device.\n" - "md/raid:%s: Operation continuing on %d devices.\n", - mdname(mddev), - bdevname(rdev->bdev, b), - mdname(mddev), - conf->raid_disks - mddev->degraded); r5c_update_on_rdev_error(mddev, rdev); }
From: Randy Dunlap rdunlap@infradead.org
commit a3b774342fa752a5290c0de36375289dfcf4a260 upstream.
When the NTFS BOOT sectors_per_clusters field is > 0x80, it represents a shift value. Make sure that the shift value is not too large before using it (NTFS max cluster size is 2MB). Return -EVINVAL if it too large.
This prevents negative shift values and shift values that are larger than the field size.
Prevents this UBSAN error:
UBSAN: shift-out-of-bounds in ../fs/ntfs3/super.c:673:16 shift exponent -192 is negative
Link: https://lkml.kernel.org/r/20220502175342.20296-1-rdunlap@infradead.org Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block") Signed-off-by: Randy Dunlap rdunlap@infradead.org Reported-by: syzbot+1631f09646bc214d2e76@syzkaller.appspotmail.com Reviewed-by: Namjae Jeon linkinjeon@kernel.org Cc: Konstantin Komarov almaz.alexandrovich@paragon-software.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Kari Argillander kari.argillander@stargateuniverse.net Cc: Namjae Jeon linkinjeon@kernel.org Cc: Matthew Wilcox willy@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/ntfs3/super.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
--- a/fs/ntfs3/super.c +++ b/fs/ntfs3/super.c @@ -668,9 +668,11 @@ static u32 format_size_gb(const u64 byte
static u32 true_sectors_per_clst(const struct NTFS_BOOT *boot) { - return boot->sectors_per_clusters <= 0x80 - ? boot->sectors_per_clusters - : (1u << (0 - boot->sectors_per_clusters)); + if (boot->sectors_per_clusters <= 0x80) + return boot->sectors_per_clusters; + if (boot->sectors_per_clusters >= 0xf4) /* limit shift to 2MB max */ + return 1U << (0 - boot->sectors_per_clusters); + return -EINVAL; }
/* @@ -713,6 +715,8 @@ static int ntfs_init_from_boot(struct su
/* cluster size: 512, 1K, 2K, 4K, ... 2M */ sct_per_clst = true_sectors_per_clst(boot); + if ((int)sct_per_clst < 0) + goto out; if (!is_power_of_2(sct_per_clst)) goto out;
From: Marek Maślanka mm@semihalf.com
commit 1d07cef7fd7599450b3d03e1915efc2a96e1f03f upstream.
The Google Whiskers touchpad does not work properly with the default multitouch configuration. Instead, use the same configuration as Google Rose.
Signed-off-by: Marek Maslanka mm@semihalf.com Acked-by: Benjamin Tissoires benjamin.tissoires@redhat.com Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/hid/hid-multitouch.c | 3 +++ 1 file changed, 3 insertions(+)
--- a/drivers/hid/hid-multitouch.c +++ b/drivers/hid/hid-multitouch.c @@ -2178,6 +2178,9 @@ static const struct hid_device_id mt_dev { .driver_data = MT_CLS_GOOGLE, HID_DEVICE(HID_BUS_ANY, HID_GROUP_ANY, USB_VENDOR_ID_GOOGLE, USB_DEVICE_ID_GOOGLE_TOUCH_ROSE) }, + { .driver_data = MT_CLS_GOOGLE, + HID_DEVICE(BUS_USB, HID_GROUP_MULTITOUCH_WIN_8, USB_VENDOR_ID_GOOGLE, + USB_DEVICE_ID_GOOGLE_WHISKERS) },
/* Generic MT device */ { HID_DEVICE(HID_BUS_ANY, HID_GROUP_MULTITOUCH, HID_ANY_ID, HID_ANY_ID) },
From: Tao Jin tao-j@outlook.com
commit 95cd2cdc88c755dcd0a58b951faeb77742c733a4 upstream.
This applies the similar quirks used by previous generation devices such as X1 tablet for X12 tablet, so that the trackpoint and buttons can work.
This patch was applied and tested working on 5.17.1 .
Cc: stable@vger.kernel.org # 5.8+ given that it relies on 40d5bb87377a Signed-off-by: Tao Jin tao-j@outlook.com Signed-off-by: Benjamin Tissoires benjamin.tissoires@redhat.com Link: https://lore.kernel.org/r/CO6PR03MB6241CB276FCDC7F4CEDC34F6E1E29@CO6PR03MB62... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/hid/hid-ids.h | 1 + drivers/hid/hid-multitouch.c | 6 ++++++ 2 files changed, 7 insertions(+)
--- a/drivers/hid/hid-ids.h +++ b/drivers/hid/hid-ids.h @@ -768,6 +768,7 @@ #define USB_DEVICE_ID_LENOVO_X1_COVER 0x6085 #define USB_DEVICE_ID_LENOVO_X1_TAB 0x60a3 #define USB_DEVICE_ID_LENOVO_X1_TAB3 0x60b5 +#define USB_DEVICE_ID_LENOVO_X12_TAB 0x60fe #define USB_DEVICE_ID_LENOVO_OPTICAL_USB_MOUSE_600E 0x600e #define USB_DEVICE_ID_LENOVO_PIXART_USB_MOUSE_608D 0x608d #define USB_DEVICE_ID_LENOVO_PIXART_USB_MOUSE_6019 0x6019 --- a/drivers/hid/hid-multitouch.c +++ b/drivers/hid/hid-multitouch.c @@ -2034,6 +2034,12 @@ static const struct hid_device_id mt_dev USB_VENDOR_ID_LENOVO, USB_DEVICE_ID_LENOVO_X1_TAB3) },
+ /* Lenovo X12 TAB Gen 1 */ + { .driver_data = MT_CLS_WIN_8_FORCE_MULTI_INPUT, + HID_DEVICE(BUS_USB, HID_GROUP_MULTITOUCH_WIN_8, + USB_VENDOR_ID_LENOVO, + USB_DEVICE_ID_LENOVO_X12_TAB) }, + /* MosArt panels */ { .driver_data = MT_CLS_CONFIDENCE_MINUS_ONE, MT_USB_DEVICE(USB_VENDOR_ID_ASUS,
From: Reinette Chatre reinette.chatre@intel.com
commit 6bd429643cc265e94a9d19839c771bcc5d008fa8 upstream.
SGX uses shmem backing storage to store encrypted enclave pages and their crypto metadata when enclave pages are moved out of enclave memory. Two shmem backing storage pages are associated with each enclave page - one backing page to contain the encrypted enclave page data and one backing page (shared by a few enclave pages) to contain the crypto metadata used by the processor to verify the enclave page when it is loaded back into the enclave.
sgx_encl_put_backing() is used to release references to the backing storage and, optionally, mark both backing store pages as dirty.
Managing references and dirty status together in this way results in both backing store pages marked as dirty, even if only one of the backing store pages are changed.
Additionally, waiting until the page reference is dropped to set the page dirty risks a race with the page fault handler that may load outdated data into the enclave when a page is faulted right after it is reclaimed.
Consider what happens if the reclaimer writes a page to the backing store and the page is immediately faulted back, before the reclaimer is able to set the dirty bit of the page:
sgx_reclaim_pages() { sgx_vma_fault() { ... sgx_encl_get_backing(); ... ... sgx_reclaimer_write() { mutex_lock(&encl->lock); /* Write data to backing store */ mutex_unlock(&encl->lock); } mutex_lock(&encl->lock); __sgx_encl_eldu() { ... /* * Enclave backing store * page not released * nor marked dirty - * contents may not be * up to date. */ sgx_encl_get_backing(); ... /* * Enclave data restored * from backing store * and PCMD pages that * are not up to date. * ENCLS[ELDU] faults * because of MAC or PCMD * checking failure. */ sgx_encl_put_backing(); } ... /* set page dirty */ sgx_encl_put_backing(); ... mutex_unlock(&encl->lock); } }
Remove the option to sgx_encl_put_backing() to set the backing pages as dirty and set the needed pages as dirty right after receiving important data while enclave mutex is held. This ensures that the page fault handler can get up to date data from a page and prepares the code for a following change where only one of the backing pages need to be marked as dirty.
Cc: stable@vger.kernel.org Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") Suggested-by: Dave Hansen dave.hansen@linux.intel.com Signed-off-by: Reinette Chatre reinette.chatre@intel.com Signed-off-by: Dave Hansen dave.hansen@linux.intel.com Reviewed-by: Jarkko Sakkinen jarkko@kernel.org Tested-by: Haitao Huang haitao.huang@intel.com Link: https://lore.kernel.org/linux-sgx/8922e48f-6646-c7cc-6393-7c78dcf23d23@intel... Link: https://lkml.kernel.org/r/fa9f98986923f43e72ef4c6702a50b2a0b3c42e3.165238982... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/cpu/sgx/encl.c | 10 ++-------- arch/x86/kernel/cpu/sgx/encl.h | 2 +- arch/x86/kernel/cpu/sgx/main.c | 6 ++++-- 3 files changed, 7 insertions(+), 11 deletions(-)
--- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -94,7 +94,7 @@ static int __sgx_encl_eldu(struct sgx_en kunmap_atomic(pcmd_page); kunmap_atomic((void *)(unsigned long)pginfo.contents);
- sgx_encl_put_backing(&b, false); + sgx_encl_put_backing(&b);
sgx_encl_truncate_backing_page(encl, page_index);
@@ -645,15 +645,9 @@ int sgx_encl_get_backing(struct sgx_encl /** * sgx_encl_put_backing() - Unpin the backing storage * @backing: data for accessing backing storage for the page - * @do_write: mark pages dirty */ -void sgx_encl_put_backing(struct sgx_backing *backing, bool do_write) +void sgx_encl_put_backing(struct sgx_backing *backing) { - if (do_write) { - set_page_dirty(backing->pcmd); - set_page_dirty(backing->contents); - } - put_page(backing->pcmd); put_page(backing->contents); } --- a/arch/x86/kernel/cpu/sgx/encl.h +++ b/arch/x86/kernel/cpu/sgx/encl.h @@ -107,7 +107,7 @@ void sgx_encl_release(struct kref *ref); int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm); int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index, struct sgx_backing *backing); -void sgx_encl_put_backing(struct sgx_backing *backing, bool do_write); +void sgx_encl_put_backing(struct sgx_backing *backing); int sgx_encl_test_and_clear_young(struct mm_struct *mm, struct sgx_encl_page *page);
--- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -191,6 +191,8 @@ static int __sgx_encl_ewb(struct sgx_epc backing->pcmd_offset;
ret = __ewb(&pginfo, sgx_get_epc_virt_addr(epc_page), va_slot); + set_page_dirty(backing->pcmd); + set_page_dirty(backing->contents);
kunmap_atomic((void *)(unsigned long)(pginfo.metadata - backing->pcmd_offset)); @@ -320,7 +322,7 @@ static void sgx_reclaimer_write(struct s sgx_encl_free_epc_page(encl->secs.epc_page); encl->secs.epc_page = NULL;
- sgx_encl_put_backing(&secs_backing, true); + sgx_encl_put_backing(&secs_backing); }
out: @@ -411,7 +413,7 @@ skip:
encl_page = epc_page->owner; sgx_reclaimer_write(epc_page, &backing[i]); - sgx_encl_put_backing(&backing[i], true); + sgx_encl_put_backing(&backing[i]);
kref_put(&encl_page->encl->refcount, sgx_encl_release); epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
From: Reinette Chatre reinette.chatre@intel.com
commit 2154e1c11b7080aa19f47160bd26b6f39bbd7824 upstream.
Recent commit 08999b2489b4 ("x86/sgx: Free backing memory after faulting the enclave page") expanded __sgx_encl_eldu() to clear an enclave page's PCMD (Paging Crypto MetaData) from the PCMD page in the backing store after the enclave page is restored to the enclave.
Since the PCMD page in the backing store is modified the page should be marked as dirty to ensure the modified data is retained.
Cc: stable@vger.kernel.org Fixes: 08999b2489b4 ("x86/sgx: Free backing memory after faulting the enclave page") Signed-off-by: Reinette Chatre reinette.chatre@intel.com Signed-off-by: Dave Hansen dave.hansen@linux.intel.com Reviewed-by: Jarkko Sakkinen jarkko@kernel.org Tested-by: Haitao Huang haitao.huang@intel.com Link: https://lkml.kernel.org/r/00cd2ac480db01058d112e347b32599c1a806bc4.165238982... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/cpu/sgx/encl.c | 1 + 1 file changed, 1 insertion(+)
--- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -84,6 +84,7 @@ static int __sgx_encl_eldu(struct sgx_en }
memset(pcmd_page + b.pcmd_offset, 0, sizeof(struct sgx_pcmd)); + set_page_dirty(b.pcmd);
/* * The area for the PCMD in the page was zeroed above. Check if the
From: Reinette Chatre reinette.chatre@intel.com
commit 0e4e729a830c1e7f31d3b3fbf8feb355a402b117 upstream.
Haitao reported encountering a WARN triggered by the ENCLS[ELDU] instruction faulting with a #GP.
The WARN is encountered when the reclaimer evicts a range of pages from the enclave when the same pages are faulted back right away.
The SGX backing storage is accessed on two paths: when there are insufficient free pages in the EPC the reclaimer works to move enclave pages to the backing storage and as enclaves access pages that have been moved to the backing storage they are retrieved from there as part of page fault handling.
An oversubscribed SGX system will often run the reclaimer and page fault handler concurrently and needs to ensure that the backing store is accessed safely between the reclaimer and the page fault handler. This is not the case because the reclaimer accesses the backing store without the enclave mutex while the page fault handler accesses the backing store with the enclave mutex.
Consider the scenario where a page is faulted while a page sharing a PCMD page with the faulted page is being reclaimed. The consequence is a race between the reclaimer and page fault handler, the reclaimer attempting to access a PCMD at the same time it is truncated by the page fault handler. This could result in lost PCMD data. Data may still be lost if the reclaimer wins the race, this is addressed in the following patch.
The reclaimer accesses pages from the backing storage without holding the enclave mutex and runs the risk of concurrently accessing the backing storage with the page fault handler that does access the backing storage with the enclave mutex held.
In the scenario below a PCMD page is truncated from the backing store after all its pages have been loaded in to the enclave at the same time the PCMD page is loaded from the backing store when one of its pages are reclaimed:
sgx_reclaim_pages() { sgx_vma_fault() { ... mutex_lock(&encl->lock); ... __sgx_encl_eldu() { ... if (pcmd_page_empty) { /* * EPC page being reclaimed /* * shares a PCMD page with an * PCMD page truncated * enclave page that is being * while requested from * faulted in. * reclaimer. */ */ sgx_encl_get_backing() <----------> sgx_encl_truncate_backing_page() } mutex_unlock(&encl->lock); } }
In this scenario there is a race between the reclaimer and the page fault handler when the reclaimer attempts to get access to the same PCMD page that is being truncated. This could result in the reclaimer writing to the PCMD page that is then truncated, causing the PCMD data to be lost, or in a new PCMD page being allocated. The lost PCMD data may still occur after protecting the backing store access with the mutex - this is fixed in the next patch. By ensuring the backing store is accessed with the mutex held the enclave page state can be made accurate with the SGX_ENCL_PAGE_BEING_RECLAIMED flag accurately reflecting that a page is in the process of being reclaimed.
Consistently protect the reclaimer's backing store access with the enclave's mutex to ensure that it can safely run concurrently with the page fault handler.
Cc: stable@vger.kernel.org Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") Reported-by: Haitao Huang haitao.huang@intel.com Signed-off-by: Reinette Chatre reinette.chatre@intel.com Signed-off-by: Dave Hansen dave.hansen@linux.intel.com Reviewed-by: Jarkko Sakkinen jarkko@kernel.org Tested-by: Jarkko Sakkinen jarkko@kernel.org Tested-by: Haitao Huang haitao.huang@intel.com Link: https://lkml.kernel.org/r/fa2e04c561a8555bfe1f4e7adc37d60efc77387b.165238982... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/cpu/sgx/main.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
--- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -310,6 +310,7 @@ static void sgx_reclaimer_write(struct s sgx_encl_ewb(epc_page, backing); encl_page->epc_page = NULL; encl->secs_child_cnt--; + sgx_encl_put_backing(backing);
if (!encl->secs_child_cnt && test_bit(SGX_ENCL_INITIALIZED, &encl->flags)) { ret = sgx_encl_get_backing(encl, PFN_DOWN(encl->size), @@ -381,11 +382,14 @@ static void sgx_reclaim_pages(void) goto skip;
page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base); + + mutex_lock(&encl_page->encl->lock); ret = sgx_encl_get_backing(encl_page->encl, page_index, &backing[i]); - if (ret) + if (ret) { + mutex_unlock(&encl_page->encl->lock); goto skip; + }
- mutex_lock(&encl_page->encl->lock); encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED; mutex_unlock(&encl_page->encl->lock); continue; @@ -413,7 +417,6 @@ skip:
encl_page = epc_page->owner; sgx_reclaimer_write(epc_page, &backing[i]); - sgx_encl_put_backing(&backing[i]);
kref_put(&encl_page->encl->refcount, sgx_encl_release); epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
From: Reinette Chatre reinette.chatre@intel.com
commit af117837ceb9a78e995804ade4726ad2c2c8981f upstream.
Haitao reported encountering a WARN triggered by the ENCLS[ELDU] instruction faulting with a #GP.
The WARN is encountered when the reclaimer evicts a range of pages from the enclave when the same pages are faulted back right away.
Consider two enclave pages (ENCLAVE_A and ENCLAVE_B) sharing a PCMD page (PCMD_AB). ENCLAVE_A is in the enclave memory and ENCLAVE_B is in the backing store. PCMD_AB contains just one entry, that of ENCLAVE_B.
Scenario proceeds where ENCLAVE_A is being evicted from the enclave while ENCLAVE_B is faulted in.
sgx_reclaim_pages() {
...
/* * Reclaim ENCLAVE_A */ mutex_lock(&encl->lock); /* * Get a reference to ENCLAVE_A's * shmem page where enclave page * encrypted data will be stored * as well as a reference to the * enclave page's PCMD data page, * PCMD_AB. * Release mutex before writing * any data to the shmem pages. */ sgx_encl_get_backing(...); encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED; mutex_unlock(&encl->lock);
/* * Fault ENCLAVE_B */
sgx_vma_fault() {
mutex_lock(&encl->lock); /* * Get reference to * ENCLAVE_B's shmem page * as well as PCMD_AB. */ sgx_encl_get_backing(...) /* * Load page back into * enclave via ELDU. */ /* * Release reference to * ENCLAVE_B' shmem page and * PCMD_AB. */ sgx_encl_put_backing(...); /* * PCMD_AB is found empty so * it and ENCLAVE_B's shmem page * are truncated. */ /* Truncate ENCLAVE_B backing page */ sgx_encl_truncate_backing_page(); /* Truncate PCMD_AB */ sgx_encl_truncate_backing_page();
mutex_unlock(&encl->lock);
... } mutex_lock(&encl->lock); encl_page->desc &= ~SGX_ENCL_PAGE_BEING_RECLAIMED; /* * Write encrypted contents of * ENCLAVE_A to ENCLAVE_A shmem * page and its PCMD data to * PCMD_AB. */ sgx_encl_put_backing(...)
/* * Reference to PCMD_AB is * dropped and it is truncated. * ENCLAVE_A's PCMD data is lost. */ mutex_unlock(&encl->lock); }
What happens next depends on whether it is ENCLAVE_A being faulted in or ENCLAVE_B being evicted - but both end up with ENCLS[ELDU] faulting with a #GP.
If ENCLAVE_A is faulted then at the time sgx_encl_get_backing() is called a new PCMD page is allocated and providing the empty PCMD data for ENCLAVE_A would cause ENCLS[ELDU] to #GP
If ENCLAVE_B is evicted first then a new PCMD_AB would be allocated by the reclaimer but later when ENCLAVE_A is faulted the ENCLS[ELDU] instruction would #GP during its checks of the PCMD value and the WARN would be encountered.
Noting that the reclaimer sets SGX_ENCL_PAGE_BEING_RECLAIMED at the time it obtains a reference to the backing store pages of an enclave page it is in the process of reclaiming, fix the race by only truncating the PCMD page after ensuring that no page sharing the PCMD page is in the process of being reclaimed.
Cc: stable@vger.kernel.org Fixes: 08999b2489b4 ("x86/sgx: Free backing memory after faulting the enclave page") Reported-by: Haitao Huang haitao.huang@intel.com Signed-off-by: Reinette Chatre reinette.chatre@intel.com Signed-off-by: Dave Hansen dave.hansen@linux.intel.com Reviewed-by: Jarkko Sakkinen jarkko@kernel.org Tested-by: Haitao Huang haitao.huang@intel.com Link: https://lkml.kernel.org/r/ed20a5db516aa813873268e125680041ae11dfcf.165238982... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/cpu/sgx/encl.c | 94 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 93 insertions(+), 1 deletion(-)
--- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -12,6 +12,92 @@ #include "encls.h" #include "sgx.h"
+#define PCMDS_PER_PAGE (PAGE_SIZE / sizeof(struct sgx_pcmd)) +/* + * 32 PCMD entries share a PCMD page. PCMD_FIRST_MASK is used to + * determine the page index associated with the first PCMD entry + * within a PCMD page. + */ +#define PCMD_FIRST_MASK GENMASK(4, 0) + +/** + * reclaimer_writing_to_pcmd() - Query if any enclave page associated with + * a PCMD page is in process of being reclaimed. + * @encl: Enclave to which PCMD page belongs + * @start_addr: Address of enclave page using first entry within the PCMD page + * + * When an enclave page is reclaimed some Paging Crypto MetaData (PCMD) is + * stored. The PCMD data of a reclaimed enclave page contains enough + * information for the processor to verify the page at the time + * it is loaded back into the Enclave Page Cache (EPC). + * + * The backing storage to which enclave pages are reclaimed is laid out as + * follows: + * Encrypted enclave pages:SECS page:PCMD pages + * + * Each PCMD page contains the PCMD metadata of + * PAGE_SIZE/sizeof(struct sgx_pcmd) enclave pages. + * + * A PCMD page can only be truncated if it is (a) empty, and (b) not in the + * process of getting data (and thus soon being non-empty). (b) is tested with + * a check if an enclave page sharing the PCMD page is in the process of being + * reclaimed. + * + * The reclaimer sets the SGX_ENCL_PAGE_BEING_RECLAIMED flag when it + * intends to reclaim that enclave page - it means that the PCMD page + * associated with that enclave page is about to get some data and thus + * even if the PCMD page is empty, it should not be truncated. + * + * Context: Enclave mutex (&sgx_encl->lock) must be held. + * Return: 1 if the reclaimer is about to write to the PCMD page + * 0 if the reclaimer has no intention to write to the PCMD page + */ +static int reclaimer_writing_to_pcmd(struct sgx_encl *encl, + unsigned long start_addr) +{ + int reclaimed = 0; + int i; + + /* + * PCMD_FIRST_MASK is based on number of PCMD entries within + * PCMD page being 32. + */ + BUILD_BUG_ON(PCMDS_PER_PAGE != 32); + + for (i = 0; i < PCMDS_PER_PAGE; i++) { + struct sgx_encl_page *entry; + unsigned long addr; + + addr = start_addr + i * PAGE_SIZE; + + /* + * Stop when reaching the SECS page - it does not + * have a page_array entry and its reclaim is + * started and completed with enclave mutex held so + * it does not use the SGX_ENCL_PAGE_BEING_RECLAIMED + * flag. + */ + if (addr == encl->base + encl->size) + break; + + entry = xa_load(&encl->page_array, PFN_DOWN(addr)); + if (!entry) + continue; + + /* + * VA page slot ID uses same bit as the flag so it is important + * to ensure that the page is not already in backing store. + */ + if (entry->epc_page && + (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)) { + reclaimed = 1; + break; + } + } + + return reclaimed; +} + /* * Calculate byte offset of a PCMD struct associated with an enclave page. PCMD's * follow right after the EPC data in the backing storage. In addition to the @@ -47,6 +133,7 @@ static int __sgx_encl_eldu(struct sgx_en unsigned long va_offset = encl_page->desc & SGX_ENCL_PAGE_VA_OFFSET_MASK; struct sgx_encl *encl = encl_page->encl; pgoff_t page_index, page_pcmd_off; + unsigned long pcmd_first_page; struct sgx_pageinfo pginfo; struct sgx_backing b; bool pcmd_page_empty; @@ -58,6 +145,11 @@ static int __sgx_encl_eldu(struct sgx_en else page_index = PFN_DOWN(encl->size);
+ /* + * Address of enclave page using the first entry within the PCMD page. + */ + pcmd_first_page = PFN_PHYS(page_index & ~PCMD_FIRST_MASK) + encl->base; + page_pcmd_off = sgx_encl_get_backing_page_pcmd_offset(encl, page_index);
ret = sgx_encl_get_backing(encl, page_index, &b); @@ -99,7 +191,7 @@ static int __sgx_encl_eldu(struct sgx_en
sgx_encl_truncate_backing_page(encl, page_index);
- if (pcmd_page_empty) + if (pcmd_page_empty && !reclaimer_writing_to_pcmd(encl, pcmd_first_page)) sgx_encl_truncate_backing_page(encl, PFN_DOWN(page_pcmd_off));
return ret;
From: Reinette Chatre reinette.chatre@intel.com
commit e3a3bbe3e99de73043a1d32d36cf4d211dc58c7e upstream.
A PCMD (Paging Crypto MetaData) page contains the PCMD structures of enclave pages that have been encrypted and moved to the shmem backing store. When all enclave pages sharing a PCMD page are loaded in the enclave, there is no need for the PCMD page and it can be truncated from the backing store.
A few issues appeared around the truncation of PCMD pages. The known issues have been addressed but the PCMD handling code could be made more robust by loudly complaining if any new issue appears in this area.
Add a check that will complain with a warning if the PCMD page is not actually empty after it has been truncated. There should never be data in the PCMD page at this point since it is was just checked to be empty and truncated with enclave mutex held and is updated with the enclave mutex held.
Suggested-by: Dave Hansen dave.hansen@linux.intel.com Signed-off-by: Reinette Chatre reinette.chatre@intel.com Signed-off-by: Dave Hansen dave.hansen@linux.intel.com Reviewed-by: Jarkko Sakkinen jarkko@kernel.org Tested-by: Haitao Huang haitao.huang@intel.com Link: https://lkml.kernel.org/r/6495120fed43fafc1496d09dd23df922b9a32709.165238982... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/cpu/sgx/encl.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
--- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -187,12 +187,20 @@ static int __sgx_encl_eldu(struct sgx_en kunmap_atomic(pcmd_page); kunmap_atomic((void *)(unsigned long)pginfo.contents);
+ get_page(b.pcmd); sgx_encl_put_backing(&b);
sgx_encl_truncate_backing_page(encl, page_index);
- if (pcmd_page_empty && !reclaimer_writing_to_pcmd(encl, pcmd_first_page)) + if (pcmd_page_empty && !reclaimer_writing_to_pcmd(encl, pcmd_first_page)) { sgx_encl_truncate_backing_page(encl, PFN_DOWN(page_pcmd_off)); + pcmd_page = kmap_atomic(b.pcmd); + if (memchr_inv(pcmd_page, 0, PAGE_SIZE)) + pr_warn("PCMD page not empty after truncate.\n"); + kunmap_atomic(pcmd_page); + } + + put_page(b.pcmd);
return ret; }
From: Bryan O'Donoghue bryan.odonoghue@linaro.org
commit bb25f071fc92d3d227178a45853347c7b3b45a6b upstream.
The imx412/imx577 sensor has a reset line that is active low not active high. Currently the logic for this is inverted.
The right way to define the reset line is to declare it active low in the DTS and invert the logic currently contained in the driver.
The DTS should represent the hardware does i.e. reset is active low. So: + reset-gpios = <&tlmm 78 GPIO_ACTIVE_LOW>; not: - reset-gpios = <&tlmm 78 GPIO_ACTIVE_HIGH>;
I was a bit reticent about changing this logic since I thought it might negatively impact @intel.com users. Googling a bit though I believe this sensor is used on "Keem Bay" which is clearly a DTS based system and is not upstream yet.
Fixes: 9214e86c0cc1 ("media: i2c: Add imx412 camera sensor driver") Cc: stable@vger.kernel.org Signed-off-by: Bryan O'Donoghue bryan.odonoghue@linaro.org Reviewed-by: Jacopo Mondi jacopo@jmondi.org Reviewed-by: Daniele Alessandrelli daniele.alessandrelli@intel.com Signed-off-by: Sakari Ailus sakari.ailus@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/media/i2c/imx412.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
--- a/drivers/media/i2c/imx412.c +++ b/drivers/media/i2c/imx412.c @@ -1011,7 +1011,7 @@ static int imx412_power_on(struct device struct imx412 *imx412 = to_imx412(sd); int ret;
- gpiod_set_value_cansleep(imx412->reset_gpio, 1); + gpiod_set_value_cansleep(imx412->reset_gpio, 0);
ret = clk_prepare_enable(imx412->inclk); if (ret) { @@ -1024,7 +1024,7 @@ static int imx412_power_on(struct device return 0;
error_reset: - gpiod_set_value_cansleep(imx412->reset_gpio, 0); + gpiod_set_value_cansleep(imx412->reset_gpio, 1);
return ret; } @@ -1040,7 +1040,7 @@ static int imx412_power_off(struct devic struct v4l2_subdev *sd = dev_get_drvdata(dev); struct imx412 *imx412 = to_imx412(sd);
- gpiod_set_value_cansleep(imx412->reset_gpio, 0); + gpiod_set_value_cansleep(imx412->reset_gpio, 1);
clk_disable_unprepare(imx412->inclk);
From: Bryan O'Donoghue bryan.odonoghue@linaro.org
commit 9a199694c6a1519522ec73a4571f68abe9f13d5d upstream.
The enable path does - gpio - clock
The disable path does - gpio - clock
Fix the order on the power-off path so that power-off and power-on have the same ordering for clock and gpio.
Fixes: 9214e86c0cc1 ("media: i2c: Add imx412 camera sensor driver") Cc: stable@vger.kernel.org Signed-off-by: Bryan O'Donoghue bryan.odonoghue@linaro.org Reviewed-by: Jacopo Mondi jacopo@jmondi.org Reviewed-by: Daniele Alessandrelli daniele.alessandrelli@intel.com Signed-off-by: Sakari Ailus sakari.ailus@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/media/i2c/imx412.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
--- a/drivers/media/i2c/imx412.c +++ b/drivers/media/i2c/imx412.c @@ -1040,10 +1040,10 @@ static int imx412_power_off(struct devic struct v4l2_subdev *sd = dev_get_drvdata(dev); struct imx412 *imx412 = to_imx412(sd);
- gpiod_set_value_cansleep(imx412->reset_gpio, 1); - clk_disable_unprepare(imx412->inclk);
+ gpiod_set_value_cansleep(imx412->reset_gpio, 1); + return 0; }
From: Stefan Mahnke-Hartmann stefan.mahnke-hartmann@infineon.com
commit e57b2523bd37e6434f4e64c7a685e3715ad21e9a upstream.
Under certain conditions uninitialized memory will be accessed. As described by TCG Trusted Platform Module Library Specification, rev. 1.59 (Part 3: Commands), if a TPM2_GetCapability is received, requesting a capability, the TPM in field upgrade mode may return a zero length list. Check the property count in tpm2_get_tpm_pt().
Fixes: 2ab3241161b3 ("tpm: migrate tpm2_get_tpm_pt() to use struct tpm_buf") Cc: stable@vger.kernel.org Signed-off-by: Stefan Mahnke-Hartmann stefan.mahnke-hartmann@infineon.com Reviewed-by: Jarkko Sakkinen jarkko@kernel.org Signed-off-by: Jarkko Sakkinen jarkko@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/char/tpm/tpm2-cmd.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-)
--- a/drivers/char/tpm/tpm2-cmd.c +++ b/drivers/char/tpm/tpm2-cmd.c @@ -400,7 +400,16 @@ ssize_t tpm2_get_tpm_pt(struct tpm_chip if (!rc) { out = (struct tpm2_get_cap_out *) &buf.data[TPM_HEADER_SIZE]; - *value = be32_to_cpu(out->value); + /* + * To prevent failing boot up of some systems, Infineon TPM2.0 + * returns SUCCESS on TPM2_Startup in field upgrade mode. Also + * the TPM2_Getcapability command returns a zero length list + * in field upgrade mode. + */ + if (be32_to_cpu(out->property_cnt) > 0) + *value = be32_to_cpu(out->value); + else + rc = -ENODATA; } tpm_buf_destroy(&buf); return rc;
From: Xiu Jianfeng xiujianfeng@huawei.com
commit d0dc1a7100f19121f6e7450f9cdda11926aa3838 upstream.
Currently it returns zero when CRQ response timed out, it should return an error code instead.
Fixes: d8d74ea3c002 ("tpm: ibmvtpm: Wait for buffer to be set before proceeding") Signed-off-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Stefan Berger stefanb@linux.ibm.com Acked-by: Jarkko Sakkinen jarkko@kernel.org Signed-off-by: Jarkko Sakkinen jarkko@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/char/tpm/tpm_ibmvtpm.c | 1 + 1 file changed, 1 insertion(+)
--- a/drivers/char/tpm/tpm_ibmvtpm.c +++ b/drivers/char/tpm/tpm_ibmvtpm.c @@ -681,6 +681,7 @@ static int tpm_ibmvtpm_probe(struct vio_ if (!wait_event_timeout(ibmvtpm->crq_queue.wq, ibmvtpm->rtce_buf != NULL, HZ)) { + rc = -ENODEV; dev_err(dev, "CRQ response timed out\n"); goto init_irq_cleanup; }
From: Akira Yokosawa akiyks@gmail.com
commit 6d5aa418b3bd42cdccc36e94ee199af423ef7c84 upstream.
The reference to `explicit_in_reply_to` is pointless as when the reference was added in the form of "#15" [1], Section 15) was "The canonical patch format". The reference of "#15" had not been properly updated in a couple of reorganizations during the plain-text SubmittingPatches era.
Fix it by using `the_canonical_patch_format`.
[1]: 2ae19acaa50a ("Documentation: Add "how to write a good patch summary" to SubmittingPatches")
Signed-off-by: Akira Yokosawa akiyks@gmail.com Fixes: 5903019b2a5e ("Documentation/SubmittingPatches: convert it to ReST markup") Fixes: 9b2c76777acc ("Documentation/SubmittingPatches: enrich the Sphinx output") Cc: Jonathan Corbet corbet@lwn.net Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: stable@vger.kernel.org # v4.9+ Link: https://lore.kernel.org/r/64e105a5-50be-23f2-6cae-903a2ea98e18@gmail.com Signed-off-by: Jonathan Corbet corbet@lwn.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- Documentation/process/submitting-patches.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/Documentation/process/submitting-patches.rst +++ b/Documentation/process/submitting-patches.rst @@ -77,7 +77,7 @@ as you intend it to.
The maintainer will thank you if you write your patch description in a form which can be easily pulled into Linux's source code management -system, ``git``, as a "commit log". See :ref:`explicit_in_reply_to`. +system, ``git``, as a "commit log". See :ref:`the_canonical_patch_format`.
Solve only one problem per patch. If your description starts to get long, that's a sign that you probably need to split up your patch.
From: Trond Myklebust trond.myklebust@hammerspace.com
commit 452284407c18d8a522c3039339b1860afa0025a8 upstream.
We need to filter out ENOMEM in nfs_error_is_fatal_on_server(), because running out of memory on our client is not a server error.
Reported-by: Olga Kornievskaia aglo@umich.edu Fixes: 2dc23afffbca ("NFS: ENOMEM should also be a fatal error.") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Anna Schumaker Anna.Schumaker@Netapp.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/nfs/internal.h | 1 + 1 file changed, 1 insertion(+)
--- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -841,6 +841,7 @@ static inline bool nfs_error_is_fatal_on case 0: case -ERESTARTSYS: case -EINTR: + case -ENOMEM: return false; } return nfs_error_is_fatal(err);
From: Chuck Lever chuck.lever@oracle.com
commit ce3c4ad7f4ce5db7b4f08a1e237d8dd94b39180b upstream.
nfsd4_release_lockowner() holds clp->cl_lock when it calls check_for_locks(). However, check_for_locks() calls nfsd_file_get() / nfsd_file_put() to access the backing inode's flc_posix list, and nfsd_file_put() can sleep if the inode was recently removed.
Let's instead rely on the stateowner's reference count to gate whether the release is permitted. This should be a reliable indication of locks-in-use since file lock operations and ->lm_get_owner take appropriate references, which are released appropriately when file locks are removed.
Reported-by: Dai Ngo dai.ngo@oracle.com Signed-off-by: Chuck Lever chuck.lever@oracle.com Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/nfsd/nfs4state.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-)
--- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -7330,16 +7330,12 @@ nfsd4_release_lockowner(struct svc_rqst if (sop->so_is_open_owner || !same_owner_str(sop, owner)) continue;
- /* see if there are still any locks associated with it */ - lo = lockowner(sop); - list_for_each_entry(stp, &sop->so_stateids, st_perstateowner) { - if (check_for_locks(stp->st_stid.sc_file, lo)) { - status = nfserr_locks_held; - spin_unlock(&clp->cl_lock); - return status; - } + if (atomic_read(&sop->so_count) != 1) { + spin_unlock(&clp->cl_lock); + return nfserr_locks_held; }
+ lo = lockowner(sop); nfs4_get_stateowner(sop); break; }
From: Song Liu song@kernel.org
commit d88bb5eed04ce50cc20e7f9282977841728be798 upstream.
bpf_prog_pack enables sharing huge pages among multiple BPF programs. These pages are marked as executable before the JIT engine fill it with BPF programs. To make these pages safe, fill the hole bpf_prog_pack with illegal instructions before making it executable.
Fixes: 57631054fae6 ("bpf: Introduce bpf_prog_pack allocator") Fixes: 33c9805860e5 ("bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]") Reported-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Song Liu song@kernel.org Signed-off-by: Daniel Borkmann daniel@iogearbox.net Link: https://lore.kernel.org/bpf/20220520235758.1858153-2-song@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/bpf/core.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)
--- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -873,7 +873,7 @@ static size_t select_bpf_prog_pack_size( return size; }
-static struct bpf_prog_pack *alloc_new_pack(void) +static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_insns) { struct bpf_prog_pack *pack;
@@ -886,6 +886,7 @@ static struct bpf_prog_pack *alloc_new_p kfree(pack); return NULL; } + bpf_fill_ill_insns(pack->ptr, bpf_prog_pack_size); bitmap_zero(pack->bitmap, bpf_prog_pack_size / BPF_PROG_CHUNK_SIZE); list_add_tail(&pack->list, &pack_list);
@@ -895,7 +896,7 @@ static struct bpf_prog_pack *alloc_new_p return pack; }
-static void *bpf_prog_pack_alloc(u32 size) +static void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns) { unsigned int nbits = BPF_PROG_SIZE_TO_NBITS(size); struct bpf_prog_pack *pack; @@ -910,6 +911,7 @@ static void *bpf_prog_pack_alloc(u32 siz size = round_up(size, PAGE_SIZE); ptr = module_alloc(size); if (ptr) { + bpf_fill_ill_insns(ptr, size); set_vm_flush_reset_perms(ptr); set_memory_ro((unsigned long)ptr, size / PAGE_SIZE); set_memory_x((unsigned long)ptr, size / PAGE_SIZE); @@ -923,7 +925,7 @@ static void *bpf_prog_pack_alloc(u32 siz goto found_free_area; }
- pack = alloc_new_pack(); + pack = alloc_new_pack(bpf_fill_ill_insns); if (!pack) goto out;
@@ -1102,7 +1104,7 @@ bpf_jit_binary_pack_alloc(unsigned int p
if (bpf_jit_charge_modmem(size)) return NULL; - ro_header = bpf_prog_pack_alloc(size); + ro_header = bpf_prog_pack_alloc(size, bpf_fill_ill_insns); if (!ro_header) { bpf_jit_uncharge_modmem(size); return NULL;
From: Yuntao Wang ytcoode@gmail.com
commit a2aa95b71c9bbec793b5c5fa50f0a80d882b3e8d upstream.
The cnt value in the 'cnt >= BPF_MAX_TRAMP_PROGS' check does not include BPF_TRAMP_MODIFY_RETURN bpf programs, so the number of the attached BPF_TRAMP_MODIFY_RETURN bpf programs in a trampoline can exceed BPF_MAX_TRAMP_PROGS.
When this happens, the assignment '*progs++ = aux->prog' in bpf_trampoline_get_progs() will cause progs array overflow as the progs field in the bpf_tramp_progs struct can only hold at most BPF_MAX_TRAMP_PROGS bpf programs.
Fixes: 88fd9e5352fe ("bpf: Refactor trampoline update code") Signed-off-by: Yuntao Wang ytcoode@gmail.com Link: https://lore.kernel.org/r/20220430130803.210624-1-ytcoode@gmail.com Signed-off-by: Alexei Starovoitov ast@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/bpf/trampoline.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-)
--- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -411,7 +411,7 @@ int bpf_trampoline_link_prog(struct bpf_ { enum bpf_tramp_prog_type kind; int err = 0; - int cnt; + int cnt = 0, i;
kind = bpf_attach_type_to_tramp(prog); mutex_lock(&tr->mutex); @@ -422,7 +422,10 @@ int bpf_trampoline_link_prog(struct bpf_ err = -EBUSY; goto out; } - cnt = tr->progs_cnt[BPF_TRAMP_FENTRY] + tr->progs_cnt[BPF_TRAMP_FEXIT]; + + for (i = 0; i < BPF_TRAMP_MAX; i++) + cnt += tr->progs_cnt[i]; + if (kind == BPF_TRAMP_REPLACE) { /* Cannot attach extension if fentry/fexit are in use. */ if (cnt) { @@ -500,16 +503,19 @@ out:
void bpf_trampoline_put(struct bpf_trampoline *tr) { + int i; + if (!tr) return; mutex_lock(&trampoline_mutex); if (!refcount_dec_and_test(&tr->refcnt)) goto out; WARN_ON_ONCE(mutex_is_locked(&tr->mutex)); - if (WARN_ON_ONCE(!hlist_empty(&tr->progs_hlist[BPF_TRAMP_FENTRY]))) - goto out; - if (WARN_ON_ONCE(!hlist_empty(&tr->progs_hlist[BPF_TRAMP_FEXIT]))) - goto out; + + for (i = 0; i < BPF_TRAMP_MAX; i++) + if (WARN_ON_ONCE(!hlist_empty(&tr->progs_hlist[i]))) + goto out; + /* This code will be executed even when the last bpf_tramp_image * is alive. All progs are detached from the trampoline and the * trampoline image is patched with jmp into epilogue to skip
From: Alexei Starovoitov ast@kernel.org
commit 4b6313cf99b0d51b49aeaea98ec76ca8161ecb80 upstream.
The combination of jit blinding and pointers to bpf subprogs causes: [ 36.989548] BUG: unable to handle page fault for address: 0000000100000001 [ 36.990342] #PF: supervisor instruction fetch in kernel mode [ 36.990968] #PF: error_code(0x0010) - not-present page [ 36.994859] RIP: 0010:0x100000001 [ 36.995209] Code: Unable to access opcode bytes at RIP 0xffffffd7. [ 37.004091] Call Trace: [ 37.004351] <TASK> [ 37.004576] ? bpf_loop+0x4d/0x70 [ 37.004932] ? bpf_prog_3899083f75e4c5de_F+0xe3/0x13b
The jit blinding logic didn't recognize that ld_imm64 with an address of bpf subprogram is a special instruction and proceeded to randomize it. By itself it wouldn't have been an issue, but jit_subprogs() logic relies on two step process to JIT all subprogs and then JIT them again when addresses of all subprogs are known. Blinding process in the first JIT phase caused second JIT to miss adjustment of special ld_imm64.
Fix this issue by ignoring special ld_imm64 instructions that don't have user controlled constants and shouldn't be blinded.
Fixes: 69c087ba6225 ("bpf: Add bpf_for_each_map_elem() helper") Reported-by: Andrii Nakryiko andrii@kernel.org Signed-off-by: Alexei Starovoitov ast@kernel.org Signed-off-by: Daniel Borkmann daniel@iogearbox.net Acked-by: Andrii Nakryiko andrii@kernel.org Acked-by: Martin KaFai Lau kafai@fb.com Link: https://lore.kernel.org/bpf/20220513011025.13344-1-alexei.starovoitov@gmail.... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/bpf/core.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
--- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -1436,6 +1436,16 @@ struct bpf_prog *bpf_jit_blind_constants insn = clone->insnsi;
for (i = 0; i < insn_cnt; i++, insn++) { + if (bpf_pseudo_func(insn)) { + /* ld_imm64 with an address of bpf subprog is not + * a user controlled constant. Don't randomize it, + * since it will conflict with jit_subprogs() logic. + */ + insn++; + i++; + continue; + } + /* We temporarily need to hold the original ld64 insn * so that we can still access the first part in the * second blinding run.
From: Liu Jian liujian56@huawei.com
commit 45969b4152c1752089351cd6836a42a566d49bcf upstream.
The data length of skb frags + frag_list may be greater than 0xffff, and skb_header_pointer can not handle negative offset. So, here INT_MAX is used to check the validity of offset. Add the same change to the related function skb_store_bytes.
Fixes: 05c74e5e53f6 ("bpf: add bpf_skb_load_bytes helper") Signed-off-by: Liu Jian liujian56@huawei.com Signed-off-by: Daniel Borkmann daniel@iogearbox.net Acked-by: Song Liu songliubraving@fb.com Link: https://lore.kernel.org/bpf/20220416105801.88708-2-liujian56@huawei.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/core/filter.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
--- a/net/core/filter.c +++ b/net/core/filter.c @@ -1687,7 +1687,7 @@ BPF_CALL_5(bpf_skb_store_bytes, struct s
if (unlikely(flags & ~(BPF_F_RECOMPUTE_CSUM | BPF_F_INVALIDATE_HASH))) return -EINVAL; - if (unlikely(offset > 0xffff)) + if (unlikely(offset > INT_MAX)) return -EFAULT; if (unlikely(bpf_try_make_writable(skb, offset + len))) return -EFAULT; @@ -1722,7 +1722,7 @@ BPF_CALL_4(bpf_skb_load_bytes, const str { void *ptr;
- if (unlikely(offset > 0xffff)) + if (unlikely(offset > INT_MAX)) goto err_clear;
ptr = skb_header_pointer(skb, offset, len, to);
From: KP Singh kpsingh@kernel.org
commit dcf456c9a095a6e71f53d6f6f004133ee851ee70 upstream.
bpf_{sk,task,inode}_storage_free() do not need to use call_rcu_tasks_trace as no BPF program should be accessing the owner as it's being destroyed. The only other reader at this point is bpf_local_storage_map_free() which uses normal RCU.
The only path that needs trace RCU are:
* bpf_local_storage_{delete,update} helpers * map_{delete,update}_elem() syscalls
Fixes: 0fe4b381a59e ("bpf: Allow bpf_local_storage to be used by sleepable programs") Signed-off-by: KP Singh kpsingh@kernel.org Signed-off-by: Alexei Starovoitov ast@kernel.org Acked-by: Martin KaFai Lau kafai@fb.com Link: https://lore.kernel.org/bpf/20220418155158.2865678-1-kpsingh@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/linux/bpf_local_storage.h | 4 ++-- kernel/bpf/bpf_inode_storage.c | 4 ++-- kernel/bpf/bpf_local_storage.c | 29 +++++++++++++++++++---------- kernel/bpf/bpf_task_storage.c | 4 ++-- net/core/bpf_sk_storage.c | 6 +++--- 5 files changed, 28 insertions(+), 19 deletions(-)
--- a/include/linux/bpf_local_storage.h +++ b/include/linux/bpf_local_storage.h @@ -143,9 +143,9 @@ void bpf_selem_link_storage_nolock(struc
bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_storage, struct bpf_local_storage_elem *selem, - bool uncharge_omem); + bool uncharge_omem, bool use_trace_rcu);
-void bpf_selem_unlink(struct bpf_local_storage_elem *selem); +void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool use_trace_rcu);
void bpf_selem_link_map(struct bpf_local_storage_map *smap, struct bpf_local_storage_elem *selem); --- a/kernel/bpf/bpf_inode_storage.c +++ b/kernel/bpf/bpf_inode_storage.c @@ -90,7 +90,7 @@ void bpf_inode_storage_free(struct inode */ bpf_selem_unlink_map(selem); free_inode_storage = bpf_selem_unlink_storage_nolock( - local_storage, selem, false); + local_storage, selem, false, false); } raw_spin_unlock_bh(&local_storage->lock); rcu_read_unlock(); @@ -149,7 +149,7 @@ static int inode_storage_delete(struct i if (!sdata) return -ENOENT;
- bpf_selem_unlink(SELEM(sdata)); + bpf_selem_unlink(SELEM(sdata), true);
return 0; } --- a/kernel/bpf/bpf_local_storage.c +++ b/kernel/bpf/bpf_local_storage.c @@ -106,7 +106,7 @@ static void bpf_selem_free_rcu(struct rc */ bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_storage, struct bpf_local_storage_elem *selem, - bool uncharge_mem) + bool uncharge_mem, bool use_trace_rcu) { struct bpf_local_storage_map *smap; bool free_local_storage; @@ -150,11 +150,16 @@ bool bpf_selem_unlink_storage_nolock(str SDATA(selem)) RCU_INIT_POINTER(local_storage->cache[smap->cache_idx], NULL);
- call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_rcu); + if (use_trace_rcu) + call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_rcu); + else + kfree_rcu(selem, rcu); + return free_local_storage; }
-static void __bpf_selem_unlink_storage(struct bpf_local_storage_elem *selem) +static void __bpf_selem_unlink_storage(struct bpf_local_storage_elem *selem, + bool use_trace_rcu) { struct bpf_local_storage *local_storage; bool free_local_storage = false; @@ -169,12 +174,16 @@ static void __bpf_selem_unlink_storage(s raw_spin_lock_irqsave(&local_storage->lock, flags); if (likely(selem_linked_to_storage(selem))) free_local_storage = bpf_selem_unlink_storage_nolock( - local_storage, selem, true); + local_storage, selem, true, use_trace_rcu); raw_spin_unlock_irqrestore(&local_storage->lock, flags);
- if (free_local_storage) - call_rcu_tasks_trace(&local_storage->rcu, + if (free_local_storage) { + if (use_trace_rcu) + call_rcu_tasks_trace(&local_storage->rcu, bpf_local_storage_free_rcu); + else + kfree_rcu(local_storage, rcu); + } }
void bpf_selem_link_storage_nolock(struct bpf_local_storage *local_storage, @@ -214,14 +223,14 @@ void bpf_selem_link_map(struct bpf_local raw_spin_unlock_irqrestore(&b->lock, flags); }
-void bpf_selem_unlink(struct bpf_local_storage_elem *selem) +void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool use_trace_rcu) { /* Always unlink from map before unlinking from local_storage * because selem will be freed after successfully unlinked from * the local_storage. */ bpf_selem_unlink_map(selem); - __bpf_selem_unlink_storage(selem); + __bpf_selem_unlink_storage(selem, use_trace_rcu); }
struct bpf_local_storage_data * @@ -466,7 +475,7 @@ bpf_local_storage_update(void *owner, st if (old_sdata) { bpf_selem_unlink_map(SELEM(old_sdata)); bpf_selem_unlink_storage_nolock(local_storage, SELEM(old_sdata), - false); + false, true); }
unlock: @@ -548,7 +557,7 @@ void bpf_local_storage_map_free(struct b migrate_disable(); __this_cpu_inc(*busy_counter); } - bpf_selem_unlink(selem); + bpf_selem_unlink(selem, false); if (busy_counter) { __this_cpu_dec(*busy_counter); migrate_enable(); --- a/kernel/bpf/bpf_task_storage.c +++ b/kernel/bpf/bpf_task_storage.c @@ -102,7 +102,7 @@ void bpf_task_storage_free(struct task_s */ bpf_selem_unlink_map(selem); free_task_storage = bpf_selem_unlink_storage_nolock( - local_storage, selem, false); + local_storage, selem, false, false); } raw_spin_unlock_irqrestore(&local_storage->lock, flags); bpf_task_storage_unlock(); @@ -192,7 +192,7 @@ static int task_storage_delete(struct ta if (!sdata) return -ENOENT;
- bpf_selem_unlink(SELEM(sdata)); + bpf_selem_unlink(SELEM(sdata), true);
return 0; } --- a/net/core/bpf_sk_storage.c +++ b/net/core/bpf_sk_storage.c @@ -40,7 +40,7 @@ static int bpf_sk_storage_del(struct soc if (!sdata) return -ENOENT;
- bpf_selem_unlink(SELEM(sdata)); + bpf_selem_unlink(SELEM(sdata), true);
return 0; } @@ -75,8 +75,8 @@ void bpf_sk_storage_free(struct sock *sk * sk_storage. */ bpf_selem_unlink_map(selem); - free_sk_storage = bpf_selem_unlink_storage_nolock(sk_storage, - selem, true); + free_sk_storage = bpf_selem_unlink_storage_nolock( + sk_storage, selem, true, false); } raw_spin_unlock_bh(&sk_storage->lock); rcu_read_unlock();
From: Yuntao Wang ytcoode@gmail.com
commit b45043192b3e481304062938a6561da2ceea46a6 upstream.
The 'n_buckets * (value_size + sizeof(struct stack_map_bucket))' part of the allocated memory for 'smap' is never used after the memlock accounting was removed, thus get rid of it.
[ Note, Daniel:
Commit b936ca643ade ("bpf: rework memlock-based memory accounting for maps") moved `cost += n_buckets * (value_size + sizeof(struct stack_map_bucket))` up and therefore before the bpf_map_area_alloc() allocation, sigh. In a later step commit c85d69135a91 ("bpf: move memory size checks to bpf_map_charge_init()"), and the overflow checks of `cost >= U32_MAX - PAGE_SIZE` moved into bpf_map_charge_init(). And then 370868107bf6 ("bpf: Eliminate rlimit-based memory accounting for stackmap maps") finally removed the bpf_map_charge_init(). Anyway, the original code did the allocation same way as /after/ this fix. ]
Fixes: b936ca643ade ("bpf: rework memlock-based memory accounting for maps") Signed-off-by: Yuntao Wang ytcoode@gmail.com Signed-off-by: Daniel Borkmann daniel@iogearbox.net Link: https://lore.kernel.org/bpf/20220407130423.798386-1-ytcoode@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/bpf/stackmap.c | 1 - 1 file changed, 1 deletion(-)
--- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -100,7 +100,6 @@ static struct bpf_map *stack_map_alloc(u return ERR_PTR(-E2BIG);
cost = n_buckets * sizeof(struct stack_map_bucket *) + sizeof(*smap); - cost += n_buckets * (value_size + sizeof(struct stack_map_bucket)); smap = bpf_map_area_alloc(cost, bpf_map_attr_numa_node(attr)); if (!smap) return ERR_PTR(-ENOMEM);
From: Kumar Kartikeya Dwivedi memxor@gmail.com
commit 7b3552d3f9f6897851fc453b5131a967167e43c2 upstream.
It is not permitted to write to PTR_TO_MAP_KEY, but the current code in check_helper_mem_access would allow for it, reject this case as well, as helpers taking ARG_PTR_TO_UNINIT_MEM also take PTR_TO_MAP_KEY.
Fixes: 69c087ba6225 ("bpf: Add bpf_for_each_map_elem() helper") Signed-off-by: Kumar Kartikeya Dwivedi memxor@gmail.com Link: https://lore.kernel.org/r/20220319080827.73251-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov ast@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/bpf/verifier.c | 5 +++++ 1 file changed, 5 insertions(+)
--- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -4861,6 +4861,11 @@ static int check_helper_mem_access(struc return check_packet_access(env, regno, reg->off, access_size, zero_size_allowed); case PTR_TO_MAP_KEY: + if (meta && meta->raw_mode) { + verbose(env, "R%d cannot write into %s\n", regno, + reg_type_str(env, reg->type)); + return -EACCES; + } return check_mem_region_access(env, regno, reg->off, access_size, reg->map_ptr->key_size, false); case PTR_TO_MAP_VALUE:
From: Kumar Kartikeya Dwivedi memxor@gmail.com
commit 97e6d7dab1ca4648821c790a2b7913d6d5d549db upstream.
The commit being fixed was aiming to disallow users from incorrectly obtaining writable pointer to memory that is only meant to be read. This is enforced now using a MEM_RDONLY flag.
For instance, in case of global percpu variables, when the BTF type is not struct (e.g. bpf_prog_active), the verifier marks register type as PTR_TO_MEM | MEM_RDONLY from bpf_this_cpu_ptr or bpf_per_cpu_ptr helpers. However, when passing such pointer to kfunc, global funcs, or BPF helpers, in check_helper_mem_access, there is no expectation MEM_RDONLY flag will be set, hence it is checked as pointer to writable memory. Later, verifier sets up argument type of global func as PTR_TO_MEM | PTR_MAYBE_NULL, so user can use a global func to get around the limitations imposed by this flag.
This check will also cover global non-percpu variables that may be introduced in kernel BTF in future.
Also, we update the log message for PTR_TO_BUF case to be similar to PTR_TO_MEM case, so that the reason for error is clear to user.
Fixes: 34d3a78c681e ("bpf: Make per_cpu_ptr return rdonly PTR_TO_MEM.") Reviewed-by: Hao Luo haoluo@google.com Signed-off-by: Kumar Kartikeya Dwivedi memxor@gmail.com Link: https://lore.kernel.org/r/20220319080827.73251-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov ast@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/bpf/verifier.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)
--- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -4876,13 +4876,23 @@ static int check_helper_mem_access(struc return check_map_access(env, regno, reg->off, access_size, zero_size_allowed); case PTR_TO_MEM: + if (type_is_rdonly_mem(reg->type)) { + if (meta && meta->raw_mode) { + verbose(env, "R%d cannot write into %s\n", regno, + reg_type_str(env, reg->type)); + return -EACCES; + } + } return check_mem_region_access(env, regno, reg->off, access_size, reg->mem_size, zero_size_allowed); case PTR_TO_BUF: if (type_is_rdonly_mem(reg->type)) { - if (meta && meta->raw_mode) + if (meta && meta->raw_mode) { + verbose(env, "R%d cannot write into %s\n", regno, + reg_type_str(env, reg->type)); return -EACCES; + }
max_access = &env->prog->aux->max_rdonly_access; } else {
From: Kumar Kartikeya Dwivedi memxor@gmail.com
commit be77354a3d7ebd4897ee18eca26dca6df9224c76 upstream.
When passing pointer to some map value to kfunc or global func, in verifier we are passing meta as NULL to various functions, which uses meta->raw_mode to check whether memory is being written to. Since some kfunc or global funcs may also write to memory pointers they receive as arguments, we must check for write access to memory. E.g. in some case map may be read only and this will be missed by current checks.
However meta->raw_mode allows for uninitialized memory (e.g. on stack), since there is not enough info available through BTF, we must perform one call for read access (raw_mode = false), and one for write access (raw_mode = true).
Fixes: e5069b9c23b3 ("bpf: Support pointers in global func args") Fixes: d583691c47dc ("bpf: Introduce mem, size argument pair support for kfunc") Signed-off-by: Kumar Kartikeya Dwivedi memxor@gmail.com Link: https://lore.kernel.org/r/20220319080827.73251-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov ast@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/bpf/verifier.c | 44 +++++++++++++++++++++++++++++--------------- 1 file changed, 29 insertions(+), 15 deletions(-)
--- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -4934,8 +4934,7 @@ static int check_mem_size_reg(struct bpf * out. Only upper bounds can be learned because retval is an * int type and negative retvals are allowed. */ - if (meta) - meta->msize_max_value = reg->umax_value; + meta->msize_max_value = reg->umax_value;
/* The register is SCALAR_VALUE; the access check * happens using its boundaries. @@ -4978,24 +4977,33 @@ static int check_mem_size_reg(struct bpf int check_mem_reg(struct bpf_verifier_env *env, struct bpf_reg_state *reg, u32 regno, u32 mem_size) { + bool may_be_null = type_may_be_null(reg->type); + struct bpf_reg_state saved_reg; + struct bpf_call_arg_meta meta; + int err; + if (register_is_null(reg)) return 0;
- if (type_may_be_null(reg->type)) { - /* Assuming that the register contains a value check if the memory - * access is safe. Temporarily save and restore the register's state as - * the conversion shouldn't be visible to a caller. - */ - const struct bpf_reg_state saved_reg = *reg; - int rv; - + memset(&meta, 0, sizeof(meta)); + /* Assuming that the register contains a value check if the memory + * access is safe. Temporarily save and restore the register's state as + * the conversion shouldn't be visible to a caller. + */ + if (may_be_null) { + saved_reg = *reg; mark_ptr_not_null_reg(reg); - rv = check_helper_mem_access(env, regno, mem_size, true, NULL); - *reg = saved_reg; - return rv; }
- return check_helper_mem_access(env, regno, mem_size, true, NULL); + err = check_helper_mem_access(env, regno, mem_size, true, &meta); + /* Check access for BPF_WRITE */ + meta.raw_mode = true; + err = err ?: check_helper_mem_access(env, regno, mem_size, true, &meta); + + if (may_be_null) + *reg = saved_reg; + + return err; }
int check_kfunc_mem_size_reg(struct bpf_verifier_env *env, struct bpf_reg_state *reg, @@ -5004,16 +5012,22 @@ int check_kfunc_mem_size_reg(struct bpf_ struct bpf_reg_state *mem_reg = &cur_regs(env)[regno - 1]; bool may_be_null = type_may_be_null(mem_reg->type); struct bpf_reg_state saved_reg; + struct bpf_call_arg_meta meta; int err;
WARN_ON_ONCE(regno < BPF_REG_2 || regno > BPF_REG_5);
+ memset(&meta, 0, sizeof(meta)); + if (may_be_null) { saved_reg = *mem_reg; mark_ptr_not_null_reg(mem_reg); }
- err = check_mem_size_reg(env, reg, regno, true, NULL); + err = check_mem_size_reg(env, reg, regno, true, &meta); + /* Check access for BPF_WRITE */ + meta.raw_mode = true; + err = err ?: check_mem_size_reg(env, reg, regno, true, &meta);
if (may_be_null) *mem_reg = saved_reg;
On Fri, Jun 03, 2022 at 07:43:01PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.18.2 release. There are 67 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 05 Jun 2022 17:38:05 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.18.2-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.18.y and the diffstat can be found below.
thanks,
greg k-h
Tested rc1 against the Fedora build system (aarch64, armv7, ppc64le, s390x, x86_64), and boot tested x86_64. No regressions noted.
Tested-by: Justin M. Forbes jforbes@fedoraproject.org
On 6/3/22 10:43 AM, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.18.2 release. There are 67 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 05 Jun 2022 17:38:05 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.18.2-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.18.y and the diffstat can be found below.
thanks,
greg k-h
Built and booted successfully on RISC-V RV64 (HiFive Unmatched).
Tested-by: Ron Economos re@w6rz.net
On Fri, 3 Jun 2022 at 23:26, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
This is the start of the stable review cycle for the 5.18.2 release. There are 67 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 05 Jun 2022 17:38:05 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.18.2-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.18.y and the diffstat can be found below.
thanks,
greg k-h
Results from Linaro’s test farm. No regressions on arm64, arm, x86_64, and i386.
Tested-by: Linux Kernel Functional Testing lkft@linaro.org
## Build * kernel: 5.18.2-rc1 * git: https://gitlab.com/Linaro/lkft/mirrors/stable/linux-stable-rc * git branch: linux-5.18.y * git commit: 20fa00749a26c2347ca02a2eb231d61ecd877c90 * git describe: v5.18-116-g20fa00749a26 * test details: https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-5.18.y/build/v5.18-...
## Test Regressions (compared to v5.18-48-g10e6e3d47333) No test regressions found.
## Metric Regressions (compared to v5.18-48-g10e6e3d47333) No metric regressions found.
## Test Fixes (compared to v5.18-48-g10e6e3d47333) No test fixes found.
## Metric Fixes (compared to v5.18-48-g10e6e3d47333) No metric fixes found.
## Test result summary total: 132732, pass: 121296, fail: 1353, skip: 10083, xfail: 0
## Build Summary * arc: 10 total, 10 passed, 0 failed * arm: 313 total, 313 passed, 0 failed * arm64: 58 total, 58 passed, 0 failed * i386: 52 total, 49 passed, 3 failed * mips: 37 total, 37 passed, 0 failed * parisc: 12 total, 12 passed, 0 failed * powerpc: 60 total, 54 passed, 6 failed * riscv: 27 total, 22 passed, 5 failed * s390: 18 total, 18 passed, 0 failed * sh: 24 total, 24 passed, 0 failed * sparc: 12 total, 12 passed, 0 failed * x86_64: 56 total, 54 passed, 2 failed
## Test suites summary * fwts * igt-gpu-tools * kselftest-android * kselftest-breakpoints * kselftest-capabilities * kselftest-cgroup * kselftest-clone3 * kselftest-core * kselftest-cpu-hotplug * kselftest-cpufreq * kselftest-drivers-dma-buf * kselftest-efivarfs * kselftest-filesystems * kselftest-filesystems-binderfs * kselftest-firmware * kselftest-fpu * kselftest-gpio * kselftest-ipc * kselftest-ir * kselftest-kcmp * kselftest-lib * kselftest-membarrier * kselftest-netfilter * kselftest-nsfs * kselftest-openat2 * kselftest-pid_namespace * kselftest-pidfd * kselftest-proc * kselftest-pstore * kselftest-rseq * kselftest-rtc * kselftest-seccomp * kselftest-sigaltstack * kselftest-size * kselftest-splice * kselftest-static_keys * kselftest-sync * kselftest-sysctl * kselftest-tc-testing * kselftest-timens * kselftest-timers * kselftest-tmpfs * kselftest-tpm2 * kselftest-user * kselftest-vm * kselftest-zram * kunit * kunit/15 * kunit/261 * kunit/3 * kunit/427 * kunit/90 * kvm-unit-tests * libgpiod * libhugetlbfs * log-parser-boot * log-parser-test * ltp-cap_bounds * ltp-cap_bounds-tests * ltp-commands * ltp-commands-tests * ltp-containers * ltp-containers-tests * ltp-controllers-tests * ltp-cpuhotplug-tests * ltp-crypto * ltp-crypto-tests * ltp-cve-tests * ltp-dio-tests * ltp-fcntl-locktests * ltp-fcntl-locktests-tests * ltp-filecaps * ltp-filecaps-tests * ltp-fs * ltp-fs-tests * ltp-fs_bind * ltp-fs_bind-tests * ltp-fs_perms_simple * ltp-fs_perms_simple-tests * ltp-fsx * ltp-fsx-tests * ltp-hugetlb * ltp-hugetlb-tests * ltp-io * ltp-io-tests * ltp-ipc * ltp-ipc-tests * ltp-math-tests * ltp-mm-tests * ltp-nptl * ltp-nptl-tests * ltp-open-posix-tests * ltp-pty * ltp-pty-tests * ltp-sched-tests * ltp-securebits * ltp-securebits-tests * ltp-syscalls-tests * ltp-tracing-tests * network-basic-tests * packetdrill * perf * rcutorture * ssuite * v4l2-compliance * vdso
-- Linaro LKFT https://lkft.linaro.org
On Fri, Jun 03, 2022 at 07:43:01PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.18.2 release. There are 67 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 05 Jun 2022 17:38:05 +0000. Anything received after that time might be too late.
Build results: total: 152 pass: 152 fail: 0 Qemu test results: total: 489 pass: 489 fail: 0
Tested-by: Guenter Roeck linux@roeck-us.net
Guenter
On Fri, Jun 03, 2022 at 07:43:01PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.18.2 release. There are 67 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 05 Jun 2022 17:38:05 +0000. Anything received after that time might be too late.
Hi Greg,
5.18.2-rc1 tested.
Run tested on: - Allwinner H6 (Tanix TX6) - Intel Tiger Lake x86_64 (nuc11 i7-1165G7)
In addition - build tested for: - Allwinner A64 - Allwinner H3 - Allwinner H5 - NXP iMX6 - NXP iMX8 - Qualcomm Dragonboard - Rockchip RK3288 - Rockchip RK3328 - Rockchip RK3399pro - Samsung Exynos
Tested-by: Rudi Heitbaum rudi@heitbaum.com -- Rudi
On Fri, Jun 03, 2022 at 07:43:01PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.18.2 release. There are 67 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Successfully cross-compiled for arm (multi_v7_defconfig, GCC 12.1.0, neon FPU) and arm64 (bcm2711_defconfig, GCC 12.1.0).
On arm64 build, I found partly outside array bounds warning:
CC [M] fs/jffs2/summary.o LD [M] drivers/gpu/drm/drm_cma_helper.o LD [M] drivers/gpu/drm/drm_shmem_helper.o LD [M] drivers/gpu/drm/drm_kms_helper.o LD [M] drivers/gpu/drm/drm.o AR drivers/gpu/drm/built-in.a AR drivers/gpu/vga/built-in.a AR drivers/gpu/built-in.a CC drivers/base/power/sysfs.o In file included from fs/jffs2/summary.c:23: In function 'jffs2_sum_add_mem', inlined from 'jffs2_sum_add_inode_mem' at fs/jffs2/summary.c:130:9: fs/jffs2/nodelist.h:43:28: warning: array subscript 'union jffs2_sum_mem[0]' is partly outside array bounds of 'unsigned char[26]' [-Warray-bounds] 43 | #define je16_to_cpu(x) ((x).v16) | ~~~~^~~~~ fs/jffs2/summary.c:71:17: note: in expansion of macro 'je16_to_cpu' 71 | switch (je16_to_cpu(item->u.nodetype)) { | ^~~~~~~~~~~ In file included from fs/jffs2/summary.c:17: In function 'kmalloc', inlined from 'jffs2_sum_add_inode_mem' at fs/jffs2/summary.c:118:37: ./include/linux/slab.h:581:24: note: object of size 26 allocated by 'kmem_cache_alloc_trace' 581 | return kmem_cache_alloc_trace( | ^~~~~~~~~~~~~~~~~~~~~~~ 582 | kmalloc_caches[kmalloc_type(flags)][index], | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 583 | flags, size); | ~~~~~~~~~~~~ In file included from fs/jffs2/nodelist.h:22: In function 'jffs2_sum_add_mem', inlined from 'jffs2_sum_add_inode_mem' at fs/jffs2/summary.c:130:9: fs/jffs2/summary.c:79:73: warning: array subscript 'union jffs2_sum_mem[0]' is partly outside array bounds of 'unsigned char[26]' [-Warray-bounds] 79 | s->sum_size += JFFS2_SUMMARY_DIRENT_SIZE(item->d.nsize);
fs/jffs2/summary.h:34:80: note: in definition of macro 'JFFS2_SUMMARY_DIRENT_SIZE' 34 | S2_SUMMARY_DIRENT_SIZE(x) (sizeof(struct jffs2_sum_dirent_flash) + (x)) | ^
In function 'kmalloc', inlined from 'jffs2_sum_add_inode_mem' at fs/jffs2/summary.c:118:37: ./include/linux/slab.h:581:24: note: object of size 26 allocated by 'kmem_cache_alloc_trace' 581 | return kmem_cache_alloc_trace( | ^~~~~~~~~~~~~~~~~~~~~~~ 582 | kmalloc_caches[kmalloc_type(flags)][index], | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 583 | flags, size); | ~~~~~~~~~~~~
These warnings are introduced by commit e631ddba5887833 ("[JFFS2] Add erase block summary support (mount time improvement)") and unfortunately, initial mainline commit at 1da177e4c3f415 ("Linux-2.6.12-rc2").
These doesn't cause build failure, however.
Cc-ing commiters and authors of these commits (Linus is already on Cc list).
Tested-by: Bagas Sanjaya bagasdotme@gmail.com Reported-by: Bagas Sanjaya bagasdotme@gmail.com
On Sun, Jun 05, 2022 at 02:02:21PM +0700, Bagas Sanjaya wrote:
On Fri, Jun 03, 2022 at 07:43:01PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.18.2 release. There are 67 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Successfully cross-compiled for arm (multi_v7_defconfig, GCC 12.1.0, neon FPU) and arm64 (bcm2711_defconfig, GCC 12.1.0).
On arm64 build, I found partly outside array bounds warning:
The kernel does not build cleanly on gcc12 just yet, so it will be a while before stuff like this goes away. Please submit patches to the proper subsystem developers to help resolve this.
thanks,
greg k-h
linux-stable-mirror@lists.linaro.org