This is the start of the stable review cycle for the 5.15.65 release. There are 73 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 04 Sep 2022 12:13:47 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.65-rc1... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y and the diffstat can be found below.
thanks,
greg k-h
------------- Pseudo-Shortlog of commits:
Greg Kroah-Hartman gregkh@linuxfoundation.org Linux 5.15.65-rc1
Yang Yingliang yangyingliang@huawei.com net: neigh: don't call kfree_skb() under spin_lock_irqsave()
Zhengchao Shao shaozhengchao@huawei.com net/af_packet: check len when min_header_len equals to 0
Liam Howlett liam.howlett@oracle.com android: binder: fix lockdep check on clearing vma
Omar Sandoval osandov@fb.com btrfs: fix space cache corruption and potential double allocations
Kuniyuki Iwashima kuniyu@amazon.com kprobes: don't call disarm_kprobe() for disabled kprobes
Josef Bacik josef@toxicpanda.com btrfs: tree-checker: check for overlapping extent items
Josef Bacik josef@toxicpanda.com btrfs: fix lockdep splat with reloc root extent buffers
Josef Bacik josef@toxicpanda.com btrfs: move lockdep class helpers to locking.c
Florian Westphal fw@strlen.de testing: selftests: nft_flowtable.sh: use random netns names
Geert Uytterhoeven geert@linux-m68k.org netfilter: conntrack: NF_CONNTRACK_PROCFS should no longer default to y
Charlene Liu Charlene.Liu@amd.com drm/amd/display: avoid doing vm_init multiple time
Dusica Milinkovic Dusica.Milinkovic@amd.com drm/amdgpu: Increase tlb flush timeout for sriov
Ilya Bakoulin Ilya.Bakoulin@amd.com drm/amd/display: Fix pixel clock programming
Evan Quan evan.quan@amd.com drm/amd/pm: add missing ->fini_microcode interface for Sienna Cichlid
Namjae Jeon linkinjeon@kernel.org ksmbd: don't remove dos attribute xattr on O_TRUNC open
Juergen Gross jgross@suse.com s390/hypfs: avoid error message under KVM
Denis V. Lunev den@openvz.org neigh: fix possible DoS due to net iface start/stop loop
Namjae Jeon linkinjeon@kernel.org ksmbd: return STATUS_BAD_NETWORK_NAME error status if share is not configured
Fudong Wang Fudong.Wang@amd.com drm/amd/display: clear optc underflow before turn off odm clock
Alvin Lee alvin.lee2@amd.com drm/amd/display: For stereo keep "FLIP_ANY_FRAME"
Leo Ma hanghong.ma@amd.com drm/amd/display: Fix HDMI VSIF V3 incorrect issue
Josip Pavic Josip.Pavic@amd.com drm/amd/display: Avoid MPC infinite loop
Biju Das biju.das.jz@bp.renesas.com ASoC: sh: rz-ssi: Improve error handling in rz_ssi_probe() error path
Konstantin Komarov almaz.alexandrovich@paragon-software.com fs/ntfs3: Fix work with fragmented xattr
Filipe Manana fdmanana@suse.com btrfs: fix warning during log replay when bumping inode link count
Filipe Manana fdmanana@suse.com btrfs: add and use helper for unlinking inode during log replay
Filipe Manana fdmanana@suse.com btrfs: remove no longer needed logic for replaying directory deletes
Filipe Manana fdmanana@suse.com btrfs: remove root argument from btrfs_unlink_inode()
Liming Sun limings@nvidia.com mmc: sdhci-of-dwcmshc: Re-enable support for the BlueField-3 SoC
Sebastian Reichel sebastian.reichel@collabora.com mmc: sdhci-of-dwcmshc: rename rk3568 to rk35xx
Yifeng Zhao yifeng.zhao@rock-chips.com mmc: sdhci-of-dwcmshc: add reset call back for rockchip Socs
Wenbin Mei wenbin.mei@mediatek.com mmc: mtk-sd: Clear interrupts when cqe off/disable
Chris Wilson chris.p.wilson@intel.com drm/i915/gt: Skip TLB invalidations once wedged
Michael Hübner michaelh.95@t-online.de HID: thrustmaster: Add sparco wheel and fix array length
Josh Kilmer srjek2@gmail.com HID: asus: ROG NKey: Ignore portion of 0x5a report
Akihiko Odaki akihiko.odaki@gmail.com HID: AMD_SFH: Add a DMI quirk entry for Chromebooks
Steev Klimaszewski steev@kali.org HID: add Lenovo Yoga C630 battery quirk
Takashi Iwai tiwai@suse.de ALSA: usb-audio: Add quirk for LH Labs Geek Out HD Audio 1V5
Jann Horn jannh@google.com mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuse
Zhengchao Shao shaozhengchao@huawei.com bpf: Don't redirect packets with invalid pkt_len
Yang Jihong yangjihong1@huawei.com ftrace: Fix NULL pointer dereference in is_ftrace_trampoline when ftrace is dead
Letu Ren fantasquex@gmail.com fbdev: fb_pm2fb: Avoid potential divide by zero error
Hawkins Jiawei yin31149@gmail.com net: fix refcount bug in sk_psock_get (2)
Karthik Alapati mail@karthek.com HID: hidraw: fix memory leak in hidraw_release()
Dongliang Mu mudongliangabcd@gmail.com media: pvrusb2: fix memory leak in pvr_probe
Vivek Kasireddy vivek.kasireddy@intel.com udmabuf: Set the DMA mask for the udmabuf device (v2)
Lee Jones lee.jones@linaro.org HID: steam: Prevent NULL pointer dereference in steam_{recv,send}_report
Greg Kroah-Hartman gregkh@linuxfoundation.org Revert "PCI/portdrv: Don't disable AER reporting in get_port_device_capability()"
Luiz Augusto von Dentz luiz.von.dentz@intel.com Bluetooth: L2CAP: Fix build errors in some archs
Jing Leng jleng@ambarella.com kbuild: Fix include path in scripts/Makefile.modpost
Pavel Begunkov asml.silence@gmail.com io_uring: fix UAF due to missing POLLFREE handling
Pavel Begunkov asml.silence@gmail.com io_uring: fix wrong arm_poll error handling
Pavel Begunkov asml.silence@gmail.com io_uring: fail links when poll fails
Jens Axboe axboe@kernel.dk io_uring: bump poll refs to full 31-bits
Jens Axboe axboe@kernel.dk io_uring: remove poll entry from list when canceling all
Jiapeng Chong jiapeng.chong@linux.alibaba.com io_uring: Remove unused function req_ref_put
Pavel Begunkov asml.silence@gmail.com io_uring: poll rework
Pavel Begunkov asml.silence@gmail.com io_uring: inline io_poll_complete
Pavel Begunkov asml.silence@gmail.com io_uring: kill poll linking optimisation
Pavel Begunkov asml.silence@gmail.com io_uring: move common poll bits
Pavel Begunkov asml.silence@gmail.com io_uring: refactor poll update
Pavel Begunkov asml.silence@gmail.com io_uring: clean cqe filling functions
Pavel Begunkov asml.silence@gmail.com io_uring: correct fill events helpers types
James Morse james.morse@arm.com arm64: errata: Add Cortex-A510 to the repeat tlbi list
Miaohe Lin linmiaohe@huawei.com mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte
Boqun Feng boqun.feng@gmail.com Drivers: hv: balloon: Support status report for larger page sizes
Eric Biggers ebiggers@google.com crypto: lib - remove unneeded selection of XOR_BLOCKS
Timo Alho talho@nvidia.com firmware: tegra: bpmp: Do only aligned access to IPC memory area
Maxime Ripard maxime@cerno.tech drm/vc4: hdmi: Depends on CONFIG_PM
Maxime Ripard maxime@cerno.tech drm/vc4: hdmi: Rework power up
Adam Borowski kilobyte@angband.pl ACPI: thermal: drop an always true check
Maxime Ripard maxime@cerno.tech drm/bridge: Add stubs for devm_drm_of_get_bridge when OF is disabled
Jann Horn jannh@google.com mm: Force TLB flush for PFNMAP mappings before unlink_file_vma()
-------------
Diffstat:
Documentation/arm64/silicon-errata.rst | 2 + Makefile | 4 +- arch/arm64/Kconfig | 17 + arch/arm64/kernel/cpu_errata.c | 8 +- arch/s390/hypfs/hypfs_diag.c | 2 +- arch/s390/hypfs/inode.c | 2 +- drivers/acpi/thermal.c | 2 - drivers/android/binder_alloc.c | 9 +- drivers/dma-buf/udmabuf.c | 18 +- drivers/firmware/tegra/bpmp.c | 6 +- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 3 +- drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 3 +- .../gpu/drm/amd/display/dc/dce/dce_clock_source.c | 2 + drivers/gpu/drm/amd/display/dc/dcn10/dcn10_mpc.c | 6 + drivers/gpu/drm/amd/display/dc/dcn10/dcn10_optc.c | 5 + drivers/gpu/drm/amd/display/dc/dcn20/dcn20_mpc.c | 6 + .../gpu/drm/amd/display/dc/dcn21/dcn21_hubbub.c | 8 +- drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hubp.c | 2 +- .../drm/amd/display/modules/freesync/freesync.c | 15 +- .../drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c | 1 + drivers/gpu/drm/i915/gt/intel_gt.c | 3 + drivers/gpu/drm/vc4/Kconfig | 1 + drivers/gpu/drm/vc4/vc4_hdmi.c | 17 +- drivers/hid/amd-sfh-hid/amd_sfh_pcie.c | 18 + drivers/hid/hid-asus.c | 7 + drivers/hid/hid-ids.h | 1 + drivers/hid/hid-input.c | 2 + drivers/hid/hid-steam.c | 10 + drivers/hid/hid-thrustmaster.c | 3 +- drivers/hid/hidraw.c | 3 + drivers/hv/hv_balloon.c | 13 +- drivers/media/usb/pvrusb2/pvrusb2-hdw.c | 1 + drivers/mmc/host/mtk-sd.c | 6 + drivers/mmc/host/sdhci-of-dwcmshc.c | 88 ++- drivers/pci/pcie/portdrv_core.c | 9 +- drivers/video/fbdev/pm2fb.c | 5 + fs/btrfs/block-group.c | 47 +- fs/btrfs/block-group.h | 4 +- fs/btrfs/ctree.c | 3 + fs/btrfs/ctree.h | 4 +- fs/btrfs/disk-io.c | 82 --- fs/btrfs/disk-io.h | 10 - fs/btrfs/extent-tree.c | 48 +- fs/btrfs/extent_io.c | 11 +- fs/btrfs/inode.c | 25 +- fs/btrfs/locking.c | 91 +++ fs/btrfs/locking.h | 14 + fs/btrfs/relocation.c | 2 + fs/btrfs/tree-checker.c | 25 +- fs/btrfs/tree-log.c | 221 +++--- fs/io_uring.c | 746 ++++++++++----------- fs/ksmbd/mgmt/tree_connect.c | 2 +- fs/ksmbd/smb2pdu.c | 21 +- fs/ntfs3/xattr.c | 7 +- include/drm/drm_bridge.h | 13 +- include/linux/rmap.h | 7 +- include/linux/skbuff.h | 8 + include/linux/skmsg.h | 3 +- include/net/sock.h | 68 +- include/uapi/linux/btrfs_tree.h | 4 +- kernel/kprobes.c | 9 +- kernel/trace/ftrace.c | 10 + lib/crypto/Kconfig | 1 - mm/hugetlb.c | 2 +- mm/mmap.c | 12 + mm/rmap.c | 29 +- net/bluetooth/l2cap_core.c | 10 +- net/bpf/test_run.c | 3 + net/core/dev.c | 1 + net/core/neighbour.c | 27 +- net/core/skmsg.c | 4 +- net/netfilter/Kconfig | 1 - net/packet/af_packet.c | 4 +- scripts/Makefile.modpost | 3 +- sound/soc/sh/rz-ssi.c | 26 +- sound/usb/quirks.c | 2 + tools/testing/selftests/netfilter/nft_flowtable.sh | 246 +++---- 78 files changed, 1195 insertions(+), 971 deletions(-)
From: Jann Horn jannh@google.com
commit b67fbebd4cf980aecbcc750e1462128bffe8ae15 upstream.
Some drivers rely on having all VMAs through which a PFN might be accessible listed in the rmap for correctness. However, on X86, it was possible for a VMA with stale TLB entries to not be listed in the rmap.
This was fixed in mainline with commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas"), but that commit relies on preceding refactoring in commit 18ba064e42df3 ("mmu_gather: Let there be one tlb_{start,end}_vma() implementation") and commit 1e9fdf21a4339 ("mmu_gather: Remove per arch tlb_{start,end}_vma()").
This patch provides equivalent protection without needing that refactoring, by forcing a TLB flush between removing PTEs in unmap_vmas() and the call to unlink_file_vma() in free_pgtables().
[This is a stable-specific rewrite of the upstream commit!] Signed-off-by: Jann Horn jannh@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/mmap.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
--- a/mm/mmap.c +++ b/mm/mmap.c @@ -2643,6 +2643,18 @@ static void unmap_region(struct mm_struc tlb_gather_mmu(&tlb, mm); update_hiwater_rss(mm); unmap_vmas(&tlb, vma, start, end); + + /* + * Ensure we have no stale TLB entries by the time this mapping is + * removed from the rmap. + * Note that we don't have to worry about nested flushes here because + * we're holding the mm semaphore for removing the mapping - so any + * concurrent flush in this region has to be coming through the rmap, + * and we synchronize against that using the rmap lock. + */ + if ((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) != 0) + tlb_flush_mmu(&tlb); + free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, next ? next->vm_start : USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb);
From: Maxime Ripard maxime@cerno.tech
commit 59050d783848d9b62e9d8fb6ce0cd00771c2bf87 upstream.
If CONFIG_OF is disabled, devm_drm_of_get_bridge won't be compiled in and drivers using that function will fail to build.
Add an inline stub so that we can still build-test those cases.
Reported-by: Randy Dunlap rdunlap@infradead.org Signed-off-by: Maxime Ripard maxime@cerno.tech Acked-by: Randy Dunlap rdunlap@infradead.org # build-tested Link: https://patchwork.freedesktop.org/patch/msgid/20210928181333.1176840-1-maxim... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/drm/drm_bridge.h | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-)
--- a/include/drm/drm_bridge.h +++ b/include/drm/drm_bridge.h @@ -911,9 +911,20 @@ struct drm_bridge *devm_drm_panel_bridge struct drm_bridge *devm_drm_panel_bridge_add_typed(struct device *dev, struct drm_panel *panel, u32 connector_type); +struct drm_connector *drm_panel_bridge_connector(struct drm_bridge *bridge); +#endif + +#if defined(CONFIG_OF) && defined(CONFIG_DRM_PANEL_BRIDGE) struct drm_bridge *devm_drm_of_get_bridge(struct device *dev, struct device_node *node, u32 port, u32 endpoint); -struct drm_connector *drm_panel_bridge_connector(struct drm_bridge *bridge); +#else +static inline struct drm_bridge *devm_drm_of_get_bridge(struct device *dev, + struct device_node *node, + u32 port, + u32 endpoint) +{ + return ERR_PTR(-ENODEV); +} #endif
#endif
From: Adam Borowski kilobyte@angband.pl
commit e5b5d25444e9ee3ae439720e62769517d331fa39 upstream.
Address of a field inside a struct can't possibly be null; gcc-12 warns about this.
Signed-off-by: Adam Borowski kilobyte@angband.pl Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Cc: Sudip Mukherjee sudipm.mukherjee@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/acpi/thermal.c | 2 -- 1 file changed, 2 deletions(-)
--- a/drivers/acpi/thermal.c +++ b/drivers/acpi/thermal.c @@ -1098,8 +1098,6 @@ static int acpi_thermal_resume(struct de return -EINVAL;
for (i = 0; i < ACPI_THERMAL_MAX_ACTIVE; i++) { - if (!(&tz->trips.active[i])) - break; if (!tz->trips.active[i].flags.valid) break; tz->trips.active[i].flags.enabled = 1;
From: Maxime Ripard maxime@cerno.tech
commit 258e483a4d5e97a6a8caa74381ddc1f395ac1c71 upstream.
The current code tries to handle the case where CONFIG_PM isn't selected by first calling our runtime_resume implementation and then properly report the power state to the runtime_pm core.
This allows to have a functionning device even if pm_runtime_get_* functions are nops.
However, the device power state if CONFIG_PM is enabled is RPM_SUSPENDED, and thus our vc4_hdmi_write() and vc4_hdmi_read() calls in the runtime_pm hooks will now report a warning since the device might not be properly powered.
Even more so, we need CONFIG_PM enabled since the previous RaspberryPi have a power domain that needs to be powered up for the HDMI controller to be usable.
The previous patch has created a dependency on CONFIG_PM, now we can just assume it's there and only call pm_runtime_resume_and_get() to make sure our device is powered in bind.
Link: https://lore.kernel.org/r/20220629123510.1915022-39-maxime@cerno.tech Acked-by: Thomas Zimmermann tzimmermann@suse.de Tested-by: Stefan Wahren stefan.wahren@i2se.com Signed-off-by: Maxime Ripard maxime@cerno.tech (cherry picked from commit 53565c28e6af2cef6bbf438c34250135e3564459) Signed-off-by: Maxime Ripard maxime@cerno.tech Cc: "Sudip Mukherjee (Codethink)" sudipm.mukherjee@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/gpu/drm/vc4/vc4_hdmi.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-)
--- a/drivers/gpu/drm/vc4/vc4_hdmi.c +++ b/drivers/gpu/drm/vc4/vc4_hdmi.c @@ -2219,17 +2219,15 @@ static int vc4_hdmi_bind(struct device * if (ret) goto err_put_ddc;
+ pm_runtime_enable(dev); + /* - * We need to have the device powered up at this point to call - * our reset hook and for the CEC init. + * We need to have the device powered up at this point to call + * our reset hook and for the CEC init. */ - ret = vc4_hdmi_runtime_resume(dev); + ret = pm_runtime_resume_and_get(dev); if (ret) - goto err_put_ddc; - - pm_runtime_get_noresume(dev); - pm_runtime_set_active(dev); - pm_runtime_enable(dev); + goto err_disable_runtime_pm;
if (vc4_hdmi->variant->reset) vc4_hdmi->variant->reset(vc4_hdmi); @@ -2278,6 +2276,7 @@ err_destroy_conn: err_destroy_encoder: drm_encoder_cleanup(encoder); pm_runtime_put_sync(dev); +err_disable_runtime_pm: pm_runtime_disable(dev); err_put_ddc: put_device(&vc4_hdmi->ddc->dev);
From: Maxime Ripard maxime@cerno.tech
commit 72e2329e7c9bbe15e7a813670497ec9c6f919af3 upstream.
We already depend on runtime PM to get the power domains and clocks for most of the devices supported by the vc4 driver, so let's just select it to make sure it's there.
Link: https://lore.kernel.org/r/20220629123510.1915022-38-maxime@cerno.tech Acked-by: Thomas Zimmermann tzimmermann@suse.de Tested-by: Stefan Wahren stefan.wahren@i2se.com Signed-off-by: Maxime Ripard maxime@cerno.tech (cherry picked from commit f1bc386b319e93e56453ae27e9e83817bb1f6f95) Signed-off-by: Maxime Ripard maxime@cerno.tech Cc: "Sudip Mukherjee (Codethink)" sudipm.mukherjee@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/gpu/drm/vc4/Kconfig | 1 + drivers/gpu/drm/vc4/vc4_hdmi.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-)
--- a/drivers/gpu/drm/vc4/Kconfig +++ b/drivers/gpu/drm/vc4/Kconfig @@ -5,6 +5,7 @@ config DRM_VC4 depends on DRM depends on SND && SND_SOC depends on COMMON_CLK + depends on PM select DRM_KMS_HELPER select DRM_KMS_CMA_HELPER select DRM_GEM_CMA_HELPER --- a/drivers/gpu/drm/vc4/vc4_hdmi.c +++ b/drivers/gpu/drm/vc4/vc4_hdmi.c @@ -2122,7 +2122,7 @@ static int vc5_hdmi_init_resources(struc return 0; }
-static int __maybe_unused vc4_hdmi_runtime_suspend(struct device *dev) +static int vc4_hdmi_runtime_suspend(struct device *dev) { struct vc4_hdmi *vc4_hdmi = dev_get_drvdata(dev);
From: Timo Alho talho@nvidia.com
commit a4740b148a04dc60e14fe6a1dfe216d3bae214fd upstream.
Use memcpy_toio and memcpy_fromio variants of memcpy to guarantee no unaligned access to IPC memory area. This is to allow the IPC memory to be mapped as Device memory to further suppress speculative reads from happening within the 64 kB memory area above the IPC memory when 64 kB memory pages are used.
Signed-off-by: Timo Alho talho@nvidia.com Signed-off-by: Mikko Perttunen mperttunen@nvidia.com Signed-off-by: Thierry Reding treding@nvidia.com Cc: Jon Hunter jonathanh@nvidia.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/firmware/tegra/bpmp.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
--- a/drivers/firmware/tegra/bpmp.c +++ b/drivers/firmware/tegra/bpmp.c @@ -201,7 +201,7 @@ static ssize_t __tegra_bpmp_channel_read int err;
if (data && size > 0) - memcpy(data, channel->ib->data, size); + memcpy_fromio(data, channel->ib->data, size);
err = tegra_bpmp_ack_response(channel); if (err < 0) @@ -245,7 +245,7 @@ static ssize_t __tegra_bpmp_channel_writ channel->ob->flags = flags;
if (data && size > 0) - memcpy(channel->ob->data, data, size); + memcpy_toio(channel->ob->data, data, size);
return tegra_bpmp_post_request(channel); } @@ -420,7 +420,7 @@ void tegra_bpmp_mrq_return(struct tegra_ channel->ob->code = code;
if (data && size > 0) - memcpy(channel->ob->data, data, size); + memcpy_toio(channel->ob->data, data, size);
err = tegra_bpmp_post_response(channel); if (WARN_ON(err < 0))
From: Eric Biggers ebiggers@google.com
commit 874b301985ef2f89b8b592ad255e03fb6fbfe605 upstream.
CRYPTO_LIB_CHACHA_GENERIC doesn't need to select XOR_BLOCKS. It perhaps was thought that it's needed for __crypto_xor, but that's not the case.
Enabling XOR_BLOCKS is problematic because the XOR_BLOCKS code runs a benchmark when it is initialized. That causes a boot time regression on systems that didn't have it enabled before.
Therefore, remove this unnecessary and problematic selection.
Fixes: e56e18985596 ("lib/crypto: add prompts back to crypto libraries") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers ebiggers@google.com Signed-off-by: Herbert Xu herbert@gondor.apana.org.au Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- lib/crypto/Kconfig | 1 - 1 file changed, 1 deletion(-)
--- a/lib/crypto/Kconfig +++ b/lib/crypto/Kconfig @@ -33,7 +33,6 @@ config CRYPTO_ARCH_HAVE_LIB_CHACHA
config CRYPTO_LIB_CHACHA_GENERIC tristate - select XOR_BLOCKS help This symbol can be depended upon by arch implementations of the ChaCha library interface that require the generic code as a
From: Boqun Feng boqun.feng@gmail.com
commit b3d6dd09ff00fdcf4f7c0cb54700ffd5dd343502 upstream.
DM_STATUS_REPORT expects the numbers of pages in the unit of 4k pages (HV_HYP_PAGE) instead of guest pages, so to make it work when guest page sizes are larger than 4k, convert the numbers of guest pages into the numbers of HV_HYP_PAGEs.
Note that the numbers of guest pages are still used for tracing because tracing is internal to the guest kernel.
Reported-by: Vitaly Kuznetsov vkuznets@redhat.com Signed-off-by: Boqun Feng boqun.feng@gmail.com Reviewed-by: Michael Kelley mikelley@microsoft.com Link: https://lore.kernel.org/r/20220325023212.1570049-2-boqun.feng@gmail.com Signed-off-by: Wei Liu wei.liu@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/hv/hv_balloon.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-)
--- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -17,6 +17,7 @@ #include <linux/slab.h> #include <linux/kthread.h> #include <linux/completion.h> +#include <linux/count_zeros.h> #include <linux/memory_hotplug.h> #include <linux/memory.h> #include <linux/notifier.h> @@ -1130,6 +1131,7 @@ static void post_status(struct hv_dynmem struct dm_status status; unsigned long now = jiffies; unsigned long last_post = last_post_time; + unsigned long num_pages_avail, num_pages_committed;
if (pressure_report_delay > 0) { --pressure_report_delay; @@ -1154,16 +1156,21 @@ static void post_status(struct hv_dynmem * num_pages_onlined) as committed to the host, otherwise it can try * asking us to balloon them out. */ - status.num_avail = si_mem_available(); - status.num_committed = vm_memory_committed() + + num_pages_avail = si_mem_available(); + num_pages_committed = vm_memory_committed() + dm->num_pages_ballooned + (dm->num_pages_added > dm->num_pages_onlined ? dm->num_pages_added - dm->num_pages_onlined : 0) + compute_balloon_floor();
- trace_balloon_status(status.num_avail, status.num_committed, + trace_balloon_status(num_pages_avail, num_pages_committed, vm_memory_committed(), dm->num_pages_ballooned, dm->num_pages_added, dm->num_pages_onlined); + + /* Convert numbers of pages into numbers of HV_HYP_PAGEs. */ + status.num_avail = num_pages_avail * NR_HV_HYP_PAGES_IN_PAGE; + status.num_committed = num_pages_committed * NR_HV_HYP_PAGES_IN_PAGE; + /* * If our transaction ID is no longer current, just don't * send the status. This can happen if we were interrupted
From: Miaohe Lin linmiaohe@huawei.com
commit ab74ef708dc51df7cf2b8a890b9c6990fac5c0c6 upstream.
In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page cache are installed in the ptes. But hugepage_add_new_anon_rmap is called for them mistakenly because they're not vm_shared. This will corrupt the page->mapping used by page cache code.
Link: https://lkml.kernel.org/r/20220712130542.18836-1-linmiaohe@huawei.com Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl") Signed-off-by: Miaohe Lin linmiaohe@huawei.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Cc: Axel Rasmussen axelrasmussen@google.com Cc: Peter Xu peterx@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Axel Rasmussen axelrasmussen@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/hugetlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5371,7 +5371,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_s if (!huge_pte_none(huge_ptep_get(dst_pte))) goto out_release_unlock;
- if (vm_shared) { + if (page_in_pagecache) { page_dup_rmap(page, true); } else { ClearHPageRestoreReserve(page);
From: James Morse james.morse@arm.com
commit 39fdb65f52e9a53d32a6ba719f96669fd300ae78 upstream.
Cortex-A510 is affected by an erratum where in rare circumstances the CPUs may not handle a race between a break-before-make sequence on one CPU, and another CPU accessing the same page. This could allow a store to a page that has been unmapped.
Work around this by adding the affected CPUs to the list that needs TLB sequences to be done twice.
Signed-off-by: James Morse james.morse@arm.com Link: https://lore.kernel.org/r/20220704155732.21216-1-james.morse@arm.com Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Lucas Wei lucaswei@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- Documentation/arm64/silicon-errata.rst | 2 ++ arch/arm64/Kconfig | 17 +++++++++++++++++ arch/arm64/kernel/cpu_errata.c | 8 +++++++- 3 files changed, 26 insertions(+), 1 deletion(-)
--- a/Documentation/arm64/silicon-errata.rst +++ b/Documentation/arm64/silicon-errata.rst @@ -92,6 +92,8 @@ stable kernels. +----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A77 | #1508412 | ARM64_ERRATUM_1508412 | +----------------+-----------------+-----------------+-----------------------------+ +| ARM | Cortex-A510 | #2441009 | ARM64_ERRATUM_2441009 | ++----------------+-----------------+-----------------+-----------------------------+ | ARM | Neoverse-N1 | #1188873,1418040| ARM64_ERRATUM_1418040 | +----------------+-----------------+-----------------+-----------------------------+ | ARM | Neoverse-N1 | #1349291 | N/A | --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -666,6 +666,23 @@ config ARM64_ERRATUM_1508412
If unsure, say Y.
+config ARM64_ERRATUM_2441009 + bool "Cortex-A510: Completion of affected memory accesses might not be guaranteed by completion of a TLBI" + default y + select ARM64_WORKAROUND_REPEAT_TLBI + help + This option adds a workaround for ARM Cortex-A510 erratum #2441009. + + Under very rare circumstances, affected Cortex-A510 CPUs + may not handle a race between a break-before-make sequence on one + CPU, and another CPU accessing the same page. This could allow a + store to a page that has been unmapped. + + Work around this by adding the affected CPUs to the list that needs + TLB sequences to be done twice. + + If unsure, say Y. + config CAVIUM_ERRATUM_22375 bool "Cavium erratum 22375, 24313" default y --- a/arch/arm64/kernel/cpu_errata.c +++ b/arch/arm64/kernel/cpu_errata.c @@ -214,6 +214,12 @@ static const struct arm64_cpu_capabiliti ERRATA_MIDR_RANGE(MIDR_QCOM_KRYO_4XX_GOLD, 0xc, 0xe, 0xf, 0xe), }, #endif +#ifdef CONFIG_ARM64_ERRATUM_2441009 + { + /* Cortex-A510 r0p0 -> r1p1. Fixed in r1p2 */ + ERRATA_MIDR_RANGE(MIDR_CORTEX_A510, 0, 0, 1, 1), + }, +#endif {}, }; #endif @@ -429,7 +435,7 @@ const struct arm64_cpu_capabilities arm6 #endif #ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI { - .desc = "Qualcomm erratum 1009, or ARM erratum 1286807", + .desc = "Qualcomm erratum 1009, or ARM erratum 1286807, 2441009", .capability = ARM64_WORKAROUND_REPEAT_TLBI, .type = ARM64_CPUCAP_LOCAL_CPU_ERRATUM, .matches = cpucap_multi_entry_cap_matches,
From: Pavel Begunkov asml.silence@gmail.com
[ upstream commit 54daa9b2d80ab35824464b35a99f716e1cdf2ccb ]
CQE result is a 32-bit integer, so the functions generating CQEs are better to accept not long but ints. Convert io_cqring_fill_event() and other helpers.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Link: https://lore.kernel.org/r/7ca6f15255e9117eae28adcac272744cae29b113.163337330... Signed-off-by: Jens Axboe axboe@kernel.dk [pavel: backport] Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/io_uring.c | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-)
--- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1080,7 +1080,7 @@ static void io_uring_try_cancel_requests static void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd);
static bool io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data, - long res, unsigned int cflags); + s32 res, u32 cflags); static void io_put_req(struct io_kiocb *req); static void io_put_req_deferred(struct io_kiocb *req); static void io_dismantle_req(struct io_kiocb *req); @@ -1763,7 +1763,7 @@ static __cold void io_uring_drop_tctx_re }
static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, - long res, unsigned int cflags) + s32 res, u32 cflags) { struct io_overflow_cqe *ocqe;
@@ -1791,7 +1791,7 @@ static bool io_cqring_event_overflow(str }
static inline bool __io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data, - long res, unsigned int cflags) + s32 res, u32 cflags) { struct io_uring_cqe *cqe;
@@ -1814,13 +1814,13 @@ static inline bool __io_cqring_fill_even
/* not as hot to bloat with inlining */ static noinline bool io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data, - long res, unsigned int cflags) + s32 res, u32 cflags) { return __io_cqring_fill_event(ctx, user_data, res, cflags); }
-static void io_req_complete_post(struct io_kiocb *req, long res, - unsigned int cflags) +static void io_req_complete_post(struct io_kiocb *req, s32 res, + u32 cflags) { struct io_ring_ctx *ctx = req->ctx;
@@ -1861,8 +1861,8 @@ static inline bool io_req_needs_clean(st return req->flags & IO_REQ_CLEAN_FLAGS; }
-static void io_req_complete_state(struct io_kiocb *req, long res, - unsigned int cflags) +static inline void io_req_complete_state(struct io_kiocb *req, s32 res, + u32 cflags) { if (io_req_needs_clean(req)) io_clean_op(req); @@ -1872,7 +1872,7 @@ static void io_req_complete_state(struct }
static inline void __io_req_complete(struct io_kiocb *req, unsigned issue_flags, - long res, unsigned cflags) + s32 res, u32 cflags) { if (issue_flags & IO_URING_F_COMPLETE_DEFER) io_req_complete_state(req, res, cflags); @@ -1880,12 +1880,12 @@ static inline void __io_req_complete(str io_req_complete_post(req, res, cflags); }
-static inline void io_req_complete(struct io_kiocb *req, long res) +static inline void io_req_complete(struct io_kiocb *req, s32 res) { __io_req_complete(req, 0, res, 0); }
-static void io_req_complete_failed(struct io_kiocb *req, long res) +static void io_req_complete_failed(struct io_kiocb *req, s32 res) { req_set_fail(req); io_req_complete_post(req, res, 0); @@ -2707,7 +2707,7 @@ static bool __io_complete_rw_common(stru static void io_req_task_complete(struct io_kiocb *req, bool *locked) { unsigned int cflags = io_put_rw_kbuf(req); - long res = req->result; + int res = req->result;
if (*locked) { struct io_ring_ctx *ctx = req->ctx;
From: Pavel Begunkov asml.silence@gmail.com
[ upstream commmit 913a571affedd17239c4d4ea90c8874b32fc2191 ]
Split io_cqring_fill_event() into a couple of more targeted functions. The first on is io_fill_cqe_aux() for completions that are not associated with request completions and doing the ->cq_extra accounting. Examples are additional CQEs from multishot poll and rsrc notifications.
The second is io_fill_cqe_req(), should be called when it's a normal request completion. Nothing more to it at the moment, will be used in later patches.
The last one is inlined __io_fill_cqe() for a finer grained control, should be used with caution and in hottest places.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Link: https://lore.kernel.org/r/59a9117a4a44fc9efcf04b3afa51e0d080f5943c.163655911... Signed-off-by: Jens Axboe axboe@kernel.dk [pavel: backport] Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/io_uring.c | 57 +++++++++++++++++++++++++++++---------------------------- 1 file changed, 29 insertions(+), 28 deletions(-)
--- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1079,8 +1079,8 @@ static void io_uring_try_cancel_requests bool cancel_all); static void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd);
-static bool io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data, - s32 res, u32 cflags); +static void io_fill_cqe_req(struct io_kiocb *req, s32 res, u32 cflags); + static void io_put_req(struct io_kiocb *req); static void io_put_req_deferred(struct io_kiocb *req); static void io_dismantle_req(struct io_kiocb *req); @@ -1515,7 +1515,7 @@ static void io_kill_timeout(struct io_ki atomic_set(&req->ctx->cq_timeouts, atomic_read(&req->ctx->cq_timeouts) + 1); list_del_init(&req->timeout.list); - io_cqring_fill_event(req->ctx, req->user_data, status, 0); + io_fill_cqe_req(req, status, 0); io_put_req_deferred(req); } } @@ -1790,8 +1790,8 @@ static bool io_cqring_event_overflow(str return true; }
-static inline bool __io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data, - s32 res, u32 cflags) +static inline bool __io_fill_cqe(struct io_ring_ctx *ctx, u64 user_data, + s32 res, u32 cflags) { struct io_uring_cqe *cqe;
@@ -1812,11 +1812,16 @@ static inline bool __io_cqring_fill_even return io_cqring_event_overflow(ctx, user_data, res, cflags); }
-/* not as hot to bloat with inlining */ -static noinline bool io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data, - s32 res, u32 cflags) +static noinline void io_fill_cqe_req(struct io_kiocb *req, s32 res, u32 cflags) +{ + __io_fill_cqe(req->ctx, req->user_data, res, cflags); +} + +static noinline bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, + s32 res, u32 cflags) { - return __io_cqring_fill_event(ctx, user_data, res, cflags); + ctx->cq_extra++; + return __io_fill_cqe(ctx, user_data, res, cflags); }
static void io_req_complete_post(struct io_kiocb *req, s32 res, @@ -1825,7 +1830,7 @@ static void io_req_complete_post(struct struct io_ring_ctx *ctx = req->ctx;
spin_lock(&ctx->completion_lock); - __io_cqring_fill_event(ctx, req->user_data, res, cflags); + __io_fill_cqe(ctx, req->user_data, res, cflags); /* * If we're the last reference to this request, add to our locked * free_list cache. @@ -2051,8 +2056,7 @@ static bool io_kill_linked_timeout(struc link->timeout.head = NULL; if (hrtimer_try_to_cancel(&io->timer) != -1) { list_del(&link->timeout.list); - io_cqring_fill_event(link->ctx, link->user_data, - -ECANCELED, 0); + io_fill_cqe_req(link, -ECANCELED, 0); io_put_req_deferred(link); return true; } @@ -2076,7 +2080,7 @@ static void io_fail_links(struct io_kioc link->link = NULL;
trace_io_uring_fail_link(req, link); - io_cqring_fill_event(link->ctx, link->user_data, res, 0); + io_fill_cqe_req(link, res, 0); io_put_req_deferred(link); link = nxt; } @@ -2093,8 +2097,7 @@ static bool io_disarm_next(struct io_kio req->flags &= ~REQ_F_ARM_LTIMEOUT; if (link && link->opcode == IORING_OP_LINK_TIMEOUT) { io_remove_next_linked(req); - io_cqring_fill_event(link->ctx, link->user_data, - -ECANCELED, 0); + io_fill_cqe_req(link, -ECANCELED, 0); io_put_req_deferred(link); posted = true; } @@ -2370,8 +2373,8 @@ static void io_submit_flush_completions( for (i = 0; i < nr; i++) { struct io_kiocb *req = state->compl_reqs[i];
- __io_cqring_fill_event(ctx, req->user_data, req->result, - req->compl.cflags); + __io_fill_cqe(ctx, req->user_data, req->result, + req->compl.cflags); } io_commit_cqring(ctx); spin_unlock(&ctx->completion_lock); @@ -2482,8 +2485,7 @@ static void io_iopoll_complete(struct io req = list_first_entry(done, struct io_kiocb, inflight_entry); list_del(&req->inflight_entry);
- __io_cqring_fill_event(ctx, req->user_data, req->result, - io_put_rw_kbuf(req)); + io_fill_cqe_req(req, req->result, io_put_rw_kbuf(req)); (*nr_events)++;
if (req_ref_put_and_test(req)) @@ -5413,13 +5415,13 @@ static bool __io_poll_complete(struct io } if (req->poll.events & EPOLLONESHOT) flags = 0; - if (!io_cqring_fill_event(ctx, req->user_data, error, flags)) { + + if (!(flags & IORING_CQE_F_MORE)) { + io_fill_cqe_req(req, error, flags); + } else if (!io_fill_cqe_aux(ctx, req->user_data, error, flags)) { req->poll.events |= EPOLLONESHOT; flags = 0; } - if (flags & IORING_CQE_F_MORE) - ctx->cq_extra++; - return !(flags & IORING_CQE_F_MORE); }
@@ -5746,9 +5748,9 @@ static bool io_poll_remove_one(struct io do_complete = __io_poll_remove_one(req, io_poll_get_single(req), true);
if (do_complete) { - io_cqring_fill_event(req->ctx, req->user_data, -ECANCELED, 0); - io_commit_cqring(req->ctx); req_set_fail(req); + io_fill_cqe_req(req, -ECANCELED, 0); + io_commit_cqring(req->ctx); io_put_req_deferred(req); } return do_complete; @@ -6045,7 +6047,7 @@ static int io_timeout_cancel(struct io_r return PTR_ERR(req);
req_set_fail(req); - io_cqring_fill_event(ctx, req->user_data, -ECANCELED, 0); + io_fill_cqe_req(req, -ECANCELED, 0); io_put_req_deferred(req); return 0; } @@ -8271,8 +8273,7 @@ static void __io_rsrc_put_work(struct io
io_ring_submit_lock(ctx, lock_ring); spin_lock(&ctx->completion_lock); - io_cqring_fill_event(ctx, prsrc->tag, 0, 0); - ctx->cq_extra++; + io_fill_cqe_aux(ctx, prsrc->tag, 0, 0); io_commit_cqring(ctx); spin_unlock(&ctx->completion_lock); io_cqring_ev_posted(ctx);
From: Pavel Begunkov asml.silence@gmail.com
[ upstream commmit 2bbb146d96f4b45e17d6aeede300796bc1a96d68 ]
Clean up io_poll_update() and unify cancellation paths for remove and update.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Link: https://lore.kernel.org/r/5937138b6265a1285220e2fab1b28132c1d73ce3.163960518... Signed-off-by: Jens Axboe axboe@kernel.dk [pavel: backport] Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/io_uring.c | 62 ++++++++++++++++++++++++---------------------------------- 1 file changed, 26 insertions(+), 36 deletions(-)
--- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5931,61 +5931,51 @@ static int io_poll_update(struct io_kioc struct io_ring_ctx *ctx = req->ctx; struct io_kiocb *preq; bool completing; - int ret; + int ret2, ret = 0;
spin_lock(&ctx->completion_lock); preq = io_poll_find(ctx, req->poll_update.old_user_data, true); if (!preq) { ret = -ENOENT; - goto err; - } - - if (!req->poll_update.update_events && !req->poll_update.update_user_data) { - completing = true; - ret = io_poll_remove_one(preq) ? 0 : -EALREADY; - goto err; +fail: + spin_unlock(&ctx->completion_lock); + goto out; } - + io_poll_remove_double(preq); /* * Don't allow racy completion with singleshot, as we cannot safely * update those. For multishot, if we're racing with completion, just * let completion re-add it. */ - io_poll_remove_double(preq); completing = !__io_poll_remove_one(preq, &preq->poll, false); if (completing && (preq->poll.events & EPOLLONESHOT)) { ret = -EALREADY; - goto err; - } - /* we now have a detached poll request. reissue. */ - ret = 0; -err: - if (ret < 0) { - spin_unlock(&ctx->completion_lock); - req_set_fail(req); - io_req_complete(req, ret); - return 0; - } - /* only mask one event flags, keep behavior flags */ - if (req->poll_update.update_events) { - preq->poll.events &= ~0xffff; - preq->poll.events |= req->poll_update.events & 0xffff; - preq->poll.events |= IO_POLL_UNMASK; + goto fail; } - if (req->poll_update.update_user_data) - preq->user_data = req->poll_update.new_user_data; spin_unlock(&ctx->completion_lock);
+ if (req->poll_update.update_events || req->poll_update.update_user_data) { + /* only mask one event flags, keep behavior flags */ + if (req->poll_update.update_events) { + preq->poll.events &= ~0xffff; + preq->poll.events |= req->poll_update.events & 0xffff; + preq->poll.events |= IO_POLL_UNMASK; + } + if (req->poll_update.update_user_data) + preq->user_data = req->poll_update.new_user_data; + + ret2 = io_poll_add(preq, issue_flags); + /* successfully updated, don't complete poll request */ + if (!ret2) + goto out; + } + req_set_fail(preq); + io_req_complete(preq, -ECANCELED); +out: + if (ret < 0) + req_set_fail(req); /* complete update request, we're done with it */ io_req_complete(req, ret); - - if (!completing) { - ret = io_poll_add(preq, issue_flags); - if (ret < 0) { - req_set_fail(preq); - io_req_complete(preq, ret); - } - } return 0; }
From: Pavel Begunkov asml.silence@gmail.com
[ upstream commmit 5641897a5e8fb8abeb07e89c71a788d3db3ec75e ]
Move some poll helpers/etc up, we'll need them there shortly
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Link: https://lore.kernel.org/r/6c5c3dba24c86aad5cd389a54a8c7412e6a0621d.163960518... Signed-off-by: Jens Axboe axboe@kernel.dk [pavel: backport] Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/io_uring.c | 74 +++++++++++++++++++++++++++++----------------------------- 1 file changed, 37 insertions(+), 37 deletions(-)
--- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5318,6 +5318,43 @@ struct io_poll_table { int error; };
+static struct io_poll_iocb *io_poll_get_double(struct io_kiocb *req) +{ + /* pure poll stashes this in ->async_data, poll driven retry elsewhere */ + if (req->opcode == IORING_OP_POLL_ADD) + return req->async_data; + return req->apoll->double_poll; +} + +static struct io_poll_iocb *io_poll_get_single(struct io_kiocb *req) +{ + if (req->opcode == IORING_OP_POLL_ADD) + return &req->poll; + return &req->apoll->poll; +} + +static void io_poll_req_insert(struct io_kiocb *req) +{ + struct io_ring_ctx *ctx = req->ctx; + struct hlist_head *list; + + list = &ctx->cancel_hash[hash_long(req->user_data, ctx->cancel_hash_bits)]; + hlist_add_head(&req->hash_node, list); +} + +static void io_init_poll_iocb(struct io_poll_iocb *poll, __poll_t events, + wait_queue_func_t wake_func) +{ + poll->head = NULL; + poll->done = false; + poll->canceled = false; +#define IO_POLL_UNMASK (EPOLLERR|EPOLLHUP|EPOLLNVAL|EPOLLRDHUP) + /* mask in events that we always want/need */ + poll->events = events | IO_POLL_UNMASK; + INIT_LIST_HEAD(&poll->wait.entry); + init_waitqueue_func_entry(&poll->wait, wake_func); +} + static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll, __poll_t mask, io_req_tw_func_t func) { @@ -5366,21 +5403,6 @@ static bool io_poll_rewait(struct io_kio return false; }
-static struct io_poll_iocb *io_poll_get_double(struct io_kiocb *req) -{ - /* pure poll stashes this in ->async_data, poll driven retry elsewhere */ - if (req->opcode == IORING_OP_POLL_ADD) - return req->async_data; - return req->apoll->double_poll; -} - -static struct io_poll_iocb *io_poll_get_single(struct io_kiocb *req) -{ - if (req->opcode == IORING_OP_POLL_ADD) - return &req->poll; - return &req->apoll->poll; -} - static void io_poll_remove_double(struct io_kiocb *req) __must_hold(&req->ctx->completion_lock) { @@ -5505,19 +5527,6 @@ static int io_poll_double_wake(struct wa return 1; }
-static void io_init_poll_iocb(struct io_poll_iocb *poll, __poll_t events, - wait_queue_func_t wake_func) -{ - poll->head = NULL; - poll->done = false; - poll->canceled = false; -#define IO_POLL_UNMASK (EPOLLERR|EPOLLHUP|EPOLLNVAL|EPOLLRDHUP) - /* mask in events that we always want/need */ - poll->events = events | IO_POLL_UNMASK; - INIT_LIST_HEAD(&poll->wait.entry); - init_waitqueue_func_entry(&poll->wait, wake_func); -} - static void __io_queue_proc(struct io_poll_iocb *poll, struct io_poll_table *pt, struct wait_queue_head *head, struct io_poll_iocb **poll_ptr) @@ -5612,15 +5621,6 @@ static int io_async_wake(struct wait_que return __io_async_wake(req, poll, key_to_poll(key), io_async_task_func); }
-static void io_poll_req_insert(struct io_kiocb *req) -{ - struct io_ring_ctx *ctx = req->ctx; - struct hlist_head *list; - - list = &ctx->cancel_hash[hash_long(req->user_data, ctx->cancel_hash_bits)]; - hlist_add_head(&req->hash_node, list); -} - static __poll_t __io_arm_poll_handler(struct io_kiocb *req, struct io_poll_iocb *poll, struct io_poll_table *ipt, __poll_t mask,
From: Pavel Begunkov asml.silence@gmail.com
[ upstream commmit ab1dab960b8352cee082db0f8a54dc92a948bfd7 ]
With IORING_FEAT_FAST_POLL in place, io_put_req_find_next() for poll requests doesn't make much sense, and in any case re-adding it shouldn't be a problem considering batching in tctx_task_work(). We can remove it.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Link: https://lore.kernel.org/r/15699682bf81610ec901d4e79d6da64baa9f70be.163960518... Signed-off-by: Jens Axboe axboe@kernel.dk [pavel: backport] Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/io_uring.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-)
--- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5460,7 +5460,6 @@ static inline bool io_poll_complete(stru static void io_poll_task_func(struct io_kiocb *req, bool *locked) { struct io_ring_ctx *ctx = req->ctx; - struct io_kiocb *nxt;
if (io_poll_rewait(req, &req->poll)) { spin_unlock(&ctx->completion_lock); @@ -5484,11 +5483,8 @@ static void io_poll_task_func(struct io_ spin_unlock(&ctx->completion_lock); io_cqring_ev_posted(ctx);
- if (done) { - nxt = io_put_req_find_next(req); - if (nxt) - io_req_task_submit(nxt, locked); - } + if (done) + io_put_req(req); } }
From: Pavel Begunkov asml.silence@gmail.com
[ upstream commmit eb6e6f0690c846f7de46181bab3954c12c96e11e ]
Inline io_poll_complete(), it's simple and doesn't have any particular purpose.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Link: https://lore.kernel.org/r/933d7ee3e4450749a2d892235462c8f18d030293.163337330... Signed-off-by: Jens Axboe axboe@kernel.dk [pavel: backport] Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/io_uring.c | 13 ++----------- 1 file changed, 2 insertions(+), 11 deletions(-)
--- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5447,16 +5447,6 @@ static bool __io_poll_complete(struct io return !(flags & IORING_CQE_F_MORE); }
-static inline bool io_poll_complete(struct io_kiocb *req, __poll_t mask) - __must_hold(&req->ctx->completion_lock) -{ - bool done; - - done = __io_poll_complete(req, mask); - io_commit_cqring(req->ctx); - return done; -} - static void io_poll_task_func(struct io_kiocb *req, bool *locked) { struct io_ring_ctx *ctx = req->ctx; @@ -5910,7 +5900,8 @@ static int io_poll_add(struct io_kiocb *
if (mask) { /* no async, we'd stolen it */ ipt.error = 0; - done = io_poll_complete(req, mask); + done = __io_poll_complete(req, mask); + io_commit_cqring(req->ctx); } spin_unlock(&ctx->completion_lock);
From: Pavel Begunkov asml.silence@gmail.com
[ upstream commmit aa43477b040251f451db0d844073ac00a8ab66ee ]
It's not possible to go forward with the current state of io_uring polling, we need a more straightforward and easier synchronisation. There are a lot of problems with how it is at the moment, including missing events on rewait.
The main idea here is to introduce a notion of request ownership while polling, no one but the owner can modify any part but ->poll_refs of struct io_kiocb, that grants us protection against all sorts of races.
Main users of such exclusivity are poll task_work handler, so before queueing a tw one should have/acquire ownership, which will be handed off to the tw handler. The other user is __io_arm_poll_handler() do initial poll arming. It starts taking the ownership, so tw handlers won't be run until it's released later in the function after vfs_poll. note: also prevents races in __io_queue_proc(). Poll wake/etc. may not be able to get ownership, then they need to increase the poll refcount and the task_work should notice it and retry if necessary, see io_poll_check_events(). There is also IO_POLL_CANCEL_FLAG flag to notify that we want to kill request.
It makes cancellations more reliable, enables double multishot polling, fixes double poll rewait, fixes missing poll events and fixes another bunch of races.
Even though it adds some overhead for new refcounting, and there are a couple of nice performance wins: - no req->refs refcounting for poll requests anymore - if the data is already there (once measured for some test to be 1-2% of all apoll requests), it removes it doesn't add atomics and removes spin_lock/unlock pair. - works well with multishots, we don't do remove from queue / add to queue for each new poll event.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Link: https://lore.kernel.org/r/6b652927c77ed9580ea4330ac5612f0e0848c946.163960518... Signed-off-by: Jens Axboe axboe@kernel.dk [pavel: backport] Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/io_uring.c | 527 +++++++++++++++++++++++++--------------------------------- 1 file changed, 228 insertions(+), 299 deletions(-)
--- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -486,8 +486,6 @@ struct io_poll_iocb { struct file *file; struct wait_queue_head *head; __poll_t events; - bool done; - bool canceled; struct wait_queue_entry wait; };
@@ -885,6 +883,9 @@ struct io_kiocb {
/* store used ubuf, so we can prevent reloading */ struct io_mapped_ubuf *imu; + /* stores selected buf, valid IFF REQ_F_BUFFER_SELECTED is set */ + struct io_buffer *kbuf; + atomic_t poll_refs; };
struct io_tctx_node { @@ -5318,6 +5319,25 @@ struct io_poll_table { int error; };
+#define IO_POLL_CANCEL_FLAG BIT(31) +#define IO_POLL_REF_MASK ((1u << 20)-1) + +/* + * If refs part of ->poll_refs (see IO_POLL_REF_MASK) is 0, it's free. We can + * bump it and acquire ownership. It's disallowed to modify requests while not + * owning it, that prevents from races for enqueueing task_work's and b/w + * arming poll and wakeups. + */ +static inline bool io_poll_get_ownership(struct io_kiocb *req) +{ + return !(atomic_fetch_inc(&req->poll_refs) & IO_POLL_REF_MASK); +} + +static void io_poll_mark_cancelled(struct io_kiocb *req) +{ + atomic_or(IO_POLL_CANCEL_FLAG, &req->poll_refs); +} + static struct io_poll_iocb *io_poll_get_double(struct io_kiocb *req) { /* pure poll stashes this in ->async_data, poll driven retry elsewhere */ @@ -5346,8 +5366,6 @@ static void io_init_poll_iocb(struct io_ wait_queue_func_t wake_func) { poll->head = NULL; - poll->done = false; - poll->canceled = false; #define IO_POLL_UNMASK (EPOLLERR|EPOLLHUP|EPOLLNVAL|EPOLLRDHUP) /* mask in events that we always want/need */ poll->events = events | IO_POLL_UNMASK; @@ -5355,161 +5373,168 @@ static void io_init_poll_iocb(struct io_ init_waitqueue_func_entry(&poll->wait, wake_func); }
-static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll, - __poll_t mask, io_req_tw_func_t func) +static inline void io_poll_remove_entry(struct io_poll_iocb *poll) { - /* for instances that support it check for an event match first: */ - if (mask && !(mask & poll->events)) - return 0; - - trace_io_uring_task_add(req->ctx, req->opcode, req->user_data, mask); + struct wait_queue_head *head = poll->head;
+ spin_lock_irq(&head->lock); list_del_init(&poll->wait.entry); + poll->head = NULL; + spin_unlock_irq(&head->lock); +}
- req->result = mask; - req->io_task_work.func = func; +static void io_poll_remove_entries(struct io_kiocb *req) +{ + struct io_poll_iocb *poll = io_poll_get_single(req); + struct io_poll_iocb *poll_double = io_poll_get_double(req);
- /* - * If this fails, then the task is exiting. When a task exits, the - * work gets canceled, so just cancel this request as well instead - * of executing it. We can't safely execute it anyway, as we may not - * have the needed state needed for it anyway. - */ - io_req_task_work_add(req); - return 1; + if (poll->head) + io_poll_remove_entry(poll); + if (poll_double && poll_double->head) + io_poll_remove_entry(poll_double); }
-static bool io_poll_rewait(struct io_kiocb *req, struct io_poll_iocb *poll) - __acquires(&req->ctx->completion_lock) +/* + * All poll tw should go through this. Checks for poll events, manages + * references, does rewait, etc. + * + * Returns a negative error on failure. >0 when no action require, which is + * either spurious wakeup or multishot CQE is served. 0 when it's done with + * the request, then the mask is stored in req->result. + */ +static int io_poll_check_events(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx; + struct io_poll_iocb *poll = io_poll_get_single(req); + int v;
/* req->task == current here, checking PF_EXITING is safe */ if (unlikely(req->task->flags & PF_EXITING)) - WRITE_ONCE(poll->canceled, true); - - if (!req->result && !READ_ONCE(poll->canceled)) { - struct poll_table_struct pt = { ._key = poll->events }; + io_poll_mark_cancelled(req);
- req->result = vfs_poll(req->file, &pt) & poll->events; - } + do { + v = atomic_read(&req->poll_refs);
- spin_lock(&ctx->completion_lock); - if (!req->result && !READ_ONCE(poll->canceled)) { - add_wait_queue(poll->head, &poll->wait); - return true; - } + /* tw handler should be the owner, and so have some references */ + if (WARN_ON_ONCE(!(v & IO_POLL_REF_MASK))) + return 0; + if (v & IO_POLL_CANCEL_FLAG) + return -ECANCELED;
- return false; -} + if (!req->result) { + struct poll_table_struct pt = { ._key = poll->events };
-static void io_poll_remove_double(struct io_kiocb *req) - __must_hold(&req->ctx->completion_lock) -{ - struct io_poll_iocb *poll = io_poll_get_double(req); + req->result = vfs_poll(req->file, &pt) & poll->events; + }
- lockdep_assert_held(&req->ctx->completion_lock); + /* multishot, just fill an CQE and proceed */ + if (req->result && !(poll->events & EPOLLONESHOT)) { + __poll_t mask = mangle_poll(req->result & poll->events); + bool filled; + + spin_lock(&ctx->completion_lock); + filled = io_fill_cqe_aux(ctx, req->user_data, mask, + IORING_CQE_F_MORE); + io_commit_cqring(ctx); + spin_unlock(&ctx->completion_lock); + if (unlikely(!filled)) + return -ECANCELED; + io_cqring_ev_posted(ctx); + } else if (req->result) { + return 0; + }
- if (poll && poll->head) { - struct wait_queue_head *head = poll->head; + /* + * Release all references, retry if someone tried to restart + * task_work while we were executing it. + */ + } while (atomic_sub_return(v & IO_POLL_REF_MASK, &req->poll_refs));
- spin_lock_irq(&head->lock); - list_del_init(&poll->wait.entry); - if (poll->wait.private) - req_ref_put(req); - poll->head = NULL; - spin_unlock_irq(&head->lock); - } + return 1; }
-static bool __io_poll_complete(struct io_kiocb *req, __poll_t mask) - __must_hold(&req->ctx->completion_lock) +static void io_poll_task_func(struct io_kiocb *req, bool *locked) { struct io_ring_ctx *ctx = req->ctx; - unsigned flags = IORING_CQE_F_MORE; - int error; + int ret; + + ret = io_poll_check_events(req); + if (ret > 0) + return;
- if (READ_ONCE(req->poll.canceled)) { - error = -ECANCELED; - req->poll.events |= EPOLLONESHOT; + if (!ret) { + req->result = mangle_poll(req->result & req->poll.events); } else { - error = mangle_poll(mask); + req->result = ret; + req_set_fail(req); } - if (req->poll.events & EPOLLONESHOT) - flags = 0;
- if (!(flags & IORING_CQE_F_MORE)) { - io_fill_cqe_req(req, error, flags); - } else if (!io_fill_cqe_aux(ctx, req->user_data, error, flags)) { - req->poll.events |= EPOLLONESHOT; - flags = 0; - } - return !(flags & IORING_CQE_F_MORE); + io_poll_remove_entries(req); + spin_lock(&ctx->completion_lock); + hash_del(&req->hash_node); + spin_unlock(&ctx->completion_lock); + io_req_complete_post(req, req->result, 0); }
-static void io_poll_task_func(struct io_kiocb *req, bool *locked) +static void io_apoll_task_func(struct io_kiocb *req, bool *locked) { struct io_ring_ctx *ctx = req->ctx; + int ret;
- if (io_poll_rewait(req, &req->poll)) { - spin_unlock(&ctx->completion_lock); - } else { - bool done; + ret = io_poll_check_events(req); + if (ret > 0) + return;
- if (req->poll.done) { - spin_unlock(&ctx->completion_lock); - return; - } - done = __io_poll_complete(req, req->result); - if (done) { - io_poll_remove_double(req); - hash_del(&req->hash_node); - req->poll.done = true; - } else { - req->result = 0; - add_wait_queue(req->poll.head, &req->poll.wait); - } - io_commit_cqring(ctx); - spin_unlock(&ctx->completion_lock); - io_cqring_ev_posted(ctx); + io_poll_remove_entries(req); + spin_lock(&ctx->completion_lock); + hash_del(&req->hash_node); + spin_unlock(&ctx->completion_lock);
- if (done) - io_put_req(req); - } + if (!ret) + io_req_task_submit(req, locked); + else + io_req_complete_failed(req, ret); +} + +static void __io_poll_execute(struct io_kiocb *req, int mask) +{ + req->result = mask; + if (req->opcode == IORING_OP_POLL_ADD) + req->io_task_work.func = io_poll_task_func; + else + req->io_task_work.func = io_apoll_task_func; + + trace_io_uring_task_add(req->ctx, req->opcode, req->user_data, mask); + io_req_task_work_add(req); +} + +static inline void io_poll_execute(struct io_kiocb *req, int res) +{ + if (io_poll_get_ownership(req)) + __io_poll_execute(req, res); }
-static int io_poll_double_wake(struct wait_queue_entry *wait, unsigned mode, - int sync, void *key) +static void io_poll_cancel_req(struct io_kiocb *req) +{ + io_poll_mark_cancelled(req); + /* kick tw, which should complete the request */ + io_poll_execute(req, 0); +} + +static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, + void *key) { struct io_kiocb *req = wait->private; - struct io_poll_iocb *poll = io_poll_get_single(req); + struct io_poll_iocb *poll = container_of(wait, struct io_poll_iocb, + wait); __poll_t mask = key_to_poll(key); - unsigned long flags;
- /* for instances that support it check for an event match first: */ + /* for instances that support it check for an event match first */ if (mask && !(mask & poll->events)) return 0; - if (!(poll->events & EPOLLONESHOT)) - return poll->wait.func(&poll->wait, mode, sync, key);
- list_del_init(&wait->entry); - - if (poll->head) { - bool done; - - spin_lock_irqsave(&poll->head->lock, flags); - done = list_empty(&poll->wait.entry); - if (!done) - list_del_init(&poll->wait.entry); - /* make sure double remove sees this as being gone */ - wait->private = NULL; - spin_unlock_irqrestore(&poll->head->lock, flags); - if (!done) { - /* use wait func handler, so it matches the rq type */ - poll->wait.func(&poll->wait, mode, sync, key); - } - } - req_ref_put(req); + if (io_poll_get_ownership(req)) + __io_poll_execute(req, mask); return 1; }
@@ -5525,10 +5550,10 @@ static void __io_queue_proc(struct io_po * if this happens. */ if (unlikely(pt->nr_entries)) { - struct io_poll_iocb *poll_one = poll; + struct io_poll_iocb *first = poll;
/* double add on the same waitqueue head, ignore */ - if (poll_one->head == head) + if (first->head == head) return; /* already have a 2nd entry, fail a third attempt */ if (*poll_ptr) { @@ -5537,25 +5562,19 @@ static void __io_queue_proc(struct io_po pt->error = -EINVAL; return; } - /* - * Can't handle multishot for double wait for now, turn it - * into one-shot mode. - */ - if (!(poll_one->events & EPOLLONESHOT)) - poll_one->events |= EPOLLONESHOT; + poll = kmalloc(sizeof(*poll), GFP_ATOMIC); if (!poll) { pt->error = -ENOMEM; return; } - io_init_poll_iocb(poll, poll_one->events, io_poll_double_wake); - req_ref_get(req); - poll->wait.private = req; + io_init_poll_iocb(poll, first->events, first->wait.func); *poll_ptr = poll; }
pt->nr_entries++; poll->head = head; + poll->wait.private = req;
if (poll->events & EPOLLEXCLUSIVE) add_wait_queue_exclusive(head, &poll->wait); @@ -5563,61 +5582,24 @@ static void __io_queue_proc(struct io_po add_wait_queue(head, &poll->wait); }
-static void io_async_queue_proc(struct file *file, struct wait_queue_head *head, +static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head, struct poll_table_struct *p) { struct io_poll_table *pt = container_of(p, struct io_poll_table, pt); - struct async_poll *apoll = pt->req->apoll; - - __io_queue_proc(&apoll->poll, pt, head, &apoll->double_poll); -} - -static void io_async_task_func(struct io_kiocb *req, bool *locked) -{ - struct async_poll *apoll = req->apoll; - struct io_ring_ctx *ctx = req->ctx; - - trace_io_uring_task_run(req->ctx, req, req->opcode, req->user_data); - - if (io_poll_rewait(req, &apoll->poll)) { - spin_unlock(&ctx->completion_lock); - return; - } - - hash_del(&req->hash_node); - io_poll_remove_double(req); - apoll->poll.done = true; - spin_unlock(&ctx->completion_lock); - - if (!READ_ONCE(apoll->poll.canceled)) - io_req_task_submit(req, locked); - else - io_req_complete_failed(req, -ECANCELED); -} - -static int io_async_wake(struct wait_queue_entry *wait, unsigned mode, int sync, - void *key) -{ - struct io_kiocb *req = wait->private; - struct io_poll_iocb *poll = &req->apoll->poll;
- trace_io_uring_poll_wake(req->ctx, req->opcode, req->user_data, - key_to_poll(key)); - - return __io_async_wake(req, poll, key_to_poll(key), io_async_task_func); + __io_queue_proc(&pt->req->poll, pt, head, + (struct io_poll_iocb **) &pt->req->async_data); }
-static __poll_t __io_arm_poll_handler(struct io_kiocb *req, - struct io_poll_iocb *poll, - struct io_poll_table *ipt, __poll_t mask, - wait_queue_func_t wake_func) - __acquires(&ctx->completion_lock) +static int __io_arm_poll_handler(struct io_kiocb *req, + struct io_poll_iocb *poll, + struct io_poll_table *ipt, __poll_t mask) { struct io_ring_ctx *ctx = req->ctx; - bool cancel = false; + int v;
INIT_HLIST_NODE(&req->hash_node); - io_init_poll_iocb(poll, mask, wake_func); + io_init_poll_iocb(poll, mask, io_poll_wake); poll->file = req->file; poll->wait.private = req;
@@ -5626,31 +5608,54 @@ static __poll_t __io_arm_poll_handler(st ipt->error = 0; ipt->nr_entries = 0;
+ /* + * Take the ownership to delay any tw execution up until we're done + * with poll arming. see io_poll_get_ownership(). + */ + atomic_set(&req->poll_refs, 1); mask = vfs_poll(req->file, &ipt->pt) & poll->events; - if (unlikely(!ipt->nr_entries) && !ipt->error) - ipt->error = -EINVAL; + + if (mask && (poll->events & EPOLLONESHOT)) { + io_poll_remove_entries(req); + /* no one else has access to the req, forget about the ref */ + return mask; + } + if (!mask && unlikely(ipt->error || !ipt->nr_entries)) { + io_poll_remove_entries(req); + if (!ipt->error) + ipt->error = -EINVAL; + return 0; + }
spin_lock(&ctx->completion_lock); - if (ipt->error || (mask && (poll->events & EPOLLONESHOT))) - io_poll_remove_double(req); - if (likely(poll->head)) { - spin_lock_irq(&poll->head->lock); - if (unlikely(list_empty(&poll->wait.entry))) { - if (ipt->error) - cancel = true; - ipt->error = 0; - mask = 0; - } - if ((mask && (poll->events & EPOLLONESHOT)) || ipt->error) - list_del_init(&poll->wait.entry); - else if (cancel) - WRITE_ONCE(poll->canceled, true); - else if (!poll->done) /* actually waiting for an event */ - io_poll_req_insert(req); - spin_unlock_irq(&poll->head->lock); + io_poll_req_insert(req); + spin_unlock(&ctx->completion_lock); + + if (mask) { + /* can't multishot if failed, just queue the event we've got */ + if (unlikely(ipt->error || !ipt->nr_entries)) + poll->events |= EPOLLONESHOT; + __io_poll_execute(req, mask); + return 0; }
- return mask; + /* + * Release ownership. If someone tried to queue a tw while it was + * locked, kick it off for them. + */ + v = atomic_dec_return(&req->poll_refs); + if (unlikely(v & IO_POLL_REF_MASK)) + __io_poll_execute(req, 0); + return 0; +} + +static void io_async_queue_proc(struct file *file, struct wait_queue_head *head, + struct poll_table_struct *p) +{ + struct io_poll_table *pt = container_of(p, struct io_poll_table, pt); + struct async_poll *apoll = pt->req->apoll; + + __io_queue_proc(&apoll->poll, pt, head, &apoll->double_poll); }
enum { @@ -5665,7 +5670,8 @@ static int io_arm_poll_handler(struct io struct io_ring_ctx *ctx = req->ctx; struct async_poll *apoll; struct io_poll_table ipt; - __poll_t ret, mask = EPOLLONESHOT | POLLERR | POLLPRI; + __poll_t mask = EPOLLONESHOT | POLLERR | POLLPRI; + int ret;
if (!req->file || !file_can_poll(req->file)) return IO_APOLL_ABORTED; @@ -5692,11 +5698,8 @@ static int io_arm_poll_handler(struct io req->apoll = apoll; req->flags |= REQ_F_POLLED; ipt.pt._qproc = io_async_queue_proc; - io_req_set_refcount(req);
- ret = __io_arm_poll_handler(req, &apoll->poll, &ipt, mask, - io_async_wake); - spin_unlock(&ctx->completion_lock); + ret = __io_arm_poll_handler(req, &apoll->poll, &ipt, mask); if (ret || ipt.error) return ret ? IO_APOLL_READY : IO_APOLL_ABORTED;
@@ -5705,43 +5708,6 @@ static int io_arm_poll_handler(struct io return IO_APOLL_OK; }
-static bool __io_poll_remove_one(struct io_kiocb *req, - struct io_poll_iocb *poll, bool do_cancel) - __must_hold(&req->ctx->completion_lock) -{ - bool do_complete = false; - - if (!poll->head) - return false; - spin_lock_irq(&poll->head->lock); - if (do_cancel) - WRITE_ONCE(poll->canceled, true); - if (!list_empty(&poll->wait.entry)) { - list_del_init(&poll->wait.entry); - do_complete = true; - } - spin_unlock_irq(&poll->head->lock); - hash_del(&req->hash_node); - return do_complete; -} - -static bool io_poll_remove_one(struct io_kiocb *req) - __must_hold(&req->ctx->completion_lock) -{ - bool do_complete; - - io_poll_remove_double(req); - do_complete = __io_poll_remove_one(req, io_poll_get_single(req), true); - - if (do_complete) { - req_set_fail(req); - io_fill_cqe_req(req, -ECANCELED, 0); - io_commit_cqring(req->ctx); - io_put_req_deferred(req); - } - return do_complete; -} - /* * Returns true if we found and killed one or more poll requests */ @@ -5750,7 +5716,8 @@ static bool io_poll_remove_all(struct io { struct hlist_node *tmp; struct io_kiocb *req; - int posted = 0, i; + bool found = false; + int i;
spin_lock(&ctx->completion_lock); for (i = 0; i < (1U << ctx->cancel_hash_bits); i++) { @@ -5758,16 +5725,14 @@ static bool io_poll_remove_all(struct io
list = &ctx->cancel_hash[i]; hlist_for_each_entry_safe(req, tmp, list, hash_node) { - if (io_match_task_safe(req, tsk, cancel_all)) - posted += io_poll_remove_one(req); + if (io_match_task_safe(req, tsk, cancel_all)) { + io_poll_cancel_req(req); + found = true; + } } } spin_unlock(&ctx->completion_lock); - - if (posted) - io_cqring_ev_posted(ctx); - - return posted != 0; + return found; }
static struct io_kiocb *io_poll_find(struct io_ring_ctx *ctx, __u64 sqe_addr, @@ -5788,19 +5753,26 @@ static struct io_kiocb *io_poll_find(str return NULL; }
+static bool io_poll_disarm(struct io_kiocb *req) + __must_hold(&ctx->completion_lock) +{ + if (!io_poll_get_ownership(req)) + return false; + io_poll_remove_entries(req); + hash_del(&req->hash_node); + return true; +} + static int io_poll_cancel(struct io_ring_ctx *ctx, __u64 sqe_addr, bool poll_only) __must_hold(&ctx->completion_lock) { - struct io_kiocb *req; + struct io_kiocb *req = io_poll_find(ctx, sqe_addr, poll_only);
- req = io_poll_find(ctx, sqe_addr, poll_only); if (!req) return -ENOENT; - if (io_poll_remove_one(req)) - return 0; - - return -EALREADY; + io_poll_cancel_req(req); + return 0; }
static __poll_t io_poll_parse_events(const struct io_uring_sqe *sqe, @@ -5850,23 +5822,6 @@ static int io_poll_update_prep(struct io return 0; }
-static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, - void *key) -{ - struct io_kiocb *req = wait->private; - struct io_poll_iocb *poll = &req->poll; - - return __io_async_wake(req, poll, key_to_poll(key), io_poll_task_func); -} - -static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head, - struct poll_table_struct *p) -{ - struct io_poll_table *pt = container_of(p, struct io_poll_table, pt); - - __io_queue_proc(&pt->req->poll, pt, head, (struct io_poll_iocb **) &pt->req->async_data); -} - static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_poll_iocb *poll = &req->poll; @@ -5888,57 +5843,31 @@ static int io_poll_add_prep(struct io_ki static int io_poll_add(struct io_kiocb *req, unsigned int issue_flags) { struct io_poll_iocb *poll = &req->poll; - struct io_ring_ctx *ctx = req->ctx; struct io_poll_table ipt; - __poll_t mask; - bool done; + int ret;
ipt.pt._qproc = io_poll_queue_proc;
- mask = __io_arm_poll_handler(req, &req->poll, &ipt, poll->events, - io_poll_wake); - - if (mask) { /* no async, we'd stolen it */ - ipt.error = 0; - done = __io_poll_complete(req, mask); - io_commit_cqring(req->ctx); - } - spin_unlock(&ctx->completion_lock); - - if (mask) { - io_cqring_ev_posted(ctx); - if (done) - io_put_req(req); - } - return ipt.error; + ret = __io_arm_poll_handler(req, &req->poll, &ipt, poll->events); + ret = ret ?: ipt.error; + if (ret) + __io_req_complete(req, issue_flags, ret, 0); + return 0; }
static int io_poll_update(struct io_kiocb *req, unsigned int issue_flags) { struct io_ring_ctx *ctx = req->ctx; struct io_kiocb *preq; - bool completing; int ret2, ret = 0;
spin_lock(&ctx->completion_lock); preq = io_poll_find(ctx, req->poll_update.old_user_data, true); - if (!preq) { - ret = -ENOENT; -fail: + if (!preq || !io_poll_disarm(preq)) { spin_unlock(&ctx->completion_lock); + ret = preq ? -EALREADY : -ENOENT; goto out; } - io_poll_remove_double(preq); - /* - * Don't allow racy completion with singleshot, as we cannot safely - * update those. For multishot, if we're racing with completion, just - * let completion re-add it. - */ - completing = !__io_poll_remove_one(preq, &preq->poll, false); - if (completing && (preq->poll.events & EPOLLONESHOT)) { - ret = -EALREADY; - goto fail; - } spin_unlock(&ctx->completion_lock);
if (req->poll_update.update_events || req->poll_update.update_user_data) {
From: Jiapeng Chong jiapeng.chong@linux.alibaba.com
[ upstream commmit c84b8a3fef663933007e885535591b9d30bdc860 ]
Fix the following clang warnings:
fs/io_uring.c:1195:20: warning: unused function 'req_ref_put' [-Wunused-function].
Fixes: aa43477b0402 ("io_uring: poll rework") Reported-by: Abaci Robot abaci@linux.alibaba.com Signed-off-by: Jiapeng Chong jiapeng.chong@linux.alibaba.com Link: https://lore.kernel.org/r/20220113162005.3011-1-jiapeng.chong@linux.alibaba.... Signed-off-by: Jens Axboe axboe@kernel.dk [pavel: backport] Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/io_uring.c | 6 ------ 1 file changed, 6 deletions(-)
--- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1155,12 +1155,6 @@ static inline bool req_ref_put_and_test( return atomic_dec_and_test(&req->refs); }
-static inline void req_ref_put(struct io_kiocb *req) -{ - WARN_ON_ONCE(!(req->flags & REQ_F_REFCOUNT)); - WARN_ON_ONCE(req_ref_put_and_test(req)); -} - static inline void req_ref_get(struct io_kiocb *req) { WARN_ON_ONCE(!(req->flags & REQ_F_REFCOUNT));
From: Jens Axboe axboe@kernel.dk
[ upstream commmit 61bc84c4008812d784c398cfb54118c1ba396dfc ]
When the ring is exiting, as part of the shutdown, poll requests are removed. But io_poll_remove_all() does not remove entries when finding them, and since completions are done out-of-band, we can find and remove the same entry multiple times.
We do guard the poll execution by poll ownership, but that does not exclude us from reissuing a new one once the previous removal ownership goes away.
This can race with poll execution as well, where we then end up seeing req->apoll be NULL because a previous task_work requeue finished the request.
Remove the poll entry when we find it and get ownership of it. This prevents multiple invocations from finding it.
Fixes: aa43477b0402 ("io_uring: poll rework") Reported-by: Dylan Yudaken dylany@fb.com Signed-off-by: Jens Axboe axboe@kernel.dk [pavel: backport] Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/io_uring.c | 1 + 1 file changed, 1 insertion(+)
--- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5720,6 +5720,7 @@ static bool io_poll_remove_all(struct io list = &ctx->cancel_hash[i]; hlist_for_each_entry_safe(req, tmp, list, hash_node) { if (io_match_task_safe(req, tsk, cancel_all)) { + hlist_del_init(&req->hash_node); io_poll_cancel_req(req); found = true; }
From: Jens Axboe axboe@kernel.dk
[ upstream commmit e2c0cb7c0cc72939b61a7efee376206725796625 ]
The previous commit:
1bc84c40088 ("io_uring: remove poll entry from list when canceling all")
removed a potential overflow condition for the poll references. They are currently limited to 20-bits, even if we have 31-bits available. The upper bit is used to mark for cancelation.
Bump the poll ref space to 31-bits, making that kind of situation much harder to trigger in general. We'll separately add overflow checking and handling.
Fixes: aa43477b0402 ("io_uring: poll rework") Signed-off-by: Jens Axboe axboe@kernel.dk [pavel: backport] Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5314,7 +5314,7 @@ struct io_poll_table { };
#define IO_POLL_CANCEL_FLAG BIT(31) -#define IO_POLL_REF_MASK ((1u << 20)-1) +#define IO_POLL_REF_MASK GENMASK(30, 0)
/* * If refs part of ->poll_refs (see IO_POLL_REF_MASK) is 0, it's free. We can
From: Pavel Begunkov asml.silence@gmail.com
[ upstream commmit c487a5ad48831afa6784b368ec40d0ee50f2fe1b ]
Don't forget to cancel all linked requests of poll request when __io_arm_poll_handler() failed.
Fixes: aa43477b04025 ("io_uring: poll rework") Signed-off-by: Pavel Begunkov asml.silence@gmail.com Link: https://lore.kernel.org/r/a78aad962460f9fdfe4aa4c0b62425c88f9415bc.165585224... Signed-off-by: Jens Axboe axboe@kernel.dk [pavel: backport] Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/io_uring.c | 2 ++ 1 file changed, 2 insertions(+)
--- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5844,6 +5844,8 @@ static int io_poll_add(struct io_kiocb * ipt.pt._qproc = io_poll_queue_proc;
ret = __io_arm_poll_handler(req, &req->poll, &ipt, poll->events); + if (!ret && ipt.error) + req_set_fail(req); ret = ret ?: ipt.error; if (ret) __io_req_complete(req, issue_flags, ret, 0);
From: Pavel Begunkov asml.silence@gmail.com
[ upstream commmit 9d2ad2947a53abf5e5e6527a9eeed50a3a4cbc72 ]
Leaving ip.error set when a request was punted to task_work execution is problematic, don't forget to clear it.
Fixes: aa43477b04025 ("io_uring: poll rework") Signed-off-by: Pavel Begunkov asml.silence@gmail.com Link: https://lore.kernel.org/r/a6c84ef4182c6962380aebe11b35bdcb25b0ccfb.165585224... Signed-off-by: Jens Axboe axboe@kernel.dk [pavel: backport] Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/io_uring.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
--- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5627,8 +5627,10 @@ static int __io_arm_poll_handler(struct
if (mask) { /* can't multishot if failed, just queue the event we've got */ - if (unlikely(ipt->error || !ipt->nr_entries)) + if (unlikely(ipt->error || !ipt->nr_entries)) { poll->events |= EPOLLONESHOT; + ipt->error = 0; + } __io_poll_execute(req, mask); return 0; }
From: Pavel Begunkov asml.silence@gmail.com
[ upstream commmit 791f3465c4afde02d7f16cf7424ca87070b69396 ]
Fixes a problem described in 50252e4b5e989 ("aio: fix use-after-free due to missing POLLFREE handling") and copies the approach used there.
In short, we have to forcibly eject a poll entry when we meet POLLFREE. We can't rely on io_poll_get_ownership() as can't wait for potentially running tw handlers, so we use the fact that wqs are RCU freed. See Eric's patch and comments for more details.
Reported-by: Eric Biggers ebiggers@google.com Link: https://lore.kernel.org/r/20211209010455.42744-6-ebiggers@kernel.org Reported-and-tested-by: syzbot+5426c7ed6868c705ca14@syzkaller.appspotmail.com Fixes: 221c5eb233823 ("io_uring: add support for IORING_OP_POLL") Signed-off-by: Pavel Begunkov asml.silence@gmail.com Link: https://lore.kernel.org/r/4ed56b6f548f7ea337603a82315750449412748a.164216125... [axboe: drop non-functional change from patch] Signed-off-by: Jens Axboe axboe@kernel.dk [pavel: backport] Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/io_uring.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 50 insertions(+), 8 deletions(-)
--- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5369,12 +5369,14 @@ static void io_init_poll_iocb(struct io_
static inline void io_poll_remove_entry(struct io_poll_iocb *poll) { - struct wait_queue_head *head = poll->head; + struct wait_queue_head *head = smp_load_acquire(&poll->head);
- spin_lock_irq(&head->lock); - list_del_init(&poll->wait.entry); - poll->head = NULL; - spin_unlock_irq(&head->lock); + if (head) { + spin_lock_irq(&head->lock); + list_del_init(&poll->wait.entry); + poll->head = NULL; + spin_unlock_irq(&head->lock); + } }
static void io_poll_remove_entries(struct io_kiocb *req) @@ -5382,10 +5384,26 @@ static void io_poll_remove_entries(struc struct io_poll_iocb *poll = io_poll_get_single(req); struct io_poll_iocb *poll_double = io_poll_get_double(req);
- if (poll->head) - io_poll_remove_entry(poll); - if (poll_double && poll_double->head) + /* + * While we hold the waitqueue lock and the waitqueue is nonempty, + * wake_up_pollfree() will wait for us. However, taking the waitqueue + * lock in the first place can race with the waitqueue being freed. + * + * We solve this as eventpoll does: by taking advantage of the fact that + * all users of wake_up_pollfree() will RCU-delay the actual free. If + * we enter rcu_read_lock() and see that the pointer to the queue is + * non-NULL, we can then lock it without the memory being freed out from + * under us. + * + * Keep holding rcu_read_lock() as long as we hold the queue lock, in + * case the caller deletes the entry from the queue, leaving it empty. + * In that case, only RCU prevents the queue memory from being freed. + */ + rcu_read_lock(); + io_poll_remove_entry(poll); + if (poll_double) io_poll_remove_entry(poll_double); + rcu_read_unlock(); }
/* @@ -5523,6 +5541,30 @@ static int io_poll_wake(struct wait_queu wait); __poll_t mask = key_to_poll(key);
+ if (unlikely(mask & POLLFREE)) { + io_poll_mark_cancelled(req); + /* we have to kick tw in case it's not already */ + io_poll_execute(req, 0); + + /* + * If the waitqueue is being freed early but someone is already + * holds ownership over it, we have to tear down the request as + * best we can. That means immediately removing the request from + * its waitqueue and preventing all further accesses to the + * waitqueue via the request. + */ + list_del_init(&poll->wait.entry); + + /* + * Careful: this *must* be the last step, since as soon + * as req->head is NULL'ed out, the request can be + * completed and freed, since aio_poll_complete_work() + * will no longer need to take the waitqueue lock. + */ + smp_store_release(&poll->head, NULL); + return 1; + } + /* for instances that support it check for an event match first */ if (mask && !(mask & poll->events)) return 0;
From: Jing Leng jleng@ambarella.com
commit 23a0cb8e3225122496bfa79172005c587c2d64bf upstream.
When building an external module, if users don't need to separate the compilation output and source code, they run the following command: "make -C $(LINUX_SRC_DIR) M=$(PWD)". At this point, "$(KBUILD_EXTMOD)" and "$(src)" are the same.
If they need to separate them, they run "make -C $(KERNEL_SRC_DIR) O=$(KERNEL_OUT_DIR) M=$(OUT_DIR) src=$(PWD)". Before running the command, they need to copy "Kbuild" or "Makefile" to "$(OUT_DIR)" to prevent compilation failure.
So the kernel should change the included path to avoid the copy operation.
Signed-off-by: Jing Leng jleng@ambarella.com [masahiro: I do not think "M=$(OUT_DIR) src=$(PWD)" is the official way, but this patch is a nice clean up anyway.] Signed-off-by: Masahiro Yamada masahiroy@kernel.org Signed-off-by: Nicolas Schier n.schier@avm.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- scripts/Makefile.modpost | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
--- a/scripts/Makefile.modpost +++ b/scripts/Makefile.modpost @@ -87,8 +87,7 @@ obj := $(KBUILD_EXTMOD) src := $(obj)
# Include the module's Makefile to find KBUILD_EXTRA_SYMBOLS -include $(if $(wildcard $(KBUILD_EXTMOD)/Kbuild), \ - $(KBUILD_EXTMOD)/Kbuild, $(KBUILD_EXTMOD)/Makefile) +include $(if $(wildcard $(src)/Kbuild), $(src)/Kbuild, $(src)/Makefile)
# modpost option for external modules MODPOST += -e
From: Luiz Augusto von Dentz luiz.von.dentz@intel.com
commit b840304fb46cdf7012722f456bce06f151b3e81b upstream.
This attempts to fix the follow errors:
In function 'memcmp', inlined from 'bacmp' at ./include/net/bluetooth/bluetooth.h:347:9, inlined from 'l2cap_global_chan_by_psm' at net/bluetooth/l2cap_core.c:2003:15: ./include/linux/fortify-string.h:44:33: error: '__builtin_memcmp' specified bound 6 exceeds source size 0 [-Werror=stringop-overread] 44 | #define __underlying_memcmp __builtin_memcmp | ^ ./include/linux/fortify-string.h:420:16: note: in expansion of macro '__underlying_memcmp' 420 | return __underlying_memcmp(p, q, size); | ^~~~~~~~~~~~~~~~~~~ In function 'memcmp', inlined from 'bacmp' at ./include/net/bluetooth/bluetooth.h:347:9, inlined from 'l2cap_global_chan_by_psm' at net/bluetooth/l2cap_core.c:2004:15: ./include/linux/fortify-string.h:44:33: error: '__builtin_memcmp' specified bound 6 exceeds source size 0 [-Werror=stringop-overread] 44 | #define __underlying_memcmp __builtin_memcmp | ^ ./include/linux/fortify-string.h:420:16: note: in expansion of macro '__underlying_memcmp' 420 | return __underlying_memcmp(p, q, size); | ^~~~~~~~~~~~~~~~~~~
Fixes: 332f1795ca20 ("Bluetooth: L2CAP: Fix l2cap_global_chan_by_psm regression") Signed-off-by: Luiz Augusto von Dentz luiz.von.dentz@intel.com Cc: Sudip Mukherjee sudipm.mukherjee@gmail.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/bluetooth/l2cap_core.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
--- a/net/bluetooth/l2cap_core.c +++ b/net/bluetooth/l2cap_core.c @@ -1992,11 +1992,11 @@ static struct l2cap_chan *l2cap_global_c src_match = !bacmp(&c->src, src); dst_match = !bacmp(&c->dst, dst); if (src_match && dst_match) { - c = l2cap_chan_hold_unless_zero(c); - if (c) { - read_unlock(&chan_list_lock); - return c; - } + if (!l2cap_chan_hold_unless_zero(c)) + continue; + + read_unlock(&chan_list_lock); + return c; }
/* Closest match */
From: Greg Kroah-Hartman gregkh@linuxfoundation.org
This reverts commit c968af565ca6c18b2f2af60fc1493c8db15abb3c which is commit 8795e182b02dc87e343c79e73af6b8b7f9c5e635 upstream.
It is reported to cause problems, so drop it from the stable trees for now until it gets sorted out.
Link: https://lore.kernel.org/r/47b775c5-57fa-5edf-b59e-8a9041ffbee7@candelatech.c... Reported-by: Ben Greear greearb@candelatech.com Cc: Stefan Roese sr@denx.de Cc: Bjorn Helgaas bhelgaas@google.com Cc: Pali Rohár pali@kernel.org Cc: Rafael J. Wysocki rjw@rjwysocki.net Cc: Bharat Kumar Gogada bharat.kumar.gogada@xilinx.com Cc: Michal Simek michal.simek@xilinx.com Cc: Yao Hongbo yaohongbo@linux.alibaba.com Cc: Naveen Naidu naveennaidu479@gmail.com Cc: Sasha Levin sashal@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/pci/pcie/portdrv_core.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
--- a/drivers/pci/pcie/portdrv_core.c +++ b/drivers/pci/pcie/portdrv_core.c @@ -222,8 +222,15 @@ static int get_port_device_capability(st
#ifdef CONFIG_PCIEAER if (dev->aer_cap && pci_aer_available() && - (pcie_ports_native || host->native_aer)) + (pcie_ports_native || host->native_aer)) { services |= PCIE_PORT_SERVICE_AER; + + /* + * Disable AER on this port in case it's been enabled by the + * BIOS (the AER service driver will enable it when necessary). + */ + pci_disable_pcie_error_reporting(dev); + } #endif
/* Root Ports and Root Complex Event Collectors may generate PMEs */
From: Lee Jones lee.jones@linaro.org
commit cd11d1a6114bd4bc6450ae59f6e110ec47362126 upstream.
It is possible for a malicious device to forgo submitting a Feature Report. The HID Steam driver presently makes no prevision for this and de-references the 'struct hid_report' pointer obtained from the HID devices without first checking its validity. Let's change that.
Cc: Jiri Kosina jikos@kernel.org Cc: Benjamin Tissoires benjamin.tissoires@redhat.com Cc: linux-input@vger.kernel.org Fixes: c164d6abf3841 ("HID: add driver for Valve Steam Controller") Signed-off-by: Lee Jones lee.jones@linaro.org Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/hid/hid-steam.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
--- a/drivers/hid/hid-steam.c +++ b/drivers/hid/hid-steam.c @@ -134,6 +134,11 @@ static int steam_recv_report(struct stea int ret;
r = steam->hdev->report_enum[HID_FEATURE_REPORT].report_id_hash[0]; + if (!r) { + hid_err(steam->hdev, "No HID_FEATURE_REPORT submitted - nothing to read\n"); + return -EINVAL; + } + if (hid_report_len(r) < 64) return -EINVAL;
@@ -165,6 +170,11 @@ static int steam_send_report(struct stea int ret;
r = steam->hdev->report_enum[HID_FEATURE_REPORT].report_id_hash[0]; + if (!r) { + hid_err(steam->hdev, "No HID_FEATURE_REPORT submitted - nothing to read\n"); + return -EINVAL; + } + if (hid_report_len(r) < 64) return -EINVAL;
From: Vivek Kasireddy vivek.kasireddy@intel.com
commit 9e9fa6a9198b767b00f48160800128e83a038f9f upstream.
If the DMA mask is not set explicitly, the following warning occurs when the userspace tries to access the dma-buf via the CPU as reported by syzbot here:
WARNING: CPU: 1 PID: 3595 at kernel/dma/mapping.c:188 __dma_map_sg_attrs+0x181/0x1f0 kernel/dma/mapping.c:188 Modules linked in: CPU: 0 PID: 3595 Comm: syz-executor249 Not tainted 5.17.0-rc2-syzkaller-00316-g0457e5153e0e #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__dma_map_sg_attrs+0x181/0x1f0 kernel/dma/mapping.c:188 Code: 00 00 00 00 00 fc ff df 48 c1 e8 03 80 3c 10 00 75 71 4c 8b 3d c0 83 b5 0d e9 db fe ff ff e8 b6 0f 13 00 0f 0b e8 af 0f 13 00 <0f> 0b 45 31 e4 e9 54 ff ff ff e8 a0 0f 13 00 49 8d 7f 50 48 b8 00 RSP: 0018:ffffc90002a07d68 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff88807e25e2c0 RSI: ffffffff81649e91 RDI: ffff88801b848408 RBP: ffff88801b848000 R08: 0000000000000002 R09: ffff88801d86c74f R10: ffffffff81649d72 R11: 0000000000000001 R12: 0000000000000002 R13: ffff88801d86c680 R14: 0000000000000001 R15: 0000000000000000 FS: 0000555556e30300(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000200000cc CR3: 000000001d74a000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> dma_map_sgtable+0x70/0xf0 kernel/dma/mapping.c:264 get_sg_table.isra.0+0xe0/0x160 drivers/dma-buf/udmabuf.c:72 begin_cpu_udmabuf+0x130/0x1d0 drivers/dma-buf/udmabuf.c:126 dma_buf_begin_cpu_access+0xfd/0x1d0 drivers/dma-buf/dma-buf.c:1164 dma_buf_ioctl+0x259/0x2b0 drivers/dma-buf/dma-buf.c:363 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f62fcf530f9 Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffe3edab9b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f62fcf530f9 RDX: 0000000020000200 RSI: 0000000040086200 RDI: 0000000000000006 RBP: 00007f62fcf170e0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f62fcf17170 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK>
v2: Dont't forget to deregister if DMA mask setup fails.
Reported-by: syzbot+10e27961f4da37c443b2@syzkaller.appspotmail.com Cc: Gerd Hoffmann kraxel@redhat.com Signed-off-by: Vivek Kasireddy vivek.kasireddy@intel.com Link: http://patchwork.freedesktop.org/patch/msgid/20220520205235.3687336-1-vivek.... Signed-off-by: Gerd Hoffmann kraxel@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/dma-buf/udmabuf.c | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-)
--- a/drivers/dma-buf/udmabuf.c +++ b/drivers/dma-buf/udmabuf.c @@ -368,7 +368,23 @@ static struct miscdevice udmabuf_misc =
static int __init udmabuf_dev_init(void) { - return misc_register(&udmabuf_misc); + int ret; + + ret = misc_register(&udmabuf_misc); + if (ret < 0) { + pr_err("Could not initialize udmabuf device\n"); + return ret; + } + + ret = dma_coerce_mask_and_coherent(udmabuf_misc.this_device, + DMA_BIT_MASK(64)); + if (ret < 0) { + pr_err("Could not setup DMA mask for udmabuf device\n"); + misc_deregister(&udmabuf_misc); + return ret; + } + + return 0; }
static void __exit udmabuf_dev_exit(void)
From: Dongliang Mu mudongliangabcd@gmail.com
commit 945a9a8e448b65bec055d37eba58f711b39f66f0 upstream.
The error handling code in pvr2_hdw_create forgets to unregister the v4l2 device. When pvr2_hdw_create returns back to pvr2_context_create, it calls pvr2_context_destroy to destroy context, but mp->hdw is NULL, which leads to that pvr2_hdw_destroy directly returns.
Fix this by adding v4l2_device_unregister to decrease the refcount of usb interface.
Reported-by: syzbot+77b432d57c4791183ed4@syzkaller.appspotmail.com Signed-off-by: Dongliang Mu mudongliangabcd@gmail.com Signed-off-by: Hans Verkuil hverkuil-cisco@xs4all.nl Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/media/usb/pvrusb2/pvrusb2-hdw.c | 1 + 1 file changed, 1 insertion(+)
--- a/drivers/media/usb/pvrusb2/pvrusb2-hdw.c +++ b/drivers/media/usb/pvrusb2/pvrusb2-hdw.c @@ -2610,6 +2610,7 @@ struct pvr2_hdw *pvr2_hdw_create(struct del_timer_sync(&hdw->encoder_run_timer); del_timer_sync(&hdw->encoder_wait_timer); flush_work(&hdw->workpoll); + v4l2_device_unregister(&hdw->v4l2_dev); usb_free_urb(hdw->ctl_read_urb); usb_free_urb(hdw->ctl_write_urb); kfree(hdw->ctl_read_buffer);
From: Karthik Alapati mail@karthek.com
commit a5623a203cffe2d2b84d2f6c989d9017db1856af upstream.
Free the buffered reports before deleting the list entry.
BUG: memory leak unreferenced object 0xffff88810e72f180 (size 32): comm "softirq", pid 0, jiffies 4294945143 (age 16.080s) hex dump (first 32 bytes): 64 f3 c6 6a d1 88 07 04 00 00 00 00 00 00 00 00 d..j............ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff814ac6c3>] kmemdup+0x23/0x50 mm/util.c:128 [<ffffffff8357c1d2>] kmemdup include/linux/fortify-string.h:440 [inline] [<ffffffff8357c1d2>] hidraw_report_event+0xa2/0x150 drivers/hid/hidraw.c:521 [<ffffffff8356ddad>] hid_report_raw_event+0x27d/0x740 drivers/hid/hid-core.c:1992 [<ffffffff8356e41e>] hid_input_report+0x1ae/0x270 drivers/hid/hid-core.c:2065 [<ffffffff835f0d3f>] hid_irq_in+0x1ff/0x250 drivers/hid/usbhid/hid-core.c:284 [<ffffffff82d3c7f9>] __usb_hcd_giveback_urb+0xf9/0x230 drivers/usb/core/hcd.c:1670 [<ffffffff82d3cc26>] usb_hcd_giveback_urb+0x1b6/0x1d0 drivers/usb/core/hcd.c:1747 [<ffffffff82ef1e14>] dummy_timer+0x8e4/0x14c0 drivers/usb/gadget/udc/dummy_hcd.c:1988 [<ffffffff812f50a8>] call_timer_fn+0x38/0x200 kernel/time/timer.c:1474 [<ffffffff812f5586>] expire_timers kernel/time/timer.c:1519 [inline] [<ffffffff812f5586>] __run_timers.part.0+0x316/0x430 kernel/time/timer.c:1790 [<ffffffff812f56e4>] __run_timers kernel/time/timer.c:1768 [inline] [<ffffffff812f56e4>] run_timer_softirq+0x44/0x90 kernel/time/timer.c:1803 [<ffffffff848000e6>] __do_softirq+0xe6/0x2ea kernel/softirq.c:571 [<ffffffff81246db0>] invoke_softirq kernel/softirq.c:445 [inline] [<ffffffff81246db0>] __irq_exit_rcu kernel/softirq.c:650 [inline] [<ffffffff81246db0>] irq_exit_rcu+0xc0/0x110 kernel/softirq.c:662 [<ffffffff84574f02>] sysvec_apic_timer_interrupt+0xa2/0xd0 arch/x86/kernel/apic/apic.c:1106 [<ffffffff84600c8b>] asm_sysvec_apic_timer_interrupt+0x1b/0x20 arch/x86/include/asm/idtentry.h:649 [<ffffffff8458a070>] native_safe_halt arch/x86/include/asm/irqflags.h:51 [inline] [<ffffffff8458a070>] arch_safe_halt arch/x86/include/asm/irqflags.h:89 [inline] [<ffffffff8458a070>] acpi_safe_halt drivers/acpi/processor_idle.c:111 [inline] [<ffffffff8458a070>] acpi_idle_do_entry+0xc0/0xd0 drivers/acpi/processor_idle.c:554
Link: https://syzkaller.appspot.com/bug?id=19a04b43c75ed1092021010419b5e560a8172c4... Reported-by: syzbot+f59100a0428e6ded9443@syzkaller.appspotmail.com Signed-off-by: Karthik Alapati mail@karthek.com Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/hid/hidraw.c | 3 +++ 1 file changed, 3 insertions(+)
--- a/drivers/hid/hidraw.c +++ b/drivers/hid/hidraw.c @@ -346,10 +346,13 @@ static int hidraw_release(struct inode * unsigned int minor = iminor(inode); struct hidraw_list *list = file->private_data; unsigned long flags; + int i;
mutex_lock(&minors_lock);
spin_lock_irqsave(&hidraw_table[minor]->list_lock, flags); + for (i = list->tail; i < list->head; i++) + kfree(list->buffer[i].value); list_del(&list->node); spin_unlock_irqrestore(&hidraw_table[minor]->list_lock, flags); kfree(list);
From: Hawkins Jiawei yin31149@gmail.com
commit 2a0133723f9ebeb751cfce19f74ec07e108bef1f upstream.
Syzkaller reports refcount bug as follows: ------------[ cut here ]------------ refcount_t: saturated; leaking memory. WARNING: CPU: 1 PID: 3605 at lib/refcount.c:19 refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19 Modules linked in: CPU: 1 PID: 3605 Comm: syz-executor208 Not tainted 5.18.0-syzkaller-03023-g7e062cda7d90 #0 <TASK> __refcount_add_not_zero include/linux/refcount.h:163 [inline] __refcount_inc_not_zero include/linux/refcount.h:227 [inline] refcount_inc_not_zero include/linux/refcount.h:245 [inline] sk_psock_get+0x3bc/0x410 include/linux/skmsg.h:439 tls_data_ready+0x6d/0x1b0 net/tls/tls_sw.c:2091 tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4983 tcp_data_queue+0x25f2/0x4c90 net/ipv4/tcp_input.c:5057 tcp_rcv_state_process+0x1774/0x4e80 net/ipv4/tcp_input.c:6659 tcp_v4_do_rcv+0x339/0x980 net/ipv4/tcp_ipv4.c:1682 sk_backlog_rcv include/net/sock.h:1061 [inline] __release_sock+0x134/0x3b0 net/core/sock.c:2849 release_sock+0x54/0x1b0 net/core/sock.c:3404 inet_shutdown+0x1e0/0x430 net/ipv4/af_inet.c:909 __sys_shutdown_sock net/socket.c:2331 [inline] __sys_shutdown_sock net/socket.c:2325 [inline] __sys_shutdown+0xf1/0x1b0 net/socket.c:2343 __do_sys_shutdown net/socket.c:2351 [inline] __se_sys_shutdown net/socket.c:2349 [inline] __x64_sys_shutdown+0x50/0x70 net/socket.c:2349 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK>
During SMC fallback process in connect syscall, kernel will replaces TCP with SMC. In order to forward wakeup smc socket waitqueue after fallback, kernel will sets clcsk->sk_user_data to origin smc socket in smc_fback_replace_callbacks().
Later, in shutdown syscall, kernel will calls sk_psock_get(), which treats the clcsk->sk_user_data as psock type, triggering the refcnt warning.
So, the root cause is that smc and psock, both will use sk_user_data field. So they will mismatch this field easily.
This patch solves it by using another bit(defined as SK_USER_DATA_PSOCK) in PTRMASK, to mark whether sk_user_data points to a psock object or not. This patch depends on a PTRMASK introduced in commit f1ff5ce2cd5e ("net, sk_msg: Clear sk_user_data pointer on clone if tagged").
For there will possibly be more flags in the sk_user_data field, this patch also refactor sk_user_data flags code to be more generic to improve its maintainability.
Reported-and-tested-by: syzbot+5f26f85569bd179c18ce@syzkaller.appspotmail.com Suggested-by: Jakub Kicinski kuba@kernel.org Acked-by: Wen Gu guwen@linux.alibaba.com Signed-off-by: Hawkins Jiawei yin31149@gmail.com Reviewed-by: Jakub Sitnicki jakub@cloudflare.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/linux/skmsg.h | 3 +- include/net/sock.h | 68 +++++++++++++++++++++++++++++++++++--------------- net/core/skmsg.c | 4 ++ 3 files changed, 53 insertions(+), 22 deletions(-)
--- a/include/linux/skmsg.h +++ b/include/linux/skmsg.h @@ -283,7 +283,8 @@ static inline void sk_msg_sg_copy_clear(
static inline struct sk_psock *sk_psock(const struct sock *sk) { - return rcu_dereference_sk_user_data(sk); + return __rcu_dereference_sk_user_data_with_flags(sk, + SK_USER_DATA_PSOCK); }
static inline void sk_psock_set_state(struct sk_psock *psock, --- a/include/net/sock.h +++ b/include/net/sock.h @@ -543,14 +543,26 @@ enum sk_pacing { SK_PACING_FQ = 2, };
-/* Pointer stored in sk_user_data might not be suitable for copying - * when cloning the socket. For instance, it can point to a reference - * counted object. sk_user_data bottom bit is set if pointer must not - * be copied. +/* flag bits in sk_user_data + * + * - SK_USER_DATA_NOCOPY: Pointer stored in sk_user_data might + * not be suitable for copying when cloning the socket. For instance, + * it can point to a reference counted object. sk_user_data bottom + * bit is set if pointer must not be copied. + * + * - SK_USER_DATA_BPF: Mark whether sk_user_data field is + * managed/owned by a BPF reuseport array. This bit should be set + * when sk_user_data's sk is added to the bpf's reuseport_array. + * + * - SK_USER_DATA_PSOCK: Mark whether pointer stored in + * sk_user_data points to psock type. This bit should be set + * when sk_user_data is assigned to a psock object. */ #define SK_USER_DATA_NOCOPY 1UL -#define SK_USER_DATA_BPF 2UL /* Managed by BPF */ -#define SK_USER_DATA_PTRMASK ~(SK_USER_DATA_NOCOPY | SK_USER_DATA_BPF) +#define SK_USER_DATA_BPF 2UL +#define SK_USER_DATA_PSOCK 4UL +#define SK_USER_DATA_PTRMASK ~(SK_USER_DATA_NOCOPY | SK_USER_DATA_BPF |\ + SK_USER_DATA_PSOCK)
/** * sk_user_data_is_nocopy - Test if sk_user_data pointer must not be copied @@ -563,24 +575,40 @@ static inline bool sk_user_data_is_nocop
#define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data)))
+/** + * __rcu_dereference_sk_user_data_with_flags - return the pointer + * only if argument flags all has been set in sk_user_data. Otherwise + * return NULL + * + * @sk: socket + * @flags: flag bits + */ +static inline void * +__rcu_dereference_sk_user_data_with_flags(const struct sock *sk, + uintptr_t flags) +{ + uintptr_t sk_user_data = (uintptr_t)rcu_dereference(__sk_user_data(sk)); + + WARN_ON_ONCE(flags & SK_USER_DATA_PTRMASK); + + if ((sk_user_data & flags) == flags) + return (void *)(sk_user_data & SK_USER_DATA_PTRMASK); + return NULL; +} + #define rcu_dereference_sk_user_data(sk) \ + __rcu_dereference_sk_user_data_with_flags(sk, 0) +#define __rcu_assign_sk_user_data_with_flags(sk, ptr, flags) \ ({ \ - void *__tmp = rcu_dereference(__sk_user_data((sk))); \ - (void *)((uintptr_t)__tmp & SK_USER_DATA_PTRMASK); \ -}) -#define rcu_assign_sk_user_data(sk, ptr) \ -({ \ - uintptr_t __tmp = (uintptr_t)(ptr); \ - WARN_ON_ONCE(__tmp & ~SK_USER_DATA_PTRMASK); \ - rcu_assign_pointer(__sk_user_data((sk)), __tmp); \ -}) -#define rcu_assign_sk_user_data_nocopy(sk, ptr) \ -({ \ - uintptr_t __tmp = (uintptr_t)(ptr); \ - WARN_ON_ONCE(__tmp & ~SK_USER_DATA_PTRMASK); \ + uintptr_t __tmp1 = (uintptr_t)(ptr), \ + __tmp2 = (uintptr_t)(flags); \ + WARN_ON_ONCE(__tmp1 & ~SK_USER_DATA_PTRMASK); \ + WARN_ON_ONCE(__tmp2 & SK_USER_DATA_PTRMASK); \ rcu_assign_pointer(__sk_user_data((sk)), \ - __tmp | SK_USER_DATA_NOCOPY); \ + __tmp1 | __tmp2); \ }) +#define rcu_assign_sk_user_data(sk, ptr) \ + __rcu_assign_sk_user_data_with_flags(sk, ptr, 0)
/* * SK_CAN_REUSE and SK_NO_REUSE on a socket mean that the socket is OK --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -731,7 +731,9 @@ struct sk_psock *sk_psock_init(struct so sk_psock_set_state(psock, SK_PSOCK_TX_ENABLED); refcount_set(&psock->refcnt, 1);
- rcu_assign_sk_user_data_nocopy(sk, psock); + __rcu_assign_sk_user_data_with_flags(sk, psock, + SK_USER_DATA_NOCOPY | + SK_USER_DATA_PSOCK); sock_hold(sk);
out:
From: Letu Ren fantasquex@gmail.com
commit 19f953e7435644b81332dd632ba1b2d80b1e37af upstream.
In `do_fb_ioctl()` of fbmem.c, if cmd is FBIOPUT_VSCREENINFO, var will be copied from user, then go through `fb_set_var()` and `info->fbops->fb_check_var()` which could may be `pm2fb_check_var()`. Along the path, `var->pixclock` won't be modified. This function checks whether reciprocal of `var->pixclock` is too high. If `var->pixclock` is zero, there will be a divide by zero error. So, it is necessary to check whether denominator is zero to avoid crash. As this bug is found by Syzkaller, logs are listed below.
divide error in pm2fb_check_var Call Trace: <TASK> fb_set_var+0x367/0xeb0 drivers/video/fbdev/core/fbmem.c:1015 do_fb_ioctl+0x234/0x670 drivers/video/fbdev/core/fbmem.c:1110 fb_ioctl+0xdd/0x130 drivers/video/fbdev/core/fbmem.c:1189
Reported-by: Zheyu Ma zheyuma97@gmail.com Signed-off-by: Letu Ren fantasquex@gmail.com Signed-off-by: Helge Deller deller@gmx.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/video/fbdev/pm2fb.c | 5 +++++ 1 file changed, 5 insertions(+)
--- a/drivers/video/fbdev/pm2fb.c +++ b/drivers/video/fbdev/pm2fb.c @@ -617,6 +617,11 @@ static int pm2fb_check_var(struct fb_var return -EINVAL; }
+ if (!var->pixclock) { + DPRINTK("pixclock is zero\n"); + return -EINVAL; + } + if (PICOS2KHZ(var->pixclock) > PM2_MAX_PIXCLOCK) { DPRINTK("pixclock too high (%ldKHz)\n", PICOS2KHZ(var->pixclock));
From: Yang Jihong yangjihong1@huawei.com
commit c3b0f72e805f0801f05fa2aa52011c4bfc694c44 upstream.
ftrace_startup does not remove ops from ftrace_ops_list when ftrace_startup_enable fails:
register_ftrace_function ftrace_startup __register_ftrace_function ... add_ftrace_ops(&ftrace_ops_list, ops) ... ... ftrace_startup_enable // if ftrace failed to modify, ftrace_disabled is set to 1 ... return 0 // ops is in the ftrace_ops_list.
When ftrace_disabled = 1, unregister_ftrace_function simply returns without doing anything: unregister_ftrace_function ftrace_shutdown if (unlikely(ftrace_disabled)) return -ENODEV; // return here, __unregister_ftrace_function is not executed, // as a result, ops is still in the ftrace_ops_list __unregister_ftrace_function ...
If ops is dynamically allocated, it will be free later, in this case, is_ftrace_trampoline accesses NULL pointer:
is_ftrace_trampoline ftrace_ops_trampoline do_for_each_ftrace_op(op, ftrace_ops_list) // OOPS! op may be NULL!
Syzkaller reports as follows: [ 1203.506103] BUG: kernel NULL pointer dereference, address: 000000000000010b [ 1203.508039] #PF: supervisor read access in kernel mode [ 1203.508798] #PF: error_code(0x0000) - not-present page [ 1203.509558] PGD 800000011660b067 P4D 800000011660b067 PUD 130fb8067 PMD 0 [ 1203.510560] Oops: 0000 [#1] SMP KASAN PTI [ 1203.511189] CPU: 6 PID: 29532 Comm: syz-executor.2 Tainted: G B W 5.10.0 #8 [ 1203.512324] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 1203.513895] RIP: 0010:is_ftrace_trampoline+0x26/0xb0 [ 1203.514644] Code: ff eb d3 90 41 55 41 54 49 89 fc 55 53 e8 f2 00 fd ff 48 8b 1d 3b 35 5d 03 e8 e6 00 fd ff 48 8d bb 90 00 00 00 e8 2a 81 26 00 <48> 8b ab 90 00 00 00 48 85 ed 74 1d e8 c9 00 fd ff 48 8d bb 98 00 [ 1203.518838] RSP: 0018:ffffc900012cf960 EFLAGS: 00010246 [ 1203.520092] RAX: 0000000000000000 RBX: 000000000000007b RCX: ffffffff8a331866 [ 1203.521469] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000000010b [ 1203.522583] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff8df18b07 [ 1203.523550] R10: fffffbfff1be3160 R11: 0000000000000001 R12: 0000000000478399 [ 1203.524596] R13: 0000000000000000 R14: ffff888145088000 R15: 0000000000000008 [ 1203.525634] FS: 00007f429f5f4700(0000) GS:ffff8881daf00000(0000) knlGS:0000000000000000 [ 1203.526801] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1203.527626] CR2: 000000000000010b CR3: 0000000170e1e001 CR4: 00000000003706e0 [ 1203.528611] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1203.529605] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Therefore, when ftrace_startup_enable fails, we need to rollback registration process and remove ops from ftrace_ops_list.
Link: https://lkml.kernel.org/r/20220818032659.56209-1-yangjihong1@huawei.com
Suggested-by: Steven Rostedt rostedt@goodmis.org Signed-off-by: Yang Jihong yangjihong1@huawei.com Signed-off-by: Steven Rostedt (Google) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/trace/ftrace.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
--- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -2901,6 +2901,16 @@ int ftrace_startup(struct ftrace_ops *op
ftrace_startup_enable(command);
+ /* + * If ftrace is in an undefined state, we just remove ops from list + * to prevent the NULL pointer, instead of totally rolling it back and + * free trampoline, because those actions could cause further damage. + */ + if (unlikely(ftrace_disabled)) { + __unregister_ftrace_function(ops); + return -ENODEV; + } + ops->flags &= ~FTRACE_OPS_FL_ADDING;
return 0;
From: Zhengchao Shao shaozhengchao@huawei.com
commit fd1894224407c484f652ad456e1ce423e89bb3eb upstream.
Syzbot found an issue [1]: fq_codel_drop() try to drop a flow whitout any skbs, that is, the flow->head is null. The root cause, as the [2] says, is because that bpf_prog_test_run_skb() run a bpf prog which redirects empty skbs. So we should determine whether the length of the packet modified by bpf prog or others like bpf_prog_test is valid before forwarding it directly.
LINK: [1] https://syzkaller.appspot.com/bug?id=0b84da80c2917757915afa89f7738a9d16ec96c... LINK: [2] https://www.spinics.net/lists/netdev/msg777503.html
Reported-by: syzbot+7a12909485b94426aceb@syzkaller.appspotmail.com Signed-off-by: Zhengchao Shao shaozhengchao@huawei.com Reviewed-by: Stanislav Fomichev sdf@google.com Link: https://lore.kernel.org/r/20220715115559.139691-1-shaozhengchao@huawei.com Signed-off-by: Alexei Starovoitov ast@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/linux/skbuff.h | 8 ++++++++ net/bpf/test_run.c | 3 +++ net/core/dev.c | 1 + 3 files changed, 12 insertions(+)
--- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2328,6 +2328,14 @@ static inline void skb_set_tail_pointer(
#endif /* NET_SKBUFF_DATA_USES_OFFSET */
+static inline void skb_assert_len(struct sk_buff *skb) +{ +#ifdef CONFIG_DEBUG_NET + if (WARN_ONCE(!skb->len, "%s\n", __func__)) + DO_ONCE_LITE(skb_dump, KERN_ERR, skb, false); +#endif /* CONFIG_DEBUG_NET */ +} + /* * Add data to an sk_buff */ --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -469,6 +469,9 @@ static int convert___skb_to_skb(struct s { struct qdisc_skb_cb *cb = (struct qdisc_skb_cb *)skb->cb;
+ if (!skb->len) + return -EINVAL; + if (!__skb) return 0;
--- a/net/core/dev.c +++ b/net/core/dev.c @@ -4147,6 +4147,7 @@ static int __dev_queue_xmit(struct sk_bu bool again = false;
skb_reset_mac_header(skb); + skb_assert_len(skb);
if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP)) __skb_tstamp_tx(skb, NULL, NULL, skb->sk, SCM_TSTAMP_SCHED);
From: Jann Horn jannh@google.com
commit 2555283eb40df89945557273121e9393ef9b542b upstream.
anon_vma->degree tracks the combined number of child anon_vmas and VMAs that use the anon_vma as their ->anon_vma.
anon_vma_clone() then assumes that for any anon_vma attached to src->anon_vma_chain other than src->anon_vma, it is impossible for it to be a leaf node of the VMA tree, meaning that for such VMAs ->degree is elevated by 1 because of a child anon_vma, meaning that if ->degree equals 1 there are no VMAs that use the anon_vma as their ->anon_vma.
This assumption is wrong because the ->degree optimization leads to leaf nodes being abandoned on anon_vma_clone() - an existing anon_vma is reused and no new parent-child relationship is created. So it is possible to reuse an anon_vma for one VMA while it is still tied to another VMA.
This is an issue because is_mergeable_anon_vma() and its callers assume that if two VMAs have the same ->anon_vma, the list of anon_vmas attached to the VMAs is guaranteed to be the same. When this assumption is violated, vma_merge() can merge pages into a VMA that is not attached to the corresponding anon_vma, leading to dangling page->mapping pointers that will be dereferenced during rmap walks.
Fix it by separately tracking the number of child anon_vmas and the number of VMAs using the anon_vma as their ->anon_vma.
Fixes: 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy") Cc: stable@kernel.org Acked-by: Michal Hocko mhocko@suse.com Acked-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: Jann Horn jannh@google.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/linux/rmap.h | 7 +++++-- mm/rmap.c | 29 ++++++++++++++++------------- 2 files changed, 21 insertions(+), 15 deletions(-)
--- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -39,12 +39,15 @@ struct anon_vma { atomic_t refcount;
/* - * Count of child anon_vmas and VMAs which points to this anon_vma. + * Count of child anon_vmas. Equals to the count of all anon_vmas that + * have ->parent pointing to this one, including itself. * * This counter is used for making decision about reusing anon_vma * instead of forking new one. See comments in function anon_vma_clone. */ - unsigned degree; + unsigned long num_children; + /* Count of VMAs whose ->anon_vma pointer points to this object. */ + unsigned long num_active_vmas;
struct anon_vma *parent; /* Parent of this anon_vma */
--- a/mm/rmap.c +++ b/mm/rmap.c @@ -90,7 +90,8 @@ static inline struct anon_vma *anon_vma_ anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL); if (anon_vma) { atomic_set(&anon_vma->refcount, 1); - anon_vma->degree = 1; /* Reference for first vma */ + anon_vma->num_children = 0; + anon_vma->num_active_vmas = 0; anon_vma->parent = anon_vma; /* * Initialise the anon_vma root to point to itself. If called @@ -198,6 +199,7 @@ int __anon_vma_prepare(struct vm_area_st anon_vma = anon_vma_alloc(); if (unlikely(!anon_vma)) goto out_enomem_free_avc; + anon_vma->num_children++; /* self-parent link for new root */ allocated = anon_vma; }
@@ -207,8 +209,7 @@ int __anon_vma_prepare(struct vm_area_st if (likely(!vma->anon_vma)) { vma->anon_vma = anon_vma; anon_vma_chain_link(vma, avc, anon_vma); - /* vma reference or self-parent link for new root */ - anon_vma->degree++; + anon_vma->num_active_vmas++; allocated = NULL; avc = NULL; } @@ -293,19 +294,19 @@ int anon_vma_clone(struct vm_area_struct anon_vma_chain_link(dst, avc, anon_vma);
/* - * Reuse existing anon_vma if its degree lower than two, - * that means it has no vma and only one anon_vma child. + * Reuse existing anon_vma if it has no vma and only one + * anon_vma child. * - * Do not chose parent anon_vma, otherwise first child - * will always reuse it. Root anon_vma is never reused: + * Root anon_vma is never reused: * it has self-parent reference and at least one child. */ if (!dst->anon_vma && src->anon_vma && - anon_vma != src->anon_vma && anon_vma->degree < 2) + anon_vma->num_children < 2 && + anon_vma->num_active_vmas == 0) dst->anon_vma = anon_vma; } if (dst->anon_vma) - dst->anon_vma->degree++; + dst->anon_vma->num_active_vmas++; unlock_anon_vma_root(root); return 0;
@@ -355,6 +356,7 @@ int anon_vma_fork(struct vm_area_struct anon_vma = anon_vma_alloc(); if (!anon_vma) goto out_error; + anon_vma->num_active_vmas++; avc = anon_vma_chain_alloc(GFP_KERNEL); if (!avc) goto out_error_free_anon_vma; @@ -375,7 +377,7 @@ int anon_vma_fork(struct vm_area_struct vma->anon_vma = anon_vma; anon_vma_lock_write(anon_vma); anon_vma_chain_link(vma, avc, anon_vma); - anon_vma->parent->degree++; + anon_vma->parent->num_children++; anon_vma_unlock_write(anon_vma);
return 0; @@ -407,7 +409,7 @@ void unlink_anon_vmas(struct vm_area_str * to free them outside the lock. */ if (RB_EMPTY_ROOT(&anon_vma->rb_root.rb_root)) { - anon_vma->parent->degree--; + anon_vma->parent->num_children--; continue; }
@@ -415,7 +417,7 @@ void unlink_anon_vmas(struct vm_area_str anon_vma_chain_free(avc); } if (vma->anon_vma) { - vma->anon_vma->degree--; + vma->anon_vma->num_active_vmas--;
/* * vma would still be needed after unlink, and anon_vma will be prepared @@ -433,7 +435,8 @@ void unlink_anon_vmas(struct vm_area_str list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) { struct anon_vma *anon_vma = avc->anon_vma;
- VM_WARN_ON(anon_vma->degree); + VM_WARN_ON(anon_vma->num_children); + VM_WARN_ON(anon_vma->num_active_vmas); put_anon_vma(anon_vma);
list_del(&avc->same_vma);
From: Takashi Iwai tiwai@suse.de
commit 5f3d9e8161bb8cb23ab3b4678cd13f6e90a06186 upstream.
The USB DAC from LH Labs (2522:0007) seems requiring the same quirk as Sony Walkman to set up the interface like UAC1; otherwise it gets the constant errors "usb_set_interface failed (-71)". This patch adds a quirk entry for addressing the buggy behavior.
Reported-by: Lennert Van Alboom lennert@vanalboom.org Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/T3VPXtCc4uFws9Gfh2RjX6OdwM1RqfC6VqQr--_LMDyB2x5N3p... Link: https://lore.kernel.org/r/20220828074143.14736-1-tiwai@suse.de Signed-off-by: Takashi Iwai tiwai@suse.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- sound/usb/quirks.c | 2 ++ 1 file changed, 2 insertions(+)
--- a/sound/usb/quirks.c +++ b/sound/usb/quirks.c @@ -1903,6 +1903,8 @@ static const struct usb_audio_quirk_flag QUIRK_FLAG_SHARE_MEDIA_DEVICE | QUIRK_FLAG_ALIGN_TRANSFER), DEVICE_FLG(0x21b4, 0x0081, /* AudioQuest DragonFly */ QUIRK_FLAG_GET_SAMPLE_RATE), + DEVICE_FLG(0x2522, 0x0007, /* LH Labs Geek Out HD Audio 1V5 */ + QUIRK_FLAG_SET_IFACE_FIRST), DEVICE_FLG(0x2708, 0x0002, /* Audient iD14 */ QUIRK_FLAG_IGNORE_CTL_ERROR), DEVICE_FLG(0x2912, 0x30c8, /* Audioengine D1 */
From: Steev Klimaszewski steev@kali.org
commit 3a47fa7b14c7d9613909a844aba27f99d3c58634 upstream.
Similar to the Surface Go devices, the Elantech touchscreen/digitizer in the Lenovo Yoga C630 mistakenly reports the battery of the stylus, and always reports an empty battery.
Apply the HID_BATTERY_QUIRK_IGNORE quirk to ignore this battery and prevent the erroneous low battery warnings.
Signed-off-by: Steev Klimaszewski steev@kali.org Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/hid/hid-ids.h | 1 + drivers/hid/hid-input.c | 2 ++ 2 files changed, 3 insertions(+)
--- a/drivers/hid/hid-ids.h +++ b/drivers/hid/hid-ids.h @@ -399,6 +399,7 @@ #define USB_DEVICE_ID_ASUS_UX550_TOUCHSCREEN 0x2706 #define I2C_DEVICE_ID_SURFACE_GO_TOUCHSCREEN 0x261A #define I2C_DEVICE_ID_SURFACE_GO2_TOUCHSCREEN 0x2A1C +#define I2C_DEVICE_ID_LENOVO_YOGA_C630_TOUCHSCREEN 0x279F
#define USB_VENDOR_ID_ELECOM 0x056e #define USB_DEVICE_ID_ELECOM_BM084 0x0061 --- a/drivers/hid/hid-input.c +++ b/drivers/hid/hid-input.c @@ -335,6 +335,8 @@ static const struct hid_device_id hid_ba HID_BATTERY_QUIRK_IGNORE }, { HID_I2C_DEVICE(USB_VENDOR_ID_ELAN, I2C_DEVICE_ID_SURFACE_GO2_TOUCHSCREEN), HID_BATTERY_QUIRK_IGNORE }, + { HID_I2C_DEVICE(USB_VENDOR_ID_ELAN, I2C_DEVICE_ID_LENOVO_YOGA_C630_TOUCHSCREEN), + HID_BATTERY_QUIRK_IGNORE }, {} };
From: Akihiko Odaki akihiko.odaki@gmail.com
commit adada3f4930ac084740ea340bd8e94028eba4f22 upstream.
Google Chromebooks use Chrome OS Embedded Controller Sensor Hub instead of Sensor Hub Fusion and leaves MP2 uninitialized, which disables all functionalities, even including the registers necessary for feature detections.
The behavior was observed with Lenovo ThinkPad C13 Yoga.
Signed-off-by: Akihiko Odaki akihiko.odaki@gmail.com Suggested-by: Mario Limonciello mario.limonciello@amd.com Acked-by: Basavaraj Natikar Basavaraj.Natikar@amd.com Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/hid/amd-sfh-hid/amd_sfh_pcie.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
--- a/drivers/hid/amd-sfh-hid/amd_sfh_pcie.c +++ b/drivers/hid/amd-sfh-hid/amd_sfh_pcie.c @@ -281,11 +281,29 @@ static int amd_sfh_irq_init(struct amd_m return 0; }
+static const struct dmi_system_id dmi_nodevs[] = { + { + /* + * Google Chromebooks use Chrome OS Embedded Controller Sensor + * Hub instead of Sensor Hub Fusion and leaves MP2 + * uninitialized, which disables all functionalities, even + * including the registers necessary for feature detections. + */ + .matches = { + DMI_MATCH(DMI_SYS_VENDOR, "Google"), + }, + }, + { } +}; + static int amd_mp2_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) { struct amd_mp2_dev *privdata; int rc;
+ if (dmi_first_match(dmi_nodevs)) + return -ENODEV; + privdata = devm_kzalloc(&pdev->dev, sizeof(*privdata), GFP_KERNEL); if (!privdata) return -ENOMEM;
From: Josh Kilmer srjek2@gmail.com
commit 1c0cc9d11c665020cbeb80e660fb8929164407f4 upstream.
On an Asus G513QY, of the 5 bytes in a 0x5a report, only the first byte is a meaningful keycode. The other bytes are zeroed out or hold garbage from the last packet sent to the keyboard.
This patch fixes up the report descriptor for this event so that the general hid code will only process 1 byte for keycodes, avoiding spurious key events and unmapped Asus vendor usagepage code warnings.
Signed-off-by: Josh Kilmer srjek2@gmail.com Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/hid/hid-asus.c | 7 +++++++ 1 file changed, 7 insertions(+)
--- a/drivers/hid/hid-asus.c +++ b/drivers/hid/hid-asus.c @@ -1212,6 +1212,13 @@ static __u8 *asus_report_fixup(struct hi rdesc = new_rdesc; }
+ if (drvdata->quirks & QUIRK_ROG_NKEY_KEYBOARD && + *rsize == 331 && rdesc[190] == 0x85 && rdesc[191] == 0x5a && + rdesc[204] == 0x95 && rdesc[205] == 0x05) { + hid_info(hdev, "Fixing up Asus N-KEY keyb report descriptor\n"); + rdesc[205] = 0x01; + } + return rdesc; }
From: Michael Hübner michaelh.95@t-online.de
commit d9a17651f3749e69890db57ca66e677dfee70829 upstream.
Add device id for the Sparco R383 Mod wheel.
Fix wheel info array length to match actual wheel count present in the array.
Signed-off-by: Michael Hübner michaelh.95@t-online.de Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/hid/hid-thrustmaster.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
--- a/drivers/hid/hid-thrustmaster.c +++ b/drivers/hid/hid-thrustmaster.c @@ -67,12 +67,13 @@ static const struct tm_wheel_info tm_whe {0x0200, 0x0005, "Thrustmaster T300RS (Missing Attachment)"}, {0x0206, 0x0005, "Thrustmaster T300RS"}, {0x0209, 0x0005, "Thrustmaster T300RS (Open Wheel Attachment)"}, + {0x020a, 0x0005, "Thrustmaster T300RS (Sparco R383 Mod)"}, {0x0204, 0x0005, "Thrustmaster T300 Ferrari Alcantara Edition"}, {0x0002, 0x0002, "Thrustmaster T500RS"} //{0x0407, 0x0001, "Thrustmaster TMX"} };
-static const uint8_t tm_wheels_infos_length = 4; +static const uint8_t tm_wheels_infos_length = 7;
/* * This structs contains (in little endian) the response data
From: Chris Wilson chris.p.wilson@intel.com
[ Upstream commit e5a95c83ed1492c0f442b448b20c90c8faaf702b ]
Skip all further TLB invalidations once the device is wedged and had been reset, as, on such cases, it can no longer process instructions on the GPU and the user no longer has access to the TLB's in each engine.
So, an attempt to do a TLB cache invalidation will produce a timeout.
That helps to reduce the performance regression introduced by TLB invalidate logic.
Cc: stable@vger.kernel.org Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store") Signed-off-by: Chris Wilson chris.p.wilson@intel.com Cc: Fei Yang fei.yang@intel.com Cc: Tvrtko Ursulin tvrtko.ursulin@intel.com Reviewed-by: Andi Shyti andi.shyti@linux.intel.com Acked-by: Thomas Hellström thomas.hellstrom@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org Signed-off-by: Andi Shyti andi.shyti@linux.intel.com Link: https://patchwork.freedesktop.org/patch/msgid/5aa86564b9ec5fe7fe605c1dd7de76... (cherry picked from commit be0366f168033374a93e4c43fdaa1a90ab905184) Signed-off-by: Rodrigo Vivi rodrigo.vivi@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/i915/gt/intel_gt.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index 3a76000d15bfd..ed8ad3b263959 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -949,6 +949,9 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt) if (I915_SELFTEST_ONLY(gt->awake == -ENODEV)) return;
+ if (intel_gt_is_wedged(gt)) + return; + if (GRAPHICS_VER(i915) == 12) { regs = gen12_regs; num = ARRAY_SIZE(gen12_regs);
From: Wenbin Mei wenbin.mei@mediatek.com
[ Upstream commit cc5d1692600613e72f32af60e27330fe0c79f4fe ]
Currently we don't clear MSDC interrupts when cqe off/disable, which led to the data complete interrupt will be reserved for the next command. If the next command with data transfer after cqe off/disable, we process the CMD ready interrupt and trigger DMA start for data, but the data complete interrupt is already exists, then SW assume that the data transfer is complete, SW will trigger DMA stop, but the data may not be transmitted yet or is transmitting, so we may encounter the following error: mtk-msdc 11230000.mmc: CMD bus busy detected.
Signed-off-by: Wenbin Mei wenbin.mei@mediatek.com Fixes: 88bd652b3c74 ("mmc: mediatek: command queue support") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220728080048.21336-1-wenbin.mei@mediatek.com Signed-off-by: Ulf Hansson ulf.hansson@linaro.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/mmc/host/mtk-sd.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/drivers/mmc/host/mtk-sd.c b/drivers/mmc/host/mtk-sd.c index f9b2897569bb4..99d8881a7d6c2 100644 --- a/drivers/mmc/host/mtk-sd.c +++ b/drivers/mmc/host/mtk-sd.c @@ -2345,6 +2345,9 @@ static void msdc_cqe_disable(struct mmc_host *mmc, bool recovery) /* disable busy check */ sdr_clr_bits(host->base + MSDC_PATCH_BIT1, MSDC_PB1_BUSY_CHECK_SEL);
+ val = readl(host->base + MSDC_INT); + writel(val, host->base + MSDC_INT); + if (recovery) { sdr_set_field(host->base + MSDC_DMA_CTRL, MSDC_DMA_CTRL_STOP, 1); @@ -2785,11 +2788,14 @@ static int __maybe_unused msdc_suspend(struct device *dev) { struct mmc_host *mmc = dev_get_drvdata(dev); int ret; + u32 val;
if (mmc->caps2 & MMC_CAP2_CQE) { ret = cqhci_suspend(mmc); if (ret) return ret; + val = readl(((struct msdc_host *)mmc_priv(mmc))->base + MSDC_INT); + writel(val, ((struct msdc_host *)mmc_priv(mmc))->base + MSDC_INT); }
return pm_runtime_force_suspend(dev);
From: Yifeng Zhao yifeng.zhao@rock-chips.com
[ Upstream commit 70f832206fe72e9998b46363e8e59e89b0b757bc ]
The reset function build in the SDHCI will not reset the logic circuit related to the tuning function, which may cause data reading errors. Resetting the complete SDHCI controller through the reset controller fixes the issue.
Signed-off-by: Yifeng Zhao yifeng.zhao@rock-chips.com [rebase, use optional variant of reset getter] Acked-by: Adrian Hunter adrian.hunter@intel.com Signed-off-by: Sebastian Reichel sebastian.reichel@collabora.com Link: https://lore.kernel.org/r/20220504213251.264819-10-sebastian.reichel@collabo... Signed-off-by: Ulf Hansson ulf.hansson@linaro.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/mmc/host/sdhci-of-dwcmshc.c | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-)
diff --git a/drivers/mmc/host/sdhci-of-dwcmshc.c b/drivers/mmc/host/sdhci-of-dwcmshc.c index bac874ab0b33a..3a1b5ba364051 100644 --- a/drivers/mmc/host/sdhci-of-dwcmshc.c +++ b/drivers/mmc/host/sdhci-of-dwcmshc.c @@ -15,6 +15,7 @@ #include <linux/module.h> #include <linux/of.h> #include <linux/of_device.h> +#include <linux/reset.h> #include <linux/sizes.h>
#include "sdhci-pltfm.h" @@ -63,6 +64,7 @@ struct rk3568_priv { /* Rockchip specified optional clocks */ struct clk_bulk_data rockchip_clks[RK3568_MAX_CLKS]; + struct reset_control *reset; u8 txclk_tapnum; };
@@ -255,6 +257,21 @@ static void dwcmshc_rk3568_set_clock(struct sdhci_host *host, unsigned int clock sdhci_writel(host, extra, DWCMSHC_EMMC_DLL_STRBIN); }
+static void rk35xx_sdhci_reset(struct sdhci_host *host, u8 mask) +{ + struct sdhci_pltfm_host *pltfm_host = sdhci_priv(host); + struct dwcmshc_priv *dwc_priv = sdhci_pltfm_priv(pltfm_host); + struct rk35xx_priv *priv = dwc_priv->priv; + + if (mask & SDHCI_RESET_ALL && priv->reset) { + reset_control_assert(priv->reset); + udelay(1); + reset_control_deassert(priv->reset); + } + + sdhci_reset(host, mask); +} + static const struct sdhci_ops sdhci_dwcmshc_ops = { .set_clock = sdhci_set_clock, .set_bus_width = sdhci_set_bus_width, @@ -269,7 +286,7 @@ static const struct sdhci_ops sdhci_dwcmshc_rk3568_ops = { .set_bus_width = sdhci_set_bus_width, .set_uhs_signaling = dwcmshc_set_uhs_signaling, .get_max_clock = sdhci_pltfm_clk_get_max_clock, - .reset = sdhci_reset, + .reset = rk35xx_sdhci_reset, .adma_write_desc = dwcmshc_adma_write_desc, };
@@ -292,6 +309,13 @@ static int dwcmshc_rk3568_init(struct sdhci_host *host, struct dwcmshc_priv *dwc int err; struct rk3568_priv *priv = dwc_priv->priv;
+ priv->reset = devm_reset_control_array_get_optional_exclusive(mmc_dev(host->mmc)); + if (IS_ERR(priv->reset)) { + err = PTR_ERR(priv->reset); + dev_err(mmc_dev(host->mmc), "failed to get reset control %d\n", err); + return err; + } + priv->rockchip_clks[0].id = "axi"; priv->rockchip_clks[1].id = "block"; priv->rockchip_clks[2].id = "timer";
From: Sebastian Reichel sebastian.reichel@collabora.com
[ Upstream commit 86e1a8e1f9b555af342c53ae06284eeeab9a4263 ]
Prepare driver for rk3588 support by renaming the internal data structures.
Acked-by: Adrian Hunter adrian.hunter@intel.com Signed-off-by: Sebastian Reichel sebastian.reichel@collabora.com Link: https://lore.kernel.org/r/20220504213251.264819-11-sebastian.reichel@collabo... Signed-off-by: Ulf Hansson ulf.hansson@linaro.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/mmc/host/sdhci-of-dwcmshc.c | 46 ++++++++++++++--------------- 1 file changed, 23 insertions(+), 23 deletions(-)
diff --git a/drivers/mmc/host/sdhci-of-dwcmshc.c b/drivers/mmc/host/sdhci-of-dwcmshc.c index 3a1b5ba364051..f5fd88c7adef1 100644 --- a/drivers/mmc/host/sdhci-of-dwcmshc.c +++ b/drivers/mmc/host/sdhci-of-dwcmshc.c @@ -56,14 +56,14 @@ #define DLL_LOCK_WO_TMOUT(x) \ ((((x) & DWCMSHC_EMMC_DLL_LOCKED) == DWCMSHC_EMMC_DLL_LOCKED) && \ (((x) & DWCMSHC_EMMC_DLL_TIMEOUT) == 0)) -#define RK3568_MAX_CLKS 3 +#define RK35xx_MAX_CLKS 3
#define BOUNDARY_OK(addr, len) \ ((addr | (SZ_128M - 1)) == ((addr + len - 1) | (SZ_128M - 1)))
-struct rk3568_priv { +struct rk35xx_priv { /* Rockchip specified optional clocks */ - struct clk_bulk_data rockchip_clks[RK3568_MAX_CLKS]; + struct clk_bulk_data rockchip_clks[RK35xx_MAX_CLKS]; struct reset_control *reset; u8 txclk_tapnum; }; @@ -178,7 +178,7 @@ static void dwcmshc_rk3568_set_clock(struct sdhci_host *host, unsigned int clock { struct sdhci_pltfm_host *pltfm_host = sdhci_priv(host); struct dwcmshc_priv *dwc_priv = sdhci_pltfm_priv(pltfm_host); - struct rk3568_priv *priv = dwc_priv->priv; + struct rk35xx_priv *priv = dwc_priv->priv; u8 txclk_tapnum = DLL_TXCLK_TAPNUM_DEFAULT; u32 extra, reg; int err; @@ -281,7 +281,7 @@ static const struct sdhci_ops sdhci_dwcmshc_ops = { .adma_write_desc = dwcmshc_adma_write_desc, };
-static const struct sdhci_ops sdhci_dwcmshc_rk3568_ops = { +static const struct sdhci_ops sdhci_dwcmshc_rk35xx_ops = { .set_clock = dwcmshc_rk3568_set_clock, .set_bus_width = sdhci_set_bus_width, .set_uhs_signaling = dwcmshc_set_uhs_signaling, @@ -296,18 +296,18 @@ static const struct sdhci_pltfm_data sdhci_dwcmshc_pdata = { .quirks2 = SDHCI_QUIRK2_PRESET_VALUE_BROKEN, };
-static const struct sdhci_pltfm_data sdhci_dwcmshc_rk3568_pdata = { - .ops = &sdhci_dwcmshc_rk3568_ops, +static const struct sdhci_pltfm_data sdhci_dwcmshc_rk35xx_pdata = { + .ops = &sdhci_dwcmshc_rk35xx_ops, .quirks = SDHCI_QUIRK_CAP_CLOCK_BASE_BROKEN | SDHCI_QUIRK_BROKEN_TIMEOUT_VAL, .quirks2 = SDHCI_QUIRK2_PRESET_VALUE_BROKEN | SDHCI_QUIRK2_CLOCK_DIV_ZERO_BROKEN, };
-static int dwcmshc_rk3568_init(struct sdhci_host *host, struct dwcmshc_priv *dwc_priv) +static int dwcmshc_rk35xx_init(struct sdhci_host *host, struct dwcmshc_priv *dwc_priv) { int err; - struct rk3568_priv *priv = dwc_priv->priv; + struct rk35xx_priv *priv = dwc_priv->priv;
priv->reset = devm_reset_control_array_get_optional_exclusive(mmc_dev(host->mmc)); if (IS_ERR(priv->reset)) { @@ -319,14 +319,14 @@ static int dwcmshc_rk3568_init(struct sdhci_host *host, struct dwcmshc_priv *dwc priv->rockchip_clks[0].id = "axi"; priv->rockchip_clks[1].id = "block"; priv->rockchip_clks[2].id = "timer"; - err = devm_clk_bulk_get_optional(mmc_dev(host->mmc), RK3568_MAX_CLKS, + err = devm_clk_bulk_get_optional(mmc_dev(host->mmc), RK35xx_MAX_CLKS, priv->rockchip_clks); if (err) { dev_err(mmc_dev(host->mmc), "failed to get clocks %d\n", err); return err; }
- err = clk_bulk_prepare_enable(RK3568_MAX_CLKS, priv->rockchip_clks); + err = clk_bulk_prepare_enable(RK35xx_MAX_CLKS, priv->rockchip_clks); if (err) { dev_err(mmc_dev(host->mmc), "failed to enable clocks %d\n", err); return err; @@ -348,7 +348,7 @@ static int dwcmshc_rk3568_init(struct sdhci_host *host, struct dwcmshc_priv *dwc static const struct of_device_id sdhci_dwcmshc_dt_ids[] = { { .compatible = "rockchip,rk3568-dwcmshc", - .data = &sdhci_dwcmshc_rk3568_pdata, + .data = &sdhci_dwcmshc_rk35xx_pdata, }, { .compatible = "snps,dwcmshc-sdhci", @@ -371,7 +371,7 @@ static int dwcmshc_probe(struct platform_device *pdev) struct sdhci_pltfm_host *pltfm_host; struct sdhci_host *host; struct dwcmshc_priv *priv; - struct rk3568_priv *rk_priv = NULL; + struct rk35xx_priv *rk_priv = NULL; const struct sdhci_pltfm_data *pltfm_data; int err; u32 extra; @@ -426,8 +426,8 @@ static int dwcmshc_probe(struct platform_device *pdev) host->mmc_host_ops.request = dwcmshc_request; host->mmc_host_ops.hs400_enhanced_strobe = dwcmshc_hs400_enhanced_strobe;
- if (pltfm_data == &sdhci_dwcmshc_rk3568_pdata) { - rk_priv = devm_kzalloc(&pdev->dev, sizeof(struct rk3568_priv), GFP_KERNEL); + if (pltfm_data == &sdhci_dwcmshc_rk35xx_pdata) { + rk_priv = devm_kzalloc(&pdev->dev, sizeof(struct rk35xx_priv), GFP_KERNEL); if (!rk_priv) { err = -ENOMEM; goto err_clk; @@ -435,7 +435,7 @@ static int dwcmshc_probe(struct platform_device *pdev)
priv->priv = rk_priv;
- err = dwcmshc_rk3568_init(host, priv); + err = dwcmshc_rk35xx_init(host, priv); if (err) goto err_clk; } @@ -452,7 +452,7 @@ static int dwcmshc_probe(struct platform_device *pdev) clk_disable_unprepare(pltfm_host->clk); clk_disable_unprepare(priv->bus_clk); if (rk_priv) - clk_bulk_disable_unprepare(RK3568_MAX_CLKS, + clk_bulk_disable_unprepare(RK35xx_MAX_CLKS, rk_priv->rockchip_clks); free_pltfm: sdhci_pltfm_free(pdev); @@ -464,14 +464,14 @@ static int dwcmshc_remove(struct platform_device *pdev) struct sdhci_host *host = platform_get_drvdata(pdev); struct sdhci_pltfm_host *pltfm_host = sdhci_priv(host); struct dwcmshc_priv *priv = sdhci_pltfm_priv(pltfm_host); - struct rk3568_priv *rk_priv = priv->priv; + struct rk35xx_priv *rk_priv = priv->priv;
sdhci_remove_host(host, 0);
clk_disable_unprepare(pltfm_host->clk); clk_disable_unprepare(priv->bus_clk); if (rk_priv) - clk_bulk_disable_unprepare(RK3568_MAX_CLKS, + clk_bulk_disable_unprepare(RK35xx_MAX_CLKS, rk_priv->rockchip_clks); sdhci_pltfm_free(pdev);
@@ -484,7 +484,7 @@ static int dwcmshc_suspend(struct device *dev) struct sdhci_host *host = dev_get_drvdata(dev); struct sdhci_pltfm_host *pltfm_host = sdhci_priv(host); struct dwcmshc_priv *priv = sdhci_pltfm_priv(pltfm_host); - struct rk3568_priv *rk_priv = priv->priv; + struct rk35xx_priv *rk_priv = priv->priv; int ret;
ret = sdhci_suspend_host(host); @@ -496,7 +496,7 @@ static int dwcmshc_suspend(struct device *dev) clk_disable_unprepare(priv->bus_clk);
if (rk_priv) - clk_bulk_disable_unprepare(RK3568_MAX_CLKS, + clk_bulk_disable_unprepare(RK35xx_MAX_CLKS, rk_priv->rockchip_clks);
return ret; @@ -507,7 +507,7 @@ static int dwcmshc_resume(struct device *dev) struct sdhci_host *host = dev_get_drvdata(dev); struct sdhci_pltfm_host *pltfm_host = sdhci_priv(host); struct dwcmshc_priv *priv = sdhci_pltfm_priv(pltfm_host); - struct rk3568_priv *rk_priv = priv->priv; + struct rk35xx_priv *rk_priv = priv->priv; int ret;
ret = clk_prepare_enable(pltfm_host->clk); @@ -521,7 +521,7 @@ static int dwcmshc_resume(struct device *dev) }
if (rk_priv) { - ret = clk_bulk_prepare_enable(RK3568_MAX_CLKS, + ret = clk_bulk_prepare_enable(RK35xx_MAX_CLKS, rk_priv->rockchip_clks); if (ret) return ret;
From: Liming Sun limings@nvidia.com
[ Upstream commit a0753ef66c34c1739580219dca664eda648164b7 ]
The commit 08f3dff799d4 (mmc: sdhci-of-dwcmshc: add rockchip platform support") introduces the use of_device_get_match_data() to check for some chips. Unfortunately, it also breaks the BlueField-3 FW, which uses ACPI.
To fix the problem, let's add the ACPI match data and the corresponding quirks to re-enable the support for the BlueField-3 SoC.
Reviewed-by: David Woods davwoods@nvidia.com Signed-off-by: Liming Sun limings@nvidia.com Acked-by: Adrian Hunter adrian.hunter@intel.com Fixes: 08f3dff799d4 ("mmc: sdhci-of-dwcmshc: add rockchip platform support") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220809173742.178440-1-limings@nvidia.com [Ulf: Clarified the commit message a bit] Signed-off-by: Ulf Hansson ulf.hansson@linaro.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/mmc/host/sdhci-of-dwcmshc.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/drivers/mmc/host/sdhci-of-dwcmshc.c b/drivers/mmc/host/sdhci-of-dwcmshc.c index f5fd88c7adef1..335c88fd849c4 100644 --- a/drivers/mmc/host/sdhci-of-dwcmshc.c +++ b/drivers/mmc/host/sdhci-of-dwcmshc.c @@ -296,6 +296,15 @@ static const struct sdhci_pltfm_data sdhci_dwcmshc_pdata = { .quirks2 = SDHCI_QUIRK2_PRESET_VALUE_BROKEN, };
+#ifdef CONFIG_ACPI +static const struct sdhci_pltfm_data sdhci_dwcmshc_bf3_pdata = { + .ops = &sdhci_dwcmshc_ops, + .quirks = SDHCI_QUIRK_CAP_CLOCK_BASE_BROKEN, + .quirks2 = SDHCI_QUIRK2_PRESET_VALUE_BROKEN | + SDHCI_QUIRK2_ACMD23_BROKEN, +}; +#endif + static const struct sdhci_pltfm_data sdhci_dwcmshc_rk35xx_pdata = { .ops = &sdhci_dwcmshc_rk35xx_ops, .quirks = SDHCI_QUIRK_CAP_CLOCK_BASE_BROKEN | @@ -360,7 +369,10 @@ MODULE_DEVICE_TABLE(of, sdhci_dwcmshc_dt_ids);
#ifdef CONFIG_ACPI static const struct acpi_device_id sdhci_dwcmshc_acpi_ids[] = { - { .id = "MLNXBF30" }, + { + .id = "MLNXBF30", + .driver_data = (kernel_ulong_t)&sdhci_dwcmshc_bf3_pdata, + }, {} }; #endif @@ -376,7 +388,7 @@ static int dwcmshc_probe(struct platform_device *pdev) int err; u32 extra;
- pltfm_data = of_device_get_match_data(&pdev->dev); + pltfm_data = device_get_match_data(&pdev->dev); if (!pltfm_data) { dev_err(&pdev->dev, "Error: No device match data found\n"); return -ENODEV;
From: Filipe Manana fdmanana@suse.com
[ Upstream commit 4467af8809299c12529b5c21481c1d44a3b209f9 ]
The root argument passed to btrfs_unlink_inode() and its callee, __btrfs_unlink_inode(), always matches the root of the given directory and the given inode. So remove the argument and make __btrfs_unlink_inode() use the root of the directory.
Signed-off-by: Filipe Manana fdmanana@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/ctree.h | 1 - fs/btrfs/inode.c | 25 +++++++++++-------------- fs/btrfs/tree-log.c | 14 +++++++------- 3 files changed, 18 insertions(+), 22 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 1831135fef1ab..cd72570d11f65 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3166,7 +3166,6 @@ void __btrfs_del_delalloc_inode(struct btrfs_root *root, struct inode *btrfs_lookup_dentry(struct inode *dir, struct dentry *dentry); int btrfs_set_inode_index(struct btrfs_inode *dir, u64 *index); int btrfs_unlink_inode(struct btrfs_trans_handle *trans, - struct btrfs_root *root, struct btrfs_inode *dir, struct btrfs_inode *inode, const char *name, int name_len); int btrfs_add_link(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 428a56f248bba..f8a01964a2169 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4097,11 +4097,11 @@ int btrfs_update_inode_fallback(struct btrfs_trans_handle *trans, * also drops the back refs in the inode to the directory */ static int __btrfs_unlink_inode(struct btrfs_trans_handle *trans, - struct btrfs_root *root, struct btrfs_inode *dir, struct btrfs_inode *inode, const char *name, int name_len) { + struct btrfs_root *root = dir->root; struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_path *path; int ret = 0; @@ -4201,15 +4201,14 @@ static int __btrfs_unlink_inode(struct btrfs_trans_handle *trans, }
int btrfs_unlink_inode(struct btrfs_trans_handle *trans, - struct btrfs_root *root, struct btrfs_inode *dir, struct btrfs_inode *inode, const char *name, int name_len) { int ret; - ret = __btrfs_unlink_inode(trans, root, dir, inode, name, name_len); + ret = __btrfs_unlink_inode(trans, dir, inode, name, name_len); if (!ret) { drop_nlink(&inode->vfs_inode); - ret = btrfs_update_inode(trans, root, inode); + ret = btrfs_update_inode(trans, inode->root, inode); } return ret; } @@ -4238,7 +4237,6 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir)
static int btrfs_unlink(struct inode *dir, struct dentry *dentry) { - struct btrfs_root *root = BTRFS_I(dir)->root; struct btrfs_trans_handle *trans; struct inode *inode = d_inode(dentry); int ret; @@ -4250,7 +4248,7 @@ static int btrfs_unlink(struct inode *dir, struct dentry *dentry) btrfs_record_unlink_dir(trans, BTRFS_I(dir), BTRFS_I(d_inode(dentry)), 0);
- ret = btrfs_unlink_inode(trans, root, BTRFS_I(dir), + ret = btrfs_unlink_inode(trans, BTRFS_I(dir), BTRFS_I(d_inode(dentry)), dentry->d_name.name, dentry->d_name.len); if (ret) @@ -4264,7 +4262,7 @@ static int btrfs_unlink(struct inode *dir, struct dentry *dentry)
out: btrfs_end_transaction(trans); - btrfs_btree_balance_dirty(root->fs_info); + btrfs_btree_balance_dirty(BTRFS_I(dir)->root->fs_info); return ret; }
@@ -4622,7 +4620,6 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry) { struct inode *inode = d_inode(dentry); int err = 0; - struct btrfs_root *root = BTRFS_I(dir)->root; struct btrfs_trans_handle *trans; u64 last_unlink_trans;
@@ -4647,7 +4644,7 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry) last_unlink_trans = BTRFS_I(inode)->last_unlink_trans;
/* now the directory is empty */ - err = btrfs_unlink_inode(trans, root, BTRFS_I(dir), + err = btrfs_unlink_inode(trans, BTRFS_I(dir), BTRFS_I(d_inode(dentry)), dentry->d_name.name, dentry->d_name.len); if (!err) { @@ -4668,7 +4665,7 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry) } out: btrfs_end_transaction(trans); - btrfs_btree_balance_dirty(root->fs_info); + btrfs_btree_balance_dirty(BTRFS_I(dir)->root->fs_info);
return err; } @@ -9571,7 +9568,7 @@ static int btrfs_rename_exchange(struct inode *old_dir, if (old_ino == BTRFS_FIRST_FREE_OBJECTID) { ret = btrfs_unlink_subvol(trans, old_dir, old_dentry); } else { /* src is an inode */ - ret = __btrfs_unlink_inode(trans, root, BTRFS_I(old_dir), + ret = __btrfs_unlink_inode(trans, BTRFS_I(old_dir), BTRFS_I(old_dentry->d_inode), old_dentry->d_name.name, old_dentry->d_name.len); @@ -9587,7 +9584,7 @@ static int btrfs_rename_exchange(struct inode *old_dir, if (new_ino == BTRFS_FIRST_FREE_OBJECTID) { ret = btrfs_unlink_subvol(trans, new_dir, new_dentry); } else { /* dest is an inode */ - ret = __btrfs_unlink_inode(trans, dest, BTRFS_I(new_dir), + ret = __btrfs_unlink_inode(trans, BTRFS_I(new_dir), BTRFS_I(new_dentry->d_inode), new_dentry->d_name.name, new_dentry->d_name.len); @@ -9862,7 +9859,7 @@ static int btrfs_rename(struct user_namespace *mnt_userns, */ btrfs_pin_log_trans(root); log_pinned = true; - ret = __btrfs_unlink_inode(trans, root, BTRFS_I(old_dir), + ret = __btrfs_unlink_inode(trans, BTRFS_I(old_dir), BTRFS_I(d_inode(old_dentry)), old_dentry->d_name.name, old_dentry->d_name.len); @@ -9882,7 +9879,7 @@ static int btrfs_rename(struct user_namespace *mnt_userns, ret = btrfs_unlink_subvol(trans, new_dir, new_dentry); BUG_ON(new_inode->i_nlink == 0); } else { - ret = btrfs_unlink_inode(trans, dest, BTRFS_I(new_dir), + ret = btrfs_unlink_inode(trans, BTRFS_I(new_dir), BTRFS_I(d_inode(new_dentry)), new_dentry->d_name.name, new_dentry->d_name.len); diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 1d7e9812f55e1..6f51c4d922d48 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -926,7 +926,7 @@ static noinline int drop_one_dir_item(struct btrfs_trans_handle *trans, if (ret) goto out;
- ret = btrfs_unlink_inode(trans, root, dir, BTRFS_I(inode), name, + ret = btrfs_unlink_inode(trans, dir, BTRFS_I(inode), name, name_len); if (ret) goto out; @@ -1091,7 +1091,7 @@ static inline int __add_inode_ref(struct btrfs_trans_handle *trans, inc_nlink(&inode->vfs_inode); btrfs_release_path(path);
- ret = btrfs_unlink_inode(trans, root, dir, inode, + ret = btrfs_unlink_inode(trans, dir, inode, victim_name, victim_name_len); kfree(victim_name); if (ret) @@ -1165,7 +1165,7 @@ static inline int __add_inode_ref(struct btrfs_trans_handle *trans, inc_nlink(&inode->vfs_inode); btrfs_release_path(path);
- ret = btrfs_unlink_inode(trans, root, + ret = btrfs_unlink_inode(trans, BTRFS_I(victim_parent), inode, victim_name, @@ -1327,7 +1327,7 @@ static int unlink_old_inode_refs(struct btrfs_trans_handle *trans, kfree(name); goto out; } - ret = btrfs_unlink_inode(trans, root, BTRFS_I(dir), + ret = btrfs_unlink_inode(trans, BTRFS_I(dir), inode, name, namelen); kfree(name); iput(dir); @@ -1434,7 +1434,7 @@ static int add_link(struct btrfs_trans_handle *trans, struct btrfs_root *root, ret = -ENOENT; goto out; } - ret = btrfs_unlink_inode(trans, root, BTRFS_I(dir), BTRFS_I(other_inode), + ret = btrfs_unlink_inode(trans, BTRFS_I(dir), BTRFS_I(other_inode), name, namelen); if (ret) goto out; @@ -1580,7 +1580,7 @@ static noinline int add_inode_ref(struct btrfs_trans_handle *trans, ret = btrfs_inode_ref_exists(inode, dir, key->type, name, namelen); if (ret > 0) { - ret = btrfs_unlink_inode(trans, root, + ret = btrfs_unlink_inode(trans, BTRFS_I(dir), BTRFS_I(inode), name, namelen); @@ -2339,7 +2339,7 @@ static noinline int check_item_in_log(struct btrfs_trans_handle *trans, }
inc_nlink(inode); - ret = btrfs_unlink_inode(trans, root, BTRFS_I(dir), + ret = btrfs_unlink_inode(trans, BTRFS_I(dir), BTRFS_I(inode), name, name_len); if (!ret) ret = btrfs_run_delayed_items(trans);
From: Filipe Manana fdmanana@suse.com
[ Upstream commit ccae4a19c9140a34a0c5f0658812496dd8bbdeaf ]
Now that we log only dir index keys when logging a directory, we no longer need to deal with dir item keys in the log replay code for replaying directory deletes. This is also true for the case when we replay a log tree created by a kernel that still logs dir items.
So remove the remaining code of the replay of directory deletes algorithm that deals with dir item keys.
Reviewed-by: Josef Bacik josef@toxicpanda.com Signed-off-by: Filipe Manana fdmanana@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/tree-log.c | 158 ++++++++++++++------------------ include/uapi/linux/btrfs_tree.h | 4 +- 2 files changed, 72 insertions(+), 90 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 6f51c4d922d48..4ab1bbc344760 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -2197,7 +2197,7 @@ static noinline int replay_one_dir_item(struct btrfs_trans_handle *trans, */ static noinline int find_dir_range(struct btrfs_root *root, struct btrfs_path *path, - u64 dirid, int key_type, + u64 dirid, u64 *start_ret, u64 *end_ret) { struct btrfs_key key; @@ -2210,7 +2210,7 @@ static noinline int find_dir_range(struct btrfs_root *root, return 1;
key.objectid = dirid; - key.type = key_type; + key.type = BTRFS_DIR_LOG_INDEX_KEY; key.offset = *start_ret;
ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); @@ -2224,7 +2224,7 @@ static noinline int find_dir_range(struct btrfs_root *root, if (ret != 0) btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
- if (key.type != key_type || key.objectid != dirid) { + if (key.type != BTRFS_DIR_LOG_INDEX_KEY || key.objectid != dirid) { ret = 1; goto next; } @@ -2251,7 +2251,7 @@ static noinline int find_dir_range(struct btrfs_root *root,
btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
- if (key.type != key_type || key.objectid != dirid) { + if (key.type != BTRFS_DIR_LOG_INDEX_KEY || key.objectid != dirid) { ret = 1; goto out; } @@ -2282,95 +2282,82 @@ static noinline int check_item_in_log(struct btrfs_trans_handle *trans, int ret; struct extent_buffer *eb; int slot; - u32 item_size; struct btrfs_dir_item *di; - struct btrfs_dir_item *log_di; int name_len; - unsigned long ptr; - unsigned long ptr_end; char *name; - struct inode *inode; + struct inode *inode = NULL; struct btrfs_key location;
-again: + /* + * Currenly we only log dir index keys. Even if we replay a log created + * by an older kernel that logged both dir index and dir item keys, all + * we need to do is process the dir index keys, we (and our caller) can + * safely ignore dir item keys (key type BTRFS_DIR_ITEM_KEY). + */ + ASSERT(dir_key->type == BTRFS_DIR_INDEX_KEY); + eb = path->nodes[0]; slot = path->slots[0]; - item_size = btrfs_item_size_nr(eb, slot); - ptr = btrfs_item_ptr_offset(eb, slot); - ptr_end = ptr + item_size; - while (ptr < ptr_end) { - di = (struct btrfs_dir_item *)ptr; - name_len = btrfs_dir_name_len(eb, di); - name = kmalloc(name_len, GFP_NOFS); - if (!name) { - ret = -ENOMEM; - goto out; - } - read_extent_buffer(eb, name, (unsigned long)(di + 1), - name_len); - log_di = NULL; - if (log && dir_key->type == BTRFS_DIR_ITEM_KEY) { - log_di = btrfs_lookup_dir_item(trans, log, log_path, - dir_key->objectid, - name, name_len, 0); - } else if (log && dir_key->type == BTRFS_DIR_INDEX_KEY) { - log_di = btrfs_lookup_dir_index_item(trans, log, - log_path, - dir_key->objectid, - dir_key->offset, - name, name_len, 0); - } - if (!log_di) { - btrfs_dir_item_key_to_cpu(eb, di, &location); - btrfs_release_path(path); - btrfs_release_path(log_path); - inode = read_one_inode(root, location.objectid); - if (!inode) { - kfree(name); - return -EIO; - } + di = btrfs_item_ptr(eb, slot, struct btrfs_dir_item); + name_len = btrfs_dir_name_len(eb, di); + name = kmalloc(name_len, GFP_NOFS); + if (!name) { + ret = -ENOMEM; + goto out; + }
- ret = link_to_fixup_dir(trans, root, - path, location.objectid); - if (ret) { - kfree(name); - iput(inode); - goto out; - } + read_extent_buffer(eb, name, (unsigned long)(di + 1), name_len);
- inc_nlink(inode); - ret = btrfs_unlink_inode(trans, BTRFS_I(dir), - BTRFS_I(inode), name, name_len); - if (!ret) - ret = btrfs_run_delayed_items(trans); - kfree(name); - iput(inode); - if (ret) - goto out; + if (log) { + struct btrfs_dir_item *log_di;
- /* there might still be more names under this key - * check and repeat if required - */ - ret = btrfs_search_slot(NULL, root, dir_key, path, - 0, 0); - if (ret == 0) - goto again; + log_di = btrfs_lookup_dir_index_item(trans, log, log_path, + dir_key->objectid, + dir_key->offset, + name, name_len, 0); + if (IS_ERR(log_di)) { + ret = PTR_ERR(log_di); + goto out; + } else if (log_di) { + /* The dentry exists in the log, we have nothing to do. */ ret = 0; goto out; - } else if (IS_ERR(log_di)) { - kfree(name); - return PTR_ERR(log_di); } - btrfs_release_path(log_path); - kfree(name); + }
- ptr = (unsigned long)(di + 1); - ptr += name_len; + btrfs_dir_item_key_to_cpu(eb, di, &location); + btrfs_release_path(path); + btrfs_release_path(log_path); + inode = read_one_inode(root, location.objectid); + if (!inode) { + ret = -EIO; + goto out; } - ret = 0; + + ret = link_to_fixup_dir(trans, root, path, location.objectid); + if (ret) + goto out; + + inc_nlink(inode); + ret = btrfs_unlink_inode(trans, BTRFS_I(dir), BTRFS_I(inode), name, + name_len); + if (ret) + goto out; + + ret = btrfs_run_delayed_items(trans); + if (ret) + goto out; + + /* + * Unlike dir item keys, dir index keys can only have one name (entry) in + * them, as there are no key collisions since each key has a unique offset + * (an index number), so we're done. + */ out: btrfs_release_path(path); btrfs_release_path(log_path); + kfree(name); + iput(inode); return ret; }
@@ -2490,7 +2477,6 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans, { u64 range_start; u64 range_end; - int key_type = BTRFS_DIR_LOG_ITEM_KEY; int ret = 0; struct btrfs_key dir_key; struct btrfs_key found_key; @@ -2498,7 +2484,7 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans, struct inode *dir;
dir_key.objectid = dirid; - dir_key.type = BTRFS_DIR_ITEM_KEY; + dir_key.type = BTRFS_DIR_INDEX_KEY; log_path = btrfs_alloc_path(); if (!log_path) return -ENOMEM; @@ -2512,14 +2498,14 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans, btrfs_free_path(log_path); return 0; } -again: + range_start = 0; range_end = 0; while (1) { if (del_all) range_end = (u64)-1; else { - ret = find_dir_range(log, path, dirid, key_type, + ret = find_dir_range(log, path, dirid, &range_start, &range_end); if (ret < 0) goto out; @@ -2546,8 +2532,10 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans, btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]); if (found_key.objectid != dirid || - found_key.type != dir_key.type) - goto next_type; + found_key.type != dir_key.type) { + ret = 0; + goto out; + }
if (found_key.offset > range_end) break; @@ -2566,15 +2554,7 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans, break; range_start = range_end + 1; } - -next_type: ret = 0; - if (key_type == BTRFS_DIR_LOG_ITEM_KEY) { - key_type = BTRFS_DIR_LOG_INDEX_KEY; - dir_key.type = BTRFS_DIR_INDEX_KEY; - btrfs_release_path(path); - goto again; - } out: btrfs_release_path(path); btrfs_free_path(log_path); diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h index e1c4c732aabac..5416f1f1a77a8 100644 --- a/include/uapi/linux/btrfs_tree.h +++ b/include/uapi/linux/btrfs_tree.h @@ -146,7 +146,9 @@
/* * dir items are the name -> inode pointers in a directory. There is one - * for every name in a directory. + * for every name in a directory. BTRFS_DIR_LOG_ITEM_KEY is no longer used + * but it's still defined here for documentation purposes and to help avoid + * having its numerical value reused in the future. */ #define BTRFS_DIR_LOG_ITEM_KEY 60 #define BTRFS_DIR_LOG_INDEX_KEY 72
From: Filipe Manana fdmanana@suse.com
[ Upstream commit 313ab75399d0c7d0ebc718c545572c1b4d8d22ef ]
During log replay there is this pattern of running delayed items after every inode unlink. To avoid repeating this several times, move the logic into an helper function and use it instead of calling btrfs_unlink_inode() followed by btrfs_run_delayed_items().
Signed-off-by: Filipe Manana fdmanana@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/tree-log.c | 77 +++++++++++++++++---------------------------- 1 file changed, 29 insertions(+), 48 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 4ab1bbc344760..c56a89d224bbb 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -884,6 +884,26 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans, return ret; }
+static int unlink_inode_for_log_replay(struct btrfs_trans_handle *trans, + struct btrfs_inode *dir, + struct btrfs_inode *inode, + const char *name, + int name_len) +{ + int ret; + + ret = btrfs_unlink_inode(trans, dir, inode, name, name_len); + if (ret) + return ret; + /* + * Whenever we need to check if a name exists or not, we check the + * fs/subvolume tree. So after an unlink we must run delayed items, so + * that future checks for a name during log replay see that the name + * does not exists anymore. + */ + return btrfs_run_delayed_items(trans); +} + /* * when cleaning up conflicts between the directory names in the * subvolume, directory names in the log and directory names in the @@ -926,12 +946,8 @@ static noinline int drop_one_dir_item(struct btrfs_trans_handle *trans, if (ret) goto out;
- ret = btrfs_unlink_inode(trans, dir, BTRFS_I(inode), name, + ret = unlink_inode_for_log_replay(trans, dir, BTRFS_I(inode), name, name_len); - if (ret) - goto out; - else - ret = btrfs_run_delayed_items(trans); out: kfree(name); iput(inode); @@ -1091,12 +1107,9 @@ static inline int __add_inode_ref(struct btrfs_trans_handle *trans, inc_nlink(&inode->vfs_inode); btrfs_release_path(path);
- ret = btrfs_unlink_inode(trans, dir, inode, + ret = unlink_inode_for_log_replay(trans, dir, inode, victim_name, victim_name_len); kfree(victim_name); - if (ret) - return ret; - ret = btrfs_run_delayed_items(trans); if (ret) return ret; *search_done = 1; @@ -1165,14 +1178,11 @@ static inline int __add_inode_ref(struct btrfs_trans_handle *trans, inc_nlink(&inode->vfs_inode); btrfs_release_path(path);
- ret = btrfs_unlink_inode(trans, + ret = unlink_inode_for_log_replay(trans, BTRFS_I(victim_parent), inode, victim_name, victim_name_len); - if (!ret) - ret = btrfs_run_delayed_items( - trans); } iput(victim_parent); kfree(victim_name); @@ -1327,19 +1337,10 @@ static int unlink_old_inode_refs(struct btrfs_trans_handle *trans, kfree(name); goto out; } - ret = btrfs_unlink_inode(trans, BTRFS_I(dir), + ret = unlink_inode_for_log_replay(trans, BTRFS_I(dir), inode, name, namelen); kfree(name); iput(dir); - /* - * Whenever we need to check if a name exists or not, we - * check the subvolume tree. So after an unlink we must - * run delayed items, so that future checks for a name - * during log replay see that the name does not exists - * anymore. - */ - if (!ret) - ret = btrfs_run_delayed_items(trans); if (ret) goto out; goto again; @@ -1434,8 +1435,8 @@ static int add_link(struct btrfs_trans_handle *trans, struct btrfs_root *root, ret = -ENOENT; goto out; } - ret = btrfs_unlink_inode(trans, BTRFS_I(dir), BTRFS_I(other_inode), - name, namelen); + ret = unlink_inode_for_log_replay(trans, BTRFS_I(dir), BTRFS_I(other_inode), + name, namelen); if (ret) goto out; /* @@ -1444,10 +1445,6 @@ static int add_link(struct btrfs_trans_handle *trans, struct btrfs_root *root, */ if (other_inode->i_nlink == 0) inc_nlink(other_inode); - - ret = btrfs_run_delayed_items(trans); - if (ret) - goto out; add_link: ret = btrfs_add_link(trans, BTRFS_I(dir), BTRFS_I(inode), name, namelen, 0, ref_index); @@ -1580,7 +1577,7 @@ static noinline int add_inode_ref(struct btrfs_trans_handle *trans, ret = btrfs_inode_ref_exists(inode, dir, key->type, name, namelen); if (ret > 0) { - ret = btrfs_unlink_inode(trans, + ret = unlink_inode_for_log_replay(trans, BTRFS_I(dir), BTRFS_I(inode), name, namelen); @@ -1591,15 +1588,6 @@ static noinline int add_inode_ref(struct btrfs_trans_handle *trans, */ if (!ret && inode->i_nlink == 0) inc_nlink(inode); - /* - * Whenever we need to check if a name exists or - * not, we check the subvolume tree. So after an - * unlink we must run delayed items, so that future - * checks for a name during log replay see that the - * name does not exists anymore. - */ - if (!ret) - ret = btrfs_run_delayed_items(trans); } if (ret < 0) goto out; @@ -2339,15 +2327,8 @@ static noinline int check_item_in_log(struct btrfs_trans_handle *trans, goto out;
inc_nlink(inode); - ret = btrfs_unlink_inode(trans, BTRFS_I(dir), BTRFS_I(inode), name, - name_len); - if (ret) - goto out; - - ret = btrfs_run_delayed_items(trans); - if (ret) - goto out; - + ret = unlink_inode_for_log_replay(trans, BTRFS_I(dir), BTRFS_I(inode), + name, name_len); /* * Unlike dir item keys, dir index keys can only have one name (entry) in * them, as there are no key collisions since each key has a unique offset
From: Filipe Manana fdmanana@suse.com
[ Upstream commit 769030e11847c5412270c0726ff21d3a1f0a3131 ]
During log replay, at add_link(), we may increment the link count of another inode that has a reference that conflicts with a new reference for the inode currently being processed.
During log replay, at add_link(), we may drop (unlink) a reference from some inode in the subvolume tree if that reference conflicts with a new reference found in the log for the inode we are currently processing.
After the unlink, If the link count has decreased from 1 to 0, then we increment the link count to prevent the inode from being deleted if it's evicted by an iput() call, because we may have references to add to that inode later on (and we will fixup its link count later during log replay).
However incrementing the link count from 0 to 1 triggers a warning:
$ cat fs/inode.c (...) void inc_nlink(struct inode *inode) { if (unlikely(inode->i_nlink == 0)) { WARN_ON(!(inode->i_state & I_LINKABLE)); atomic_long_dec(&inode->i_sb->s_remove_count); } (...)
The I_LINKABLE flag is only set when creating an O_TMPFILE file, so it's never set during log replay.
Most of the time, the warning isn't triggered even if we dropped the last reference of the conflicting inode, and this is because:
1) The conflicting inode was previously marked for fixup, through a call to link_to_fixup_dir(), which increments the inode's link count;
2) And the last iput() on the inode has not triggered eviction of the inode, nor was eviction triggered after the iput(). So at add_link(), even if we unlink the last reference of the inode, its link count ends up being 1 and not 0.
So this means that if eviction is triggered after link_to_fixup_dir() is called, at add_link() we will read the inode back from the subvolume tree and have it with a correct link count, matching the number of references it has on the subvolume tree. So if when we are at add_link() the inode has exactly one reference only, its link count is 1, and after the unlink its link count becomes 0.
So fix this by using set_nlink() instead of inc_nlink(), as the former accepts a transition from 0 to 1 and it's what we use in other similar contexts (like at link_to_fixup_dir().
Also make add_inode_ref() use set_nlink() instead of inc_nlink() to bump the link count from 0 to 1.
The warning is actually harmless, but it may scare users. Josef also ran into it recently.
CC: stable@vger.kernel.org # 5.1+ Signed-off-by: Filipe Manana fdmanana@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/tree-log.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index c56a89d224bbb..7272896587302 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -1444,7 +1444,7 @@ static int add_link(struct btrfs_trans_handle *trans, struct btrfs_root *root, * on the inode will not free it. We will fixup the link count later. */ if (other_inode->i_nlink == 0) - inc_nlink(other_inode); + set_nlink(other_inode, 1); add_link: ret = btrfs_add_link(trans, BTRFS_I(dir), BTRFS_I(inode), name, namelen, 0, ref_index); @@ -1587,7 +1587,7 @@ static noinline int add_inode_ref(struct btrfs_trans_handle *trans, * free it. We will fixup the link count later. */ if (!ret && inode->i_nlink == 0) - inc_nlink(inode); + set_nlink(inode, 1); } if (ret < 0) goto out;
From: Konstantin Komarov almaz.alexandrovich@paragon-software.com
[ Upstream commit 42f86b1226a42bfc79a7125af435432ad4680a32 ]
In some cases xattr is too fragmented, so we need to load it before writing.
Signed-off-by: Konstantin Komarov almaz.alexandrovich@paragon-software.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/ntfs3/xattr.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/fs/ntfs3/xattr.c b/fs/ntfs3/xattr.c index e8bfa709270d1..4652b97969957 100644 --- a/fs/ntfs3/xattr.c +++ b/fs/ntfs3/xattr.c @@ -118,7 +118,7 @@ static int ntfs_read_ea(struct ntfs_inode *ni, struct EA_FULL **ea,
run_init(&run);
- err = attr_load_runs(attr_ea, ni, &run, NULL); + err = attr_load_runs_range(ni, ATTR_EA, NULL, 0, &run, 0, size); if (!err) err = ntfs_read_run_nb(sbi, &run, 0, ea_p, size, NULL); run_close(&run); @@ -443,6 +443,11 @@ static noinline int ntfs_set_ea(struct inode *inode, const char *name, /* Delete xattr, ATTR_EA */ ni_remove_attr_le(ni, attr, mi, le); } else if (attr->non_res) { + err = attr_load_runs_range(ni, ATTR_EA, NULL, 0, &ea_run, 0, + size); + if (err) + goto out; + err = ntfs_sb_write_run(sbi, &ea_run, 0, ea_all, size, 0); if (err) goto out;
From: Biju Das biju.das.jz@bp.renesas.com
[ Upstream commit c75ed9f54ce8d349fee557f2b471a4d637ed2a6b ]
We usually do cleanup in reverse order of init. Currently in case of error rz_ssi_release_dma_channels() done in the reverse order. This patch improves error handling in rz_ssi_probe() error path.
While at it, use "goto cleanup" style to reduce code duplication.
Reported-by: Pavel Machek pavel@denx.de Signed-off-by: Biju Das biju.das.jz@bp.renesas.com Link: https://lore.kernel.org/r/20220728092612.38858-1-biju.das.jz@bp.renesas.com Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- sound/soc/sh/rz-ssi.c | 26 +++++++++++++++----------- 1 file changed, 15 insertions(+), 11 deletions(-)
diff --git a/sound/soc/sh/rz-ssi.c b/sound/soc/sh/rz-ssi.c index 6d794eaaf4c39..2e33a1fa0a6f4 100644 --- a/sound/soc/sh/rz-ssi.c +++ b/sound/soc/sh/rz-ssi.c @@ -1022,32 +1022,36 @@ static int rz_ssi_probe(struct platform_device *pdev)
ssi->rstc = devm_reset_control_get_exclusive(&pdev->dev, NULL); if (IS_ERR(ssi->rstc)) { - rz_ssi_release_dma_channels(ssi); - return PTR_ERR(ssi->rstc); + ret = PTR_ERR(ssi->rstc); + goto err_reset; }
reset_control_deassert(ssi->rstc); pm_runtime_enable(&pdev->dev); ret = pm_runtime_resume_and_get(&pdev->dev); if (ret < 0) { - rz_ssi_release_dma_channels(ssi); - pm_runtime_disable(ssi->dev); - reset_control_assert(ssi->rstc); - return dev_err_probe(ssi->dev, ret, "pm_runtime_resume_and_get failed\n"); + dev_err(&pdev->dev, "pm_runtime_resume_and_get failed\n"); + goto err_pm; }
ret = devm_snd_soc_register_component(&pdev->dev, &rz_ssi_soc_component, rz_ssi_soc_dai, ARRAY_SIZE(rz_ssi_soc_dai)); if (ret < 0) { - rz_ssi_release_dma_channels(ssi); - - pm_runtime_put(ssi->dev); - pm_runtime_disable(ssi->dev); - reset_control_assert(ssi->rstc); dev_err(&pdev->dev, "failed to register snd component\n"); + goto err_snd_soc; }
+ return 0; + +err_snd_soc: + pm_runtime_put(ssi->dev); +err_pm: + pm_runtime_disable(ssi->dev); + reset_control_assert(ssi->rstc); +err_reset: + rz_ssi_release_dma_channels(ssi); + return ret; }
From: Josip Pavic Josip.Pavic@amd.com
[ Upstream commit 8de297dc046c180651c0500f8611663ae1c3828a ]
[why] In some cases MPC tree bottom pipe ends up point to itself. This causes iterating from top to bottom to hang the system in an infinite loop.
[how] When looping to next MPC bottom pipe, check that the pointer is not same as current to avoid infinite loop.
Reviewed-by: Josip Pavic Josip.Pavic@amd.com Reviewed-by: Jun Lei Jun.Lei@amd.com Acked-by: Alex Hung alex.hung@amd.com Signed-off-by: Aric Cyr aric.cyr@amd.com Tested-by: Daniel Wheeler daniel.wheeler@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/amd/display/dc/dcn10/dcn10_mpc.c | 6 ++++++ drivers/gpu/drm/amd/display/dc/dcn20/dcn20_mpc.c | 6 ++++++ 2 files changed, 12 insertions(+)
diff --git a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_mpc.c b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_mpc.c index 11019c2c62ccb..8192f1967e924 100644 --- a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_mpc.c +++ b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_mpc.c @@ -126,6 +126,12 @@ struct mpcc *mpc1_get_mpcc_for_dpp(struct mpc_tree *tree, int dpp_id) while (tmp_mpcc != NULL) { if (tmp_mpcc->dpp_id == dpp_id) return tmp_mpcc; + + /* avoid circular linked list */ + ASSERT(tmp_mpcc != tmp_mpcc->mpcc_bot); + if (tmp_mpcc == tmp_mpcc->mpcc_bot) + break; + tmp_mpcc = tmp_mpcc->mpcc_bot; } return NULL; diff --git a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_mpc.c b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_mpc.c index 947eb0df3f125..142fc0a3a536c 100644 --- a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_mpc.c +++ b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_mpc.c @@ -532,6 +532,12 @@ struct mpcc *mpc2_get_mpcc_for_dpp(struct mpc_tree *tree, int dpp_id) while (tmp_mpcc != NULL) { if (tmp_mpcc->dpp_id == 0xf || tmp_mpcc->dpp_id == dpp_id) return tmp_mpcc; + + /* avoid circular linked list */ + ASSERT(tmp_mpcc != tmp_mpcc->mpcc_bot); + if (tmp_mpcc == tmp_mpcc->mpcc_bot) + break; + tmp_mpcc = tmp_mpcc->mpcc_bot; } return NULL;
From: Leo Ma hanghong.ma@amd.com
[ Upstream commit 0591183699fceeafb4c4141072d47775de83ecfb ]
[Why] Reported from customer the checksum in AMD VSIF V3 is incorrect and causing blank screen issue.
[How] Fix the packet length issue on AMD HDMI VSIF V3.
Reviewed-by: Anthony Koo Anthony.Koo@amd.com Acked-by: Tom Chung chiahsuan.chung@amd.com Signed-off-by: Leo Ma hanghong.ma@amd.com Tested-by: Daniel Wheeler daniel.wheeler@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org --- .../drm/amd/display/modules/freesync/freesync.c | 15 +++------------ 1 file changed, 3 insertions(+), 12 deletions(-)
diff --git a/drivers/gpu/drm/amd/display/modules/freesync/freesync.c b/drivers/gpu/drm/amd/display/modules/freesync/freesync.c index b99aa232bd8b1..4bee6d018bfa9 100644 --- a/drivers/gpu/drm/amd/display/modules/freesync/freesync.c +++ b/drivers/gpu/drm/amd/display/modules/freesync/freesync.c @@ -567,10 +567,6 @@ static void build_vrr_infopacket_data_v1(const struct mod_vrr_params *vrr, * Note: We should never go above the field rate of the mode timing set. */ infopacket->sb[8] = (unsigned char)((vrr->max_refresh_in_uhz + 500000) / 1000000); - - /* FreeSync HDR */ - infopacket->sb[9] = 0; - infopacket->sb[10] = 0; }
static void build_vrr_infopacket_data_v3(const struct mod_vrr_params *vrr, @@ -638,10 +634,6 @@ static void build_vrr_infopacket_data_v3(const struct mod_vrr_params *vrr,
/* PB16 : Reserved bits 7:1, FixedRate bit 0 */ infopacket->sb[16] = (vrr->state == VRR_STATE_ACTIVE_FIXED) ? 1 : 0; - - //FreeSync HDR - infopacket->sb[9] = 0; - infopacket->sb[10] = 0; }
static void build_vrr_infopacket_fs2_data(enum color_transfer_func app_tf, @@ -726,8 +718,7 @@ static void build_vrr_infopacket_header_v2(enum signal_type signal, /* HB2 = [Bits 7:5 = 0] [Bits 4:0 = Length = 0x09] */ infopacket->hb2 = 0x09;
- *payload_size = 0x0A; - + *payload_size = 0x09; } else if (dc_is_dp_signal(signal)) {
/* HEADER */ @@ -776,9 +767,9 @@ static void build_vrr_infopacket_header_v3(enum signal_type signal, infopacket->hb1 = version;
/* HB2 = [Bits 7:5 = 0] [Bits 4:0 = Length] */ - *payload_size = 0x10; - infopacket->hb2 = *payload_size - 1; //-1 for checksum + infopacket->hb2 = 0x10;
+ *payload_size = 0x10; } else if (dc_is_dp_signal(signal)) {
/* HEADER */
From: Alvin Lee alvin.lee2@amd.com
[ Upstream commit 84ef99c728079dfd21d6bc70b4c3e4af20602b3c ]
[Description] Observed in stereomode that programming FLIP_LEFT_EYE can cause hangs. Keep FLIP_ANY_FRAME in stereo mode so the surface flip can take place before left or right eye
Reviewed-by: Martin Leung Martin.Leung@amd.com Acked-by: Tom Chung chiahsuan.chung@amd.com Signed-off-by: Alvin Lee alvin.lee2@amd.com Tested-by: Daniel Wheeler daniel.wheeler@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hubp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hubp.c b/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hubp.c index f246125232482..33c2337c4edf3 100644 --- a/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hubp.c +++ b/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hubp.c @@ -86,7 +86,7 @@ bool hubp3_program_surface_flip_and_addr( VMID, address->vmid);
if (address->type == PLN_ADDR_TYPE_GRPH_STEREO) { - REG_UPDATE(DCSURF_FLIP_CONTROL, SURFACE_FLIP_MODE_FOR_STEREOSYNC, 0x1); + REG_UPDATE(DCSURF_FLIP_CONTROL, SURFACE_FLIP_MODE_FOR_STEREOSYNC, 0); REG_UPDATE(DCSURF_FLIP_CONTROL, SURFACE_FLIP_IN_STEREOSYNC, 0x1);
} else {
From: Fudong Wang Fudong.Wang@amd.com
[ Upstream commit b2a93490201300a749ad261b5c5d05cb50179c44 ]
[Why] After ODM clock off, optc underflow bit will be kept there always and clear not work. We need to clear that before clock off.
[How] Clear that if have when clock off.
Reviewed-by: Alvin Lee alvin.lee2@amd.com Acked-by: Tom Chung chiahsuan.chung@amd.com Signed-off-by: Fudong Wang Fudong.Wang@amd.com Tested-by: Daniel Wheeler daniel.wheeler@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/amd/display/dc/dcn10/dcn10_optc.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_optc.c b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_optc.c index 37848f4577b18..92fee47278e5a 100644 --- a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_optc.c +++ b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_optc.c @@ -480,6 +480,11 @@ void optc1_enable_optc_clock(struct timing_generator *optc, bool enable) OTG_CLOCK_ON, 1, 1, 1000); } else { + + //last chance to clear underflow, otherwise, it will always there due to clock is off. + if (optc->funcs->is_optc_underflow_occurred(optc) == true) + optc->funcs->clear_optc_underflow(optc); + REG_UPDATE_2(OTG_CLOCK_CONTROL, OTG_CLOCK_GATE_DIS, 0, OTG_CLOCK_EN, 0);
From: Namjae Jeon linkinjeon@kernel.org
[ Upstream commit fe54833dc8d97ef387e86f7c80537d51c503ca75 ]
If share is not configured in smb.conf, smb2 tree connect should return STATUS_BAD_NETWORK_NAME instead of STATUS_BAD_NETWORK_PATH.
Signed-off-by: Namjae Jeon linkinjeon@kernel.org Reviewed-by: Hyunchul Lee hyc.lee@gmail.com Signed-off-by: Steve French stfrench@microsoft.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/ksmbd/mgmt/tree_connect.c | 2 +- fs/ksmbd/smb2pdu.c | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/ksmbd/mgmt/tree_connect.c b/fs/ksmbd/mgmt/tree_connect.c index 0d28e723a28c7..940385c6a9135 100644 --- a/fs/ksmbd/mgmt/tree_connect.c +++ b/fs/ksmbd/mgmt/tree_connect.c @@ -18,7 +18,7 @@ struct ksmbd_tree_conn_status ksmbd_tree_conn_connect(struct ksmbd_session *sess, char *share_name) { - struct ksmbd_tree_conn_status status = {-EINVAL, NULL}; + struct ksmbd_tree_conn_status status = {-ENOENT, NULL}; struct ksmbd_tree_connect_response *resp = NULL; struct ksmbd_share_config *sc; struct ksmbd_tree_connect *tree_conn = NULL; diff --git a/fs/ksmbd/smb2pdu.c b/fs/ksmbd/smb2pdu.c index 28b5d20c8766e..824f17a101a9e 100644 --- a/fs/ksmbd/smb2pdu.c +++ b/fs/ksmbd/smb2pdu.c @@ -1932,8 +1932,9 @@ int smb2_tree_connect(struct ksmbd_work *work) rsp->hdr.Status = STATUS_SUCCESS; rc = 0; break; + case -ENOENT: case KSMBD_TREE_CONN_STATUS_NO_SHARE: - rsp->hdr.Status = STATUS_BAD_NETWORK_PATH; + rsp->hdr.Status = STATUS_BAD_NETWORK_NAME; break; case -ENOMEM: case KSMBD_TREE_CONN_STATUS_NOMEM:
From: Denis V. Lunev den@openvz.org
[ Upstream commit 66ba215cb51323e4e55e38fd5f250e0fae0cbc94 ]
Normal processing of ARP request (usually this is Ethernet broadcast packet) coming to the host is looking like the following: * the packet comes to arp_process() call and is passed through routing procedure * the request is put into the queue using pneigh_enqueue() if corresponding ARP record is not local (common case for container records on the host) * the request is processed by timer (within 80 jiffies by default) and ARP reply is sent from the same arp_process() using NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED condition (flag is set inside pneigh_enqueue())
And here the problem comes. Linux kernel calls pneigh_queue_purge() which destroys the whole queue of ARP requests on ANY network interface start/stop event through __neigh_ifdown().
This is actually not a problem within the original world as network interface start/stop was accessible to the host 'root' only, which could do more destructive things. But the world is changed and there are Linux containers available. Here container 'root' has an access to this API and could be considered as untrusted user in the hosting (container's) world.
Thus there is an attack vector to other containers on node when container's root will endlessly start/stop interfaces. We have observed similar situation on a real production node when docker container was doing such activity and thus other containers on the node become not accessible.
The patch proposed doing very simple thing. It drops only packets from the same namespace in the pneigh_queue_purge() where network interface state change is detected. This is enough to prevent the problem for the whole node preserving original semantics of the code.
v2: - do del_timer_sync() if queue is empty after pneigh_queue_purge() v3: - rebase to net tree
Cc: "David S. Miller" davem@davemloft.net Cc: Eric Dumazet edumazet@google.com Cc: Jakub Kicinski kuba@kernel.org Cc: Paolo Abeni pabeni@redhat.com Cc: Daniel Borkmann daniel@iogearbox.net Cc: David Ahern dsahern@kernel.org Cc: Yajun Deng yajun.deng@linux.dev Cc: Roopa Prabhu roopa@nvidia.com Cc: Christian Brauner brauner@kernel.org Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Alexey Kuznetsov kuznet@ms2.inr.ac.ru Cc: Alexander Mikhalitsyn alexander.mikhalitsyn@virtuozzo.com Cc: Konstantin Khorenko khorenko@virtuozzo.com Cc: kernel@openvz.org Cc: devel@openvz.org Investigated-by: Alexander Mikhalitsyn alexander.mikhalitsyn@virtuozzo.com Signed-off-by: Denis V. Lunev den@openvz.org Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- net/core/neighbour.c | 25 +++++++++++++++++-------- 1 file changed, 17 insertions(+), 8 deletions(-)
diff --git a/net/core/neighbour.c b/net/core/neighbour.c index ff049733cceeb..f0be42c140b91 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -279,14 +279,23 @@ static int neigh_del_timer(struct neighbour *n) return 0; }
-static void pneigh_queue_purge(struct sk_buff_head *list) +static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net) { + unsigned long flags; struct sk_buff *skb;
- while ((skb = skb_dequeue(list)) != NULL) { - dev_put(skb->dev); - kfree_skb(skb); + spin_lock_irqsave(&list->lock, flags); + skb = skb_peek(list); + while (skb != NULL) { + struct sk_buff *skb_next = skb_peek_next(skb, list); + if (net == NULL || net_eq(dev_net(skb->dev), net)) { + __skb_unlink(skb, list); + dev_put(skb->dev); + kfree_skb(skb); + } + skb = skb_next; } + spin_unlock_irqrestore(&list->lock, flags); }
static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev, @@ -357,9 +366,9 @@ static int __neigh_ifdown(struct neigh_table *tbl, struct net_device *dev, write_lock_bh(&tbl->lock); neigh_flush_dev(tbl, dev, skip_perm); pneigh_ifdown_and_unlock(tbl, dev); - - del_timer_sync(&tbl->proxy_timer); - pneigh_queue_purge(&tbl->proxy_queue); + pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev)); + if (skb_queue_empty_lockless(&tbl->proxy_queue)) + del_timer_sync(&tbl->proxy_timer); return 0; }
@@ -1735,7 +1744,7 @@ int neigh_table_clear(int index, struct neigh_table *tbl) /* It is not clean... Fix it to unload IPv6 module safely */ cancel_delayed_work_sync(&tbl->gc_work); del_timer_sync(&tbl->proxy_timer); - pneigh_queue_purge(&tbl->proxy_queue); + pneigh_queue_purge(&tbl->proxy_queue, NULL); neigh_ifdown(tbl, NULL); if (atomic_read(&tbl->entries)) pr_crit("neighbour leakage\n");
From: Juergen Gross jgross@suse.com
[ Upstream commit 7b6670b03641ac308aaa6fa2e6f964ac993b5ea3 ]
When booting under KVM the following error messages are issued:
hypfs.7f5705: The hardware system does not support hypfs hypfs.7a79f0: Initialization of hypfs failed with rc=-61
Demote the severity of first message from "error" to "info" and issue the second message only in other error cases.
Signed-off-by: Juergen Gross jgross@suse.com Acked-by: Heiko Carstens hca@linux.ibm.com Acked-by: Christian Borntraeger borntraeger@linux.ibm.com Link: https://lore.kernel.org/r/20220620094534.18967-1-jgross@suse.com [arch/s390/hypfs/hypfs_diag.c changed description] Signed-off-by: Alexander Gordeev agordeev@linux.ibm.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/s390/hypfs/hypfs_diag.c | 2 +- arch/s390/hypfs/inode.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/s390/hypfs/hypfs_diag.c b/arch/s390/hypfs/hypfs_diag.c index f0bc4dc3e9bf0..6511d15ace45e 100644 --- a/arch/s390/hypfs/hypfs_diag.c +++ b/arch/s390/hypfs/hypfs_diag.c @@ -437,7 +437,7 @@ __init int hypfs_diag_init(void) int rc;
if (diag204_probe()) { - pr_err("The hardware system does not support hypfs\n"); + pr_info("The hardware system does not support hypfs\n"); return -ENODATA; }
diff --git a/arch/s390/hypfs/inode.c b/arch/s390/hypfs/inode.c index 5c97f48cea91d..ee919bfc81867 100644 --- a/arch/s390/hypfs/inode.c +++ b/arch/s390/hypfs/inode.c @@ -496,9 +496,9 @@ static int __init hypfs_init(void) hypfs_vm_exit(); fail_hypfs_diag_exit: hypfs_diag_exit(); + pr_err("Initialization of hypfs failed with rc=%i\n", rc); fail_dbfs_exit: hypfs_dbfs_exit(); - pr_err("Initialization of hypfs failed with rc=%i\n", rc); return rc; } device_initcall(hypfs_init)
From: Namjae Jeon linkinjeon@kernel.org
[ Upstream commit 17661ecf6a64eb11ae7f1108fe88686388b2acd5 ]
When smb client open file in ksmbd share with O_TRUNC, dos attribute xattr is removed as well as data in file. This cause the FSCTL_SET_SPARSE request from the client fails because ksmbd can't update the dos attribute after setting ATTR_SPARSE_FILE. And this patch fix xfstests generic/469 test also.
Signed-off-by: Namjae Jeon linkinjeon@kernel.org Reviewed-by: Hyunchul Lee hyc.lee@gmail.com Signed-off-by: Steve French stfrench@microsoft.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/ksmbd/smb2pdu.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/fs/ksmbd/smb2pdu.c b/fs/ksmbd/smb2pdu.c index 824f17a101a9e..55ee639703ff0 100644 --- a/fs/ksmbd/smb2pdu.c +++ b/fs/ksmbd/smb2pdu.c @@ -2319,15 +2319,15 @@ static int smb2_remove_smb_xattrs(struct path *path) name += strlen(name) + 1) { ksmbd_debug(SMB, "%s, len %zd\n", name, strlen(name));
- if (strncmp(name, XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN) && - strncmp(&name[XATTR_USER_PREFIX_LEN], DOS_ATTRIBUTE_PREFIX, - DOS_ATTRIBUTE_PREFIX_LEN) && - strncmp(&name[XATTR_USER_PREFIX_LEN], STREAM_PREFIX, STREAM_PREFIX_LEN)) - continue; - - err = ksmbd_vfs_remove_xattr(user_ns, path->dentry, name); - if (err) - ksmbd_debug(SMB, "remove xattr failed : %s\n", name); + if (!strncmp(name, XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN) && + !strncmp(&name[XATTR_USER_PREFIX_LEN], STREAM_PREFIX, + STREAM_PREFIX_LEN)) { + err = ksmbd_vfs_remove_xattr(user_ns, path->dentry, + name); + if (err) + ksmbd_debug(SMB, "remove xattr failed : %s\n", + name); + } } out: kvfree(xattr_list);
From: Evan Quan evan.quan@amd.com
[ Upstream commit 0a2d922a5618377cdf8fa476351362733ef55342 ]
To avoid any potential memory leak.
Signed-off-by: Evan Quan evan.quan@amd.com Reviewed-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c index 918d5c7c2328b..79976921dc46f 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c @@ -3915,6 +3915,7 @@ static const struct pptable_funcs sienna_cichlid_ppt_funcs = { .dump_pptable = sienna_cichlid_dump_pptable, .init_microcode = smu_v11_0_init_microcode, .load_microcode = smu_v11_0_load_microcode, + .fini_microcode = smu_v11_0_fini_microcode, .init_smc_tables = sienna_cichlid_init_smc_tables, .fini_smc_tables = smu_v11_0_fini_smc_tables, .init_power = smu_v11_0_init_power,
From: Ilya Bakoulin Ilya.Bakoulin@amd.com
[ Upstream commit 04fb918bf421b299feaee1006e82921d7d381f18 ]
[Why] Some pixel clock values could cause HDMI TMDS SSCPs to be misaligned between different HDMI lanes when using YCbCr420 10-bit pixel format.
BIOS functions for transmitter/encoder control take pixel clock in kHz increments, whereas the function for setting the pixel clock is in 100Hz increments. Setting pixel clock to a value that is not on a kHz boundary will cause the issue.
[How] Round pixel clock down to nearest kHz in 10/12-bpc cases.
Reviewed-by: Aric Cyr Aric.Cyr@amd.com Acked-by: Brian Chang Brian.Chang@amd.com Signed-off-by: Ilya Bakoulin Ilya.Bakoulin@amd.com Tested-by: Daniel Wheeler daniel.wheeler@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/amd/display/dc/dce/dce_clock_source.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/amd/display/dc/dce/dce_clock_source.c b/drivers/gpu/drm/amd/display/dc/dce/dce_clock_source.c index 054823d12403d..5f1b735da5063 100644 --- a/drivers/gpu/drm/amd/display/dc/dce/dce_clock_source.c +++ b/drivers/gpu/drm/amd/display/dc/dce/dce_clock_source.c @@ -545,9 +545,11 @@ static void dce112_get_pix_clk_dividers_helper ( switch (pix_clk_params->color_depth) { case COLOR_DEPTH_101010: actual_pixel_clock_100hz = (actual_pixel_clock_100hz * 5) >> 2; + actual_pixel_clock_100hz -= actual_pixel_clock_100hz % 10; break; case COLOR_DEPTH_121212: actual_pixel_clock_100hz = (actual_pixel_clock_100hz * 6) >> 2; + actual_pixel_clock_100hz -= actual_pixel_clock_100hz % 10; break; case COLOR_DEPTH_161616: actual_pixel_clock_100hz = actual_pixel_clock_100hz * 2;
From: Dusica Milinkovic Dusica.Milinkovic@amd.com
[ Upstream commit 373008bfc9cdb0f050258947fa5a095f0657e1bc ]
[Why] During multi-vf executing benchmark (Luxmark) observed kiq error timeout. It happenes because all of VFs do the tlb invalidation at the same time. Although each VF has the invalidate register set, from hardware side the invalidate requests are queue to execute.
[How] In case of 12 VF increase timeout on 12*100ms
Signed-off-by: Dusica Milinkovic Dusica.Milinkovic@amd.com Acked-by: Shaoyun Liu shaoyun.liu@amd.com Acked-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 3 ++- drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 3 ++- 3 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 5f95d03fd46a0..4f62f422bcb78 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -312,7 +312,7 @@ enum amdgpu_kiq_irq { AMDGPU_CP_KIQ_IRQ_DRIVER0 = 0, AMDGPU_CP_KIQ_IRQ_LAST }; - +#define SRIOV_USEC_TIMEOUT 1200000 /* wait 12 * 100ms for SRIOV */ #define MAX_KIQ_REG_WAIT 5000 /* in usecs, 5ms */ #define MAX_KIQ_REG_BAILOUT_INTERVAL 5 /* in msecs, 5ms */ #define MAX_KIQ_REG_TRY 1000 diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c index 93a4da4284ede..9c07ec8b97327 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c @@ -414,6 +414,7 @@ static int gmc_v10_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint32_t seq; uint16_t queried_pasid; bool ret; + u32 usec_timeout = amdgpu_sriov_vf(adev) ? SRIOV_USEC_TIMEOUT : adev->usec_timeout; struct amdgpu_ring *ring = &adev->gfx.kiq.ring; struct amdgpu_kiq *kiq = &adev->gfx.kiq;
@@ -432,7 +433,7 @@ static int gmc_v10_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,
amdgpu_ring_commit(ring); spin_unlock(&adev->gfx.kiq.ring_lock); - r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout); + r = amdgpu_fence_wait_polling(ring, seq, usec_timeout); if (r < 1) { dev_err(adev->dev, "wait for kiq fence error: %ld.\n", r); return -ETIME; diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c index 0e731016921be..70d24b522df8d 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c @@ -863,6 +863,7 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint32_t seq; uint16_t queried_pasid; bool ret; + u32 usec_timeout = amdgpu_sriov_vf(adev) ? SRIOV_USEC_TIMEOUT : adev->usec_timeout; struct amdgpu_ring *ring = &adev->gfx.kiq.ring; struct amdgpu_kiq *kiq = &adev->gfx.kiq;
@@ -902,7 +903,7 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,
amdgpu_ring_commit(ring); spin_unlock(&adev->gfx.kiq.ring_lock); - r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout); + r = amdgpu_fence_wait_polling(ring, seq, usec_timeout); if (r < 1) { dev_err(adev->dev, "wait for kiq fence error: %ld.\n", r); up_read(&adev->reset_sem);
From: Charlene Liu Charlene.Liu@amd.com
[ Upstream commit 5544a7b5a07480192eb5fd3536462faed2c21528 ]
[why] this is to ensure that driver will not reprogram hvm_prefetch_req again if it is done.
Reviewed-by: Martin Leung Martin.Leung@amd.com Acked-by: Brian Chang Brian.Chang@amd.com Signed-off-by: Charlene Liu Charlene.Liu@amd.com Tested-by: Daniel Wheeler daniel.wheeler@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/amd/display/dc/dcn21/dcn21_hubbub.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_hubbub.c b/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_hubbub.c index 36044cb8ec834..1c0f56d8ba8bb 100644 --- a/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_hubbub.c +++ b/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_hubbub.c @@ -67,9 +67,15 @@ static uint32_t convert_and_clamp( void dcn21_dchvm_init(struct hubbub *hubbub) { struct dcn20_hubbub *hubbub1 = TO_DCN20_HUBBUB(hubbub); - uint32_t riommu_active; + uint32_t riommu_active, prefetch_done; int i;
+ REG_GET(DCHVM_RIOMMU_STAT0, HOSTVM_PREFETCH_DONE, &prefetch_done); + + if (prefetch_done) { + hubbub->riommu_active = true; + return; + } //Init DCHVM block REG_UPDATE(DCHVM_CTRL0, HOSTVM_INIT_REQ, 1);
From: Geert Uytterhoeven geert@linux-m68k.org
[ Upstream commit aa5762c34213aba7a72dc58e70601370805fa794 ]
NF_CONNTRACK_PROCFS was marked obsolete in commit 54b07dca68557b09 ("netfilter: provide config option to disable ancient procfs parts") in v3.3.
Signed-off-by: Geert Uytterhoeven geert@linux-m68k.org Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: Sasha Levin sashal@kernel.org --- net/netfilter/Kconfig | 1 - 1 file changed, 1 deletion(-)
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig index 92a747896f808..4f645d51c2573 100644 --- a/net/netfilter/Kconfig +++ b/net/netfilter/Kconfig @@ -133,7 +133,6 @@ config NF_CONNTRACK_ZONES
config NF_CONNTRACK_PROCFS bool "Supply CT list in procfs (OBSOLETE)" - default y depends on PROC_FS help This option enables for the list of known conntrack entries
From: Florian Westphal fw@strlen.de
[ Upstream commit b71b7bfeac38c7a21c423ddafb29aa6258949df8 ]
"ns1" is a too generic name, use a random suffix to avoid errors when such a netns exists. Also allows to run multiple instances of the script in parallel.
Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: Sasha Levin sashal@kernel.org --- .../selftests/netfilter/nft_flowtable.sh | 246 +++++++++--------- 1 file changed, 128 insertions(+), 118 deletions(-)
diff --git a/tools/testing/selftests/netfilter/nft_flowtable.sh b/tools/testing/selftests/netfilter/nft_flowtable.sh index d4ffebb989f88..c336e6c148d1f 100755 --- a/tools/testing/selftests/netfilter/nft_flowtable.sh +++ b/tools/testing/selftests/netfilter/nft_flowtable.sh @@ -14,6 +14,11 @@ # nft_flowtable.sh -o8000 -l1500 -r2000 #
+sfx=$(mktemp -u "XXXXXXXX") +ns1="ns1-$sfx" +ns2="ns2-$sfx" +nsr1="nsr1-$sfx" +nsr2="nsr2-$sfx"
# Kselftest framework requirement - SKIP code is 4. ksft_skip=4 @@ -36,18 +41,17 @@ checktool (){ checktool "nft --version" "run test without nft tool" checktool "ip -Version" "run test without ip tool" checktool "which nc" "run test without nc (netcat)" -checktool "ip netns add nsr1" "create net namespace" +checktool "ip netns add $nsr1" "create net namespace $nsr1"
-ip netns add ns1 -ip netns add ns2 - -ip netns add nsr2 +ip netns add $ns1 +ip netns add $ns2 +ip netns add $nsr2
cleanup() { - for i in 1 2; do - ip netns del ns$i - ip netns del nsr$i - done + ip netns del $ns1 + ip netns del $ns2 + ip netns del $nsr1 + ip netns del $nsr2
rm -f "$ns1in" "$ns1out" rm -f "$ns2in" "$ns2out" @@ -59,22 +63,21 @@ trap cleanup EXIT
sysctl -q net.netfilter.nf_log_all_netns=1
-ip link add veth0 netns nsr1 type veth peer name eth0 netns ns1 -ip link add veth1 netns nsr1 type veth peer name veth0 netns nsr2 +ip link add veth0 netns $nsr1 type veth peer name eth0 netns $ns1 +ip link add veth1 netns $nsr1 type veth peer name veth0 netns $nsr2
-ip link add veth1 netns nsr2 type veth peer name eth0 netns ns2 +ip link add veth1 netns $nsr2 type veth peer name eth0 netns $ns2
for dev in lo veth0 veth1; do - for i in 1 2; do - ip -net nsr$i link set $dev up - done + ip -net $nsr1 link set $dev up + ip -net $nsr2 link set $dev up done
-ip -net nsr1 addr add 10.0.1.1/24 dev veth0 -ip -net nsr1 addr add dead:1::1/64 dev veth0 +ip -net $nsr1 addr add 10.0.1.1/24 dev veth0 +ip -net $nsr1 addr add dead:1::1/64 dev veth0
-ip -net nsr2 addr add 10.0.2.1/24 dev veth1 -ip -net nsr2 addr add dead:2::1/64 dev veth1 +ip -net $nsr2 addr add 10.0.2.1/24 dev veth1 +ip -net $nsr2 addr add dead:2::1/64 dev veth1
# set different MTUs so we need to push packets coming from ns1 (large MTU) # to ns2 (smaller MTU) to stack either to perform fragmentation (ip_no_pmtu_disc=1), @@ -106,49 +109,56 @@ do esac done
-if ! ip -net nsr1 link set veth0 mtu $omtu; then +if ! ip -net $nsr1 link set veth0 mtu $omtu; then exit 1 fi
-ip -net ns1 link set eth0 mtu $omtu +ip -net $ns1 link set eth0 mtu $omtu
-if ! ip -net nsr2 link set veth1 mtu $rmtu; then +if ! ip -net $nsr2 link set veth1 mtu $rmtu; then exit 1 fi
-ip -net ns2 link set eth0 mtu $rmtu +ip -net $ns2 link set eth0 mtu $rmtu
# transfer-net between nsr1 and nsr2. # these addresses are not used for connections. -ip -net nsr1 addr add 192.168.10.1/24 dev veth1 -ip -net nsr1 addr add fee1:2::1/64 dev veth1 - -ip -net nsr2 addr add 192.168.10.2/24 dev veth0 -ip -net nsr2 addr add fee1:2::2/64 dev veth0 - -for i in 1 2; do - ip netns exec nsr$i sysctl net.ipv4.conf.veth0.forwarding=1 > /dev/null - ip netns exec nsr$i sysctl net.ipv4.conf.veth1.forwarding=1 > /dev/null - - ip -net ns$i link set lo up - ip -net ns$i link set eth0 up - ip -net ns$i addr add 10.0.$i.99/24 dev eth0 - ip -net ns$i route add default via 10.0.$i.1 - ip -net ns$i addr add dead:$i::99/64 dev eth0 - ip -net ns$i route add default via dead:$i::1 - if ! ip netns exec ns$i sysctl net.ipv4.tcp_no_metrics_save=1 > /dev/null; then +ip -net $nsr1 addr add 192.168.10.1/24 dev veth1 +ip -net $nsr1 addr add fee1:2::1/64 dev veth1 + +ip -net $nsr2 addr add 192.168.10.2/24 dev veth0 +ip -net $nsr2 addr add fee1:2::2/64 dev veth0 + +for i in 0 1; do + ip netns exec $nsr1 sysctl net.ipv4.conf.veth$i.forwarding=1 > /dev/null + ip netns exec $nsr2 sysctl net.ipv4.conf.veth$i.forwarding=1 > /dev/null +done + +for ns in $ns1 $ns2;do + ip -net $ns link set lo up + ip -net $ns link set eth0 up + + if ! ip netns exec $ns sysctl net.ipv4.tcp_no_metrics_save=1 > /dev/null; then echo "ERROR: Check Originator/Responder values (problem during address addition)" exit 1 fi - # don't set ip DF bit for first two tests - ip netns exec ns$i sysctl net.ipv4.ip_no_pmtu_disc=1 > /dev/null + ip netns exec $ns sysctl net.ipv4.ip_no_pmtu_disc=1 > /dev/null done
-ip -net nsr1 route add default via 192.168.10.2 -ip -net nsr2 route add default via 192.168.10.1 +ip -net $ns1 addr add 10.0.1.99/24 dev eth0 +ip -net $ns2 addr add 10.0.2.99/24 dev eth0 +ip -net $ns1 route add default via 10.0.1.1 +ip -net $ns2 route add default via 10.0.2.1 +ip -net $ns1 addr add dead:1::99/64 dev eth0 +ip -net $ns2 addr add dead:2::99/64 dev eth0 +ip -net $ns1 route add default via dead:1::1 +ip -net $ns2 route add default via dead:2::1 + +ip -net $nsr1 route add default via 192.168.10.2 +ip -net $nsr2 route add default via 192.168.10.1
-ip netns exec nsr1 nft -f - <<EOF +ip netns exec $nsr1 nft -f - <<EOF table inet filter { flowtable f1 { hook ingress priority 0 @@ -197,18 +207,18 @@ if [ $? -ne 0 ]; then fi
# test basic connectivity -if ! ip netns exec ns1 ping -c 1 -q 10.0.2.99 > /dev/null; then - echo "ERROR: ns1 cannot reach ns2" 1>&2 +if ! ip netns exec $ns1 ping -c 1 -q 10.0.2.99 > /dev/null; then + echo "ERROR: $ns1 cannot reach ns2" 1>&2 exit 1 fi
-if ! ip netns exec ns2 ping -c 1 -q 10.0.1.99 > /dev/null; then - echo "ERROR: ns2 cannot reach ns1" 1>&2 +if ! ip netns exec $ns2 ping -c 1 -q 10.0.1.99 > /dev/null; then + echo "ERROR: $ns2 cannot reach $ns1" 1>&2 exit 1 fi
if [ $ret -eq 0 ];then - echo "PASS: netns routing/connectivity: ns1 can reach ns2" + echo "PASS: netns routing/connectivity: $ns1 can reach $ns2" fi
ns1in=$(mktemp) @@ -312,24 +322,24 @@ make_file "$ns2in"
# First test: # No PMTU discovery, nsr1 is expected to fragment packets from ns1 to ns2 as needed. -if test_tcp_forwarding ns1 ns2; then +if test_tcp_forwarding $ns1 $ns2; then echo "PASS: flow offloaded for ns1/ns2" else echo "FAIL: flow offload for ns1/ns2:" 1>&2 - ip netns exec nsr1 nft list ruleset + ip netns exec $nsr1 nft list ruleset ret=1 fi
# delete default route, i.e. ns2 won't be able to reach ns1 and # will depend on ns1 being masqueraded in nsr1. # expect ns1 has nsr1 address. -ip -net ns2 route del default via 10.0.2.1 -ip -net ns2 route del default via dead:2::1 -ip -net ns2 route add 192.168.10.1 via 10.0.2.1 +ip -net $ns2 route del default via 10.0.2.1 +ip -net $ns2 route del default via dead:2::1 +ip -net $ns2 route add 192.168.10.1 via 10.0.2.1
# Second test: # Same, but with NAT enabled. -ip netns exec nsr1 nft -f - <<EOF +ip netns exec $nsr1 nft -f - <<EOF table ip nat { chain prerouting { type nat hook prerouting priority 0; policy accept; @@ -343,47 +353,47 @@ table ip nat { } EOF
-if test_tcp_forwarding_nat ns1 ns2; then +if test_tcp_forwarding_nat $ns1 $ns2; then echo "PASS: flow offloaded for ns1/ns2 with NAT" else echo "FAIL: flow offload for ns1/ns2 with NAT" 1>&2 - ip netns exec nsr1 nft list ruleset + ip netns exec $nsr1 nft list ruleset ret=1 fi
# Third test: # Same as second test, but with PMTU discovery enabled. -handle=$(ip netns exec nsr1 nft -a list table inet filter | grep something-to-grep-for | cut -d # -f 2) +handle=$(ip netns exec $nsr1 nft -a list table inet filter | grep something-to-grep-for | cut -d # -f 2)
-if ! ip netns exec nsr1 nft delete rule inet filter forward $handle; then +if ! ip netns exec $nsr1 nft delete rule inet filter forward $handle; then echo "FAIL: Could not delete large-packet accept rule" exit 1 fi
-ip netns exec ns1 sysctl net.ipv4.ip_no_pmtu_disc=0 > /dev/null -ip netns exec ns2 sysctl net.ipv4.ip_no_pmtu_disc=0 > /dev/null +ip netns exec $ns1 sysctl net.ipv4.ip_no_pmtu_disc=0 > /dev/null +ip netns exec $ns2 sysctl net.ipv4.ip_no_pmtu_disc=0 > /dev/null
-if test_tcp_forwarding_nat ns1 ns2; then +if test_tcp_forwarding_nat $ns1 $ns2; then echo "PASS: flow offloaded for ns1/ns2 with NAT and pmtu discovery" else echo "FAIL: flow offload for ns1/ns2 with NAT and pmtu discovery" 1>&2 - ip netns exec nsr1 nft list ruleset + ip netns exec $nsr1 nft list ruleset fi
# Another test: # Add bridge interface br0 to Router1, with NAT enabled. -ip -net nsr1 link add name br0 type bridge -ip -net nsr1 addr flush dev veth0 -ip -net nsr1 link set up dev veth0 -ip -net nsr1 link set veth0 master br0 -ip -net nsr1 addr add 10.0.1.1/24 dev br0 -ip -net nsr1 addr add dead:1::1/64 dev br0 -ip -net nsr1 link set up dev br0 +ip -net $nsr1 link add name br0 type bridge +ip -net $nsr1 addr flush dev veth0 +ip -net $nsr1 link set up dev veth0 +ip -net $nsr1 link set veth0 master br0 +ip -net $nsr1 addr add 10.0.1.1/24 dev br0 +ip -net $nsr1 addr add dead:1::1/64 dev br0 +ip -net $nsr1 link set up dev br0
-ip netns exec nsr1 sysctl net.ipv4.conf.br0.forwarding=1 > /dev/null +ip netns exec $nsr1 sysctl net.ipv4.conf.br0.forwarding=1 > /dev/null
# br0 with NAT enabled. -ip netns exec nsr1 nft -f - <<EOF +ip netns exec $nsr1 nft -f - <<EOF flush table ip nat table ip nat { chain prerouting { @@ -398,59 +408,59 @@ table ip nat { } EOF
-if test_tcp_forwarding_nat ns1 ns2; then +if test_tcp_forwarding_nat $ns1 $ns2; then echo "PASS: flow offloaded for ns1/ns2 with bridge NAT" else echo "FAIL: flow offload for ns1/ns2 with bridge NAT" 1>&2 - ip netns exec nsr1 nft list ruleset + ip netns exec $nsr1 nft list ruleset ret=1 fi
# Another test: # Add bridge interface br0 to Router1, with NAT and VLAN. -ip -net nsr1 link set veth0 nomaster -ip -net nsr1 link set down dev veth0 -ip -net nsr1 link add link veth0 name veth0.10 type vlan id 10 -ip -net nsr1 link set up dev veth0 -ip -net nsr1 link set up dev veth0.10 -ip -net nsr1 link set veth0.10 master br0 - -ip -net ns1 addr flush dev eth0 -ip -net ns1 link add link eth0 name eth0.10 type vlan id 10 -ip -net ns1 link set eth0 up -ip -net ns1 link set eth0.10 up -ip -net ns1 addr add 10.0.1.99/24 dev eth0.10 -ip -net ns1 route add default via 10.0.1.1 -ip -net ns1 addr add dead:1::99/64 dev eth0.10 - -if test_tcp_forwarding_nat ns1 ns2; then +ip -net $nsr1 link set veth0 nomaster +ip -net $nsr1 link set down dev veth0 +ip -net $nsr1 link add link veth0 name veth0.10 type vlan id 10 +ip -net $nsr1 link set up dev veth0 +ip -net $nsr1 link set up dev veth0.10 +ip -net $nsr1 link set veth0.10 master br0 + +ip -net $ns1 addr flush dev eth0 +ip -net $ns1 link add link eth0 name eth0.10 type vlan id 10 +ip -net $ns1 link set eth0 up +ip -net $ns1 link set eth0.10 up +ip -net $ns1 addr add 10.0.1.99/24 dev eth0.10 +ip -net $ns1 route add default via 10.0.1.1 +ip -net $ns1 addr add dead:1::99/64 dev eth0.10 + +if test_tcp_forwarding_nat $ns1 $ns2; then echo "PASS: flow offloaded for ns1/ns2 with bridge NAT and VLAN" else echo "FAIL: flow offload for ns1/ns2 with bridge NAT and VLAN" 1>&2 - ip netns exec nsr1 nft list ruleset + ip netns exec $nsr1 nft list ruleset ret=1 fi
# restore test topology (remove bridge and VLAN) -ip -net nsr1 link set veth0 nomaster -ip -net nsr1 link set veth0 down -ip -net nsr1 link set veth0.10 down -ip -net nsr1 link delete veth0.10 type vlan -ip -net nsr1 link delete br0 type bridge -ip -net ns1 addr flush dev eth0.10 -ip -net ns1 link set eth0.10 down -ip -net ns1 link set eth0 down -ip -net ns1 link delete eth0.10 type vlan +ip -net $nsr1 link set veth0 nomaster +ip -net $nsr1 link set veth0 down +ip -net $nsr1 link set veth0.10 down +ip -net $nsr1 link delete veth0.10 type vlan +ip -net $nsr1 link delete br0 type bridge +ip -net $ns1 addr flush dev eth0.10 +ip -net $ns1 link set eth0.10 down +ip -net $ns1 link set eth0 down +ip -net $ns1 link delete eth0.10 type vlan
# restore address in ns1 and nsr1 -ip -net ns1 link set eth0 up -ip -net ns1 addr add 10.0.1.99/24 dev eth0 -ip -net ns1 route add default via 10.0.1.1 -ip -net ns1 addr add dead:1::99/64 dev eth0 -ip -net ns1 route add default via dead:1::1 -ip -net nsr1 addr add 10.0.1.1/24 dev veth0 -ip -net nsr1 addr add dead:1::1/64 dev veth0 -ip -net nsr1 link set up dev veth0 +ip -net $ns1 link set eth0 up +ip -net $ns1 addr add 10.0.1.99/24 dev eth0 +ip -net $ns1 route add default via 10.0.1.1 +ip -net $ns1 addr add dead:1::99/64 dev eth0 +ip -net $ns1 route add default via dead:1::1 +ip -net $nsr1 addr add 10.0.1.1/24 dev veth0 +ip -net $nsr1 addr add dead:1::1/64 dev veth0 +ip -net $nsr1 link set up dev veth0
KEY_SHA="0x"$(ps -xaf | sha1sum | cut -d " " -f 1) KEY_AES="0x"$(ps -xaf | md5sum | cut -d " " -f 1) @@ -480,23 +490,23 @@ do_esp() {
}
-do_esp nsr1 192.168.10.1 192.168.10.2 10.0.1.0/24 10.0.2.0/24 $SPI1 $SPI2 +do_esp $nsr1 192.168.10.1 192.168.10.2 10.0.1.0/24 10.0.2.0/24 $SPI1 $SPI2
-do_esp nsr2 192.168.10.2 192.168.10.1 10.0.2.0/24 10.0.1.0/24 $SPI2 $SPI1 +do_esp $nsr2 192.168.10.2 192.168.10.1 10.0.2.0/24 10.0.1.0/24 $SPI2 $SPI1
-ip netns exec nsr1 nft delete table ip nat +ip netns exec $nsr1 nft delete table ip nat
# restore default routes -ip -net ns2 route del 192.168.10.1 via 10.0.2.1 -ip -net ns2 route add default via 10.0.2.1 -ip -net ns2 route add default via dead:2::1 +ip -net $ns2 route del 192.168.10.1 via 10.0.2.1 +ip -net $ns2 route add default via 10.0.2.1 +ip -net $ns2 route add default via dead:2::1
-if test_tcp_forwarding ns1 ns2; then +if test_tcp_forwarding $ns1 $ns2; then echo "PASS: ipsec tunnel mode for ns1/ns2" else echo "FAIL: ipsec tunnel mode for ns1/ns2" - ip netns exec nsr1 nft list ruleset 1>&2 - ip netns exec nsr1 cat /proc/net/xfrm_stat 1>&2 + ip netns exec $nsr1 nft list ruleset 1>&2 + ip netns exec $nsr1 cat /proc/net/xfrm_stat 1>&2 fi
exit $ret
From: Josef Bacik josef@toxicpanda.com
[ Upstream commit 0a27a0474d146eb79e09ec88bf0d4229f4cfc1b8 ]
These definitions exist in disk-io.c, which is not related to the locking. Move this over to locking.h/c where it makes more sense.
Reviewed-by: Johannes Thumshirn johannes.thumshirn@wdc.com Signed-off-by: Josef Bacik josef@toxicpanda.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/disk-io.c | 82 ---------------------------------------------- fs/btrfs/disk-io.h | 10 ------ fs/btrfs/locking.c | 80 ++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/locking.h | 9 +++++ 4 files changed, 89 insertions(+), 92 deletions(-)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 247d7f9ced3b0..c76c360bece59 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -121,88 +121,6 @@ struct async_submit_bio { blk_status_t status; };
-/* - * Lockdep class keys for extent_buffer->lock's in this root. For a given - * eb, the lockdep key is determined by the btrfs_root it belongs to and - * the level the eb occupies in the tree. - * - * Different roots are used for different purposes and may nest inside each - * other and they require separate keysets. As lockdep keys should be - * static, assign keysets according to the purpose of the root as indicated - * by btrfs_root->root_key.objectid. This ensures that all special purpose - * roots have separate keysets. - * - * Lock-nesting across peer nodes is always done with the immediate parent - * node locked thus preventing deadlock. As lockdep doesn't know this, use - * subclass to avoid triggering lockdep warning in such cases. - * - * The key is set by the readpage_end_io_hook after the buffer has passed - * csum validation but before the pages are unlocked. It is also set by - * btrfs_init_new_buffer on freshly allocated blocks. - * - * We also add a check to make sure the highest level of the tree is the - * same as our lockdep setup here. If BTRFS_MAX_LEVEL changes, this code - * needs update as well. - */ -#ifdef CONFIG_DEBUG_LOCK_ALLOC -# if BTRFS_MAX_LEVEL != 8 -# error -# endif - -#define DEFINE_LEVEL(stem, level) \ - .names[level] = "btrfs-" stem "-0" #level, - -#define DEFINE_NAME(stem) \ - DEFINE_LEVEL(stem, 0) \ - DEFINE_LEVEL(stem, 1) \ - DEFINE_LEVEL(stem, 2) \ - DEFINE_LEVEL(stem, 3) \ - DEFINE_LEVEL(stem, 4) \ - DEFINE_LEVEL(stem, 5) \ - DEFINE_LEVEL(stem, 6) \ - DEFINE_LEVEL(stem, 7) - -static struct btrfs_lockdep_keyset { - u64 id; /* root objectid */ - /* Longest entry: btrfs-free-space-00 */ - char names[BTRFS_MAX_LEVEL][20]; - struct lock_class_key keys[BTRFS_MAX_LEVEL]; -} btrfs_lockdep_keysets[] = { - { .id = BTRFS_ROOT_TREE_OBJECTID, DEFINE_NAME("root") }, - { .id = BTRFS_EXTENT_TREE_OBJECTID, DEFINE_NAME("extent") }, - { .id = BTRFS_CHUNK_TREE_OBJECTID, DEFINE_NAME("chunk") }, - { .id = BTRFS_DEV_TREE_OBJECTID, DEFINE_NAME("dev") }, - { .id = BTRFS_CSUM_TREE_OBJECTID, DEFINE_NAME("csum") }, - { .id = BTRFS_QUOTA_TREE_OBJECTID, DEFINE_NAME("quota") }, - { .id = BTRFS_TREE_LOG_OBJECTID, DEFINE_NAME("log") }, - { .id = BTRFS_TREE_RELOC_OBJECTID, DEFINE_NAME("treloc") }, - { .id = BTRFS_DATA_RELOC_TREE_OBJECTID, DEFINE_NAME("dreloc") }, - { .id = BTRFS_UUID_TREE_OBJECTID, DEFINE_NAME("uuid") }, - { .id = BTRFS_FREE_SPACE_TREE_OBJECTID, DEFINE_NAME("free-space") }, - { .id = 0, DEFINE_NAME("tree") }, -}; - -#undef DEFINE_LEVEL -#undef DEFINE_NAME - -void btrfs_set_buffer_lockdep_class(u64 objectid, struct extent_buffer *eb, - int level) -{ - struct btrfs_lockdep_keyset *ks; - - BUG_ON(level >= ARRAY_SIZE(ks->keys)); - - /* find the matching keyset, id 0 is the default entry */ - for (ks = btrfs_lockdep_keysets; ks->id; ks++) - if (ks->id == objectid) - break; - - lockdep_set_class_and_name(&eb->lock, - &ks->keys[level], ks->names[level]); -} - -#endif - /* * Compute the csum of a btree block and store the result to provided buffer. */ diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h index 0e7e9526b6a83..1b8fd3deafc92 100644 --- a/fs/btrfs/disk-io.h +++ b/fs/btrfs/disk-io.h @@ -140,14 +140,4 @@ int btrfs_init_root_free_objectid(struct btrfs_root *root); int __init btrfs_end_io_wq_init(void); void __cold btrfs_end_io_wq_exit(void);
-#ifdef CONFIG_DEBUG_LOCK_ALLOC -void btrfs_set_buffer_lockdep_class(u64 objectid, - struct extent_buffer *eb, int level); -#else -static inline void btrfs_set_buffer_lockdep_class(u64 objectid, - struct extent_buffer *eb, int level) -{ -} -#endif - #endif diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c index 33461b4f9c8b5..5747c63929df7 100644 --- a/fs/btrfs/locking.c +++ b/fs/btrfs/locking.c @@ -13,6 +13,86 @@ #include "extent_io.h" #include "locking.h"
+/* + * Lockdep class keys for extent_buffer->lock's in this root. For a given + * eb, the lockdep key is determined by the btrfs_root it belongs to and + * the level the eb occupies in the tree. + * + * Different roots are used for different purposes and may nest inside each + * other and they require separate keysets. As lockdep keys should be + * static, assign keysets according to the purpose of the root as indicated + * by btrfs_root->root_key.objectid. This ensures that all special purpose + * roots have separate keysets. + * + * Lock-nesting across peer nodes is always done with the immediate parent + * node locked thus preventing deadlock. As lockdep doesn't know this, use + * subclass to avoid triggering lockdep warning in such cases. + * + * The key is set by the readpage_end_io_hook after the buffer has passed + * csum validation but before the pages are unlocked. It is also set by + * btrfs_init_new_buffer on freshly allocated blocks. + * + * We also add a check to make sure the highest level of the tree is the + * same as our lockdep setup here. If BTRFS_MAX_LEVEL changes, this code + * needs update as well. + */ +#ifdef CONFIG_DEBUG_LOCK_ALLOC +#if BTRFS_MAX_LEVEL != 8 +#error +#endif + +#define DEFINE_LEVEL(stem, level) \ + .names[level] = "btrfs-" stem "-0" #level, + +#define DEFINE_NAME(stem) \ + DEFINE_LEVEL(stem, 0) \ + DEFINE_LEVEL(stem, 1) \ + DEFINE_LEVEL(stem, 2) \ + DEFINE_LEVEL(stem, 3) \ + DEFINE_LEVEL(stem, 4) \ + DEFINE_LEVEL(stem, 5) \ + DEFINE_LEVEL(stem, 6) \ + DEFINE_LEVEL(stem, 7) + +static struct btrfs_lockdep_keyset { + u64 id; /* root objectid */ + /* Longest entry: btrfs-free-space-00 */ + char names[BTRFS_MAX_LEVEL][20]; + struct lock_class_key keys[BTRFS_MAX_LEVEL]; +} btrfs_lockdep_keysets[] = { + { .id = BTRFS_ROOT_TREE_OBJECTID, DEFINE_NAME("root") }, + { .id = BTRFS_EXTENT_TREE_OBJECTID, DEFINE_NAME("extent") }, + { .id = BTRFS_CHUNK_TREE_OBJECTID, DEFINE_NAME("chunk") }, + { .id = BTRFS_DEV_TREE_OBJECTID, DEFINE_NAME("dev") }, + { .id = BTRFS_CSUM_TREE_OBJECTID, DEFINE_NAME("csum") }, + { .id = BTRFS_QUOTA_TREE_OBJECTID, DEFINE_NAME("quota") }, + { .id = BTRFS_TREE_LOG_OBJECTID, DEFINE_NAME("log") }, + { .id = BTRFS_TREE_RELOC_OBJECTID, DEFINE_NAME("treloc") }, + { .id = BTRFS_DATA_RELOC_TREE_OBJECTID, DEFINE_NAME("dreloc") }, + { .id = BTRFS_UUID_TREE_OBJECTID, DEFINE_NAME("uuid") }, + { .id = BTRFS_FREE_SPACE_TREE_OBJECTID, DEFINE_NAME("free-space") }, + { .id = 0, DEFINE_NAME("tree") }, +}; + +#undef DEFINE_LEVEL +#undef DEFINE_NAME + +void btrfs_set_buffer_lockdep_class(u64 objectid, struct extent_buffer *eb, int level) +{ + struct btrfs_lockdep_keyset *ks; + + BUG_ON(level >= ARRAY_SIZE(ks->keys)); + + /* Find the matching keyset, id 0 is the default entry */ + for (ks = btrfs_lockdep_keysets; ks->id; ks++) + if (ks->id == objectid) + break; + + lockdep_set_class_and_name(&eb->lock, &ks->keys[level], ks->names[level]); +} + +#endif + /* * Extent buffer locking * ===================== diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h index a2e1f1f5c6e34..97370ec0cd297 100644 --- a/fs/btrfs/locking.h +++ b/fs/btrfs/locking.h @@ -130,4 +130,13 @@ void btrfs_drew_write_unlock(struct btrfs_drew_lock *lock); void btrfs_drew_read_lock(struct btrfs_drew_lock *lock); void btrfs_drew_read_unlock(struct btrfs_drew_lock *lock);
+#ifdef CONFIG_DEBUG_LOCK_ALLOC +void btrfs_set_buffer_lockdep_class(u64 objectid, struct extent_buffer *eb, int level); +#else +static inline void btrfs_set_buffer_lockdep_class(u64 objectid, + struct extent_buffer *eb, int level) +{ +} +#endif + #endif
From: Josef Bacik josef@toxicpanda.com
[ Upstream commit b40130b23ca4a08c5785d5a3559805916bddba3c ]
We have been hitting the following lockdep splat with btrfs/187 recently
WARNING: possible circular locking dependency detected 5.19.0-rc8+ #775 Not tainted ------------------------------------------------------ btrfs/752500 is trying to acquire lock: ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
but task is already holding lock: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (btrfs-tree-01/1){+.+.}-{3:3}: down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_init_new_buffer+0x7d/0x2c0 btrfs_alloc_tree_block+0x120/0x3b0 __btrfs_cow_block+0x136/0x600 btrfs_cow_block+0x10b/0x230 btrfs_search_slot+0x53b/0xb70 btrfs_lookup_inode+0x2a/0xa0 __btrfs_update_delayed_inode+0x5f/0x280 btrfs_async_run_delayed_root+0x24c/0x290 btrfs_work_helper+0xf2/0x3e0 process_one_work+0x271/0x590 worker_thread+0x52/0x3b0 kthread+0xf0/0x120 ret_from_fork+0x1f/0x30
-> #1 (btrfs-tree-01){++++}-{3:3}: down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_search_slot+0x3c3/0xb70 do_relocation+0x10c/0x6b0 relocate_tree_blocks+0x317/0x6d0 relocate_block_group+0x1f1/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #0 (btrfs-treloc-02#2){+.+.}-{3:3}: __lock_acquire+0x1122/0x1e10 lock_acquire+0xc2/0x2d0 down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_lock_root_node+0x31/0x50 btrfs_search_slot+0x1cb/0xb70 replace_path+0x541/0x9f0 merge_reloc_root+0x1d6/0x610 merge_reloc_roots+0xe2/0x260 relocate_block_group+0x2c8/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd
other info that might help us debug this:
Chain exists of: btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1
Possible unsafe locking scenario:
CPU0 CPU1 ---- ---- lock(btrfs-tree-01/1); lock(btrfs-tree-01); lock(btrfs-tree-01/1); lock(btrfs-treloc-02#2);
*** DEADLOCK ***
7 locks held by btrfs/752500: #0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90 #1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40 #2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400 #3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610 #4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0 #5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0 #6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
stack backtrace: CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace:
dump_stack_lvl+0x56/0x73 check_noncircular+0xd6/0x100 ? lock_is_held_type+0xe2/0x140 __lock_acquire+0x1122/0x1e10 lock_acquire+0xc2/0x2d0 ? __btrfs_tree_lock+0x24/0x110 down_write_nested+0x41/0x80 ? __btrfs_tree_lock+0x24/0x110 __btrfs_tree_lock+0x24/0x110 btrfs_lock_root_node+0x31/0x50 btrfs_search_slot+0x1cb/0xb70 ? lock_release+0x137/0x2d0 ? _raw_spin_unlock+0x29/0x50 ? release_extent_buffer+0x128/0x180 replace_path+0x541/0x9f0 merge_reloc_root+0x1d6/0x610 merge_reloc_roots+0xe2/0x260 relocate_block_group+0x2c8/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 ? lock_is_held_type+0xe2/0x140 ? lock_is_held_type+0xe2/0x140 ? __x64_sys_ioctl+0x88/0xc0 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd
This isn't necessarily new, it's just tricky to hit in practice. There are two competing things going on here. With relocation we create a snapshot of every fs tree with a reloc tree. Any extent buffers that get initialized here are initialized with the reloc root lockdep key. However since it is a snapshot, any blocks that are currently in cache that originally belonged to the fs tree will have the normal tree lockdep key set. This creates the lock dependency of
reloc tree -> normal tree
for the extent buffer locking during the first phase of the relocation as we walk down the reloc root to relocate blocks.
However this is problematic because the final phase of the relocation is merging the reloc root into the original fs root. This involves searching down to any keys that exist in the original fs root and then swapping the relocated block and the original fs root block. We have to search down to the fs root first, and then go search the reloc root for the block we need to replace. This creates the dependency of
normal tree -> reloc tree
which is why lockdep complains.
Additionally even if we were to fix this particular mismatch with a different nesting for the merge case, we're still slotting in a block that has a owner of the reloc root objectid into a normal tree, so that block will have its lockdep key set to the tree reloc root, and create a lockdep splat later on when we wander into that block from the fs root.
Unfortunately the only solution here is to make sure we do not set the lockdep key to the reloc tree lockdep key normally, and then reset any blocks we wander into from the reloc root when we're doing the merged.
This solves the problem of having mixed tree reloc keys intermixed with normal tree keys, and then allows us to make sure in the merge case we maintain the lock order of
normal tree -> reloc tree
We handle this by setting a bit on the reloc root when we do the search for the block we want to relocate, and any block we search into or COW at that point gets set to the reloc tree key. This works correctly because we only ever COW down to the parent node, so we aren't resetting the key for the block we're linking into the fs root.
With this patch we no longer have the lockdep splat in btrfs/187.
Signed-off-by: Josef Bacik josef@toxicpanda.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/ctree.c | 3 +++ fs/btrfs/ctree.h | 2 ++ fs/btrfs/extent-tree.c | 18 +++++++++++++++++- fs/btrfs/extent_io.c | 11 ++++++++++- fs/btrfs/locking.c | 11 +++++++++++ fs/btrfs/locking.h | 5 +++++ fs/btrfs/relocation.c | 2 ++ 7 files changed, 50 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 341ce90d24b15..fb7e331b69756 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -1938,6 +1938,9 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
if (!p->skip_locking) { level = btrfs_header_level(b); + + btrfs_maybe_reset_lockdep_class(root, b); + if (level <= write_lock_level) { btrfs_tree_lock(b); p->locks[level] = BTRFS_WRITE_LOCK; diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index cd72570d11f65..319af4c4fffbe 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1105,6 +1105,8 @@ enum { BTRFS_ROOT_QGROUP_FLUSHING, /* This root has a drop operation that was started previously. */ BTRFS_ROOT_UNFINISHED_DROP, + /* This reloc root needs to have its buffers lockdep class reset. */ + BTRFS_ROOT_RESET_LOCKDEP_CLASS, };
static inline void btrfs_wake_unfinished_drop(struct btrfs_fs_info *fs_info) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 248ea15c97346..c71f6480d4d4c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4781,6 +4781,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root, { struct btrfs_fs_info *fs_info = root->fs_info; struct extent_buffer *buf; + u64 lockdep_owner = owner;
buf = btrfs_find_create_tree_block(fs_info, bytenr, owner, level); if (IS_ERR(buf)) @@ -4799,12 +4800,27 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root, return ERR_PTR(-EUCLEAN); }
+ /* + * The reloc trees are just snapshots, so we need them to appear to be + * just like any other fs tree WRT lockdep. + * + * The exception however is in replace_path() in relocation, where we + * hold the lock on the original fs root and then search for the reloc + * root. At that point we need to make sure any reloc root buffers are + * set to the BTRFS_TREE_RELOC_OBJECTID lockdep class in order to make + * lockdep happy. + */ + if (lockdep_owner == BTRFS_TREE_RELOC_OBJECTID && + !test_bit(BTRFS_ROOT_RESET_LOCKDEP_CLASS, &root->state)) + lockdep_owner = BTRFS_FS_TREE_OBJECTID; + /* * This needs to stay, because we could allocate a freed block from an * old tree into a new tree, so we need to make sure this new block is * set to the appropriate level and owner. */ - btrfs_set_buffer_lockdep_class(owner, buf, level); + btrfs_set_buffer_lockdep_class(lockdep_owner, buf, level); + __btrfs_tree_lock(buf, nest); btrfs_clean_tree_block(buf); clear_bit(EXTENT_BUFFER_STALE, &buf->bflags); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index a72a8d4d4a72e..7bd704779a99b 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -6109,6 +6109,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, struct extent_buffer *exists = NULL; struct page *p; struct address_space *mapping = fs_info->btree_inode->i_mapping; + u64 lockdep_owner = owner_root; int uptodate = 1; int ret;
@@ -6143,7 +6144,15 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, eb = __alloc_extent_buffer(fs_info, start, len); if (!eb) return ERR_PTR(-ENOMEM); - btrfs_set_buffer_lockdep_class(owner_root, eb, level); + + /* + * The reloc trees are just snapshots, so we need them to appear to be + * just like any other fs tree WRT lockdep. + */ + if (lockdep_owner == BTRFS_TREE_RELOC_OBJECTID) + lockdep_owner = BTRFS_FS_TREE_OBJECTID; + + btrfs_set_buffer_lockdep_class(lockdep_owner, eb, level);
num_pages = num_extent_pages(eb); for (i = 0; i < num_pages; i++, index++) { diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c index 5747c63929df7..9063072b399bd 100644 --- a/fs/btrfs/locking.c +++ b/fs/btrfs/locking.c @@ -91,6 +91,13 @@ void btrfs_set_buffer_lockdep_class(u64 objectid, struct extent_buffer *eb, int lockdep_set_class_and_name(&eb->lock, &ks->keys[level], ks->names[level]); }
+void btrfs_maybe_reset_lockdep_class(struct btrfs_root *root, struct extent_buffer *eb) +{ + if (test_bit(BTRFS_ROOT_RESET_LOCKDEP_CLASS, &root->state)) + btrfs_set_buffer_lockdep_class(root->root_key.objectid, + eb, btrfs_header_level(eb)); +} + #endif
/* @@ -244,6 +251,8 @@ struct extent_buffer *btrfs_lock_root_node(struct btrfs_root *root)
while (1) { eb = btrfs_root_node(root); + + btrfs_maybe_reset_lockdep_class(root, eb); btrfs_tree_lock(eb); if (eb == root->node) break; @@ -265,6 +274,8 @@ struct extent_buffer *btrfs_read_lock_root_node(struct btrfs_root *root)
while (1) { eb = btrfs_root_node(root); + + btrfs_maybe_reset_lockdep_class(root, eb); btrfs_tree_read_lock(eb); if (eb == root->node) break; diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h index 97370ec0cd297..26a2f962c268e 100644 --- a/fs/btrfs/locking.h +++ b/fs/btrfs/locking.h @@ -132,11 +132,16 @@ void btrfs_drew_read_unlock(struct btrfs_drew_lock *lock);
#ifdef CONFIG_DEBUG_LOCK_ALLOC void btrfs_set_buffer_lockdep_class(u64 objectid, struct extent_buffer *eb, int level); +void btrfs_maybe_reset_lockdep_class(struct btrfs_root *root, struct extent_buffer *eb); #else static inline void btrfs_set_buffer_lockdep_class(u64 objectid, struct extent_buffer *eb, int level) { } +static inline void btrfs_maybe_reset_lockdep_class(struct btrfs_root *root, + struct extent_buffer *eb) +{ +} #endif
#endif diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 673e11fcf3fc9..becf3396d533d 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -1326,7 +1326,9 @@ int replace_path(struct btrfs_trans_handle *trans, struct reloc_control *rc, btrfs_release_path(path);
path->lowest_level = level; + set_bit(BTRFS_ROOT_RESET_LOCKDEP_CLASS, &src->state); ret = btrfs_search_slot(trans, src, &key, path, 0, 1); + clear_bit(BTRFS_ROOT_RESET_LOCKDEP_CLASS, &src->state); path->lowest_level = 0; if (ret) { if (ret > 0)
From: Josef Bacik josef@toxicpanda.com
[ Upstream commit 899b7f69f244e539ea5df1b4d756046337de44a5 ]
We're seeing a weird problem in production where we have overlapping extent items in the extent tree. It's unclear where these are coming from, and in debugging we realized there's no check in the tree checker for this sort of problem. Add a check to the tree-checker to make sure that the extents do not overlap each other.
Reviewed-by: Qu Wenruo wqu@suse.com Signed-off-by: Josef Bacik josef@toxicpanda.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/tree-checker.c | 25 +++++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c index 51382d2be3d44..a84d2d4895104 100644 --- a/fs/btrfs/tree-checker.c +++ b/fs/btrfs/tree-checker.c @@ -1216,7 +1216,8 @@ static void extent_err(const struct extent_buffer *eb, int slot, }
static int check_extent_item(struct extent_buffer *leaf, - struct btrfs_key *key, int slot) + struct btrfs_key *key, int slot, + struct btrfs_key *prev_key) { struct btrfs_fs_info *fs_info = leaf->fs_info; struct btrfs_extent_item *ei; @@ -1436,6 +1437,26 @@ static int check_extent_item(struct extent_buffer *leaf, total_refs, inline_refs); return -EUCLEAN; } + + if ((prev_key->type == BTRFS_EXTENT_ITEM_KEY) || + (prev_key->type == BTRFS_METADATA_ITEM_KEY)) { + u64 prev_end = prev_key->objectid; + + if (prev_key->type == BTRFS_METADATA_ITEM_KEY) + prev_end += fs_info->nodesize; + else + prev_end += prev_key->offset; + + if (unlikely(prev_end > key->objectid)) { + extent_err(leaf, slot, + "previous extent [%llu %u %llu] overlaps current extent [%llu %u %llu]", + prev_key->objectid, prev_key->type, + prev_key->offset, key->objectid, key->type, + key->offset); + return -EUCLEAN; + } + } + return 0; }
@@ -1604,7 +1625,7 @@ static int check_leaf_item(struct extent_buffer *leaf, break; case BTRFS_EXTENT_ITEM_KEY: case BTRFS_METADATA_ITEM_KEY: - ret = check_extent_item(leaf, key, slot); + ret = check_extent_item(leaf, key, slot, prev_key); break; case BTRFS_TREE_BLOCK_REF_KEY: case BTRFS_SHARED_DATA_REF_KEY:
From: Kuniyuki Iwashima kuniyu@amazon.com
commit 9c80e79906b4ca440d09e7f116609262bb747909 upstream.
The assumption in __disable_kprobe() is wrong, and it could try to disarm an already disarmed kprobe and fire the WARN_ONCE() below. [0] We can easily reproduce this issue.
1. Write 0 to /sys/kernel/debug/kprobes/enabled.
# echo 0 > /sys/kernel/debug/kprobes/enabled
2. Run execsnoop. At this time, one kprobe is disabled.
# /usr/share/bcc/tools/execsnoop & [1] 2460 PCOMM PID PPID RET ARGS
# cat /sys/kernel/debug/kprobes/list ffffffff91345650 r __x64_sys_execve+0x0 [FTRACE] ffffffff91345650 k __x64_sys_execve+0x0 [DISABLED][FTRACE]
3. Write 1 to /sys/kernel/debug/kprobes/enabled, which changes kprobes_all_disarmed to false but does not arm the disabled kprobe.
# echo 1 > /sys/kernel/debug/kprobes/enabled
# cat /sys/kernel/debug/kprobes/list ffffffff91345650 r __x64_sys_execve+0x0 [FTRACE] ffffffff91345650 k __x64_sys_execve+0x0 [DISABLED][FTRACE]
4. Kill execsnoop, when __disable_kprobe() calls disarm_kprobe() for the disabled kprobe and hits the WARN_ONCE() in __disarm_kprobe_ftrace().
# fg /usr/share/bcc/tools/execsnoop ^C
Actually, WARN_ONCE() is fired twice, and __unregister_kprobe_top() misses some cleanups and leaves the aggregated kprobe in the hash table. Then, __unregister_trace_kprobe() initialises tk->rp.kp.list and creates an infinite loop like this.
aggregated kprobe.list -> kprobe.list -. ^ | '.__.'
In this situation, these commands fall into the infinite loop and result in RCU stall or soft lockup.
cat /sys/kernel/debug/kprobes/list : show_kprobe_addr() enters into the infinite loop with RCU.
/usr/share/bcc/tools/execsnoop : warn_kprobe_rereg() holds kprobe_mutex, and __get_valid_kprobe() is stuck in the loop.
To avoid the issue, make sure we don't call disarm_kprobe() for disabled kprobes.
[0] Failed to disarm kprobe-ftrace at __x64_sys_execve+0x0/0x40 (error -2) WARNING: CPU: 6 PID: 2460 at kernel/kprobes.c:1130 __disarm_kprobe_ftrace.isra.19 (kernel/kprobes.c:1129) Modules linked in: ena CPU: 6 PID: 2460 Comm: execsnoop Not tainted 5.19.0+ #28 Hardware name: Amazon EC2 c5.2xlarge/, BIOS 1.0 10/16/2017 RIP: 0010:__disarm_kprobe_ftrace.isra.19 (kernel/kprobes.c:1129) Code: 24 8b 02 eb c1 80 3d c4 83 f2 01 00 75 d4 48 8b 75 00 89 c2 48 c7 c7 90 fa 0f 92 89 04 24 c6 05 ab 83 01 e8 e4 94 f0 ff <0f> 0b 8b 04 24 eb b1 89 c6 48 c7 c7 60 fa 0f 92 89 04 24 e8 cc 94 RSP: 0018:ffff9e6ec154bd98 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffffffff930f7b00 RCX: 0000000000000001 RDX: 0000000080000001 RSI: ffffffff921461c5 RDI: 00000000ffffffff RBP: ffff89c504286da8 R08: 0000000000000000 R09: c0000000fffeffff R10: 0000000000000000 R11: ffff9e6ec154bc28 R12: ffff89c502394e40 R13: ffff89c502394c00 R14: ffff9e6ec154bc00 R15: 0000000000000000 FS: 00007fe800398740(0000) GS:ffff89c812d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000c00057f010 CR3: 0000000103b54006 CR4: 00000000007706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> __disable_kprobe (kernel/kprobes.c:1716) disable_kprobe (kernel/kprobes.c:2392) __disable_trace_kprobe (kernel/trace/trace_kprobe.c:340) disable_trace_kprobe (kernel/trace/trace_kprobe.c:429) perf_trace_event_unreg.isra.2 (./include/linux/tracepoint.h:93 kernel/trace/trace_event_perf.c:168) perf_kprobe_destroy (kernel/trace/trace_event_perf.c:295) _free_event (kernel/events/core.c:4971) perf_event_release_kernel (kernel/events/core.c:5176) perf_release (kernel/events/core.c:5186) __fput (fs/file_table.c:321) task_work_run (./include/linux/sched.h:2056 (discriminator 1) kernel/task_work.c:179 (discriminator 1)) exit_to_user_mode_prepare (./include/linux/resume_user_mode.h:49 kernel/entry/common.c:169 kernel/entry/common.c:201) syscall_exit_to_user_mode (./arch/x86/include/asm/jump_label.h:55 ./arch/x86/include/asm/nospec-branch.h:384 ./arch/x86/include/asm/entry-common.h:94 kernel/entry/common.c:133 kernel/entry/common.c:296) do_syscall_64 (arch/x86/entry/common.c:87) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) RIP: 0033:0x7fe7ff210654 Code: 15 79 89 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb be 0f 1f 00 8b 05 9a cd 20 00 48 63 ff 85 c0 75 11 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3a f3 c3 48 83 ec 18 48 89 7c 24 08 e8 34 fc RSP: 002b:00007ffdbd1d3538 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000000000000008 RCX: 00007fe7ff210654 RDX: 0000000000000000 RSI: 0000000000002401 RDI: 0000000000000008 RBP: 0000000000000000 R08: 94ae31d6fda838a4 R0900007fe8001c9d30 R10: 00007ffdbd1d34b0 R11: 0000000000000246 R12: 00007ffdbd1d3600 R13: 0000000000000000 R14: fffffffffffffffc R15: 00007ffdbd1d3560 </TASK>
Link: https://lkml.kernel.org/r/20220813020509.90805-1-kuniyu@amazon.com Fixes: 69d54b916d83 ("kprobes: makes kprobes/enabled works correctly for optimized kprobes.") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Reported-by: Ayushman Dutta ayudutta@amazon.com Cc: "Naveen N. Rao" naveen.n.rao@linux.ibm.com Cc: Anil S Keshavamurthy anil.s.keshavamurthy@intel.com Cc: "David S. Miller" davem@davemloft.net Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Wang Nan wangnan0@huawei.com Cc: Kuniyuki Iwashima kuniyu@amazon.com Cc: Kuniyuki Iwashima kuni1840@gmail.com Cc: Ayushman Dutta ayudutta@amazon.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/kprobes.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
--- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -1705,11 +1705,12 @@ static struct kprobe *__disable_kprobe(s /* Try to disarm and disable this/parent probe */ if (p == orig_p || aggr_kprobe_disabled(orig_p)) { /* - * If kprobes_all_disarmed is set, orig_p - * should have already been disarmed, so - * skip unneed disarming process. + * Don't be lazy here. Even if 'kprobes_all_disarmed' + * is false, 'orig_p' might not have been armed yet. + * Note arm_all_kprobes() __tries__ to arm all kprobes + * on the best effort basis. */ - if (!kprobes_all_disarmed) { + if (!kprobes_all_disarmed && !kprobe_disabled(orig_p)) { ret = disarm_kprobe(orig_p, true); if (ret) { p->flags &= ~KPROBE_FLAG_DISABLED;
From: Omar Sandoval osandov@fb.com
commit ced8ecf026fd8084cf175530ff85c76d6085d715 upstream.
When testing space_cache v2 on a large set of machines, we encountered a few symptoms:
1. "unable to add free space :-17" (EEXIST) errors. 2. Missing free space info items, sometimes caught with a "missing free space info for X" error. 3. Double-accounted space: ranges that were allocated in the extent tree and also marked as free in the free space tree, ranges that were marked as allocated twice in the extent tree, or ranges that were marked as free twice in the free space tree. If the latter made it onto disk, the next reboot would hit the BUG_ON() in add_new_free_space(). 4. On some hosts with no on-disk corruption or error messages, the in-memory space cache (dumped with drgn) disagreed with the free space tree.
All of these symptoms have the same underlying cause: a race between caching the free space for a block group and returning free space to the in-memory space cache for pinned extents causes us to double-add a free range to the space cache. This race exists when free space is cached from the free space tree (space_cache=v2) or the extent tree (nospace_cache, or space_cache=v1 if the cache needs to be regenerated). struct btrfs_block_group::last_byte_to_unpin and struct btrfs_block_group::progress are supposed to protect against this race, but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") subtly broke this by allowing multiple transactions to be unpinning extents at the same time.
Specifically, the race is as follows:
1. An extent is deleted from an uncached block group in transaction A. 2. btrfs_commit_transaction() is called for transaction A. 3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed ref for the deleted extent. 4. __btrfs_free_extent() -> do_free_extent_accounting() -> add_to_free_space_tree() adds the deleted extent back to the free space tree. 5. do_free_extent_accounting() -> btrfs_update_block_group() -> btrfs_cache_block_group() queues up the block group to get cached. block_group->progress is set to block_group->start. 6. btrfs_commit_transaction() for transaction A calls switch_commit_roots(). It sets block_group->last_byte_to_unpin to block_group->progress, which is block_group->start because the block group hasn't been cached yet. 7. The caching thread gets to our block group. Since the commit roots were already switched, load_free_space_tree() sees the deleted extent as free and adds it to the space cache. It finishes caching and sets block_group->progress to U64_MAX. 8. btrfs_commit_transaction() advances transaction A to TRANS_STATE_SUPER_COMMITTED. 9. fsync calls btrfs_commit_transaction() for transaction B. Since transaction A is already in TRANS_STATE_SUPER_COMMITTED and the commit is for fsync, it advances. 10. btrfs_commit_transaction() for transaction B calls switch_commit_roots(). This time, the block group has already been cached, so it sets block_group->last_byte_to_unpin to U64_MAX. 11. btrfs_commit_transaction() for transaction A calls btrfs_finish_extent_commit(), which calls unpin_extent_range() for the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by transaction B!), so it adds the deleted extent to the space cache again!
This explains all of our symptoms above:
* If the sequence of events is exactly as described above, when the free space is re-added in step 11, it will fail with EEXIST. * If another thread reallocates the deleted extent in between steps 7 and 11, then step 11 will silently re-add that space to the space cache as free even though it is actually allocated. Then, if that space is allocated *again*, the free space tree will be corrupted (namely, the wrong item will be deleted). * If we don't catch this free space tree corruption, it will continue to get worse as extents are deleted and reallocated.
The v1 space_cache is synchronously loaded when an extent is deleted (btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group() with load_cache_only=1), so it is not normally affected by this bug. However, as noted above, if we fail to load the space cache, we will fall back to caching from the extent tree and may hit this bug.
The easiest fix for this race is to also make caching from the free space tree or extent tree synchronous. Josef tested this and found no performance regressions.
A few extra changes fall out of this change. Namely, this fix does the following, with step 2 being the crucial fix:
1. Factor btrfs_caching_ctl_wait_done() out of btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl that we already hold a reference to. 2. Change the call in btrfs_cache_block_group() of btrfs_wait_space_cache_v1_finished() to btrfs_caching_ctl_wait_done(), which makes us wait regardless of the space_cache option. 3. Delete the now unused btrfs_wait_space_cache_v1_finished() and space_cache_v1_done(). 4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to `bool wait` to more accurately describe its new meaning. 5. Change a few callers which had a separate call to btrfs_wait_block_group_cache_done() to use wait = true instead. 6. Make btrfs_wait_block_group_cache_done() static now that it's not used outside of block-group.c anymore.
Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") CC: stable@vger.kernel.org # 5.12+ Reviewed-by: Filipe Manana fdmanana@suse.com Signed-off-by: Omar Sandoval osandov@fb.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/btrfs/block-group.c | 47 +++++++++++++++-------------------------------- fs/btrfs/block-group.h | 4 +--- fs/btrfs/ctree.h | 1 - fs/btrfs/extent-tree.c | 30 ++++++------------------------ 4 files changed, 22 insertions(+), 60 deletions(-)
--- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -418,39 +418,26 @@ void btrfs_wait_block_group_cache_progre btrfs_put_caching_control(caching_ctl); }
-int btrfs_wait_block_group_cache_done(struct btrfs_block_group *cache) +static int btrfs_caching_ctl_wait_done(struct btrfs_block_group *cache, + struct btrfs_caching_control *caching_ctl) +{ + wait_event(caching_ctl->wait, btrfs_block_group_done(cache)); + return cache->cached == BTRFS_CACHE_ERROR ? -EIO : 0; +} + +static int btrfs_wait_block_group_cache_done(struct btrfs_block_group *cache) { struct btrfs_caching_control *caching_ctl; - int ret = 0; + int ret;
caching_ctl = btrfs_get_caching_control(cache); if (!caching_ctl) return (cache->cached == BTRFS_CACHE_ERROR) ? -EIO : 0; - - wait_event(caching_ctl->wait, btrfs_block_group_done(cache)); - if (cache->cached == BTRFS_CACHE_ERROR) - ret = -EIO; + ret = btrfs_caching_ctl_wait_done(cache, caching_ctl); btrfs_put_caching_control(caching_ctl); return ret; }
-static bool space_cache_v1_done(struct btrfs_block_group *cache) -{ - bool ret; - - spin_lock(&cache->lock); - ret = cache->cached != BTRFS_CACHE_FAST; - spin_unlock(&cache->lock); - - return ret; -} - -void btrfs_wait_space_cache_v1_finished(struct btrfs_block_group *cache, - struct btrfs_caching_control *caching_ctl) -{ - wait_event(caching_ctl->wait, space_cache_v1_done(cache)); -} - #ifdef CONFIG_BTRFS_DEBUG static void fragment_free_space(struct btrfs_block_group *block_group) { @@ -727,9 +714,8 @@ done: btrfs_put_block_group(block_group); }
-int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only) +int btrfs_cache_block_group(struct btrfs_block_group *cache, bool wait) { - DEFINE_WAIT(wait); struct btrfs_fs_info *fs_info = cache->fs_info; struct btrfs_caching_control *caching_ctl = NULL; int ret = 0; @@ -762,10 +748,7 @@ int btrfs_cache_block_group(struct btrfs } WARN_ON(cache->caching_ctl); cache->caching_ctl = caching_ctl; - if (btrfs_test_opt(fs_info, SPACE_CACHE)) - cache->cached = BTRFS_CACHE_FAST; - else - cache->cached = BTRFS_CACHE_STARTED; + cache->cached = BTRFS_CACHE_STARTED; cache->has_caching_ctl = 1; spin_unlock(&cache->lock);
@@ -778,8 +761,8 @@ int btrfs_cache_block_group(struct btrfs
btrfs_queue_work(fs_info->caching_workers, &caching_ctl->work); out: - if (load_cache_only && caching_ctl) - btrfs_wait_space_cache_v1_finished(cache, caching_ctl); + if (wait && caching_ctl) + ret = btrfs_caching_ctl_wait_done(cache, caching_ctl); if (caching_ctl) btrfs_put_caching_control(caching_ctl);
@@ -3200,7 +3183,7 @@ int btrfs_update_block_group(struct btrf * space back to the block group, otherwise we will leak space. */ if (!alloc && !btrfs_block_group_done(cache)) - btrfs_cache_block_group(cache, 1); + btrfs_cache_block_group(cache, true);
byte_in_group = bytenr - cache->start; WARN_ON(byte_in_group > cache->length); --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -251,9 +251,7 @@ void btrfs_dec_nocow_writers(struct btrf void btrfs_wait_nocow_writers(struct btrfs_block_group *bg); void btrfs_wait_block_group_cache_progress(struct btrfs_block_group *cache, u64 num_bytes); -int btrfs_wait_block_group_cache_done(struct btrfs_block_group *cache); -int btrfs_cache_block_group(struct btrfs_block_group *cache, - int load_cache_only); +int btrfs_cache_block_group(struct btrfs_block_group *cache, bool wait); void btrfs_put_caching_control(struct btrfs_caching_control *ctl); struct btrfs_caching_control *btrfs_get_caching_control( struct btrfs_block_group *cache); --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -454,7 +454,6 @@ struct btrfs_free_cluster { enum btrfs_caching_type { BTRFS_CACHE_NO, BTRFS_CACHE_STARTED, - BTRFS_CACHE_FAST, BTRFS_CACHE_FINISHED, BTRFS_CACHE_ERROR, }; --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2572,17 +2572,10 @@ int btrfs_pin_extent_for_log_replay(stru return -EINVAL;
/* - * pull in the free space cache (if any) so that our pin - * removes the free space from the cache. We have load_only set - * to one because the slow code to read in the free extents does check - * the pinned extents. + * Fully cache the free space first so that our pin removes the free space + * from the cache. */ - btrfs_cache_block_group(cache, 1); - /* - * Make sure we wait until the cache is completely built in case it is - * missing or is invalid and therefore needs to be rebuilt. - */ - ret = btrfs_wait_block_group_cache_done(cache); + ret = btrfs_cache_block_group(cache, true); if (ret) goto out;
@@ -2605,12 +2598,7 @@ static int __exclude_logged_extent(struc if (!block_group) return -EINVAL;
- btrfs_cache_block_group(block_group, 1); - /* - * Make sure we wait until the cache is completely built in case it is - * missing or is invalid and therefore needs to be rebuilt. - */ - ret = btrfs_wait_block_group_cache_done(block_group); + ret = btrfs_cache_block_group(block_group, true); if (ret) goto out;
@@ -4324,7 +4312,7 @@ have_block_group: ffe_ctl.cached = btrfs_block_group_done(block_group); if (unlikely(!ffe_ctl.cached)) { ffe_ctl.have_caching_bg = true; - ret = btrfs_cache_block_group(block_group, 0); + ret = btrfs_cache_block_group(block_group, false);
/* * If we get ENOMEM here or something else we want to @@ -6082,13 +6070,7 @@ int btrfs_trim_fs(struct btrfs_fs_info *
if (end - start >= range->minlen) { if (!btrfs_block_group_done(cache)) { - ret = btrfs_cache_block_group(cache, 0); - if (ret) { - bg_failed++; - bg_ret = ret; - continue; - } - ret = btrfs_wait_block_group_cache_done(cache); + ret = btrfs_cache_block_group(cache, true); if (ret) { bg_failed++; bg_ret = ret;
From: Liam Howlett liam.howlett@oracle.com
commit b0cab80ecd54ae3b2356bb081af0bffd538c8265 upstream.
When munmapping a vma, the mmap_lock can be degraded to a write before calling close() on the file handle. The binder close() function calls binder_alloc_set_vma() to clear the vma address, which now has a lock dep check for writing on the mmap_lock. Change the lockdep check to ensure the reading lock is held while clearing and keep the write check while writing.
Link: https://lkml.kernel.org/r/20220627151857.2316964-1-Liam.Howlett@oracle.com Fixes: 472a68df605b ("android: binder: stop saving a pointer to the VMA") Signed-off-by: Liam R. Howlett Liam.Howlett@oracle.com Reported-by: syzbot+da54fa8d793ca89c741f@syzkaller.appspotmail.com Acked-by: Todd Kjos tkjos@google.com Cc: "Arve Hjønnevåg" arve@android.com Cc: Christian Brauner (Microsoft) brauner@kernel.org Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Hridya Valsaraju hridya@google.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Martijn Coenen maco@android.com Cc: Suren Baghdasaryan surenb@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/android/binder_alloc.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
--- a/drivers/android/binder_alloc.c +++ b/drivers/android/binder_alloc.c @@ -315,12 +315,19 @@ static inline void binder_alloc_set_vma( { unsigned long vm_start = 0;
+ /* + * Allow clearing the vma with holding just the read lock to allow + * munmapping downgrade of the write lock before freeing and closing the + * file using binder_alloc_vma_close(). + */ if (vma) { vm_start = vma->vm_start; alloc->vma_vm_mm = vma->vm_mm; + mmap_assert_write_locked(alloc->vma_vm_mm); + } else { + mmap_assert_locked(alloc->vma_vm_mm); }
- mmap_assert_write_locked(alloc->vma_vm_mm); alloc->vma_addr = vm_start; }
From: Zhengchao Shao shaozhengchao@huawei.com
commit dc633700f00f726e027846a318c5ffeb8deaaeda upstream.
User can use AF_PACKET socket to send packets with the length of 0. When min_header_len equals to 0, packet_snd will call __dev_queue_xmit to send packets, and sock->type can be any type.
Reported-by: syzbot+5ea725c25d06fb9114c4@syzkaller.appspotmail.com Fixes: fd1894224407 ("bpf: Don't redirect packets with invalid pkt_len") Signed-off-by: Zhengchao Shao shaozhengchao@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/packet/af_packet.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
--- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2989,8 +2989,8 @@ static int packet_snd(struct socket *soc if (err) goto out_free;
- if (sock->type == SOCK_RAW && - !dev_validate_header(dev, skb->data, len)) { + if ((sock->type == SOCK_RAW && + !dev_validate_header(dev, skb->data, len)) || !skb->len) { err = -EINVAL; goto out_free; }
From: Yang Yingliang yangyingliang@huawei.com
commit d5485d9dd24e1d04e5509916515260186eb1455c upstream.
It is not allowed to call kfree_skb() from hardware interrupt context or with interrupts being disabled. So add all skb to a tmp list, then free them after spin_unlock_irqrestore() at once.
Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop") Suggested-by: Denis V. Lunev den@openvz.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Nikolay Aleksandrov razor@blackwall.org Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/core/neighbour.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
--- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -281,21 +281,27 @@ static int neigh_del_timer(struct neighb
static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net) { + struct sk_buff_head tmp; unsigned long flags; struct sk_buff *skb;
+ skb_queue_head_init(&tmp); spin_lock_irqsave(&list->lock, flags); skb = skb_peek(list); while (skb != NULL) { struct sk_buff *skb_next = skb_peek_next(skb, list); if (net == NULL || net_eq(dev_net(skb->dev), net)) { __skb_unlink(skb, list); - dev_put(skb->dev); - kfree_skb(skb); + __skb_queue_tail(&tmp, skb); } skb = skb_next; } spin_unlock_irqrestore(&list->lock, flags); + + while ((skb = __skb_dequeue(&tmp))) { + dev_put(skb->dev); + kfree_skb(skb); + } }
static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev,
On Fri, 02 Sep 2022 14:18:24 +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.65 release. There are 73 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 04 Sep 2022 12:13:47 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.65-rc1... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y and the diffstat can be found below.
thanks,
greg k-h
All tests passing for Tegra ...
Test results for stable-v5.15: 10 builds: 10 pass, 0 fail 28 boots: 28 pass, 0 fail 114 tests: 114 pass, 0 fail
Linux version: 5.15.65-rc1-gad2e22e028e7 Boards tested: tegra124-jetson-tk1, tegra186-p2771-0000, tegra194-p2972-0000, tegra194-p3509-0000+p3668-0000, tegra20-ventana, tegra210-p2371-2180, tegra210-p3450-0000, tegra30-cardhu-a04
Tested-by: Jon Hunter jonathanh@nvidia.com
Jon
On 9/2/2022 5:18 AM, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.65 release. There are 73 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 04 Sep 2022 12:13:47 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.65-rc1... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y and the diffstat can be found below.
thanks,
greg k-h
On ARCH_BRCMSTB using 32-bit and 64-bit ARM kernels, build tested on BMIPS_GENERIC:
Tested-by: Florian Fainelli f.fainelli@gmail.com
On 9/2/22 06:18, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.65 release. There are 73 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 04 Sep 2022 12:13:47 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.65-rc1... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y and the diffstat can be found below.
thanks,
greg k-h
Compiled and booted on my test system. No dmesg regressions.
Tested-by: Shuah Khan skhan@linuxfoundation.org
thanks, -- Shuah
On Fri, Sep 02, 2022 at 02:18:24PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.65 release. There are 73 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 04 Sep 2022 12:13:47 +0000. Anything received after that time might be too late.
Build results: total: 159 pass: 159 fail: 0 Qemu test results: total: 485 pass: 485 fail: 0
Tested-by: Guenter Roeck linux@roeck-us.net
Guenter
On Fri, 2 Sept 2022 at 18:02, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
This is the start of the stable review cycle for the 5.15.65 release. There are 73 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 04 Sep 2022 12:13:47 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.65-rc1... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y and the diffstat can be found below.
thanks,
greg k-h
Results from Linaro's test farm. No regressions on arm64, arm, x86_64, and i386.
Tested-by: Linux Kernel Functional Testing lkft@linaro.org
## Build * kernel: 5.15.65-rc1 * git: https://gitlab.com/Linaro/lkft/mirrors/stable/linux-stable-rc * git branch: linux-5.15.y * git commit: ad2e22e028e72136a55cb6c301a29aa0178fe7d6 * git describe: v5.15.63-211-gad2e22e028e7 * test details: https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-5.15.y/build/v5.15....
## No test Regressions (compared to v5.15.63)
## No metric Regressions (compared to v5.15.63)
## No test Fixes (compared to v5.15.63)
## No metric Fixes (compared to v5.15.63)
## Test result summary total: 107640, pass: 94949, fail: 685, skip: 11690, xfail: 316
## Build Summary * arc: 10 total, 10 passed, 0 failed * arm: 306 total, 303 passed, 3 failed * arm64: 68 total, 65 passed, 3 failed * i386: 57 total, 51 passed, 6 failed * mips: 50 total, 47 passed, 3 failed * parisc: 14 total, 14 passed, 0 failed * powerpc: 59 total, 56 passed, 3 failed * riscv: 27 total, 26 passed, 1 failed * s390: 26 total, 23 passed, 3 failed * sh: 26 total, 24 passed, 2 failed * sparc: 14 total, 14 passed, 0 failed * x86_64: 61 total, 58 passed, 3 failed
## Test suites summary * fwts * igt-gpu-tools * kunit * kvm-unit-tests * libgpiod * libhugetlbfs * log-parser-boot * log-parser-test * ltp-cap_bounds * ltp-commands * ltp-containers * ltp-controllers * ltp-cpuhotplug * ltp-crypto * ltp-cve * ltp-dio * ltp-fcntl-locktests * ltp-filecaps * ltp-fs * ltp-fs_bind * ltp-fs_perms_simple * ltp-fsx * ltp-hugetlb * ltp-io * ltp-ipc * ltp-math * ltp-mm * ltp-nptl * ltp-open-posix-tests * ltp-pty * ltp-sched * ltp-securebits * ltp-syscalls * ltp-tracing * network-basic-tests * packetdrill * rcutorture * v4l2-compliance * vdso
-- Linaro LKFT https://lkft.linaro.org
On Fri, Sep 02, 2022 at 02:18:24PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.65 release. There are 73 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Successfully cross-compiled for arm64 (bcm2711_defconfig, GCC 10.2.0) and powerpc (ps3_defconfig, GCC 12.1.0).
Tested-by: Bagas Sanjaya bagasdotme@gmail.com
On 9/2/22 5:18 AM, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.65 release. There are 73 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 04 Sep 2022 12:13:47 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.65-rc1... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y and the diffstat can be found below.
thanks,
greg k-h
Built and booted successfully on RISC-V RV64 (HiFive Unmatched).
Tested-by: Ron Economos re@w6rz.net
Hi Greg,
On Fri, Sep 02, 2022 at 02:18:24PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.65 release. There are 73 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 04 Sep 2022 12:13:47 +0000. Anything received after that time might be too late.
Build test (gcc version 11.3.1 20220819): mips: 62 configs -> no failure arm: 99 configs -> no failure arm64: 3 configs -> no failure x86_64: 4 configs -> no failure alpha allmodconfig -> no failure csky allmodconfig -> no failure powerpc allmodconfig -> no failure riscv allmodconfig -> no failure s390 allmodconfig -> no failure xtensa allmodconfig -> no failure
Note: x86_64 allmodconfig failed with gcc-12 with the error:
drivers/net/wwan/iosm/iosm_ipc_protocol_ops.c: In function 'ipc_protocol_dl_td_process': drivers/net/wwan/iosm/iosm_ipc_protocol_ops.c:406:13: error: the comparison will always evaluate as 'true' for the address of 'cb' will never be NULL [-Werror=address] 406 | if (!IPC_CB(skb)) { | ^ In file included from drivers/net/wwan/iosm/iosm_ipc_imem.h:9, from drivers/net/wwan/iosm/iosm_ipc_protocol.h:9, from drivers/net/wwan/iosm/iosm_ipc_protocol_ops.c:6: ./include/linux/skbuff.h:794:33: note: 'cb' declared here 794 | char cb[48] __aligned(8);
It will need dbbc7d04c549 ("net: wwan: iosm: remove pointless null check").
Boot test: x86_64: Booted on my test laptop. No regression. x86_64: Booted on qemu. No regression. [1] arm64: Booted on rpi4b (4GB model). No regression. [2] mips: Booted on ci20 board. No regression. [3]
[1]. https://openqa.qa.codethink.co.uk/tests/1757 [2]. https://openqa.qa.codethink.co.uk/tests/1760 [3]. https://openqa.qa.codethink.co.uk/tests/1762
Tested-by: Sudip Mukherjee sudip.mukherjee@codethink.co.uk
-- Regards Sudip
linux-stable-mirror@lists.linaro.org