This is the start of the stable review cycle for the 5.15.126 release. There are 92 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y and the diffstat can be found below.
thanks,
greg k-h
------------- Pseudo-Shortlog of commits:
Greg Kroah-Hartman gregkh@linuxfoundation.org Linux 5.15.126-rc1
Johan Hovold johan+linaro@kernel.org PM: sleep: wakeirq: fix wake irq arming
Chunfeng Yun chunfeng.yun@mediatek.com PM / wakeirq: support enabling wake-up irq after runtime_suspend called
Johan Hovold johan+linaro@kernel.org soundwire: fix enumeration completion
Pierre-Louis Bossart pierre-louis.bossart@linux.intel.com soundwire: bus: pm_runtime_request_resume on peripheral attachment
Sean Christopherson seanjc@google.com selftests/rseq: Play nice with binaries statically linked against glibc 2.35+
Michael Jeanson mjeanson@efficios.com selftests/rseq: check if libc rseq support is registered
Alexander Stein alexander.stein@ew.tq-group.com drm/imx/ipuv3: Fix front porch adjustment upon hactive aligning
Thomas Zimmermann tzimmermann@suse.de drm/fsl-dcu: Use drm_plane_helper_destroy()
Aneesh Kumar K.V aneesh.kumar@linux.ibm.com powerpc/mm/altmap: Fix altmap boundary check
Christophe JAILLET christophe.jaillet@wanadoo.fr mtd: rawnand: fsl_upm: Fix an off-by one test in fun_exec_op()
Johan Jonker jbx6244@gmail.com mtd: rawnand: rockchip: Align hwecc vs. raw page helper layouts
Johan Jonker jbx6244@gmail.com mtd: rawnand: rockchip: fix oobfree offset and description
Roger Quadros rogerq@kernel.org mtd: rawnand: omap_elm: Fix incorrect type in assignment
Jan Kara jack@suse.cz ext2: Drop fragment support
Jan Kara jack@suse.cz fs: Protect reconfiguration of sb read-write from racing writes
Alan Stern stern@rowland.harvard.edu net: usbnet: Fix WARNING in usbnet_start_xmit/usb_submit_urb
Sungwoo Kim iam@sung-woo.kim Bluetooth: L2CAP: Fix use-after-free in l2cap_sock_ready_cb
Prince Kumar Maurya princekumarmaurya06@gmail.com fs/sysv: Null check to prevent null-ptr-deref bug
Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp fs/ntfs3: Use __GFP_NOWARN allocation at ntfs_load_attr_list()
Linus Torvalds torvalds@linux-foundation.org file: reinstate f_pos locking optimization for regular files
Hou Tao houtao1@huawei.com bpf, cpumap: Make sure kthread is running before map update returns
Guchun Chen guchun.chen@amd.com drm/ttm: check null pointer before accessing when swapping
Aleksa Sarai cyphar@cyphar.com open: make RESOLVE_CACHED correctly test for O_TMPFILE
Jiri Olsa jolsa@kernel.org bpf: Disable preemption in bpf_event_output
Ilya Dryomov idryomov@gmail.com rbd: prevent busy loop when requesting exclusive lock
Paul Fertser fercerpav@gmail.com wifi: mt76: mt7615: do not advertise 5 GHz on first phy of MT7615D (DBDC)
Laszlo Ersek lersek@redhat.com net: tap_open(): set sk_uid from current_fsuid()
Laszlo Ersek lersek@redhat.com net: tun_chr_open(): set sk_uid from current_fsuid()
Dinh Nguyen dinguyen@kernel.org arm64: dts: stratix10: fix incorrect I2C property for SCL signal
Arseniy Krasnov AVKrasnov@sberdevices.ru mtd: rawnand: meson: fix OOB available bytes for ECC
Olivier Maignial olivier.maignial@hotmail.fr mtd: spinand: toshiba: Fix ecc_get_status
Sungjong Seo sj1557.seo@samsung.com exfat: release s_lock before calling dir_emit()
gaoming gaoming20@hihonor.com exfat: use kvmalloc_array/kvfree instead of kmalloc_array/kfree
Krzysztof Kozlowski krzysztof.kozlowski@linaro.org firmware: arm_scmi: Drop OF node reference in the transport channel setup
Xiubo Li xiubli@redhat.com ceph: defer stopping mdsc delayed_work
Ross Maynard bids.7405@bigpond.com USB: zaurus: Add ID for A-300/B-500/C-700
Ilya Dryomov idryomov@gmail.com libceph: fix potential hang in ceph_osdc_notify()
Michael Kelley mikelley@microsoft.com scsi: storvsc: Limit max_sectors for virtual Fibre Channel devices
Steffen Maier maier@linux.ibm.com scsi: zfcp: Defer fc_rport blocking until after ADISC response
Eric Dumazet edumazet@google.com tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen
Eric Dumazet edumazet@google.com tcp_metrics: annotate data-races around tm->tcpm_net
Eric Dumazet edumazet@google.com tcp_metrics: annotate data-races around tm->tcpm_vals[]
Eric Dumazet edumazet@google.com tcp_metrics: annotate data-races around tm->tcpm_lock
Eric Dumazet edumazet@google.com tcp_metrics: annotate data-races around tm->tcpm_stamp
Eric Dumazet edumazet@google.com tcp_metrics: fix addr_same() helper
Jonas Gorski jonas.gorski@bisdn.de prestera: fix fallback to previous version on same major version
Jianbo Liu jianbol@nvidia.com net/mlx5: fs_core: Skip the FTs in the same FS_TYPE_PRIO_CHAINS fs_prio
Jianbo Liu jianbol@nvidia.com net/mlx5: fs_core: Make find_closest_ft more generic
Benjamin Poirier bpoirier@nvidia.com vxlan: Fix nexthop hash size
Yue Haibing yuehaibing@huawei.com ip6mr: Fix skb_under_panic in ip6mr_cache_report()
Alexandra Winter wintera@linux.ibm.com s390/qeth: Don't call dev_close/dev_open (DOWN/UP)
Lin Ma linma@zju.edu.cn net: dcb: choose correct policy to parse DCB_ATTR_BCN
Mark Brown broonie@kernel.org net: netsec: Ignore 'phy-mode' on SynQuacer in DT mode
Yuanjun Gong ruc_gongyuanjun@163.com net: korina: handle clk prepare error in korina_probe()
Dan Carpenter dan.carpenter@linaro.org net: ll_temac: fix error checking of irq_of_parse_and_map()
Yang Yingliang yangyingliang@huawei.com net: ll_temac: Switch to use dev_err_probe() helper
Tomas Glozar tglozar@redhat.com bpf: sockmap: Remove preempt_disable in sock_map_sk_acquire
valis sec@valis.email net/sched: cls_route: No longer copy tcf_result on update to avoid use-after-free
valis sec@valis.email net/sched: cls_fw: No longer copy tcf_result on update to avoid use-after-free
valis sec@valis.email net/sched: cls_u32: No longer copy tcf_result on update to avoid use-after-free
Hou Tao houtao1@huawei.com bpf, cpumap: Handle skb as well when clean up ptr_ring
Kuniyuki Iwashima kuniyu@amazon.com net/sched: taprio: Limit TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME to INT_MAX.
Eric Dumazet edumazet@google.com net: add missing data-race annotation for sk_ll_usec
Eric Dumazet edumazet@google.com net: add missing data-race annotations around sk->sk_peek_off
Eric Dumazet edumazet@google.com net: add missing READ_ONCE(sk->sk_rcvbuf) annotation
Eric Dumazet edumazet@google.com net: add missing READ_ONCE(sk->sk_sndbuf) annotation
Eric Dumazet edumazet@google.com net: add missing READ_ONCE(sk->sk_rcvlowat) annotation
Eric Dumazet edumazet@google.com net: annotate data-races around sk->sk_max_pacing_rate
Konstantin Khorenko khorenko@virtuozzo.com qed: Fix scheduling in a tasklet while getting stats
Prabhakar Kushwaha pkushwaha@marvell.com qed: Fix kernel-doc warnings
Chengfeng Ye dg573847474@gmail.com mISDN: hfcpci: Fix potential deadlock on &hc->lock
Jamal Hadi Salim jhs@mojatatu.com net: sched: cls_u32: Fix match key mis-addressing
Georg Müller georgmueller@gmx.net perf test uprobe_from_different_cu: Skip if there is no gcc
Yuanjun Gong ruc_gongyuanjun@163.com net: dsa: fix value check in bcm_sf2_sw_probe()
Lin Ma linma@zju.edu.cn rtnetlink: let rtnl_bridge_setlink checks IFLA_BRIDGE_MODE length
Lin Ma linma@zju.edu.cn bpf: Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing
Yuanjun Gong ruc_gongyuanjun@163.com net/mlx5e: fix return value check in mlx5e_ipsec_remove_trailer()
Zhengchao Shao shaozhengchao@huawei.com net/mlx5: DR, fix memory leak in mlx5dr_cmd_create_reformat_ctx
Ilan Peer ilan.peer@intel.com wifi: cfg80211: Fix return value in scan logic
Heiko Carstens hca@linux.ibm.com KVM: s390: fix sthyi error handling
ndesaulniers@google.com ndesaulniers@google.com word-at-a-time: use the same return type for has_zero regardless of endianness
Cristian Marussi cristian.marussi@arm.com firmware: arm_scmi: Fix chan_free cleanup on SMC
Hugo Villeneuve hvilleneuve@dimonoff.com arm64: dts: imx8mn-var-som: add missing pull-up for onboard PHY reset pinmux
Robin Murphy robin.murphy@arm.com iommu/arm-smmu-v3: Document nesting-related errata
Robin Murphy robin.murphy@arm.com iommu/arm-smmu-v3: Add explicit feature for nesting
Robin Murphy robin.murphy@arm.com iommu/arm-smmu-v3: Document MMU-700 erratum 2812531
Robin Murphy robin.murphy@arm.com iommu/arm-smmu-v3: Work around MMU-600 erratum 1076982
Suzuki K Poulose suzuki.poulose@arm.com arm64: errata: Add detection for TRBE write to out-of-range
Suzuki K Poulose suzuki.poulose@arm.com arm64: errata: Add workaround for TSB flush failures
Shay Drory shayd@nvidia.com net/mlx5: Free irqs only on shutdown callback
Peter Zijlstra peterz@infradead.org perf: Fix function pointer case
Jens Axboe axboe@kernel.dk io_uring: gate iowait schedule on having pending requests
-------------
Diffstat:
Documentation/arm64/silicon-errata.rst | 12 + Makefile | 4 +- arch/arm64/Kconfig | 74 ++ .../boot/dts/altera/socfpga_stratix10_socdk.dts | 2 +- .../dts/altera/socfpga_stratix10_socdk_nand.dts | 2 +- arch/arm64/boot/dts/freescale/imx8mn-var-som.dtsi | 2 +- arch/arm64/include/asm/barrier.h | 16 +- arch/arm64/kernel/cpu_errata.c | 39 + arch/arm64/tools/cpucaps | 2 + arch/powerpc/include/asm/word-at-a-time.h | 2 +- arch/powerpc/mm/init_64.c | 3 +- arch/s390/kernel/sthyi.c | 6 +- arch/s390/kvm/intercept.c | 9 +- drivers/base/power/power.h | 8 +- drivers/base/power/runtime.c | 6 +- drivers/base/power/wakeirq.c | 111 ++- drivers/block/rbd.c | 28 +- drivers/firmware/arm_scmi/mailbox.c | 4 +- drivers/firmware/arm_scmi/smc.c | 21 +- drivers/gpu/drm/fsl-dcu/fsl_dcu_drm_plane.c | 8 +- drivers/gpu/drm/imx/ipuv3-crtc.c | 2 +- drivers/gpu/drm/ttm/ttm_bo.c | 3 +- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 ++ drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 8 + drivers/isdn/hardware/mISDN/hfcpci.c | 10 +- drivers/mtd/nand/raw/fsl_upm.c | 2 +- drivers/mtd/nand/raw/meson_nand.c | 3 +- drivers/mtd/nand/raw/omap_elm.c | 24 +- drivers/mtd/nand/raw/rockchip-nand-controller.c | 45 +- drivers/mtd/nand/spi/toshiba.c | 4 +- drivers/net/dsa/bcm_sf2.c | 8 +- drivers/net/ethernet/korina.c | 3 +- .../net/ethernet/marvell/prestera/prestera_pci.c | 3 +- .../mellanox/mlx5/core/en_accel/ipsec_rxtx.c | 4 +- drivers/net/ethernet/mellanox/mlx5/core/eq.c | 2 +- drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 105 ++- drivers/net/ethernet/mellanox/mlx5/core/mlx5_irq.h | 1 + drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c | 29 + .../ethernet/mellanox/mlx5/core/steering/dr_cmd.c | 5 +- drivers/net/ethernet/qlogic/qed/qed.h | 9 +- drivers/net/ethernet/qlogic/qed/qed_cxt.h | 138 +-- drivers/net/ethernet/qlogic/qed/qed_dev_api.h | 361 ++++---- drivers/net/ethernet/qlogic/qed/qed_fcoe.c | 19 +- drivers/net/ethernet/qlogic/qed/qed_fcoe.h | 17 +- drivers/net/ethernet/qlogic/qed/qed_hsi.h | 922 +++++++++++---------- drivers/net/ethernet/qlogic/qed/qed_hw.c | 26 +- drivers/net/ethernet/qlogic/qed/qed_hw.h | 214 ++--- drivers/net/ethernet/qlogic/qed/qed_init_ops.h | 58 +- drivers/net/ethernet/qlogic/qed/qed_int.h | 274 +++--- drivers/net/ethernet/qlogic/qed/qed_iscsi.c | 19 +- drivers/net/ethernet/qlogic/qed/qed_iscsi.h | 17 +- drivers/net/ethernet/qlogic/qed/qed_l2.c | 19 +- drivers/net/ethernet/qlogic/qed/qed_l2.h | 158 ++-- drivers/net/ethernet/qlogic/qed/qed_ll2.h | 130 +-- drivers/net/ethernet/qlogic/qed/qed_main.c | 6 +- drivers/net/ethernet/qlogic/qed/qed_mcp.h | 757 +++++++++-------- drivers/net/ethernet/qlogic/qed/qed_selftest.h | 30 +- drivers/net/ethernet/qlogic/qed/qed_sp.h | 215 +++-- drivers/net/ethernet/qlogic/qed/qed_sriov.h | 99 ++- drivers/net/ethernet/qlogic/qed/qed_vf.h | 301 ++++--- drivers/net/ethernet/qlogic/qede/qede_main.c | 5 +- drivers/net/ethernet/socionext/netsec.c | 11 + drivers/net/ethernet/xilinx/ll_temac_main.c | 16 +- drivers/net/tap.c | 2 +- drivers/net/tun.c | 2 +- drivers/net/usb/cdc_ether.c | 21 + drivers/net/usb/usbnet.c | 6 + drivers/net/usb/zaurus.c | 21 + drivers/net/wireless/mediatek/mt76/mt7615/eeprom.c | 6 +- drivers/s390/net/qeth_core.h | 1 - drivers/s390/net/qeth_core_main.c | 2 - drivers/s390/net/qeth_l2_main.c | 9 +- drivers/s390/net/qeth_l3_main.c | 8 +- drivers/s390/scsi/zfcp_fc.c | 6 +- drivers/scsi/storvsc_drv.c | 4 + drivers/soundwire/bus.c | 20 +- fs/ceph/mds_client.c | 4 +- fs/ceph/mds_client.h | 5 + fs/ceph/super.c | 10 + fs/exfat/balloc.c | 6 +- fs/exfat/dir.c | 27 +- fs/ext2/ext2.h | 12 - fs/ext2/super.c | 23 +- fs/file.c | 18 +- fs/ntfs3/attrlist.c | 4 +- fs/open.c | 2 +- fs/super.c | 11 +- fs/sysv/itree.c | 4 + include/asm-generic/word-at-a-time.h | 2 +- include/linux/pm_wakeirq.h | 9 +- include/linux/qed/qed_chain.h | 97 ++- include/linux/qed/qed_if.h | 255 +++--- include/linux/qed/qed_iscsi_if.h | 2 +- include/linux/qed/qed_ll2_if.h | 42 +- include/linux/qed/qed_nvmetcp_if.h | 17 + include/net/vxlan.h | 4 +- io_uring/io_uring.c | 23 +- kernel/bpf/cpumap.c | 35 +- kernel/events/core.c | 8 +- kernel/trace/bpf_trace.c | 6 +- net/bluetooth/l2cap_sock.c | 2 + net/ceph/osd_client.c | 20 +- net/core/bpf_sk_storage.c | 5 +- net/core/rtnetlink.c | 8 +- net/core/sock.c | 21 +- net/core/sock_map.c | 2 - net/dcb/dcbnl.c | 2 +- net/ipv4/tcp_metrics.c | 70 +- net/ipv6/ip6mr.c | 2 +- net/sched/cls_fw.c | 1 - net/sched/cls_route.c | 1 - net/sched/cls_u32.c | 57 +- net/sched/sch_taprio.c | 15 +- net/unix/af_unix.c | 2 +- net/wireless/scan.c | 2 +- .../tests/shell/test_uprobe_from_different_cu.sh | 8 +- tools/testing/selftests/rseq/rseq.c | 31 +- 117 files changed, 3227 insertions(+), 2247 deletions(-)
On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.126 release. There are 92 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y and the diffstat can be found below.
Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios hang with this -rc: TREE04, TREE07, TASKS03.
5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu hotplug rcutorture testing. Me and tglx are continuing to debug this. The issue does not show up on anything but 5.15 stable kernels and neither on mainline.
I will do some more runs to see if TASKS03 hang is a new thing but it could be related to the existing issues.
thanks,
- Joel
thanks,
greg k-h
Pseudo-Shortlog of commits:
Greg Kroah-Hartman gregkh@linuxfoundation.org Linux 5.15.126-rc1
Johan Hovold johan+linaro@kernel.org PM: sleep: wakeirq: fix wake irq arming
Chunfeng Yun chunfeng.yun@mediatek.com PM / wakeirq: support enabling wake-up irq after runtime_suspend called
Johan Hovold johan+linaro@kernel.org soundwire: fix enumeration completion
Pierre-Louis Bossart pierre-louis.bossart@linux.intel.com soundwire: bus: pm_runtime_request_resume on peripheral attachment
Sean Christopherson seanjc@google.com selftests/rseq: Play nice with binaries statically linked against glibc 2.35+
Michael Jeanson mjeanson@efficios.com selftests/rseq: check if libc rseq support is registered
Alexander Stein alexander.stein@ew.tq-group.com drm/imx/ipuv3: Fix front porch adjustment upon hactive aligning
Thomas Zimmermann tzimmermann@suse.de drm/fsl-dcu: Use drm_plane_helper_destroy()
Aneesh Kumar K.V aneesh.kumar@linux.ibm.com powerpc/mm/altmap: Fix altmap boundary check
Christophe JAILLET christophe.jaillet@wanadoo.fr mtd: rawnand: fsl_upm: Fix an off-by one test in fun_exec_op()
Johan Jonker jbx6244@gmail.com mtd: rawnand: rockchip: Align hwecc vs. raw page helper layouts
Johan Jonker jbx6244@gmail.com mtd: rawnand: rockchip: fix oobfree offset and description
Roger Quadros rogerq@kernel.org mtd: rawnand: omap_elm: Fix incorrect type in assignment
Jan Kara jack@suse.cz ext2: Drop fragment support
Jan Kara jack@suse.cz fs: Protect reconfiguration of sb read-write from racing writes
Alan Stern stern@rowland.harvard.edu net: usbnet: Fix WARNING in usbnet_start_xmit/usb_submit_urb
Sungwoo Kim iam@sung-woo.kim Bluetooth: L2CAP: Fix use-after-free in l2cap_sock_ready_cb
Prince Kumar Maurya princekumarmaurya06@gmail.com fs/sysv: Null check to prevent null-ptr-deref bug
Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp fs/ntfs3: Use __GFP_NOWARN allocation at ntfs_load_attr_list()
Linus Torvalds torvalds@linux-foundation.org file: reinstate f_pos locking optimization for regular files
Hou Tao houtao1@huawei.com bpf, cpumap: Make sure kthread is running before map update returns
Guchun Chen guchun.chen@amd.com drm/ttm: check null pointer before accessing when swapping
Aleksa Sarai cyphar@cyphar.com open: make RESOLVE_CACHED correctly test for O_TMPFILE
Jiri Olsa jolsa@kernel.org bpf: Disable preemption in bpf_event_output
Ilya Dryomov idryomov@gmail.com rbd: prevent busy loop when requesting exclusive lock
Paul Fertser fercerpav@gmail.com wifi: mt76: mt7615: do not advertise 5 GHz on first phy of MT7615D (DBDC)
Laszlo Ersek lersek@redhat.com net: tap_open(): set sk_uid from current_fsuid()
Laszlo Ersek lersek@redhat.com net: tun_chr_open(): set sk_uid from current_fsuid()
Dinh Nguyen dinguyen@kernel.org arm64: dts: stratix10: fix incorrect I2C property for SCL signal
Arseniy Krasnov AVKrasnov@sberdevices.ru mtd: rawnand: meson: fix OOB available bytes for ECC
Olivier Maignial olivier.maignial@hotmail.fr mtd: spinand: toshiba: Fix ecc_get_status
Sungjong Seo sj1557.seo@samsung.com exfat: release s_lock before calling dir_emit()
gaoming gaoming20@hihonor.com exfat: use kvmalloc_array/kvfree instead of kmalloc_array/kfree
Krzysztof Kozlowski krzysztof.kozlowski@linaro.org firmware: arm_scmi: Drop OF node reference in the transport channel setup
Xiubo Li xiubli@redhat.com ceph: defer stopping mdsc delayed_work
Ross Maynard bids.7405@bigpond.com USB: zaurus: Add ID for A-300/B-500/C-700
Ilya Dryomov idryomov@gmail.com libceph: fix potential hang in ceph_osdc_notify()
Michael Kelley mikelley@microsoft.com scsi: storvsc: Limit max_sectors for virtual Fibre Channel devices
Steffen Maier maier@linux.ibm.com scsi: zfcp: Defer fc_rport blocking until after ADISC response
Eric Dumazet edumazet@google.com tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen
Eric Dumazet edumazet@google.com tcp_metrics: annotate data-races around tm->tcpm_net
Eric Dumazet edumazet@google.com tcp_metrics: annotate data-races around tm->tcpm_vals[]
Eric Dumazet edumazet@google.com tcp_metrics: annotate data-races around tm->tcpm_lock
Eric Dumazet edumazet@google.com tcp_metrics: annotate data-races around tm->tcpm_stamp
Eric Dumazet edumazet@google.com tcp_metrics: fix addr_same() helper
Jonas Gorski jonas.gorski@bisdn.de prestera: fix fallback to previous version on same major version
Jianbo Liu jianbol@nvidia.com net/mlx5: fs_core: Skip the FTs in the same FS_TYPE_PRIO_CHAINS fs_prio
Jianbo Liu jianbol@nvidia.com net/mlx5: fs_core: Make find_closest_ft more generic
Benjamin Poirier bpoirier@nvidia.com vxlan: Fix nexthop hash size
Yue Haibing yuehaibing@huawei.com ip6mr: Fix skb_under_panic in ip6mr_cache_report()
Alexandra Winter wintera@linux.ibm.com s390/qeth: Don't call dev_close/dev_open (DOWN/UP)
Lin Ma linma@zju.edu.cn net: dcb: choose correct policy to parse DCB_ATTR_BCN
Mark Brown broonie@kernel.org net: netsec: Ignore 'phy-mode' on SynQuacer in DT mode
Yuanjun Gong ruc_gongyuanjun@163.com net: korina: handle clk prepare error in korina_probe()
Dan Carpenter dan.carpenter@linaro.org net: ll_temac: fix error checking of irq_of_parse_and_map()
Yang Yingliang yangyingliang@huawei.com net: ll_temac: Switch to use dev_err_probe() helper
Tomas Glozar tglozar@redhat.com bpf: sockmap: Remove preempt_disable in sock_map_sk_acquire
valis sec@valis.email net/sched: cls_route: No longer copy tcf_result on update to avoid use-after-free
valis sec@valis.email net/sched: cls_fw: No longer copy tcf_result on update to avoid use-after-free
valis sec@valis.email net/sched: cls_u32: No longer copy tcf_result on update to avoid use-after-free
Hou Tao houtao1@huawei.com bpf, cpumap: Handle skb as well when clean up ptr_ring
Kuniyuki Iwashima kuniyu@amazon.com net/sched: taprio: Limit TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME to INT_MAX.
Eric Dumazet edumazet@google.com net: add missing data-race annotation for sk_ll_usec
Eric Dumazet edumazet@google.com net: add missing data-race annotations around sk->sk_peek_off
Eric Dumazet edumazet@google.com net: add missing READ_ONCE(sk->sk_rcvbuf) annotation
Eric Dumazet edumazet@google.com net: add missing READ_ONCE(sk->sk_sndbuf) annotation
Eric Dumazet edumazet@google.com net: add missing READ_ONCE(sk->sk_rcvlowat) annotation
Eric Dumazet edumazet@google.com net: annotate data-races around sk->sk_max_pacing_rate
Konstantin Khorenko khorenko@virtuozzo.com qed: Fix scheduling in a tasklet while getting stats
Prabhakar Kushwaha pkushwaha@marvell.com qed: Fix kernel-doc warnings
Chengfeng Ye dg573847474@gmail.com mISDN: hfcpci: Fix potential deadlock on &hc->lock
Jamal Hadi Salim jhs@mojatatu.com net: sched: cls_u32: Fix match key mis-addressing
Georg Müller georgmueller@gmx.net perf test uprobe_from_different_cu: Skip if there is no gcc
Yuanjun Gong ruc_gongyuanjun@163.com net: dsa: fix value check in bcm_sf2_sw_probe()
Lin Ma linma@zju.edu.cn rtnetlink: let rtnl_bridge_setlink checks IFLA_BRIDGE_MODE length
Lin Ma linma@zju.edu.cn bpf: Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing
Yuanjun Gong ruc_gongyuanjun@163.com net/mlx5e: fix return value check in mlx5e_ipsec_remove_trailer()
Zhengchao Shao shaozhengchao@huawei.com net/mlx5: DR, fix memory leak in mlx5dr_cmd_create_reformat_ctx
Ilan Peer ilan.peer@intel.com wifi: cfg80211: Fix return value in scan logic
Heiko Carstens hca@linux.ibm.com KVM: s390: fix sthyi error handling
ndesaulniers@google.com ndesaulniers@google.com word-at-a-time: use the same return type for has_zero regardless of endianness
Cristian Marussi cristian.marussi@arm.com firmware: arm_scmi: Fix chan_free cleanup on SMC
Hugo Villeneuve hvilleneuve@dimonoff.com arm64: dts: imx8mn-var-som: add missing pull-up for onboard PHY reset pinmux
Robin Murphy robin.murphy@arm.com iommu/arm-smmu-v3: Document nesting-related errata
Robin Murphy robin.murphy@arm.com iommu/arm-smmu-v3: Add explicit feature for nesting
Robin Murphy robin.murphy@arm.com iommu/arm-smmu-v3: Document MMU-700 erratum 2812531
Robin Murphy robin.murphy@arm.com iommu/arm-smmu-v3: Work around MMU-600 erratum 1076982
Suzuki K Poulose suzuki.poulose@arm.com arm64: errata: Add detection for TRBE write to out-of-range
Suzuki K Poulose suzuki.poulose@arm.com arm64: errata: Add workaround for TSB flush failures
Shay Drory shayd@nvidia.com net/mlx5: Free irqs only on shutdown callback
Peter Zijlstra peterz@infradead.org perf: Fix function pointer case
Jens Axboe axboe@kernel.dk io_uring: gate iowait schedule on having pending requests
Diffstat:
Documentation/arm64/silicon-errata.rst | 12 + Makefile | 4 +- arch/arm64/Kconfig | 74 ++ .../boot/dts/altera/socfpga_stratix10_socdk.dts | 2 +- .../dts/altera/socfpga_stratix10_socdk_nand.dts | 2 +- arch/arm64/boot/dts/freescale/imx8mn-var-som.dtsi | 2 +- arch/arm64/include/asm/barrier.h | 16 +- arch/arm64/kernel/cpu_errata.c | 39 + arch/arm64/tools/cpucaps | 2 + arch/powerpc/include/asm/word-at-a-time.h | 2 +- arch/powerpc/mm/init_64.c | 3 +- arch/s390/kernel/sthyi.c | 6 +- arch/s390/kvm/intercept.c | 9 +- drivers/base/power/power.h | 8 +- drivers/base/power/runtime.c | 6 +- drivers/base/power/wakeirq.c | 111 ++- drivers/block/rbd.c | 28 +- drivers/firmware/arm_scmi/mailbox.c | 4 +- drivers/firmware/arm_scmi/smc.c | 21 +- drivers/gpu/drm/fsl-dcu/fsl_dcu_drm_plane.c | 8 +- drivers/gpu/drm/imx/ipuv3-crtc.c | 2 +- drivers/gpu/drm/ttm/ttm_bo.c | 3 +- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 ++ drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 8 + drivers/isdn/hardware/mISDN/hfcpci.c | 10 +- drivers/mtd/nand/raw/fsl_upm.c | 2 +- drivers/mtd/nand/raw/meson_nand.c | 3 +- drivers/mtd/nand/raw/omap_elm.c | 24 +- drivers/mtd/nand/raw/rockchip-nand-controller.c | 45 +- drivers/mtd/nand/spi/toshiba.c | 4 +- drivers/net/dsa/bcm_sf2.c | 8 +- drivers/net/ethernet/korina.c | 3 +- .../net/ethernet/marvell/prestera/prestera_pci.c | 3 +- .../mellanox/mlx5/core/en_accel/ipsec_rxtx.c | 4 +- drivers/net/ethernet/mellanox/mlx5/core/eq.c | 2 +- drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 105 ++- drivers/net/ethernet/mellanox/mlx5/core/mlx5_irq.h | 1 + drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c | 29 + .../ethernet/mellanox/mlx5/core/steering/dr_cmd.c | 5 +- drivers/net/ethernet/qlogic/qed/qed.h | 9 +- drivers/net/ethernet/qlogic/qed/qed_cxt.h | 138 +-- drivers/net/ethernet/qlogic/qed/qed_dev_api.h | 361 ++++---- drivers/net/ethernet/qlogic/qed/qed_fcoe.c | 19 +- drivers/net/ethernet/qlogic/qed/qed_fcoe.h | 17 +- drivers/net/ethernet/qlogic/qed/qed_hsi.h | 922 +++++++++++---------- drivers/net/ethernet/qlogic/qed/qed_hw.c | 26 +- drivers/net/ethernet/qlogic/qed/qed_hw.h | 214 ++--- drivers/net/ethernet/qlogic/qed/qed_init_ops.h | 58 +- drivers/net/ethernet/qlogic/qed/qed_int.h | 274 +++--- drivers/net/ethernet/qlogic/qed/qed_iscsi.c | 19 +- drivers/net/ethernet/qlogic/qed/qed_iscsi.h | 17 +- drivers/net/ethernet/qlogic/qed/qed_l2.c | 19 +- drivers/net/ethernet/qlogic/qed/qed_l2.h | 158 ++-- drivers/net/ethernet/qlogic/qed/qed_ll2.h | 130 +-- drivers/net/ethernet/qlogic/qed/qed_main.c | 6 +- drivers/net/ethernet/qlogic/qed/qed_mcp.h | 757 +++++++++-------- drivers/net/ethernet/qlogic/qed/qed_selftest.h | 30 +- drivers/net/ethernet/qlogic/qed/qed_sp.h | 215 +++-- drivers/net/ethernet/qlogic/qed/qed_sriov.h | 99 ++- drivers/net/ethernet/qlogic/qed/qed_vf.h | 301 ++++--- drivers/net/ethernet/qlogic/qede/qede_main.c | 5 +- drivers/net/ethernet/socionext/netsec.c | 11 + drivers/net/ethernet/xilinx/ll_temac_main.c | 16 +- drivers/net/tap.c | 2 +- drivers/net/tun.c | 2 +- drivers/net/usb/cdc_ether.c | 21 + drivers/net/usb/usbnet.c | 6 + drivers/net/usb/zaurus.c | 21 + drivers/net/wireless/mediatek/mt76/mt7615/eeprom.c | 6 +- drivers/s390/net/qeth_core.h | 1 - drivers/s390/net/qeth_core_main.c | 2 - drivers/s390/net/qeth_l2_main.c | 9 +- drivers/s390/net/qeth_l3_main.c | 8 +- drivers/s390/scsi/zfcp_fc.c | 6 +- drivers/scsi/storvsc_drv.c | 4 + drivers/soundwire/bus.c | 20 +- fs/ceph/mds_client.c | 4 +- fs/ceph/mds_client.h | 5 + fs/ceph/super.c | 10 + fs/exfat/balloc.c | 6 +- fs/exfat/dir.c | 27 +- fs/ext2/ext2.h | 12 - fs/ext2/super.c | 23 +- fs/file.c | 18 +- fs/ntfs3/attrlist.c | 4 +- fs/open.c | 2 +- fs/super.c | 11 +- fs/sysv/itree.c | 4 + include/asm-generic/word-at-a-time.h | 2 +- include/linux/pm_wakeirq.h | 9 +- include/linux/qed/qed_chain.h | 97 ++- include/linux/qed/qed_if.h | 255 +++--- include/linux/qed/qed_iscsi_if.h | 2 +- include/linux/qed/qed_ll2_if.h | 42 +- include/linux/qed/qed_nvmetcp_if.h | 17 + include/net/vxlan.h | 4 +- io_uring/io_uring.c | 23 +- kernel/bpf/cpumap.c | 35 +- kernel/events/core.c | 8 +- kernel/trace/bpf_trace.c | 6 +- net/bluetooth/l2cap_sock.c | 2 + net/ceph/osd_client.c | 20 +- net/core/bpf_sk_storage.c | 5 +- net/core/rtnetlink.c | 8 +- net/core/sock.c | 21 +- net/core/sock_map.c | 2 - net/dcb/dcbnl.c | 2 +- net/ipv4/tcp_metrics.c | 70 +- net/ipv6/ip6mr.c | 2 +- net/sched/cls_fw.c | 1 - net/sched/cls_route.c | 1 - net/sched/cls_u32.c | 57 +- net/sched/sch_taprio.c | 15 +- net/unix/af_unix.c | 2 +- net/wireless/scan.c | 2 +- .../tests/shell/test_uprobe_from_different_cu.sh | 8 +- tools/testing/selftests/rseq/rseq.c | 31 +- 117 files changed, 3227 insertions(+), 2247 deletions(-)
On 8/9/23 06:53, Joel Fernandes wrote:
On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.126 release. There are 92 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y and the diffstat can be found below.
Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios hang with this -rc: TREE04, TREE07, TASKS03.
5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu hotplug rcutorture testing. Me and tglx are continuing to debug this. The issue does not show up on anything but 5.15 stable kernels and neither on mainline.
Do you by any have a crash pattern that we could possibly use to find the crash in ChromeOS crash logs ? No idea if that would help, but it could provide some additional data points.
Thanks, Guenter
I will do some more runs to see if TASKS03 hang is a new thing but it could be related to the existing issues.
thanks,
- Joel
thanks,
greg k-h
Pseudo-Shortlog of commits:
Greg Kroah-Hartman gregkh@linuxfoundation.org Linux 5.15.126-rc1
Johan Hovold johan+linaro@kernel.org PM: sleep: wakeirq: fix wake irq arming
Chunfeng Yun chunfeng.yun@mediatek.com PM / wakeirq: support enabling wake-up irq after runtime_suspend called
Johan Hovold johan+linaro@kernel.org soundwire: fix enumeration completion
Pierre-Louis Bossart pierre-louis.bossart@linux.intel.com soundwire: bus: pm_runtime_request_resume on peripheral attachment
Sean Christopherson seanjc@google.com selftests/rseq: Play nice with binaries statically linked against glibc 2.35+
Michael Jeanson mjeanson@efficios.com selftests/rseq: check if libc rseq support is registered
Alexander Stein alexander.stein@ew.tq-group.com drm/imx/ipuv3: Fix front porch adjustment upon hactive aligning
Thomas Zimmermann tzimmermann@suse.de drm/fsl-dcu: Use drm_plane_helper_destroy()
Aneesh Kumar K.V aneesh.kumar@linux.ibm.com powerpc/mm/altmap: Fix altmap boundary check
Christophe JAILLET christophe.jaillet@wanadoo.fr mtd: rawnand: fsl_upm: Fix an off-by one test in fun_exec_op()
Johan Jonker jbx6244@gmail.com mtd: rawnand: rockchip: Align hwecc vs. raw page helper layouts
Johan Jonker jbx6244@gmail.com mtd: rawnand: rockchip: fix oobfree offset and description
Roger Quadros rogerq@kernel.org mtd: rawnand: omap_elm: Fix incorrect type in assignment
Jan Kara jack@suse.cz ext2: Drop fragment support
Jan Kara jack@suse.cz fs: Protect reconfiguration of sb read-write from racing writes
Alan Stern stern@rowland.harvard.edu net: usbnet: Fix WARNING in usbnet_start_xmit/usb_submit_urb
Sungwoo Kim iam@sung-woo.kim Bluetooth: L2CAP: Fix use-after-free in l2cap_sock_ready_cb
Prince Kumar Maurya princekumarmaurya06@gmail.com fs/sysv: Null check to prevent null-ptr-deref bug
Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp fs/ntfs3: Use __GFP_NOWARN allocation at ntfs_load_attr_list()
Linus Torvalds torvalds@linux-foundation.org file: reinstate f_pos locking optimization for regular files
Hou Tao houtao1@huawei.com bpf, cpumap: Make sure kthread is running before map update returns
Guchun Chen guchun.chen@amd.com drm/ttm: check null pointer before accessing when swapping
Aleksa Sarai cyphar@cyphar.com open: make RESOLVE_CACHED correctly test for O_TMPFILE
Jiri Olsa jolsa@kernel.org bpf: Disable preemption in bpf_event_output
Ilya Dryomov idryomov@gmail.com rbd: prevent busy loop when requesting exclusive lock
Paul Fertser fercerpav@gmail.com wifi: mt76: mt7615: do not advertise 5 GHz on first phy of MT7615D (DBDC)
Laszlo Ersek lersek@redhat.com net: tap_open(): set sk_uid from current_fsuid()
Laszlo Ersek lersek@redhat.com net: tun_chr_open(): set sk_uid from current_fsuid()
Dinh Nguyen dinguyen@kernel.org arm64: dts: stratix10: fix incorrect I2C property for SCL signal
Arseniy Krasnov AVKrasnov@sberdevices.ru mtd: rawnand: meson: fix OOB available bytes for ECC
Olivier Maignial olivier.maignial@hotmail.fr mtd: spinand: toshiba: Fix ecc_get_status
Sungjong Seo sj1557.seo@samsung.com exfat: release s_lock before calling dir_emit()
gaoming gaoming20@hihonor.com exfat: use kvmalloc_array/kvfree instead of kmalloc_array/kfree
Krzysztof Kozlowski krzysztof.kozlowski@linaro.org firmware: arm_scmi: Drop OF node reference in the transport channel setup
Xiubo Li xiubli@redhat.com ceph: defer stopping mdsc delayed_work
Ross Maynard bids.7405@bigpond.com USB: zaurus: Add ID for A-300/B-500/C-700
Ilya Dryomov idryomov@gmail.com libceph: fix potential hang in ceph_osdc_notify()
Michael Kelley mikelley@microsoft.com scsi: storvsc: Limit max_sectors for virtual Fibre Channel devices
Steffen Maier maier@linux.ibm.com scsi: zfcp: Defer fc_rport blocking until after ADISC response
Eric Dumazet edumazet@google.com tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen
Eric Dumazet edumazet@google.com tcp_metrics: annotate data-races around tm->tcpm_net
Eric Dumazet edumazet@google.com tcp_metrics: annotate data-races around tm->tcpm_vals[]
Eric Dumazet edumazet@google.com tcp_metrics: annotate data-races around tm->tcpm_lock
Eric Dumazet edumazet@google.com tcp_metrics: annotate data-races around tm->tcpm_stamp
Eric Dumazet edumazet@google.com tcp_metrics: fix addr_same() helper
Jonas Gorski jonas.gorski@bisdn.de prestera: fix fallback to previous version on same major version
Jianbo Liu jianbol@nvidia.com net/mlx5: fs_core: Skip the FTs in the same FS_TYPE_PRIO_CHAINS fs_prio
Jianbo Liu jianbol@nvidia.com net/mlx5: fs_core: Make find_closest_ft more generic
Benjamin Poirier bpoirier@nvidia.com vxlan: Fix nexthop hash size
Yue Haibing yuehaibing@huawei.com ip6mr: Fix skb_under_panic in ip6mr_cache_report()
Alexandra Winter wintera@linux.ibm.com s390/qeth: Don't call dev_close/dev_open (DOWN/UP)
Lin Ma linma@zju.edu.cn net: dcb: choose correct policy to parse DCB_ATTR_BCN
Mark Brown broonie@kernel.org net: netsec: Ignore 'phy-mode' on SynQuacer in DT mode
Yuanjun Gong ruc_gongyuanjun@163.com net: korina: handle clk prepare error in korina_probe()
Dan Carpenter dan.carpenter@linaro.org net: ll_temac: fix error checking of irq_of_parse_and_map()
Yang Yingliang yangyingliang@huawei.com net: ll_temac: Switch to use dev_err_probe() helper
Tomas Glozar tglozar@redhat.com bpf: sockmap: Remove preempt_disable in sock_map_sk_acquire
valis sec@valis.email net/sched: cls_route: No longer copy tcf_result on update to avoid use-after-free
valis sec@valis.email net/sched: cls_fw: No longer copy tcf_result on update to avoid use-after-free
valis sec@valis.email net/sched: cls_u32: No longer copy tcf_result on update to avoid use-after-free
Hou Tao houtao1@huawei.com bpf, cpumap: Handle skb as well when clean up ptr_ring
Kuniyuki Iwashima kuniyu@amazon.com net/sched: taprio: Limit TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME to INT_MAX.
Eric Dumazet edumazet@google.com net: add missing data-race annotation for sk_ll_usec
Eric Dumazet edumazet@google.com net: add missing data-race annotations around sk->sk_peek_off
Eric Dumazet edumazet@google.com net: add missing READ_ONCE(sk->sk_rcvbuf) annotation
Eric Dumazet edumazet@google.com net: add missing READ_ONCE(sk->sk_sndbuf) annotation
Eric Dumazet edumazet@google.com net: add missing READ_ONCE(sk->sk_rcvlowat) annotation
Eric Dumazet edumazet@google.com net: annotate data-races around sk->sk_max_pacing_rate
Konstantin Khorenko khorenko@virtuozzo.com qed: Fix scheduling in a tasklet while getting stats
Prabhakar Kushwaha pkushwaha@marvell.com qed: Fix kernel-doc warnings
Chengfeng Ye dg573847474@gmail.com mISDN: hfcpci: Fix potential deadlock on &hc->lock
Jamal Hadi Salim jhs@mojatatu.com net: sched: cls_u32: Fix match key mis-addressing
Georg Müller georgmueller@gmx.net perf test uprobe_from_different_cu: Skip if there is no gcc
Yuanjun Gong ruc_gongyuanjun@163.com net: dsa: fix value check in bcm_sf2_sw_probe()
Lin Ma linma@zju.edu.cn rtnetlink: let rtnl_bridge_setlink checks IFLA_BRIDGE_MODE length
Lin Ma linma@zju.edu.cn bpf: Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing
Yuanjun Gong ruc_gongyuanjun@163.com net/mlx5e: fix return value check in mlx5e_ipsec_remove_trailer()
Zhengchao Shao shaozhengchao@huawei.com net/mlx5: DR, fix memory leak in mlx5dr_cmd_create_reformat_ctx
Ilan Peer ilan.peer@intel.com wifi: cfg80211: Fix return value in scan logic
Heiko Carstens hca@linux.ibm.com KVM: s390: fix sthyi error handling
ndesaulniers@google.com ndesaulniers@google.com word-at-a-time: use the same return type for has_zero regardless of endianness
Cristian Marussi cristian.marussi@arm.com firmware: arm_scmi: Fix chan_free cleanup on SMC
Hugo Villeneuve hvilleneuve@dimonoff.com arm64: dts: imx8mn-var-som: add missing pull-up for onboard PHY reset pinmux
Robin Murphy robin.murphy@arm.com iommu/arm-smmu-v3: Document nesting-related errata
Robin Murphy robin.murphy@arm.com iommu/arm-smmu-v3: Add explicit feature for nesting
Robin Murphy robin.murphy@arm.com iommu/arm-smmu-v3: Document MMU-700 erratum 2812531
Robin Murphy robin.murphy@arm.com iommu/arm-smmu-v3: Work around MMU-600 erratum 1076982
Suzuki K Poulose suzuki.poulose@arm.com arm64: errata: Add detection for TRBE write to out-of-range
Suzuki K Poulose suzuki.poulose@arm.com arm64: errata: Add workaround for TSB flush failures
Shay Drory shayd@nvidia.com net/mlx5: Free irqs only on shutdown callback
Peter Zijlstra peterz@infradead.org perf: Fix function pointer case
Jens Axboe axboe@kernel.dk io_uring: gate iowait schedule on having pending requests
Diffstat:
Documentation/arm64/silicon-errata.rst | 12 + Makefile | 4 +- arch/arm64/Kconfig | 74 ++ .../boot/dts/altera/socfpga_stratix10_socdk.dts | 2 +- .../dts/altera/socfpga_stratix10_socdk_nand.dts | 2 +- arch/arm64/boot/dts/freescale/imx8mn-var-som.dtsi | 2 +- arch/arm64/include/asm/barrier.h | 16 +- arch/arm64/kernel/cpu_errata.c | 39 + arch/arm64/tools/cpucaps | 2 + arch/powerpc/include/asm/word-at-a-time.h | 2 +- arch/powerpc/mm/init_64.c | 3 +- arch/s390/kernel/sthyi.c | 6 +- arch/s390/kvm/intercept.c | 9 +- drivers/base/power/power.h | 8 +- drivers/base/power/runtime.c | 6 +- drivers/base/power/wakeirq.c | 111 ++- drivers/block/rbd.c | 28 +- drivers/firmware/arm_scmi/mailbox.c | 4 +- drivers/firmware/arm_scmi/smc.c | 21 +- drivers/gpu/drm/fsl-dcu/fsl_dcu_drm_plane.c | 8 +- drivers/gpu/drm/imx/ipuv3-crtc.c | 2 +- drivers/gpu/drm/ttm/ttm_bo.c | 3 +- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 ++ drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 8 + drivers/isdn/hardware/mISDN/hfcpci.c | 10 +- drivers/mtd/nand/raw/fsl_upm.c | 2 +- drivers/mtd/nand/raw/meson_nand.c | 3 +- drivers/mtd/nand/raw/omap_elm.c | 24 +- drivers/mtd/nand/raw/rockchip-nand-controller.c | 45 +- drivers/mtd/nand/spi/toshiba.c | 4 +- drivers/net/dsa/bcm_sf2.c | 8 +- drivers/net/ethernet/korina.c | 3 +- .../net/ethernet/marvell/prestera/prestera_pci.c | 3 +- .../mellanox/mlx5/core/en_accel/ipsec_rxtx.c | 4 +- drivers/net/ethernet/mellanox/mlx5/core/eq.c | 2 +- drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 105 ++- drivers/net/ethernet/mellanox/mlx5/core/mlx5_irq.h | 1 + drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c | 29 + .../ethernet/mellanox/mlx5/core/steering/dr_cmd.c | 5 +- drivers/net/ethernet/qlogic/qed/qed.h | 9 +- drivers/net/ethernet/qlogic/qed/qed_cxt.h | 138 +-- drivers/net/ethernet/qlogic/qed/qed_dev_api.h | 361 ++++---- drivers/net/ethernet/qlogic/qed/qed_fcoe.c | 19 +- drivers/net/ethernet/qlogic/qed/qed_fcoe.h | 17 +- drivers/net/ethernet/qlogic/qed/qed_hsi.h | 922 +++++++++++---------- drivers/net/ethernet/qlogic/qed/qed_hw.c | 26 +- drivers/net/ethernet/qlogic/qed/qed_hw.h | 214 ++--- drivers/net/ethernet/qlogic/qed/qed_init_ops.h | 58 +- drivers/net/ethernet/qlogic/qed/qed_int.h | 274 +++--- drivers/net/ethernet/qlogic/qed/qed_iscsi.c | 19 +- drivers/net/ethernet/qlogic/qed/qed_iscsi.h | 17 +- drivers/net/ethernet/qlogic/qed/qed_l2.c | 19 +- drivers/net/ethernet/qlogic/qed/qed_l2.h | 158 ++-- drivers/net/ethernet/qlogic/qed/qed_ll2.h | 130 +-- drivers/net/ethernet/qlogic/qed/qed_main.c | 6 +- drivers/net/ethernet/qlogic/qed/qed_mcp.h | 757 +++++++++-------- drivers/net/ethernet/qlogic/qed/qed_selftest.h | 30 +- drivers/net/ethernet/qlogic/qed/qed_sp.h | 215 +++-- drivers/net/ethernet/qlogic/qed/qed_sriov.h | 99 ++- drivers/net/ethernet/qlogic/qed/qed_vf.h | 301 ++++--- drivers/net/ethernet/qlogic/qede/qede_main.c | 5 +- drivers/net/ethernet/socionext/netsec.c | 11 + drivers/net/ethernet/xilinx/ll_temac_main.c | 16 +- drivers/net/tap.c | 2 +- drivers/net/tun.c | 2 +- drivers/net/usb/cdc_ether.c | 21 + drivers/net/usb/usbnet.c | 6 + drivers/net/usb/zaurus.c | 21 + drivers/net/wireless/mediatek/mt76/mt7615/eeprom.c | 6 +- drivers/s390/net/qeth_core.h | 1 - drivers/s390/net/qeth_core_main.c | 2 - drivers/s390/net/qeth_l2_main.c | 9 +- drivers/s390/net/qeth_l3_main.c | 8 +- drivers/s390/scsi/zfcp_fc.c | 6 +- drivers/scsi/storvsc_drv.c | 4 + drivers/soundwire/bus.c | 20 +- fs/ceph/mds_client.c | 4 +- fs/ceph/mds_client.h | 5 + fs/ceph/super.c | 10 + fs/exfat/balloc.c | 6 +- fs/exfat/dir.c | 27 +- fs/ext2/ext2.h | 12 - fs/ext2/super.c | 23 +- fs/file.c | 18 +- fs/ntfs3/attrlist.c | 4 +- fs/open.c | 2 +- fs/super.c | 11 +- fs/sysv/itree.c | 4 + include/asm-generic/word-at-a-time.h | 2 +- include/linux/pm_wakeirq.h | 9 +- include/linux/qed/qed_chain.h | 97 ++- include/linux/qed/qed_if.h | 255 +++--- include/linux/qed/qed_iscsi_if.h | 2 +- include/linux/qed/qed_ll2_if.h | 42 +- include/linux/qed/qed_nvmetcp_if.h | 17 + include/net/vxlan.h | 4 +- io_uring/io_uring.c | 23 +- kernel/bpf/cpumap.c | 35 +- kernel/events/core.c | 8 +- kernel/trace/bpf_trace.c | 6 +- net/bluetooth/l2cap_sock.c | 2 + net/ceph/osd_client.c | 20 +- net/core/bpf_sk_storage.c | 5 +- net/core/rtnetlink.c | 8 +- net/core/sock.c | 21 +- net/core/sock_map.c | 2 - net/dcb/dcbnl.c | 2 +- net/ipv4/tcp_metrics.c | 70 +- net/ipv6/ip6mr.c | 2 +- net/sched/cls_fw.c | 1 - net/sched/cls_route.c | 1 - net/sched/cls_u32.c | 57 +- net/sched/sch_taprio.c | 15 +- net/unix/af_unix.c | 2 +- net/wireless/scan.c | 2 +- .../tests/shell/test_uprobe_from_different_cu.sh | 8 +- tools/testing/selftests/rseq/rseq.c | 31 +- 117 files changed, 3227 insertions(+), 2247 deletions(-)
On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck linux@roeck-us.net wrote:
On 8/9/23 06:53, Joel Fernandes wrote:
On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.126 release. There are 92 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y and the diffstat can be found below.
Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios hang with this -rc: TREE04, TREE07, TASKS03.
5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu hotplug rcutorture testing. Me and tglx are continuing to debug this. The issue does not show up on anything but 5.15 stable kernels and neither on mainline.
Do you by any have a crash pattern that we could possibly use to find the crash in ChromeOS crash logs ? No idea if that would help, but it could provide some additional data points.
The pattern shows as a hard hang, the system is unresponsive and all CPUs are stuck in stop_machine. Sometimes it recovers on its own from the hang and then RCU immediately gives stall warnings. It takes 1.5 hour to reproduce and sometimes never happens for several hours.
It appears related to CPU hotplug since gdb showed me most of the CPUs are spinning in multi_cpu_stop() / stop machine after the hang.
thanks,
- Joel
On Wed, Aug 9, 2023 at 2:35 PM Joel Fernandes joel@joelfernandes.org wrote:
On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck linux@roeck-us.net wrote:
On 8/9/23 06:53, Joel Fernandes wrote:
On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.126 release. There are 92 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y and the diffstat can be found below.
Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios hang with this -rc: TREE04, TREE07, TASKS03.
5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu hotplug rcutorture testing. Me and tglx are continuing to debug this. The issue does not show up on anything but 5.15 stable kernels and neither on mainline.
Do you by any have a crash pattern that we could possibly use to find the crash in ChromeOS crash logs ? No idea if that would help, but it could provide some additional data points.
The pattern shows as a hard hang, the system is unresponsive and all CPUs are stuck in stop_machine. Sometimes it recovers on its own from the hang and then RCU immediately gives stall warnings. It takes 1.5 hour to reproduce and sometimes never happens for several hours.
It appears related to CPU hotplug since gdb showed me most of the CPUs are spinning in multi_cpu_stop() / stop machine after the hang.
Adding to this, it appears one of the CPUs is constantly firing and reprogramming hrtimer events for some reason every few 100 microseconds (I see this in gdb). My debug angle right now is to figure out why it does that but collecting a trace is hard as it appears even trace collection may not be happening once hung and the only traces I am getting are the ones after the hang recovers, not during the hang. I am also trying to see if multi_cpu_stop() can panic the kernel if it sits there too long.
- Joel
On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote:
On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck linux@roeck-us.net wrote:
On 8/9/23 06:53, Joel Fernandes wrote:
On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.126 release. There are 92 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y and the diffstat can be found below.
Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios hang with this -rc: TREE04, TREE07, TASKS03.
5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu hotplug rcutorture testing. Me and tglx are continuing to debug this. The issue does not show up on anything but 5.15 stable kernels and neither on mainline.
Do you by any have a crash pattern that we could possibly use to find the crash in ChromeOS crash logs ? No idea if that would help, but it could provide some additional data points.
The pattern shows as a hard hang, the system is unresponsive and all CPUs are stuck in stop_machine. Sometimes it recovers on its own from the hang and then RCU immediately gives stall warnings. It takes 1.5 hour to reproduce and sometimes never happens for several hours.
It appears related to CPU hotplug since gdb showed me most of the CPUs are spinning in multi_cpu_stop() / stop machine after the hang.
Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace, but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield(). Example:
<0>[63298.624328] watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [migration/0:11] <4>[63298.624331] Modules linked in: 8021q ccm snd_seq_dummy snd_seq snd_seq_device bridge stp llc tun nf_nat_tftp nf_conntrack_tftp nf_nat_ftp nf_conntrack_ftp esp6 ah6 ip6t_REJECT ip6t_ipv6header vhost_vsock vhost vmw_vsock_virtio_transport_common vsock veth rfcomm xt_cgroup cmac algif_hash algif_skcipher af_alg xt_MASQUERADE uinput iwlmvm snd_soc_skl_ssp_clk iwl7000_mac80211 btusb snd_soc_kbl_da7219_max98357a btrtl btintel snd_soc_hdac_hdmi btbcm bluetooth snd_soc_dmic snd_soc_skl ecdh_generic ecc snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_hdac_hda uvcvideo snd_soc_acpi_intel_match snd_soc_acpi snd_hda_ext_core videobuf2_vmalloc videobuf2_v4l2 videobuf2_common snd_intel_dspcfg videobuf2_memops snd_hda_codec snd_hwdep snd_hda_core iwlwifi snd_soc_da7219 snd_soc_max98357a fuse ip6table_nat cfg80211 lzo_rle lzo_compress zram joydev <4>[63298.624357] CPU: 0 PID: 11 Comm: migration/0 Tainted: G U W 5.4.180-17902-g44152654f29b #1 <4>[63298.624358] Hardware name: Google Nami/Nami, BIOS Google_Nami.10775.145.0 09/19/2019 <4>[63298.624363] RIP: 0010:stop_machine_yield+0xb/0xd <4>[63298.624366] Code: ff 74 b6 f0 ff 0f 75 b1 48 83 c7 08 e8 1f cb f9 ff eb a6 e8 a0 20 e3 ff eb bc e8 50 4b f5 ff 0f 1f 44 00 00 55 48 89 e5 f3 90 <5d> c3 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 81 <4>[63298.624368] RSP: 0000:ffffbaf90006fe38 EFLAGS: 00000293 ORIG_RAX: ffffffffffffff13 <4>[63298.624370] RAX: 0000000000000000 RBX: ffffbaf90300bca8 RCX: 0000000000000000 <4>[63298.624371] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffffffffb0d46920 <4>[63298.624373] RBP: ffffbaf90006fe38 R08: 0000000000000002 R09: 0000398ecf9a0ac5 <4>[63298.624374] R10: 0000000000000171 R11: ffffffffaf9cfb11 R12: 0000000000000001 <4>[63298.624376] R13: ffff9b09baa22201 R14: ffffffffb0d46920 R15: 0000000000000001 <4>[63298.624377] FS: 0000000000000000(0000) GS:ffff9b09baa00000(0000) knlGS:0000000000000000 <4>[63298.624379] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[63298.624380] CR2: 0000153c00724820 CR3: 0000000171ab8005 CR4: 00000000003606f0 <4>[63298.624382] Call Trace: <4>[63298.624386] multi_cpu_stop+0x89/0x119 <4>[63298.624389] ? stop_two_cpus+0x24d/0x24d <4>[63298.624391] cpu_stopper_thread+0x8f/0x111 <4>[63298.624394] smpboot_thread_fn+0x174/0x212 <4>[63298.624397] kthread+0x147/0x156 <4>[63298.624399] ? cpu_report_death+0x43/0x43 <4>[63298.624401] ? kthread_blkcg+0x2e/0x2e <4>[63298.624404] ret_from_fork+0x35/0x40 <0>[63298.624407] Kernel panic - not syncing: softlockup: hung tasks
I guess that is something different ?
Guenter
On Wed, Aug 09, 2023 at 12:25:48PM -0700, Guenter Roeck wrote:
On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote:
On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck linux@roeck-us.net wrote:
On 8/9/23 06:53, Joel Fernandes wrote:
On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.126 release. There are 92 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y and the diffstat can be found below.
Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios hang with this -rc: TREE04, TREE07, TASKS03.
5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu hotplug rcutorture testing. Me and tglx are continuing to debug this. The issue does not show up on anything but 5.15 stable kernels and neither on mainline.
Do you by any have a crash pattern that we could possibly use to find the crash in ChromeOS crash logs ? No idea if that would help, but it could provide some additional data points.
The pattern shows as a hard hang, the system is unresponsive and all CPUs are stuck in stop_machine. Sometimes it recovers on its own from the hang and then RCU immediately gives stall warnings. It takes 1.5 hour to reproduce and sometimes never happens for several hours.
It appears related to CPU hotplug since gdb showed me most of the CPUs are spinning in multi_cpu_stop() / stop machine after the hang.
Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace, but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield().
Interesting. It looks similar as far as the stack dump in gdb goes, here are the stacks I dumped with the hang I referred to: https://paste.debian.net/1288308/
But in dmesg, it prints nothing for about 20-30 mins before recovering, then I get RCU stalls. It looks like this:
[ 682.721962] kvm-clock: cpu 7, msr 199981c1, secondary cpu clock [ 682.736830] kvm-guest: stealtime: cpu 7, msr 1f5db140 [ 684.445875] smpboot: Booting Node 0 Processor 5 APIC 0x5 [ 684.467831] kvm-clock: cpu 5, msr 19998141, secondary cpu clock [ 684.555766] kvm-guest: stealtime: cpu 5, msr 1f55b140 [ 687.356637] smpboot: Booting Node 0 Processor 4 APIC 0x4 [ 687.377214] kvm-clock: cpu 4, msr 19998101, secondary cpu clock [ 2885.473742] kvm-guest: stealtime: cpu 4, msr 1f51b140 [ 2886.456408] rcu: INFO: rcu_sched self-detected stall on CPU [ 2886.457590] rcu_torture_fwd_prog_nr: Duration 15423 cver 170 gps 337 [ 2886.464934] rcu: 0-...!: (2 ticks this GP) idle=7eb/0/0x1 softirq=118271/118271 fqs=0 last_accelerate: e3cd/71c0 dyntick_enabled: 1 [ 2886.490837] (t=2199034 jiffies g=185489 q=4) [ 2886.497297] rcu: rcu_sched kthread timer wakeup didn't happen for 2199031 jiffies! g185489 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 [ 2886.514201] rcu: Possible timer handling issue on cpu=0 timer-softirq=441616 [ 2886.524593] rcu: rcu_sched kthread starved for 2199034 jiffies! g185489 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0 [ 2886.540067] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. [ 2886.551967] rcu: RCU grace-period kthread stack dump: [ 2886.558644] task:rcu_sched state:I stack:14896 pid: 15 ppid: 2 flags:0x00004000 [ 2886.569640] Call Trace: [ 2886.572940] <TASK> [ 2886.575902] __schedule+0x284/0x6e0 [ 2886.580969] schedule+0x53/0xa0 [ 2886.585231] schedule_timeout+0x8f/0x130
In that huge gap, I connect gdb and dumped those stacks in above link.
On 5.15 stable you could repro it in about an hour and a half most of the time by running something like: tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 48 --duration 60 --configs TREE04
Let me know if you saw anything like this. I am currently trying to panic the kernel when the hang happens so I can get better traces.
thanks,
- Joel
On 8/9/23 13:14, Joel Fernandes wrote:
On Wed, Aug 09, 2023 at 12:25:48PM -0700, Guenter Roeck wrote:
On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote:
On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck linux@roeck-us.net wrote:
On 8/9/23 06:53, Joel Fernandes wrote:
On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.126 release. There are 92 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y and the diffstat can be found below.
Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios hang with this -rc: TREE04, TREE07, TASKS03.
5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu hotplug rcutorture testing. Me and tglx are continuing to debug this. The issue does not show up on anything but 5.15 stable kernels and neither on mainline.
Do you by any have a crash pattern that we could possibly use to find the crash in ChromeOS crash logs ? No idea if that would help, but it could provide some additional data points.
The pattern shows as a hard hang, the system is unresponsive and all CPUs are stuck in stop_machine. Sometimes it recovers on its own from the hang and then RCU immediately gives stall warnings. It takes 1.5 hour to reproduce and sometimes never happens for several hours.
It appears related to CPU hotplug since gdb showed me most of the CPUs are spinning in multi_cpu_stop() / stop machine after the hang.
Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace, but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield().
Interesting. It looks similar as far as the stack dump in gdb goes, here are the stacks I dumped with the hang I referred to: https://paste.debian.net/1288308/
That link gives me "Entry not found".
Guenter
On Wed, Aug 9, 2023 at 4:38 PM Guenter Roeck linux@roeck-us.net wrote:
On 8/9/23 13:14, Joel Fernandes wrote:
On Wed, Aug 09, 2023 at 12:25:48PM -0700, Guenter Roeck wrote:
On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote:
On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck linux@roeck-us.net wrote:
On 8/9/23 06:53, Joel Fernandes wrote:
On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote: > This is the start of the stable review cycle for the 5.15.126 release. > There are 92 patches in this series, all will be posted as a response > to this one. If anyone has any issues with these being applied, please > let me know. > > Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. > Anything received after that time might be too late. > > The whole patch series can be found in one patch at: > https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... > or in the git tree and branch at: > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y > and the diffstat can be found below.
Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios hang with this -rc: TREE04, TREE07, TASKS03.
5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu hotplug rcutorture testing. Me and tglx are continuing to debug this. The issue does not show up on anything but 5.15 stable kernels and neither on mainline.
Do you by any have a crash pattern that we could possibly use to find the crash in ChromeOS crash logs ? No idea if that would help, but it could provide some additional data points.
The pattern shows as a hard hang, the system is unresponsive and all CPUs are stuck in stop_machine. Sometimes it recovers on its own from the hang and then RCU immediately gives stall warnings. It takes 1.5 hour to reproduce and sometimes never happens for several hours.
It appears related to CPU hotplug since gdb showed me most of the CPUs are spinning in multi_cpu_stop() / stop machine after the hang.
Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace, but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield().
Interesting. It looks similar as far as the stack dump in gdb goes, here are the stacks I dumped with the hang I referred to: https://paste.debian.net/1288308/
That link gives me "Entry not found".
Yeah that was weird. Here it is again: https://pastebin.com/raw/L3nv1kH2
On 8/9/23 13:39, Joel Fernandes wrote:
On Wed, Aug 9, 2023 at 4:38 PM Guenter Roeck linux@roeck-us.net wrote:
On 8/9/23 13:14, Joel Fernandes wrote:
On Wed, Aug 09, 2023 at 12:25:48PM -0700, Guenter Roeck wrote:
On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote:
On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck linux@roeck-us.net wrote:
On 8/9/23 06:53, Joel Fernandes wrote: > On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote: >> This is the start of the stable review cycle for the 5.15.126 release. >> There are 92 patches in this series, all will be posted as a response >> to this one. If anyone has any issues with these being applied, please >> let me know. >> >> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. >> Anything received after that time might be too late. >> >> The whole patch series can be found in one patch at: >> https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... >> or in the git tree and branch at: >> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y >> and the diffstat can be found below. > > Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios > hang with this -rc: TREE04, TREE07, TASKS03. > > 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu > hotplug rcutorture testing. Me and tglx are continuing to debug this. The > issue does not show up on anything but 5.15 stable kernels and neither on > mainline. >
Do you by any have a crash pattern that we could possibly use to find the crash in ChromeOS crash logs ? No idea if that would help, but it could provide some additional data points.
The pattern shows as a hard hang, the system is unresponsive and all CPUs are stuck in stop_machine. Sometimes it recovers on its own from the hang and then RCU immediately gives stall warnings. It takes 1.5 hour to reproduce and sometimes never happens for several hours.
It appears related to CPU hotplug since gdb showed me most of the CPUs are spinning in multi_cpu_stop() / stop machine after the hang.
Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace, but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield().
Interesting. It looks similar as far as the stack dump in gdb goes, here are the stacks I dumped with the hang I referred to: https://paste.debian.net/1288308/
That link gives me "Entry not found".
Yeah that was weird. Here it is again: https://pastebin.com/raw/L3nv1kH2
I found a couple of crash reports from chromeos-5.10, one of them complaining about RCU issues. I sent you links via IM. Nothing from 5.15 or later, though.
Guenter
On Wed, Aug 09, 2023 at 02:45:44PM -0700, Guenter Roeck wrote:
On 8/9/23 13:39, Joel Fernandes wrote:
On Wed, Aug 9, 2023 at 4:38 PM Guenter Roeck linux@roeck-us.net wrote:
On 8/9/23 13:14, Joel Fernandes wrote:
On Wed, Aug 09, 2023 at 12:25:48PM -0700, Guenter Roeck wrote:
On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote:
On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck linux@roeck-us.net wrote: > > On 8/9/23 06:53, Joel Fernandes wrote: > > On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote: > > > This is the start of the stable review cycle for the 5.15.126 release. > > > There are 92 patches in this series, all will be posted as a response > > > to this one. If anyone has any issues with these being applied, please > > > let me know. > > > > > > Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. > > > Anything received after that time might be too late. > > > > > > The whole patch series can be found in one patch at: > > > https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... > > > or in the git tree and branch at: > > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y > > > and the diffstat can be found below. > > > > Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios > > hang with this -rc: TREE04, TREE07, TASKS03. > > > > 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu > > hotplug rcutorture testing. Me and tglx are continuing to debug this. The > > issue does not show up on anything but 5.15 stable kernels and neither on > > mainline. > > > > Do you by any have a crash pattern that we could possibly use to find the crash > in ChromeOS crash logs ? No idea if that would help, but it could provide some > additional data points.
The pattern shows as a hard hang, the system is unresponsive and all CPUs are stuck in stop_machine. Sometimes it recovers on its own from the hang and then RCU immediately gives stall warnings. It takes 1.5 hour to reproduce and sometimes never happens for several hours.
It appears related to CPU hotplug since gdb showed me most of the CPUs are spinning in multi_cpu_stop() / stop machine after the hang.
Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace, but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield().
Interesting. It looks similar as far as the stack dump in gdb goes, here are the stacks I dumped with the hang I referred to: https://paste.debian.net/1288308/
That link gives me "Entry not found".
Yeah that was weird. Here it is again: https://pastebin.com/raw/L3nv1kH2
I found a couple of crash reports from chromeos-5.10, one of them complaining about RCU issues. I sent you links via IM. Nothing from 5.15 or later, though.
Is the crash showing the eternally refiring timer fixed by this commit?
53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry")
This commit fixed something similar for me in v5.16.
https://paulmck.livejournal.com/62071.html
Thanx, Paul
On Thu, Aug 10, 2023 at 10:55:16AM -0700, Paul E. McKenney wrote:
On Wed, Aug 09, 2023 at 02:45:44PM -0700, Guenter Roeck wrote:
On 8/9/23 13:39, Joel Fernandes wrote:
On Wed, Aug 9, 2023 at 4:38 PM Guenter Roeck linux@roeck-us.net wrote:
On 8/9/23 13:14, Joel Fernandes wrote:
On Wed, Aug 09, 2023 at 12:25:48PM -0700, Guenter Roeck wrote:
On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote: > On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck linux@roeck-us.net wrote: > > > > On 8/9/23 06:53, Joel Fernandes wrote: > > > On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote: > > > > This is the start of the stable review cycle for the 5.15.126 release. > > > > There are 92 patches in this series, all will be posted as a response > > > > to this one. If anyone has any issues with these being applied, please > > > > let me know. > > > > > > > > Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. > > > > Anything received after that time might be too late. > > > > > > > > The whole patch series can be found in one patch at: > > > > https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... > > > > or in the git tree and branch at: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y > > > > and the diffstat can be found below. > > > > > > Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios > > > hang with this -rc: TREE04, TREE07, TASKS03. > > > > > > 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu > > > hotplug rcutorture testing. Me and tglx are continuing to debug this. The > > > issue does not show up on anything but 5.15 stable kernels and neither on > > > mainline. > > > > > > > Do you by any have a crash pattern that we could possibly use to find the crash > > in ChromeOS crash logs ? No idea if that would help, but it could provide some > > additional data points. > > The pattern shows as a hard hang, the system is unresponsive and all CPUs > are stuck in stop_machine. Sometimes it recovers on its own from the > hang and then RCU immediately gives stall warnings. It takes 1.5 hour > to reproduce and sometimes never happens for several hours. > > It appears related to CPU hotplug since gdb showed me most of the CPUs > are spinning in multi_cpu_stop() / stop machine after the hang. >
Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace, but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield().
Interesting. It looks similar as far as the stack dump in gdb goes, here are the stacks I dumped with the hang I referred to: https://paste.debian.net/1288308/
That link gives me "Entry not found".
Yeah that was weird. Here it is again: https://pastebin.com/raw/L3nv1kH2
I found a couple of crash reports from chromeos-5.10, one of them complaining about RCU issues. I sent you links via IM. Nothing from 5.15 or later, though.
Is the crash showing the eternally refiring timer fixed by this commit?
53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry")
Ah I was just replying, I have been seeing really good results after applying the following 3 commits since yesterday:
53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry") 5417ddc1cf1f ("timers/nohz: Switch to ONESHOT_STOPPED in the low-res handler when the tick is stopped") a1ff03cd6fb9 ("tick: Detect and fix jiffies update stall")
5417ddc1cf1f also mentioned a "tick storm" which is exactly what I was seeing.
I did a lengthy test and everything is looking good. I'll send these out to the stable list.
thanks,
- Joel
On Thu, Aug 10, 2023 at 09:54:16PM +0000, Joel Fernandes wrote:
On Thu, Aug 10, 2023 at 10:55:16AM -0700, Paul E. McKenney wrote:
On Wed, Aug 09, 2023 at 02:45:44PM -0700, Guenter Roeck wrote:
On 8/9/23 13:39, Joel Fernandes wrote:
On Wed, Aug 9, 2023 at 4:38 PM Guenter Roeck linux@roeck-us.net wrote:
On 8/9/23 13:14, Joel Fernandes wrote:
On Wed, Aug 09, 2023 at 12:25:48PM -0700, Guenter Roeck wrote: > On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote: > > On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck linux@roeck-us.net wrote: > > > > > > On 8/9/23 06:53, Joel Fernandes wrote: > > > > On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote: > > > > > This is the start of the stable review cycle for the 5.15.126 release. > > > > > There are 92 patches in this series, all will be posted as a response > > > > > to this one. If anyone has any issues with these being applied, please > > > > > let me know. > > > > > > > > > > Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. > > > > > Anything received after that time might be too late. > > > > > > > > > > The whole patch series can be found in one patch at: > > > > > https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... > > > > > or in the git tree and branch at: > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y > > > > > and the diffstat can be found below. > > > > > > > > Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios > > > > hang with this -rc: TREE04, TREE07, TASKS03. > > > > > > > > 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu > > > > hotplug rcutorture testing. Me and tglx are continuing to debug this. The > > > > issue does not show up on anything but 5.15 stable kernels and neither on > > > > mainline. > > > > > > > > > > Do you by any have a crash pattern that we could possibly use to find the crash > > > in ChromeOS crash logs ? No idea if that would help, but it could provide some > > > additional data points. > > > > The pattern shows as a hard hang, the system is unresponsive and all CPUs > > are stuck in stop_machine. Sometimes it recovers on its own from the > > hang and then RCU immediately gives stall warnings. It takes 1.5 hour > > to reproduce and sometimes never happens for several hours. > > > > It appears related to CPU hotplug since gdb showed me most of the CPUs > > are spinning in multi_cpu_stop() / stop machine after the hang. > > > > Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace, > but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield().
Interesting. It looks similar as far as the stack dump in gdb goes, here are the stacks I dumped with the hang I referred to: https://paste.debian.net/1288308/
That link gives me "Entry not found".
Yeah that was weird. Here it is again: https://pastebin.com/raw/L3nv1kH2
I found a couple of crash reports from chromeos-5.10, one of them complaining about RCU issues. I sent you links via IM. Nothing from 5.15 or later, though.
Is the crash showing the eternally refiring timer fixed by this commit?
53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry")
Ah I was just replying, I have been seeing really good results after applying the following 3 commits since yesterday:
53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry") 5417ddc1cf1f ("timers/nohz: Switch to ONESHOT_STOPPED in the low-res handler when the tick is stopped") a1ff03cd6fb9 ("tick: Detect and fix jiffies update stall")
5417ddc1cf1f also mentioned a "tick storm" which is exactly what I was seeing.
I did a lengthy test and everything is looking good. I'll send these out to the stable list.
I just read your post for the first time. And just to humor you about my debugging which was very similar to yours, I got as far as this statement in your post (before looking for fixes in timer code): <quote> Further checking showed that the stuck CPU was in fact suffering from an interrupt storm, namely an interrupt storm of scheduling-clock interrupts. This spurred another code-inspection session. </quote>
My detection of this came from gdb, within that 2000 second stall, I broke into the VM with --gdb and kept dumping the stuck CPU's stack with "thread X" and "bt". I noticed that it was always in the timer interrupt. Here were the stacks: https://pastebin.com/raw/L3nv1kH2
Then I narrowed my search down to timer events by enabling boot options ftrace_dump_on_oops and panic-on-stall ones, and noticed a storm of hrtimer_start coming out of the long stall. I was all but certain it was a tick storm and noticed it kept programming hrtimer to the same event.
Ah, then I just did a "git diff" in kernel/time/ between v5.15 and v6.1 and noticed the missing patches. ;-)
Though in my experience, I wasn't seeing a KTIME_MAX-type of value like you mentioned in the post. What I noticed is that the tick was never stopped, it just kept firing a bit earlier than was requested and in the interrupt exit path (of the delivered-too-early timer interrupt), it kept re-requesting the tick.
thanks,
- Joel
On Thu, Aug 10, 2023 at 10:14:16PM +0000, Joel Fernandes wrote:
On Thu, Aug 10, 2023 at 09:54:16PM +0000, Joel Fernandes wrote:
On Thu, Aug 10, 2023 at 10:55:16AM -0700, Paul E. McKenney wrote:
On Wed, Aug 09, 2023 at 02:45:44PM -0700, Guenter Roeck wrote:
On 8/9/23 13:39, Joel Fernandes wrote:
On Wed, Aug 9, 2023 at 4:38 PM Guenter Roeck linux@roeck-us.net wrote:
On 8/9/23 13:14, Joel Fernandes wrote: > On Wed, Aug 09, 2023 at 12:25:48PM -0700, Guenter Roeck wrote: > > On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote: > > > On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck linux@roeck-us.net wrote: > > > > > > > > On 8/9/23 06:53, Joel Fernandes wrote: > > > > > On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote: > > > > > > This is the start of the stable review cycle for the 5.15.126 release. > > > > > > There are 92 patches in this series, all will be posted as a response > > > > > > to this one. If anyone has any issues with these being applied, please > > > > > > let me know. > > > > > > > > > > > > Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. > > > > > > Anything received after that time might be too late. > > > > > > > > > > > > The whole patch series can be found in one patch at: > > > > > > https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... > > > > > > or in the git tree and branch at: > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y > > > > > > and the diffstat can be found below. > > > > > > > > > > Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios > > > > > hang with this -rc: TREE04, TREE07, TASKS03. > > > > > > > > > > 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu > > > > > hotplug rcutorture testing. Me and tglx are continuing to debug this. The > > > > > issue does not show up on anything but 5.15 stable kernels and neither on > > > > > mainline. > > > > > > > > > > > > > Do you by any have a crash pattern that we could possibly use to find the crash > > > > in ChromeOS crash logs ? No idea if that would help, but it could provide some > > > > additional data points. > > > > > > The pattern shows as a hard hang, the system is unresponsive and all CPUs > > > are stuck in stop_machine. Sometimes it recovers on its own from the > > > hang and then RCU immediately gives stall warnings. It takes 1.5 hour > > > to reproduce and sometimes never happens for several hours. > > > > > > It appears related to CPU hotplug since gdb showed me most of the CPUs > > > are spinning in multi_cpu_stop() / stop machine after the hang. > > > > > > > Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace, > > but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield(). > > Interesting. It looks similar as far as the stack dump in gdb goes, here are > the stacks I dumped with the hang I referred to: > https://paste.debian.net/1288308/ >
That link gives me "Entry not found".
Yeah that was weird. Here it is again: https://pastebin.com/raw/L3nv1kH2
I found a couple of crash reports from chromeos-5.10, one of them complaining about RCU issues. I sent you links via IM. Nothing from 5.15 or later, though.
Is the crash showing the eternally refiring timer fixed by this commit?
53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry")
Ah I was just replying, I have been seeing really good results after applying the following 3 commits since yesterday:
53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry") 5417ddc1cf1f ("timers/nohz: Switch to ONESHOT_STOPPED in the low-res handler when the tick is stopped") a1ff03cd6fb9 ("tick: Detect and fix jiffies update stall")
5417ddc1cf1f also mentioned a "tick storm" which is exactly what I was seeing.
I did a lengthy test and everything is looking good. I'll send these out to the stable list.
I just read your post for the first time. And just to humor you about my debugging which was very similar to yours, I got as far as this statement in your post (before looking for fixes in timer code):
<quote> Further checking showed that the stuck CPU was in fact suffering from an interrupt storm, namely an interrupt storm of scheduling-clock interrupts. This spurred another code-inspection session. </quote>
My detection of this came from gdb, within that 2000 second stall, I broke into the VM with --gdb and kept dumping the stuck CPU's stack with "thread X" and "bt". I noticed that it was always in the timer interrupt. Here were the stacks: https://pastebin.com/raw/L3nv1kH2
Then I narrowed my search down to timer events by enabling boot options ftrace_dump_on_oops and panic-on-stall ones, and noticed a storm of hrtimer_start coming out of the long stall. I was all but certain it was a tick storm and noticed it kept programming hrtimer to the same event.
Ah, then I just did a "git diff" in kernel/time/ between v5.15 and v6.1 and noticed the missing patches. ;-)
Though in my experience, I wasn't seeing a KTIME_MAX-type of value like you mentioned in the post. What I noticed is that the tick was never stopped, it just kept firing a bit earlier than was requested and in the interrupt exit path (of the delivered-too-early timer interrupt), it kept re-requesting the tick.
That "git diff" wouldn't have shown me much at the time, but I am very glad that you found it!
Thanx, Paul
On 8/10/23 14:54, Joel Fernandes wrote:
On Thu, Aug 10, 2023 at 10:55:16AM -0700, Paul E. McKenney wrote:
On Wed, Aug 09, 2023 at 02:45:44PM -0700, Guenter Roeck wrote:
On 8/9/23 13:39, Joel Fernandes wrote:
On Wed, Aug 9, 2023 at 4:38 PM Guenter Roeck linux@roeck-us.net wrote:
On 8/9/23 13:14, Joel Fernandes wrote:
On Wed, Aug 09, 2023 at 12:25:48PM -0700, Guenter Roeck wrote: > On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote: >> On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck linux@roeck-us.net wrote: >>> >>> On 8/9/23 06:53, Joel Fernandes wrote: >>>> On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote: >>>>> This is the start of the stable review cycle for the 5.15.126 release. >>>>> There are 92 patches in this series, all will be posted as a response >>>>> to this one. If anyone has any issues with these being applied, please >>>>> let me know. >>>>> >>>>> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. >>>>> Anything received after that time might be too late. >>>>> >>>>> The whole patch series can be found in one patch at: >>>>> https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... >>>>> or in the git tree and branch at: >>>>> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y >>>>> and the diffstat can be found below. >>>> >>>> Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios >>>> hang with this -rc: TREE04, TREE07, TASKS03. >>>> >>>> 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu >>>> hotplug rcutorture testing. Me and tglx are continuing to debug this. The >>>> issue does not show up on anything but 5.15 stable kernels and neither on >>>> mainline. >>>> >>> >>> Do you by any have a crash pattern that we could possibly use to find the crash >>> in ChromeOS crash logs ? No idea if that would help, but it could provide some >>> additional data points. >> >> The pattern shows as a hard hang, the system is unresponsive and all CPUs >> are stuck in stop_machine. Sometimes it recovers on its own from the >> hang and then RCU immediately gives stall warnings. It takes 1.5 hour >> to reproduce and sometimes never happens for several hours. >> >> It appears related to CPU hotplug since gdb showed me most of the CPUs >> are spinning in multi_cpu_stop() / stop machine after the hang. >> > > Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace, > but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield().
Interesting. It looks similar as far as the stack dump in gdb goes, here are the stacks I dumped with the hang I referred to: https://paste.debian.net/1288308/
That link gives me "Entry not found".
Yeah that was weird. Here it is again: https://pastebin.com/raw/L3nv1kH2
I found a couple of crash reports from chromeos-5.10, one of them complaining about RCU issues. I sent you links via IM. Nothing from 5.15 or later, though.
Is the crash showing the eternally refiring timer fixed by this commit?
53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry")
Ah I was just replying, I have been seeing really good results after applying the following 3 commits since yesterday:
53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry") 5417ddc1cf1f ("timers/nohz: Switch to ONESHOT_STOPPED in the low-res handler when the tick is stopped") a1ff03cd6fb9 ("tick: Detect and fix jiffies update stall")
Would those also apply to v5.10.y, or just 5.15.y ?
Thanks, Guenter
5417ddc1cf1f also mentioned a "tick storm" which is exactly what I was seeing.
I did a lengthy test and everything is looking good. I'll send these out to the stable list.
thanks,
- Joel
On Aug 10, 2023, at 6:55 PM, Guenter Roeck linux@roeck-us.net wrote:
On 8/10/23 14:54, Joel Fernandes wrote:
On Thu, Aug 10, 2023 at 10:55:16AM -0700, Paul E. McKenney wrote: On Wed, Aug 09, 2023 at 02:45:44PM -0700, Guenter Roeck wrote:
On 8/9/23 13:39, Joel Fernandes wrote:
On Wed, Aug 9, 2023 at 4:38 PM Guenter Roeck linux@roeck-us.net wrote:
On 8/9/23 13:14, Joel Fernandes wrote: > On Wed, Aug 09, 2023 at 12:25:48PM -0700, Guenter Roeck wrote: >> On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote: >>> On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck linux@roeck-us.net wrote: >>>> >>>> On 8/9/23 06:53, Joel Fernandes wrote: >>>>> On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote: >>>>>> This is the start of the stable review cycle for the 5.15.126 release. >>>>>> There are 92 patches in this series, all will be posted as a response >>>>>> to this one. If anyone has any issues with these being applied, please >>>>>> let me know. >>>>>> >>>>>> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. >>>>>> Anything received after that time might be too late. >>>>>> >>>>>> The whole patch series can be found in one patch at: >>>>>> https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... >>>>>> or in the git tree and branch at: >>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y >>>>>> and the diffstat can be found below. >>>>> >>>>> Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios >>>>> hang with this -rc: TREE04, TREE07, TASKS03. >>>>> >>>>> 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu >>>>> hotplug rcutorture testing. Me and tglx are continuing to debug this. The >>>>> issue does not show up on anything but 5.15 stable kernels and neither on >>>>> mainline. >>>>> >>>> >>>> Do you by any have a crash pattern that we could possibly use to find the crash >>>> in ChromeOS crash logs ? No idea if that would help, but it could provide some >>>> additional data points. >>> >>> The pattern shows as a hard hang, the system is unresponsive and all CPUs >>> are stuck in stop_machine. Sometimes it recovers on its own from the >>> hang and then RCU immediately gives stall warnings. It takes 1.5 hour >>> to reproduce and sometimes never happens for several hours. >>> >>> It appears related to CPU hotplug since gdb showed me most of the CPUs >>> are spinning in multi_cpu_stop() / stop machine after the hang. >>> >> >> Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace, >> but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield(). > > Interesting. It looks similar as far as the stack dump in gdb goes, here are > the stacks I dumped with the hang I referred to: > https://paste.debian.net/1288308/ >
That link gives me "Entry not found".
Yeah that was weird. Here it is again: https://pastebin.com/raw/L3nv1kH2
I found a couple of crash reports from chromeos-5.10, one of them complaining about RCU issues. I sent you links via IM. Nothing from 5.15 or later, though.
Is the crash showing the eternally refiring timer fixed by this commit?
53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry")
Ah I was just replying, I have been seeing really good results after applying the following 3 commits since yesterday: 53e87e3cdc15 ("timers/nohz: Last resort update jiffies on nohz_full IRQ entry") 5417ddc1cf1f ("timers/nohz: Switch to ONESHOT_STOPPED in the low-res handler when the tick is stopped") a1ff03cd6fb9 ("tick: Detect and fix jiffies update stall")
Would those also apply to v5.10.y, or just 5.15.y ?
All apply to 5.10 but one. I am currently testing with it more and will post to stable for 5.10 as well.
Thanks,
- Joel
Thanks, Guenter
5417ddc1cf1f also mentioned a "tick storm" which is exactly what I was seeing. I did a lengthy test and everything is looking good. I'll send these out to the stable list. thanks,
- Joel
Hello,
On 2023-08-09T12:40:36+02:00 Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
This is the start of the stable review cycle for the 5.15.126 release. There are 92 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y and the diffstat can be found below.
This rc kernel passes DAMON functionality test[1] on my test machine. Attaching the test results summary below. Please note that I retrieved the kernel from linux-stable-rc tree[2].
Tested-by: SeongJae Park sj@kernel.org
[1] https://github.com/awslabs/damon-tests/tree/next/corr [2] ae7f23cbf199 ("Linux 5.15.126-rc1")
Thanks, SJ
[...]
---
ok 1 selftests: damon: debugfs_attrs.sh ok 1 selftests: damon-tests: kunit.sh ok 2 selftests: damon-tests: huge_count_read_write.sh ok 3 selftests: damon-tests: buffer_overflow.sh ok 4 selftests: damon-tests: rm_contexts.sh ok 5 selftests: damon-tests: record_null_deref.sh ok 6 selftests: damon-tests: dbgfs_target_ids_read_before_terminate_race.sh ok 7 selftests: damon-tests: dbgfs_target_ids_pid_leak.sh ok 8 selftests: damon-tests: damo_tests.sh ok 9 selftests: damon-tests: masim-record.sh ok 10 selftests: damon-tests: build_i386.sh ok 11 selftests: damon-tests: build_m68k.sh ok 12 selftests: damon-tests: build_arm64.sh ok 13 selftests: damon-tests: build_i386_idle_flag.sh ok 14 selftests: damon-tests: build_i386_highpte.sh ok 15 selftests: damon-tests: build_nomemcg.sh
PASS
On 8/9/23 03:40, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.126 release. There are 92 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y and the diffstat can be found below.
thanks,
greg k-h
SCMI with the SMC transport fails to build because of "[PATCH 5.15 11/92] firmware: arm_scmi: Fix chan_free cleanup on SMC" where the specific details have been reported there. Here is the build failure FWIW:
drivers/firmware/arm_scmi/smc.c:39:6: error: duplicate member 'irq' int irq; ^~~ drivers/firmware/arm_scmi/smc.c: In function 'smc_chan_setup': drivers/firmware/arm_scmi/smc.c:118:20: error: 'irq' undeclared (first use in this function); did you mean 'rq'? scmi_info->irq = irq; ^~~ rq drivers/firmware/arm_scmi/smc.c:118:20: note: each undeclared identifier is reported only once for each function it appears in CC drivers/mmc/core/slot-gpio.o host-make[5]: *** [scripts/Makefile.build:289: drivers/firmware/arm_scmi/smc.o] Error 1 host-make[5]: *** Waiting for unfinished jobs....
Hi Greg,
On 09/08/23 4:10 pm, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.126 release. There are 92 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. Anything received after that time might be too late.
No problems seen on x86_64 and aarch64.
Tested-by: Harshit Mogalapalli harshit.m.mogalapalli@oracle.com
Thanks, Harshit
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y and the diffstat can be found below.
thanks,
greg k-h
On 8/10/23 03:16, Harshit Mogalapalli wrote:
Hi Greg,
On 09/08/23 4:10 pm, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.126 release. There are 92 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. Anything received after that time might be too late.
No problems seen on x86_64 and aarch64.
fwiw, aarch64:allmodconfig doesn't compile.
Guenter
Tested-by: Harshit Mogalapalli harshit.m.mogalapalli@oracle.com
Thanks, Harshit
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y and the diffstat can be found below.
thanks,
greg k-h
On 8/9/23 03:40, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.126 release. There are 92 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. Anything received after that time might be too late.
Building arm:allmodconfig ... failed -------------- Error log: drivers/firmware/arm_scmi/smc.c:39:13: error: duplicate member 'irq'
drivers/firmware/arm_scmi/smc.c: In function 'smc_chan_setup': drivers/firmware/arm_scmi/smc.c:118:34: error: 'irq' undeclared
Building arm64:defconfig ... failed -------------- Error log:
drivers/firmware/arm_scmi/smc.c:39:13: error: duplicate member 'irq'
drivers/firmware/arm_scmi/smc.c: In function 'smc_chan_setup': drivers/firmware/arm_scmi/smc.c:118:34: error: 'irq' undeclared
That is because commit d80e159dbdbb ("firmware: arm_scmi: Fix chan free cleanup on SMC") is applied without its dependent commit(s).
Guenter
On 8/10/23 03:24, Guenter Roeck wrote:
On 8/9/23 03:40, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.126 release. There are 92 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. Anything received after that time might be too late.
Building arm:allmodconfig ... failed
Error log: drivers/firmware/arm_scmi/smc.c:39:13: error: duplicate member 'irq'
drivers/firmware/arm_scmi/smc.c: In function 'smc_chan_setup': drivers/firmware/arm_scmi/smc.c:118:34: error: 'irq' undeclared
Building arm64:defconfig ... failed
Error log:
drivers/firmware/arm_scmi/smc.c:39:13: error: duplicate member 'irq'
drivers/firmware/arm_scmi/smc.c: In function 'smc_chan_setup': drivers/firmware/arm_scmi/smc.c:118:34: error: 'irq' undeclared
That is because commit d80e159dbdbb ("firmware: arm_scmi: Fix chan free cleanup on SMC") is applied without its dependent commit(s).
Indeed, we discussed this here: https://lore.kernel.org/all/20230810084529.53thk6dmlejbma3t@bogus/
On Thu, Aug 10, 2023 at 09:25:53AM -0700, Florian Fainelli wrote:
On 8/10/23 03:24, Guenter Roeck wrote:
On 8/9/23 03:40, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.126 release. There are 92 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. Anything received after that time might be too late.
Building arm:allmodconfig ... failed
Error log: drivers/firmware/arm_scmi/smc.c:39:13: error: duplicate member 'irq'
drivers/firmware/arm_scmi/smc.c: In function 'smc_chan_setup': drivers/firmware/arm_scmi/smc.c:118:34: error: 'irq' undeclared
Building arm64:defconfig ... failed
Error log:
drivers/firmware/arm_scmi/smc.c:39:13: error: duplicate member 'irq'
drivers/firmware/arm_scmi/smc.c: In function 'smc_chan_setup': drivers/firmware/arm_scmi/smc.c:118:34: error: 'irq' undeclared
That is because commit d80e159dbdbb ("firmware: arm_scmi: Fix chan free cleanup on SMC") is applied without its dependent commit(s).
Indeed, we discussed this here: https://lore.kernel.org/all/20230810084529.53thk6dmlejbma3t@bogus/
Offending commit should now be dropped, thanks.
greg k-h
On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.126 release. There are 92 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. Anything received after that time might be too late.
Build results: total: 160 pass: 157 fail: 3 Failed builds: arm:allmodconfig arm64:defconfig arm64:allmodconfig Qemu test results: total: 501 pass: 423 fail: 78 Failed tests: <most arm> <all arm64/arm64be>
As already reported, plus:
Error log: drivers/gpu/drm/fsl-dcu/fsl_dcu_drm_plane.c:176:20: error: 'drm_plane_helper_destroy' undeclared here
for arm:multi_v7_defconfig
Side note: I am surprised about successful arm64 tests/builds since arm64:defconfig fails to build with obvious code errors.
drivers/firmware/arm_scmi/smc.c:39:13: error: duplicate member 'irq'
drivers/firmware/arm_scmi/smc.c: In function 'smc_chan_setup': drivers/firmware/arm_scmi/smc.c:118:34: error: 'irq' undeclared
Guenter
On Thu, Aug 10, 2023 at 09:06:01AM -0700, Guenter Roeck wrote:
On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.126 release. There are 92 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. Anything received after that time might be too late.
Build results: total: 160 pass: 157 fail: 3 Failed builds: arm:allmodconfig arm64:defconfig arm64:allmodconfig Qemu test results: total: 501 pass: 423 fail: 78 Failed tests:
<most arm> <all arm64/arm64be>
As already reported, plus:
Error log: drivers/gpu/drm/fsl-dcu/fsl_dcu_drm_plane.c:176:20: error: 'drm_plane_helper_destroy' undeclared here
Offending commit now dropped, Sasha's dep-bot went a little crazy there, and this wasn't needed, sorry for not catching that sooner.
for arm:multi_v7_defconfig
Side note: I am surprised about successful arm64 tests/builds since arm64:defconfig fails to build with obvious code errors.
drivers/firmware/arm_scmi/smc.c:39:13: error: duplicate member 'irq'
drivers/firmware/arm_scmi/smc.c: In function 'smc_chan_setup': drivers/firmware/arm_scmi/smc.c:118:34: error: 'irq' undeclared
Should now be fixed, thanks.
greg k-h
On 8/9/23 3:40 AM, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.15.126 release. There are 92 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y and the diffstat can be found below.
thanks,
greg k-h
Built and booted successfully on RISC-V RV64 (HiFive Unmatched).
Tested-by: Ron Economos re@w6rz.net
Hello!
On Wed, 9 Aug 2023 at 04:57, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
This is the start of the stable review cycle for the 5.15.126 release. There are 92 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y and the diffstat can be found below.
thanks,
greg k-h
We are also seeing build failures on Arm and Arm64, with Clang 17 and GCC 8:
* arm, build - clang-17-defconfig - clang-17-lkftconfig - clang-17-lkftconfig-no-kselftest-frag - clang-lkftconfig - clang-nightly-lkftconfig-kselftest - gcc-8-defconfig
* arm64, build - clang-17-defconfig - clang-17-defconfig-40bc7ee5 - clang-17-lkftconfig - clang-17-lkftconfig-no-kselftest-frag - clang-lkftconfig - clang-nightly-lkftconfig-kselftest - gcc-8-defconfig - gcc-8-defconfig-40bc7ee5
Failure is:
-----8<----- /builds/linux/drivers/firmware/arm_scmi/smc.c:39:13: error: duplicate member 'irq' 39 | int irq; | ^~~ /builds/linux/drivers/firmware/arm_scmi/smc.c: In function 'smc_chan_setup': /builds/linux/drivers/firmware/arm_scmi/smc.c:118:34: error: 'irq' undeclared (first use in this function); did you mean 'rq'? 118 | scmi_info->irq = irq; | ^~~ | rq ----->8-----
(Funnily enough, this was reported by Naresh [1] before this RC round, but we chalked it up to GCC-13 on an older branch.)
Greetings!
Daniel Díaz daniel.diaz@linaro.org
[1] https://lore.kernel.org/stable/CA+G9fYvTjm2oa6mXR=HUe6gYuVaS2nFb_otuvPfmPeKH...