I've found a regression somewhere between 6.7.4 and 6.8.0-rc0 that causes my Framework 7840 AMD laptop to freeze after waking it from a suspend. I can reliably trigger the issue, but unfortunately cannot provide useful logs for now and hope to get some help with doing so.
How to reproduce
----------------
The last working kernel release was 6.7.4, the issue appeared in all 6.8 release candidates from 1-4.
1. normally boot the system with a 6.8 kernel
2. suspend and resume the system 2 to 4 times. Usually, the freeze already occurs at the 2nd resume.
2.1 graphical approach: I've reproduced this directly from the "Suspend" button in my SDDM display manager (X11)
2.2 TTY approach: You can also directly swith to a tty after boot, log in, and then issue `systemctl suspend` 2 to 4 times
3. After resume number 2 or later, the system freezes while resuming and cannot be used anymore. The screen is switched on and displays something, but no inputs are processed and the image is static.
3.1 graphical approach: When suspending and resuming by closing and opening the laptop lid, the screen is black with the cursor displayed. When doing so while keeping the lid open, the display manager's background image and a cursor are displayed, but only statically frozen. No keyboard or touchpad input is processed.
3.2 on a TTY: When suspending and resuming from a tty, a few kernel messages still manage to be printed, but after that no new information is displayed. E.g. when keeping a `journalctl -f` session open in another tmux pane, that journal output does not update anymore. Keyboard inputs are not processed anymore. Nonetheless, the cursor continues to blink regulary.
I've attached two screenshots showing the situation with kernels 6.7.4 (working) and 6.8 (broken).
Detailed Description
--------------------
In most cases, the freeze already occurs after the 2nd suspend-resume-cycle.
Unfortunately, this freeze also appears to block IO, as after a forced hard reboot I cannot retrieve relevant information from `journalctl -b "-1"`. The last retrievable log messages are from the successful suspend action.
I welcome any recommendation on how to retrieve valuable information. I have not yet played around with cmdline parameters relevant for debug.
System Details
--------------
Distro: NixOS 23.11, kernel 6.8 rc1-4 compiled manually
kernel: Linux version 6.8.0-rc4 (nixbld@localhost) (gcc (GCC) 12.3.0, GNU ld (GNU Binutils) 2.40) #1-NixOS SMP PREEMPT_DYNAMIC Sun Feb 11 20:18:13 UTC 2024
hardware: Framework 13 laptop, 7840 AMD series, CPU AMD Ryzen 7 7840U x86_64
In the attachments you find:
- the used kernel config
- my cmdline params
- 2 screenshots from the bug occuring when triggered from a tty
Next Steps
----------
I intend to bisect the issue, but this can take a while due to the need for manual testing and a kernel compile cycle requiring >30min for now.
I am mostly reporting this now already to still raise awareness in the RC phase and to have something where I can point other Framework laptop users towards for reproduction of the bug.
All the best
schmittlauch
Disclaimer:
I read the kernel docs on reporting issues, IMHO these were a bit unclear on whether RC kernels are in the scope of the stable and regressions lists.
If they aren't please point me to where to do this. So far I had only reported bugs via the bugzilla.
In current scenario if Plug-out and Plug-In performed continuously
there could be a chance while checking for dwc->gadget_driver in
dwc3_gadget_suspend, a NULL pointer dereference may occur.
Call Stack:
CPU1: CPU2:
gadget_unbind_driver dwc3_suspend_common
dwc3_gadget_stop dwc3_gadget_suspend
dwc3_disconnect_gadget
CPU1 basically clears the variable and CPU2 checks the variable.
Consider CPU1 is running and right before gadget_driver is cleared
and in parallel CPU2 executes dwc3_gadget_suspend where it finds
dwc->gadget_driver which is not NULL and resumes execution and then
CPU1 completes execution. CPU2 executes dwc3_disconnect_gadget where
it checks dwc->gadget_driver is already NULL because of which the
NULL pointer deference occur.
Cc: <stable(a)vger.kernel.org>
Fixes: 9772b47a4c29 ("usb: dwc3: gadget: Fix suspend/resume during device mode")
Acked-by: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com>
Signed-off-by: Uttkarsh Aggarwal <quic_uaggarwa(a)quicinc.com>
---
changes in v3:
Corrected fixes tag and typo mistake in commit message dw3_gadget_stop -> dwc3_gadget_stop.
Link to v2:
https://lore.kernel.org/linux-usb/CAKzKK0r8RUqgXy1o5dndU21KuTKtyZ5rn5Fb9sZq…
Changes in v2:
Added cc and fixes tag missing in v1.
Link to v1:
https://lore.kernel.org/linux-usb/20240110095532.4776-1-quic_uaggarwa@quici…
drivers/usb/dwc3/gadget.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 019368f8e9c4..564976b3e2b9 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -4709,15 +4709,13 @@ int dwc3_gadget_suspend(struct dwc3 *dwc)
unsigned long flags;
int ret;
- if (!dwc->gadget_driver)
- return 0;
-
ret = dwc3_gadget_soft_disconnect(dwc);
if (ret)
goto err;
spin_lock_irqsave(&dwc->lock, flags);
- dwc3_disconnect_gadget(dwc);
+ if (dwc->gadget_driver)
+ dwc3_disconnect_gadget(dwc);
spin_unlock_irqrestore(&dwc->lock, flags);
return 0;
--
2.17.1
This is the start of the stable review cycle for the 6.1.78 release.
There are 65 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Fri, 16 Feb 2024 14:28:54 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.1.78-rc2…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.1.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 6.1.78-rc2
Furong Xu <0x1207(a)gmail.com>
net: stmmac: xgmac: fix a typo of register name in DPP safety handling
Takashi Iwai <tiwai(a)suse.de>
ALSA: usb-audio: Sort quirk table entries
Simon Horman <horms(a)kernel.org>
net: stmmac: xgmac: use #define for string constants
Jiri Wiesner <jwiesner(a)suse.de>
clocksource: Skip watchdog check for large watchdog intervals
Jens Axboe <axboe(a)kernel.dk>
block: treat poll queue enter similarly to timeouts
Sheng Yong <shengyong(a)oppo.com>
f2fs: add helper to check compression level
Mike Marciniszyn <mike.marciniszyn(a)intel.com>
RDMA/irdma: Fix support for 64k pages
Prathu Baronia <prathubaronia2011(a)gmail.com>
vhost: use kzalloc() instead of kmalloc() followed by memset()
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Revert "ASoC: amd: Add new dmi entries for acp5x platform"
Jens Axboe <axboe(a)kernel.dk>
io_uring/net: fix sr->len for IORING_OP_RECV with MSG_WAITALL and buffers
Hans de Goede <hdegoede(a)redhat.com>
Input: atkbd - skip ATKBD_CMD_SETLEDS when skipping ATKBD_CMD_GETID
Werner Sembach <wse(a)tuxedocomputers.com>
Input: i8042 - fix strange behavior of touchpad on Clevo NS70PU
Frederic Weisbecker <frederic(a)kernel.org>
hrtimer: Report offline hrtimer enqueue
Prashanth K <quic_prashk(a)quicinc.com>
usb: host: xhci-plat: Add support for XHCI_SG_TRB_CACHE_SIZE_QUIRK
Prashanth K <quic_prashk(a)quicinc.com>
usb: dwc3: host: Set XHCI_SG_TRB_CACHE_SIZE_QUIRK
Leonard Dallmayr <leonard.dallmayr(a)mailbox.org>
USB: serial: cp210x: add ID for IMST iM871A-USB
Puliang Lu <puliang.lu(a)fibocom.com>
USB: serial: option: add Fibocom FM101-GL variant
JackBB Wu <wojackbb(a)gmail.com>
USB: serial: qcserial: add new usb-id for Dell Wireless DW5826e
Sean Young <sean(a)mess.org>
ALSA: usb-audio: add quirk for RODE NT-USB+
Julian Sikorski <belegdol+github(a)gmail.com>
ALSA: usb-audio: Add a quirk for Yamaha YIT-W12TX transmitter
Alexander Tsoy <alexander(a)tsoy.me>
ALSA: usb-audio: Add delay quirk for MOTU M Series 2nd revision
Francesco Dolcini <francesco.dolcini(a)toradex.com>
mtd: parsers: ofpart: add workaround for #size-cells 0
Alexander Aring <aahringo(a)redhat.com>
fs: dlm: don't put dlm_local_addrs on heap
Tejun Heo <tj(a)kernel.org>
blk-iocost: Fix an UBSAN shift-out-of-bounds warning
Ming Lei <ming.lei(a)redhat.com>
scsi: core: Move scsi_host_busy() out of host lock if it is for per-command
Dan Carpenter <dan.carpenter(a)linaro.org>
fs/ntfs3: Fix an NULL dereference bug
Florian Westphal <fw(a)strlen.de>
netfilter: nft_set_pipapo: remove scratch_aligned pointer
Florian Westphal <fw(a)strlen.de>
netfilter: nft_set_pipapo: add helper to release pcpu scratch area
Florian Westphal <fw(a)strlen.de>
netfilter: nft_set_pipapo: store index in scratch maps
Pablo Neira Ayuso <pablo(a)netfilter.org>
netfilter: nft_ct: reject direction for ct id
Srinivasan Shanmugam <srinivasan.shanmugam(a)amd.com>
drm/amd/display: Implement bounds check for stream encoder creation in DCN301
Pablo Neira Ayuso <pablo(a)netfilter.org>
netfilter: nft_compat: restrict match/target protocol to u16
Pablo Neira Ayuso <pablo(a)netfilter.org>
netfilter: nft_compat: reject unused compat flag
Pablo Neira Ayuso <pablo(a)netfilter.org>
netfilter: nft_compat: narrow down revision to unsigned 8-bits
Jakub Kicinski <kuba(a)kernel.org>
selftests: cmsg_ipv6: repeat the exact packet
Eric Dumazet <edumazet(a)google.com>
ppp_async: limit MRU to 64K
Kuniyuki Iwashima <kuniyu(a)amazon.com>
af_unix: Call kfree_skb() for dead unix_(sk)->oob_skb in GC.
Shigeru Yoshida <syoshida(a)redhat.com>
tipc: Check the bearer type before calling tipc_udp_nl_bearer_add()
David Howells <dhowells(a)redhat.com>
rxrpc: Fix response to PING RESPONSE ACKs to a dead call
Dan Carpenter <dan.carpenter(a)linaro.org>
drm/i915/gvt: Fix uninitialized variable in handle_mmio()
Eric Dumazet <edumazet(a)google.com>
inet: read sk->sk_family once in inet_recv_error()
Zhang Rui <rui.zhang(a)intel.com>
hwmon: (coretemp) Fix bogus core_id to attr name mapping
Zhang Rui <rui.zhang(a)intel.com>
hwmon: (coretemp) Fix out-of-bounds memory access
Loic Prylli <lprylli(a)netflix.com>
hwmon: (aspeed-pwm-tacho) mutex for tach reading
Zhipeng Lu <alexious(a)zju.edu.cn>
octeontx2-pf: Fix a memleak otx2_sq_init
Zhipeng Lu <alexious(a)zju.edu.cn>
atm: idt77252: fix a memleak in open_card_ubr0
Antoine Tenart <atenart(a)kernel.org>
tunnels: fix out of bounds access when building IPv6 PMTU error
Paolo Abeni <pabeni(a)redhat.com>
selftests: net: avoid just another constant wait
Paolo Abeni <pabeni(a)redhat.com>
selftests: net: cut more slack for gro fwd tests.
Ivan Vecera <ivecera(a)redhat.com>
net: atlantic: Fix DMA mapping for PTP hwts ring
Eric Dumazet <edumazet(a)google.com>
netdevsim: avoid potential loop in nsim_dev_trap_report_work()
Johannes Berg <johannes.berg(a)intel.com>
wifi: mac80211: fix waiting for beacons logic
Furong Xu <0x1207(a)gmail.com>
net: stmmac: xgmac: fix handling of DPP safety error for DMA channels
Abhinav Kumar <quic_abhinavk(a)quicinc.com>
drm/msm/dpu: check for valid hw_pp in dpu_encoder_helper_phys_cleanup
Kuogee Hsieh <quic_khsieh(a)quicinc.com>
drm/msm/dp: return correct Colorimetry for DP_TEST_DYNAMIC_RANGE_CEA case
Kuogee Hsieh <quic_khsieh(a)quicinc.com>
drm/msms/dp: fixed link clock divider bits be over written in BPC unknown case
Shyam Prasad N <sprasad(a)microsoft.com>
cifs: failure to add channel on iface should bump up weight
Tony Lindgren <tony(a)atomide.com>
phy: ti: phy-omap-usb2: Fix NULL pointer dereference for SRP
Frank Li <Frank.Li(a)nxp.com>
dmaengine: fix is_slave_direction() return false when DMA_DEV_TO_DEV
Yoshihiro Shimoda <yoshihiro.shimoda.uh(a)renesas.com>
phy: renesas: rcar-gen3-usb2: Fix returning wrong error code
Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
dmaengine: fsl-qdma: Fix a memory leak related to the queue command DMA
Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
dmaengine: fsl-qdma: Fix a memory leak related to the status queue DMA
Jai Luthra <j-luthra(a)ti.com>
dmaengine: ti: k3-udma: Report short packet errors
Guanhua Gao <guanhua.gao(a)nxp.com>
dmaengine: fsl-dpaa2-qdma: Fix the size of dma pools
Baokun Li <libaokun1(a)huawei.com>
ext4: regenerate buddy after block freeing failed if under fc replay
-------------
Diffstat:
Makefile | 4 +-
block/blk-core.c | 11 ++-
block/blk-iocost.c | 7 ++
drivers/atm/idt77252.c | 2 +
drivers/dma/fsl-dpaa2-qdma/dpaa2-qdma.c | 10 +-
drivers/dma/fsl-qdma.c | 27 ++----
drivers/dma/ti/k3-udma.c | 10 +-
.../drm/amd/display/dc/dcn301/dcn301_resource.c | 2 +-
drivers/gpu/drm/i915/gvt/handlers.c | 3 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_encoder.c | 4 +-
drivers/gpu/drm/msm/dp/dp_ctrl.c | 5 -
drivers/gpu/drm/msm/dp/dp_link.c | 22 +++--
drivers/gpu/drm/msm/dp/dp_reg.h | 3 +
drivers/hwmon/aspeed-pwm-tacho.c | 7 ++
drivers/hwmon/coretemp.c | 40 ++++----
drivers/infiniband/hw/irdma/verbs.c | 2 +-
drivers/input/keyboard/atkbd.c | 13 ++-
drivers/input/serio/i8042-acpipnpio.h | 6 ++
drivers/mtd/parsers/ofpart_core.c | 19 ++++
drivers/net/ethernet/aquantia/atlantic/aq_ptp.c | 4 +-
drivers/net/ethernet/aquantia/atlantic/aq_ring.c | 13 +++
drivers/net/ethernet/aquantia/atlantic/aq_ring.h | 1 +
.../ethernet/marvell/octeontx2/nic/otx2_common.c | 14 ++-
drivers/net/ethernet/stmicro/stmmac/common.h | 1 +
drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h | 3 +
.../net/ethernet/stmicro/stmmac/dwxgmac2_core.c | 58 ++++++++++-
drivers/net/netdevsim/dev.c | 8 +-
drivers/net/ppp/ppp_async.c | 4 +
drivers/phy/renesas/phy-rcar-gen3-usb2.c | 4 -
drivers/phy/ti/phy-omap-usb2.c | 4 +-
drivers/scsi/scsi_error.c | 3 +-
drivers/scsi/scsi_lib.c | 4 +-
drivers/usb/dwc3/host.c | 4 +-
drivers/usb/host/xhci-plat.c | 3 +
drivers/usb/serial/cp210x.c | 1 +
drivers/usb/serial/option.c | 1 +
drivers/usb/serial/qcserial.c | 2 +
drivers/vhost/vhost.c | 5 +-
fs/dlm/lowcomms.c | 38 +++-----
fs/ext4/mballoc.c | 20 ++++
fs/f2fs/compress.c | 27 ++++++
fs/f2fs/f2fs.h | 2 +
fs/f2fs/super.c | 4 +-
fs/ntfs3/ntfs_fs.h | 2 +-
fs/smb/client/sess.c | 2 +
include/linux/dmaengine.h | 3 +-
include/linux/hrtimer.h | 4 +-
include/uapi/linux/netfilter/nf_tables.h | 2 +
io_uring/net.c | 1 +
kernel/time/clocksource.c | 25 ++++-
kernel/time/hrtimer.c | 3 +
net/ipv4/af_inet.c | 6 +-
net/ipv4/ip_tunnel_core.c | 2 +-
net/mac80211/mlme.c | 3 +-
net/netfilter/nft_compat.c | 17 +++-
net/netfilter/nft_ct.c | 3 +
net/netfilter/nft_set_pipapo.c | 108 ++++++++++-----------
net/netfilter/nft_set_pipapo.h | 18 +++-
net/netfilter/nft_set_pipapo_avx2.c | 17 ++--
net/rxrpc/conn_event.c | 8 ++
net/tipc/bearer.c | 6 ++
net/unix/garbage.c | 11 +++
sound/soc/amd/acp-config.c | 15 +--
sound/usb/quirks.c | 38 +++++---
tools/testing/selftests/net/cmsg_ipv6.sh | 4 +-
tools/testing/selftests/net/pmtu.sh | 18 +++-
tools/testing/selftests/net/udpgro_fwd.sh | 14 ++-
tools/testing/selftests/net/udpgso_bench_rx.c | 2 +-
68 files changed, 517 insertions(+), 240 deletions(-)
I have a Framework 13 with a 7840U and started having massive GPU
driver issues a few weeks ago (including system freezes).
Unfortunately the information of when exactly this started to happen
is gone, but It should be somewhere in between 6.6.0 and 6.7.4.
I got many different and random dmesg-errors and system behaviors, but
I currently can only reproduce one, so let's focus on that for now.
First some basic info:
I'm on Arch Linux using the `linux` kernel package.(currently at 6.7.4).
I have an external monitor connected via a thinkpad thunderbolt 4 dock.
I am using amdgpu.sg_display=0 and VRAM sharing is configured to
UMA_GAME_OPTIMIZED in the firmware settings.
If I start playing a youtube video in firefox with hardware
acceleration enabled, it stutters until it stops playing after a few
seconds. I can see this in the kernel log. I see this multiple times
for many different addresses.
[ 5641.070540] amdgpu 0000:c1:00.0: amdgpu: [mmhub] page fault
(src_id:0 ring:40 vmid:1 pasid:32786, for process RDD Process pid 3680
thread firefox-bi:cs0 pid 3852)
[ 5641.070549] amdgpu 0000:c1:00.0: amdgpu: in page starting at
address 0x0000000000020000 from client 18
[ 5641.070553] amdgpu 0000:c1:00.0: amdgpu:
MMVM_L2_PROTECTION_FAULT_STATUS:0x00143A51
[ 5641.070556] amdgpu 0000:c1:00.0: amdgpu: Faulty UTCL2 client
ID: unknown (0x1d)
[ 5641.070559] amdgpu 0000:c1:00.0: amdgpu: MORE_FAULTS: 0x1
[ 5641.070561] amdgpu 0000:c1:00.0: amdgpu: WALKER_ERROR: 0x0
[ 5641.070563] amdgpu 0000:c1:00.0: amdgpu: PERMISSION_FAULTS: 0x5
[ 5641.070565] amdgpu 0000:c1:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 5641.070567] amdgpu 0000:c1:00.0: amdgpu: RW: 0x1
Thanks
Michael
There is a corner case here where start/end is after/before the block
range we are currently checking. If so we need to be sure that splitting
the block will eventually give use the block size we need. To do that we
should adjust the block range to account for the start/end, and only
continue with the split if the size/alignment will fit the requested
size. Not doing so can result in leaving split blocks unmerged when it
eventually fails.
Fixes: afea229fe102 ("drm: improve drm_buddy_alloc function")
Signed-off-by: Matthew Auld <matthew.auld(a)intel.com>
Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam(a)amd.com>
Cc: Christian König <christian.koenig(a)amd.com>
Cc: <stable(a)vger.kernel.org> # v5.18+
---
drivers/gpu/drm/drm_buddy.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
index c1a99bf4dffd..d09540d4065b 100644
--- a/drivers/gpu/drm/drm_buddy.c
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -332,6 +332,7 @@ alloc_range_bias(struct drm_buddy *mm,
u64 start, u64 end,
unsigned int order)
{
+ u64 req_size = mm->chunk_size << order;
struct drm_buddy_block *block;
struct drm_buddy_block *buddy;
LIST_HEAD(dfs);
@@ -367,6 +368,15 @@ alloc_range_bias(struct drm_buddy *mm,
if (drm_buddy_block_is_allocated(block))
continue;
+ if (block_start < start || block_end > end) {
+ u64 adjusted_start = max(block_start, start);
+ u64 adjusted_end = min(block_end, end);
+
+ if (round_down(adjusted_end + 1, req_size) <=
+ round_up(adjusted_start, req_size))
+ continue;
+ }
+
if (contains(start, end, block_start, block_end) &&
order == drm_buddy_block_order(block)) {
/*
--
2.43.0