I've been experiencing a rather strange looking bug on the P50 I've got
for work. After a number of reboots, nouveau will fail to initialize the
dedicated GPU on the system at boot properly. Things start off with
this disp mthd failure:
...
[ 2.088505] nouveau 0000:01:00.0: disp: outp 04:0006:0f81: aux power -> demand
[ 2.088516] nouveau 0000:01:00.0: disp: outp 05:0002:0f81: no heads (0 3 2)
[ 2.088620] nouveau 0000:01:00.0: disp: init completed in 329us
[ 2.088957] nouveau 0000:01:00.0: disp: chid 0 mthd 0000 data 00000400 00001000 00000002
the failure ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[ 2.151517] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[ 2.151517] [drm] Driver supports precise vblank timestamp query.
[ 2.151521] 0088 1 core507d_init
[ 2.151522] f0000000
After the error happens, parts of the card start timing out and
eventually the GR fails to hold it's golden context and starts timing
out:
[ 10.163137] ------------[ cut here ]------------
[ 10.163169] nouveau 0000:01:00.0: timeout
[ 10.163218] WARNING: CPU: 4 PID: 98 at drivers/gpu/drm/nouveau/nvkm/engine/disp/coregf119.c:181 gf119_disp_core_fini+0xe6/0x140 [nouveau]
[ 10.163246] Modules linked in: joydev vfat fat intel_rapl iTCO_wdt x86_pkg_temp_thermal coretemp crc32_pclmul psmouse wmi_bmof i2c_i801 mei_me tpm_tis mei tpm_tis_core tpm thinkpad_acpi pcc_cpufreq ax88179_178a usbnet mii nouveau mxm_wmi i915 ttm i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel serio_raw xhci_pci drm xhci_hcd i2c_core wmi video
[ 10.163330] CPU: 4 PID: 98 Comm: kworker/4:1 Kdump: loaded Not tainted 4.18.0-rc8Lyude-Test+ #7
[ 10.163349] Hardware name: LENOVO 20EQS64N0B/20EQS64N0B, BIOS N1EET78W (1.51 ) 05/18/2018
[ 10.163370] Workqueue: pm pm_runtime_work
[ 10.163404] RIP: 0010:gf119_disp_core_fini+0xe6/0x140 [nouveau]
[ 10.163418] Code: 5e 41 5f 5d c3 49 8b 7c 24 10 48 8b 5f 50 48 85 db 74 5f e8 1c 5b 0f e1 48 89 da 48 c7 c7 b3 b2 4e a0 48 89 c6 e8 5c bf c8 e0 <0f> 0b 41 8b 47 50 85 c0 74 c6 49 8b 7c 24 78 48 81 c7 90 04 61 00
[ 10.163476] RSP: 0018:ffffc90000a83b00 EFLAGS: 00010286
[ 10.163489] RAX: 0000000000000000 RBX: ffff8808773c6bd0 RCX: 0000000000000006
[ 10.163506] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff88089b515570
[ 10.163523] RBP: ffffc90000a83b28 R08: 0000000000000000 R09: 0000000000aaaaaa
[ 10.163539] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8808715b2c00
[ 10.163556] R13: ffff88087779d780 R14: 00000001e68f0200 R15: ffff88086f91b000
[ 10.163573] FS: 0000000000000000(0000) GS:ffff88089b500000(0000) knlGS:0000000000000000
[ 10.163591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 10.163605] CR2: 00007f3d7953d180 CR3: 000000000200a003 CR4: 00000000003606e0
[ 10.163622] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 10.163639] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 10.163655] Call Trace:
[ 10.163686] nv50_disp_chan_fini+0x23/0x40 [nouveau]
[ 10.163711] nvkm_object_fini+0xbf/0x150 [nouveau]
[ 10.163735] nvkm_object_fini+0x76/0x150 [nouveau]
[ 10.163759] nvkm_object_fini+0x76/0x150 [nouveau]
[ 10.163783] nvkm_object_fini+0x76/0x150 [nouveau]
[ 10.163807] nvkm_object_fini+0x76/0x150 [nouveau]
[ 10.163840] nvkm_client_suspend+0x13/0x20 [nouveau]
[ 10.163864] nvif_client_suspend+0x1d/0x20 [nouveau]
[ 10.163898] nouveau_do_suspend+0x113/0x310 [nouveau]
[ 10.163931] nouveau_pmops_runtime_suspend+0x57/0xe0 [nouveau]
[ 10.163947] ? pci_has_legacy_pm_support+0x70/0x70
[ 10.163960] pci_pm_runtime_suspend+0x6b/0x180
[ 10.163972] ? pci_has_legacy_pm_support+0x70/0x70
[ 10.163985] ? pci_has_legacy_pm_support+0x70/0x70
[ 10.163997] __rpm_callback+0xcc/0x1e0
[ 10.164009] ? __switch_to_asm+0x40/0x70
[ 10.164020] ? pci_has_legacy_pm_support+0x70/0x70
[ 10.164033] rpm_callback+0x24/0x80
[ 10.164043] ? pci_has_legacy_pm_support+0x70/0x70
[ 10.164055] rpm_suspend+0x142/0x600
[ 10.164066] ? __switch_to_asm+0x40/0x70
[ 10.164100] pm_runtime_work+0x79/0x90
[ 10.164112] process_one_work+0x1b2/0x370
[ 10.164140] worker_thread+0x37/0x3a0
[ 10.164150] kthread+0x120/0x140
[ 10.164160] ? wq_update_unbound_numa+0x10/0x10
[ 10.164172] ? kthread_create_worker_on_cpu+0x70/0x70
[ 10.164186] ret_from_fork+0x35/0x40
[ 10.164196] ---[ end trace d5c556c207f0c26b ]---
You'll notice from those traces that the very first evo kick happens
/after/ the mthd failure on the display channel, not before.
Additionally, there is no point at this part of the initialization
process where we actually call mthd 0000 from nouveau.
Upon closer inspection, I discovered that this mysterious phantom disp
failure seems to be the result of someone else (probably the VBIOS or
the BIOS of the P50) leaving the disp core channel enabled by the time
nouveau begins to start initializing it. This was confirmed by observing
that the 0x610490 register holds a value of 0x490a009b when the card is
in this broken state, as opposed to the usual 0x48070088 or 0x48000088
observed on most cards pre-init.
It appears we can fix this by checking for the unknown mask 0x000a0000,
and simply shutting down the channel like we normally would on suspend
or driver unload before we start trying to initialize it. This appears
to be close to what nouveau does for older cards, as a similar
workaround can be seen in nv50_disp_core_init().
Unfortunately, I'm still not entirely clear on what conditions actually
cause this problem to be reproduced. Everyone else I've talked to so far
with a P50 doesn't report ever having hit this issue. As well, I haven't
managed to find a clear reproducer for this besides rebooting the
machine until the bug happens, while alternating between booting while
docked and while on battery every so often.
This fixes most random initialization errors on my ThinkPad P50 with a
GM107 GPU.
Signed-off-by: Lyude Paul <lyude(a)redhat.com>
Cc: Karol Herbst <kherbst(a)redhat.com>
Cc: stable(a)vger.kernel.org
---
.../drm/nouveau/nvkm/engine/disp/coregf119.c | 21 +++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/disp/coregf119.c b/drivers/gpu/drm/nouveau/nvkm/engine/disp/coregf119.c
index d162b9cf4eac..7534b5e9246f 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/disp/coregf119.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/disp/coregf119.c
@@ -166,8 +166,8 @@ gf119_disp_core_mthd = {
}
};
-void
-gf119_disp_core_fini(struct nv50_disp_chan *chan)
+static bool
+gf119_disp_core_deactivate(struct nv50_disp_chan *chan)
{
struct nvkm_subdev *subdev = &chan->disp->base.engine.subdev;
struct nvkm_device *device = subdev->device;
@@ -181,7 +181,16 @@ gf119_disp_core_fini(struct nv50_disp_chan *chan)
) < 0) {
nvkm_error(subdev, "core fini: %08x\n",
nvkm_rd32(device, 0x610490));
+ return false;
}
+
+ return true;
+}
+
+void
+gf119_disp_core_fini(struct nv50_disp_chan *chan)
+{
+ gf119_disp_core_deactivate(chan);
}
static int
@@ -190,6 +199,14 @@ gf119_disp_core_init(struct nv50_disp_chan *chan)
struct nvkm_subdev *subdev = &chan->disp->base.engine.subdev;
struct nvkm_device *device = subdev->device;
+ /* attempt to unstick the channel from some unknown state */
+ if ((nvkm_rd32(device, 0x610490) & 0x000a0000) == 0x000a0000 &&
+ WARN_ON(!gf119_disp_core_deactivate(chan))) {
+
+ nvkm_error(subdev, "core won't shut down, aborting\n");
+ return -EBUSY;
+ }
+
/* initialise channel for dma command submission */
nvkm_wr32(device, 0x610494, chan->push);
nvkm_wr32(device, 0x610498, 0x00010000);
--
2.17.1
Hi Greg, hi Thomas,
I noticed /sys/devices/system/cpu/smt dir is missing on 4.4.148 and
4.14.63, default setting.
Tried stable/master branch 31130a16d459 ("Merge tag
'for-linus-4.19-rc1-tag' of
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip")
It's the same there.
When boot with 'nosmt' kernel paramter kernel 4.14.63 panic during
boot, 4.4.148 boot fine.
The call trace seem irq related, is it known bug?
Thanks
--
Jack Wang
Linux Kernel Developer
ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin
Tel: +49 30 577 008 042
Fax: +49 30 577 008 299
Email: jinpu.wang(a)profitbricks.com
URL: https://www.profitbricks.de
Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss, Matthias Steinberg, Christoph Steffens
acpi_gsb_i2c_write_bytes() returns i2c_transfer()'s return value, which
is the number of transfers executed on success, so 1.
The ACPI code expects us to store 0 in gsb->status for success, not 1.
Specifically this breaks the following code in the Thinkpad 8 DSDT:
ECWR = I2CW = ECWR /* \_SB_.I2C1.BAT0.ECWR */
If ((ECST == Zero))
{
ECRD = I2CR /* \_SB_.I2C1.I2CR */
}
Before this commit we set ECST to 1, causing the read to never happen
breaking battery monitoring on the Thinkpad 8.
This commit makes acpi_gsb_i2c_write_bytes() return 0 when i2c_transfer()
returns 1, so the single write transfer completed successfully, and
makes it return -EIO on for other (unexpected) return values >= 0.
Cc: stable(a)vger.kernel.org
Signed-off-by: Hans de Goede <hdegoede(a)redhat.com>
---
Changes in v2:
-Modify the value which acpi_gsb_i2c_write_bytes() returns instead of
checking + modifying the return value in its caller
---
drivers/i2c/i2c-core-acpi.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/i2c/i2c-core-acpi.c b/drivers/i2c/i2c-core-acpi.c
index 7c3b4740b94b..b8f303dea305 100644
--- a/drivers/i2c/i2c-core-acpi.c
+++ b/drivers/i2c/i2c-core-acpi.c
@@ -482,11 +482,16 @@ static int acpi_gsb_i2c_write_bytes(struct i2c_client *client,
msgs[0].buf = buffer;
ret = i2c_transfer(client->adapter, msgs, ARRAY_SIZE(msgs));
- if (ret < 0)
- dev_err(&client->adapter->dev, "i2c write failed\n");
kfree(buffer);
- return ret;
+
+ if (ret < 0) {
+ dev_err(&client->adapter->dev, "i2c write failed: %d\n", ret);
+ return ret;
+ }
+
+ /* 1 transfer must have completed successfully */
+ return (ret == 1) ? 0 : -EIO;
}
static acpi_status
--
2.18.0
100 ms is not enough time for the LSPCON adapter on Intel NUC devices to
settle. This causes dropped display modes at boot or screen reconfiguration.
Empirical testing can reproduce the error up to a timeout of 190 ms. Basic
boot and stress testing at 200 ms has not (yet) failed.
Increase timeout to 400 ms to get some margin of error.
Changes from v1:
The initial suggestion of 1000 ms was lowered due to concerns about delaying
valid timeout cases.
Update patch metadata.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107503
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1570392
Fixes: 357c0ae9198a ("drm/i915/lspcon: Wait for expected LSPCON mode to settle")
Cc: Shashank Sharma <shashank.sharma(a)intel.com>
Cc: Imre Deak <imre.deak(a)intel.com>
Cc: Jani Nikula <jani.nikula(a)intel.com>
Cc: <stable(a)vger.kernel.org> # v4.11+
Reviewed-by: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
Signed-off-by: Fredrik Schön <fredrik.schon(a)gmail.com>
---
drivers/gpu/drm/i915/intel_lspcon.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/intel_lspcon.c b/drivers/gpu/drm/i915/intel_lspcon.c
index 8ae8f42f430a..6b6758419fb3 100644
--- a/drivers/gpu/drm/i915/intel_lspcon.c
+++ b/drivers/gpu/drm/i915/intel_lspcon.c
@@ -74,7 +74,7 @@ static enum drm_lspcon_mode lspcon_wait_mode(struct intel_lspcon *lspcon,
DRM_DEBUG_KMS("Waiting for LSPCON mode %s to settle\n",
lspcon_mode_name(mode));
- wait_for((current_mode = lspcon_get_current_mode(lspcon)) == mode, 100);
+ wait_for((current_mode = lspcon_get_current_mode(lspcon)) == mode, 400);
if (current_mode != mode)
DRM_ERROR("LSPCON mode hasn't settled\n");
--
2.17.1
I changed the way mac80211 updates the PM state of the peer.
I forgot that we could also have multicast frames from the
peer and that those frame should of course not change the
PM state of the peer: A peer goes to power save when it
needs to scan, but it won't send the broadcast Probe Request
with the PM bit set.
This made us mark the peer as awake when it wasn't and then
Intel's firmware would fail to transmit because the peer is
asleep according to its database. The driver warned about
this and it looked like this:
WARNING: CPU: 0 PID: 184 at /usr/src/linux-4.16.14/drivers/net/wireless/intel/iwlwifi/mvm/tx.c:1369 iwl_mvm_rx_tx_cmd+0x53b/0x860
CPU: 0 PID: 184 Comm: irq/124-iwlwifi Not tainted 4.16.14 #1
RIP: 0010:iwl_mvm_rx_tx_cmd+0x53b/0x860
Call Trace:
iwl_pcie_rx_handle+0x220/0x880
iwl_pcie_irq_handler+0x6c9/0xa20
? irq_forced_thread_fn+0x60/0x60
? irq_thread_dtor+0x90/0x90
The relevant code that spits the WARNING is:
case TX_STATUS_FAIL_DEST_PS:
/* the FW should have stopped the queue and not
* return this status
*/
WARN_ON(1);
info->flags |= IEEE80211_TX_STAT_TX_FILTERED;
This fixes https://bugzilla.kernel.org/show_bug.cgi?id=199967.
Fixes: 9fef65443388 ("mac80211: always update the PM state of a peer on MGMT / DATA frames")
Cc: <stable(a)vger.kernel.org> #4.16+
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach(a)intel.com>
---
net/mac80211/rx.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/mac80211/rx.c b/net/mac80211/rx.c
index a16ba56..3cf6027 100644
--- a/net/mac80211/rx.c
+++ b/net/mac80211/rx.c
@@ -1728,6 +1728,7 @@ ieee80211_rx_h_sta_process(struct ieee80211_rx_data *rx)
*/
if (!ieee80211_hw_check(&sta->local->hw, AP_LINK_PS) &&
!ieee80211_has_morefrags(hdr->frame_control) &&
+ !is_multicast_ether_addr(hdr->addr1) &&
(ieee80211_is_mgmt(hdr->frame_control) ||
ieee80211_is_data(hdr->frame_control)) &&
!(status->rx_flags & IEEE80211_RX_DEFERRED_RELEASE) &&
--
2.7.4