There are two major types of uncorrected error (UC) :
- Action Required: The error is detected and the processor already consumes the
memory. OS requires to take action (for example, offline failure page/kill
failure thread) to recover this uncorrectable error.
- Action Optional: The error is detected out of processor execution context.
Some data in the memory are corrupted. But the data have not been consumed.
OS is optional to take action to recover this uncorrectable error.
For X86 platforms, we can easily distinguish between these two types
based on the MCA Bank. While for arm64 platform, the memory failure
flags for all UCs which severity are GHES_SEV_RECOVERABLE are set as 0,
a.k.a, Action Optional now.
If UC is detected by a background scrubber, it is obviously an Action
Optional error. For other errors, we should conservatively regard them
as Action Required.
cper_sec_mem_err::error_type identifies the type of error that occurred
if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0
for Scrub Uncorrected Error (type 14). Otherwise, set memory failure
flags as MF_ACTION_REQUIRED.
Signed-off-by: Shuai Xue <xueshuai(a)linux.alibaba.com>
---
drivers/acpi/apei/ghes.c | 10 ++++++++--
include/linux/cper.h | 3 +++
2 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 80ad530583c9..6c03059cbfc6 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -474,8 +474,14 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
if (sec_sev == GHES_SEV_CORRECTED &&
(gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
flags = MF_SOFT_OFFLINE;
- if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
- flags = 0;
+ if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) {
+ if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE)
+ flags = mem_err->error_type == CPER_MEM_SCRUB_UC ?
+ 0 :
+ MF_ACTION_REQUIRED;
+ else
+ flags = MF_ACTION_REQUIRED;
+ }
if (flags != -1)
return ghes_do_memory_failure(mem_err->physical_addr, flags);
diff --git a/include/linux/cper.h b/include/linux/cper.h
index eacb7dd7b3af..b77ab7636614 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -235,6 +235,9 @@ enum {
#define CPER_MEM_VALID_BANK_ADDRESS 0x100000
#define CPER_MEM_VALID_CHIP_ID 0x200000
+#define CPER_MEM_SCRUB_CE 13
+#define CPER_MEM_SCRUB_UC 14
+
#define CPER_MEM_EXT_ROW_MASK 0x3
#define CPER_MEM_EXT_ROW_SHIFT 16
--
2.20.1.9.gb50a0d7
Add the necessary definitions to the qcom-cpufreq-nvmem driver to
support basic cpufreq scaling on the Qualcomm MSM8909 SoC. In practice
the necessary power domains vary depending on the actual PMIC the SoC
was combined with. With PM8909 the VDD_APC power domain is shared with
VDD_CX so the RPM firmware handles all voltage adjustments, while with
PM8916 and PM660 Linux is responsible to do adaptive voltage scaling
of a dedicated CPU regulator using CPR.
Signed-off-by: Stephan Gerhold <stephan.gerhold(a)kernkonzept.com>
---
Stephan Gerhold (4):
cpufreq: qcom-nvmem: Enable virtual power domain devices
cpufreq: dt: platdev: Add MSM8909 to blocklist
dt-bindings: cpufreq: qcom-nvmem: Document MSM8909
cpufreq: qcom-nvmem: Add MSM8909
.../bindings/cpufreq/qcom-cpufreq-nvmem.yaml | 1 +
drivers/cpufreq/cpufreq-dt-platdev.c | 1 +
drivers/cpufreq/qcom-cpufreq-nvmem.c | 47 +++++++++++++++++++++-
3 files changed, 48 insertions(+), 1 deletion(-)
---
base-commit: 0bb80ecc33a8fb5a682236443c1e740d5c917d1d
change-id: 20230906-msm8909-cpufreq-dff238de9ff3
Best regards,
--
Stephan Gerhold <stephan.gerhold(a)kernkonzept.com>
Kernkonzept GmbH at Dresden, Germany, HRB 31129, CEO Dr.-Ing. Michael Hohmuth
Hi,
I notice a stable-specific regression on Bugzilla [1]. Quoting from it:
> The backport commit to 5.15 branch:
> 9d4f84a15f9c9727bc07f59d9dafc89e65aadb34 "arm64: dts: imx8mp: Add snps,gfladj-refclk-lpm-sel quirk to USB nodes" (from upstream commit 5c3d5ecf48ab06c709c012bf1e8f0c91e1fcd7ad)
> switched from "snps,dis-u2-freeclk-exists-quirk" to "snps,gfladj-refclk-lpm-sel-quirk".
>
> The problem is that the gfladj-refclk-lpm-sel-quirk quirk is not implemented / backported to 5.15 branch.
>
> This commit should be either reverted, or the commit introducing gfladj-refclk-lpm-sel-quirk needs to be merged to 5.15 kernel branch.
>
> As a result of this patch, on Gateworks Venice GW7400 revB board the USB 3.x devices which are connected to the USB Type C port does not enumerate and the following errors are generated:
>
> [ 14.906302] xhci-hcd xhci-hcd.0.auto: Timeout while waiting for setup device command
> [ 15.122383] usb 2-1: device not accepting address 2, error -62
> [ 25.282195] xhci-hcd xhci-hcd.0.auto: Abort failed to stop command ring: -110
> [ 25.297408] xhci-hcd xhci-hcd.0.auto: xHCI host controller not responding, assume dead
> [ 25.305345] xhci-hcd xhci-hcd.0.auto: HC died; cleaning up
> [ 25.311058] xhci-hcd xhci-hcd.0.auto: Timeout while waiting for stop endpoint command
> [ 25.334361] usb usb2-port1: couldn't allocate usb_device
>
> When the commit is reverted the USB 3.x drives works fine.
See Bugzilla for the full thread and attach dmesgs.
Anyway, I'm adding it to regzbot:
#regzbot introduced: 9d4f84a15f9c97 https://bugzilla.kernel.org/show_bug.cgi?id=217670
#regzbot title: regression in USB DWC3 driver due to missing gfladj-refclk-lpm-sel-quirk quirk
Thanks.
[1]: https://bugzilla.kernel.org/show_bug.cgi?id=217670
--
An old man doll... just what I always wanted! - Clara
Hi!
The Tegra20 requires an enabled VDE power domain during startup. As the
VDE is currently not used, it is disabled during runtime.
Since 8f0c714ad9be, there is a workaround for the "normal restart path"
which enables the VDE before doing PMC's warm reboot. This workaround is
not executed in the "emergency restart path", leading to a hang-up
during start.
This series implements and registers a new pmic-based restart handler
for boards with tps6586x. This cold reboot ensures that the VDE power
domain is enabled during startup on tegra20-based boards.
Since bae1d3a05a8b, i2c transfers are non-atomic while preemption is
disabled (which is e.g. done during panic()). This could lead to
warnings ("Voluntary context switch within RCU") in i2c-based restart
handlers during emergency restart. The state of preemption should be
detected by i2c_in_atomic_xfer_mode() to use atomic i2c xfer when
required. Beside the new system_state check, the check is the same as
the one pre v5.2.
---
v7:
- 5/5: drop mode check (suggested by Dmitry)
- Link to v6: https://lore.kernel.org/r/20230327-tegra-pmic-reboot-v6-0-af44a4cd82e9@skid…
v6:
- drop 4/6 to abort restart on unexpected failure (suggested by Dmitry)
- 4,5: fix comments in handlers (suggested by Lee)
- 4,5: same delay for both handlers (suggested by Lee)
v5:
- introduce new 3 & 4, therefore 3 -> 5, 4 -> 6
- 3: provide dev to sys_off handler, if it is known
- 4: return NOTIFY_DONE from sys_off_notify, to avoid skipping
- 5: drop Reviewed-by from Dmitry, add poweroff timeout
- 5,6: return notifier value instead of direct errno from handler
- 5,6: use new dev field instead of passing dev as cb_data
- 5,6: increase timeout values based on error observations
- 6: skip unsupported reboot modes in restart handler
---
Benjamin Bara (5):
kernel/reboot: emergency_restart: set correct system_state
i2c: core: run atomic i2c xfer when !preemptible
kernel/reboot: add device to sys_off_handler
mfd: tps6586x: use devm-based power off handler
mfd: tps6586x: register restart handler
drivers/i2c/i2c-core.h | 2 +-
drivers/mfd/tps6586x.c | 50 ++++++++++++++++++++++++++++++++++++++++++--------
include/linux/reboot.h | 3 +++
kernel/reboot.c | 4 ++++
4 files changed, 50 insertions(+), 9 deletions(-)
---
base-commit: 197b6b60ae7bc51dd0814953c562833143b292aa
change-id: 20230327-tegra-pmic-reboot-4175ff814a4b
Best regards,
--
Benjamin Bara <benjamin.bara(a)skidata.com>
cros_ec_sensors_push_data() reads `indio_dev->active_scan_mask` and
calls iio_push_to_buffers_with_timestamp() without making sure the
`indio_dev` stays in buffer mode. There is a race if `indio_dev` exits
buffer mode right before cros_ec_sensors_push_data() accesses them.
An use-after-free on `indio_dev->active_scan_mask` was observed. The
call trace:
[...]
_find_next_bit
cros_ec_sensors_push_data
cros_ec_sensorhub_event
blocking_notifier_call_chain
cros_ec_irq_thread
It was caused by a race condition: one thread just freed
`active_scan_mask` at [1]; while another thread tried to access the
memory at [2].
Fix it by calling iio_device_claim_buffer_mode() to ensure the
`indio_dev` can't exit buffer mode during cros_ec_sensors_push_data().
[1]: https://elixir.bootlin.com/linux/v6.5/source/drivers/iio/industrialio-buffe…
[2]: https://elixir.bootlin.com/linux/v6.5/source/drivers/iio/common/cros_ec_sen…
Cc: stable(a)vger.kernel.org
Fixes: aa984f1ba4a4 ("iio: cros_ec: Register to cros_ec_sensorhub when EC supports FIFO")
Signed-off-by: Tzung-Bi Shih <tzungbi(a)kernel.org>
---
Changes from v1(https://patchwork.kernel.org/project/linux-iio/patch/20230828094339.1248…:
- Use iio_device_{claim|release}_buffer_mode() instead of accessing `mlock`.
drivers/iio/common/cros_ec_sensors/cros_ec_sensors_core.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/iio/common/cros_ec_sensors/cros_ec_sensors_core.c b/drivers/iio/common/cros_ec_sensors/cros_ec_sensors_core.c
index b72d39fc2434..6bfe5d6847e7 100644
--- a/drivers/iio/common/cros_ec_sensors/cros_ec_sensors_core.c
+++ b/drivers/iio/common/cros_ec_sensors/cros_ec_sensors_core.c
@@ -190,8 +190,11 @@ int cros_ec_sensors_push_data(struct iio_dev *indio_dev,
/*
* Ignore samples if the buffer is not set: it is needed if the ODR is
* set but the buffer is not enabled yet.
+ *
+ * Note: iio_device_claim_buffer_mode() returns -EBUSY if the buffer
+ * is not enabled.
*/
- if (!iio_buffer_enabled(indio_dev))
+ if (iio_device_claim_buffer_mode(indio_dev) < 0)
return 0;
out = (s16 *)st->samples;
@@ -210,6 +213,7 @@ int cros_ec_sensors_push_data(struct iio_dev *indio_dev,
iio_push_to_buffers_with_timestamp(indio_dev, st->samples,
timestamp + delta);
+ iio_device_release_buffer_mode(indio_dev);
return 0;
}
EXPORT_SYMBOL_GPL(cros_ec_sensors_push_data);
--
2.42.0.rc2.253.gd59a3bf2b4-goog
Hi,
I notice a regression report on Bugzilla [1]. Quoting from it:
> Kernel 6.4.6 compiled from source worked AOK on my desktop with Intel Xeon cpu and Nvidia graphics - see below for system specs.
>
> Kernels 6.4.7 & 6.4.8 also compiled from source with identical configs hang with a frozen boot terminal screen after a significant way through the boot sequence (e.g. whilst running /etc/profile). The system may still be running as a sound is emitted when the power button is pressed (only way to escape from the system hang).
>
> The issue seems to be specific to the hardware of this desktop as the problem kernels do boot through to completion on other machines.
>
> A test was done with a different build (from Porteus) of kernel 6.5-RC4 and that did not hang - but kernel 6.4.7 from the same builder hung just like my build.
>
> I apologise that I cannot provide any detailed diagnostics - but I can put diagnostics into /etc/profile and provide screenshots if requested.
>
> Forum thread with more details and screenshots:
> https://forum.puppylinux.com/viewtopic.php?p=95733#p95733
>
> Computer Profile:
> Machine Dell Inc. Precision WorkStation T5400 (version: Not Specified)
> Mainboard Dell Inc. 0RW203 (version: NA)
> • BIOS Dell Inc. A11 | Date: 04/30/2012 | Type: Legacy
> • CPU Intel(R) Xeon(R) CPU E5450 @ 3.00GHz (4 cores)
> • RAM Total: 7955 MB | Used: 1555 MB (19.5%) | Actual Used: 775 MB (9.7%)
> Graphics Resolution: 1366x768 pixels | Display Server: X.Org 21.1.8
> • device-0 NVIDIA Corporation GT218 [NVS 300] [10de:10d8] (rev a2)
> Audio ALSA
> • device-0 Intel Corporation 631xESB/632xESB High Definition Audio Controller [8086:269a] (rev 09)
> • device-1 NVIDIA Corporation High Definition Audio Controller [10de:0be3] (rev a1)
> Network wlan1
> • device-0 Ethernet: Broadcom Inc. and subsidiaries NetXtreme BCM5754 Gigabit Ethernet PCI Express [14e4:167a] (rev 02)
See Bugzilla for the full thread.
FYI, this is stable-specific regression since it doesn't appear on mainline.
Also, I have asked the reporter to also open the issue on gitlab.freedesktop.org
tracker (as it is the standard for DRM subsystem).
To the reporter (on To: list): It'd been great if you also have netconsole
output on your Bugzilla report, providing that you have another machine
connecting to your problematic one.
Anyway, I'm adding this regression to be tracked by regzbot:
#regzbot introduced: v6.4.6..v6.4.7 https://bugzilla.kernel.org/show_bug.cgi?id=217776
#regzbot link: https://forum.puppylinux.com/viewtopic.php?p=95733#p95733
Thanks.
[1]: https://bugzilla.kernel.org/show_bug.cgi?id=217776
--
An old man doll... just what I always wanted! - Clara