On Tue, Mar 12, 2024 at 07:04:10AM -0400, Eric Hagberg wrote:
> On Thu, Mar 7, 2024 at 11:33 AM Steve Wahl <steve.wahl(a)hpe.com> wrote:
> > What Linux Distribution are you running on that machine? My guess
> > would be that this is not distro related; if you are running something
> > quite different from Pavin that would confirm this.
>
> Distro in use is Rocky 8, so it’s pretty clear not to be distro-specific.
>
> > I found an AMD based system to try to reproduce this on.
>
> yeah, it probably requires either a specific cpu or set or devices plus cpu
> to trigger… found that it also affects Dell R7625 servers in addition to
> the R6615s
I agree that it's likely the CPU or particular set of surrounding
devices that trigger the problem.
I have not succeeded in reproducing the problem yet. I tried an AMD
based system lent to me, but it's probably the wrong generation (AMD
EPYC 7251) and I didn't see the problem. I have a line on a system
that's more in line with the systems the bug was reported on that I
should be able to try tomorrow.
I would love to have some direction from the community at large on
this. The fact that nogbpages on the command line causes the same
problem without my patch suggests it's not bad code directly in my
patch, but something in the way kexec reacts to the resulting identity
map. One quick solution would be a kernel command line parameter to
select between the previous identity map creation behavior and the new
behavior. E.g. in addition to "nogbpages", we could have
"somegbpages" and "allgbpages" -- or gbpages=[all, some, none] with
nogbpages a synonym for backwards compatibility.
But I don't want to introduce a new command line parameter if the
actual problem can be understood and fixed. The question is how much
time do I have to persue a direct fix before some other action needs
to be taken?
Thanks,
--> Steve Wahl
--
Steve Wahl, Hewlett Packard Enterprise
Stuart Hayhurst has found that both at bootup and fullscreen VA-API video
is leading to black screens for around 1 second and kernel WARNING [1] traces
when calling dmub_psr_enable() with Parade 08-01 TCON.
These symptoms all go away with PSR-SU disabled for this TCON, so disable
it for now while DMUB traces [2] from the failure can be analyzed and the failure
state properly root caused.
Cc: stable(a)vger.kernel.org
Cc: Marc Rossi <Marc.Rossi(a)amd.com>
Cc: Hamza Mahfooz <Hamza.Mahfooz(a)amd.com>
Link: https://gitlab.freedesktop.org/drm/amd/uploads/a832dd515b571ee171b3e3b566e9… [1]
Link: https://gitlab.freedesktop.org/drm/amd/uploads/8f13ff3b00963c833e23e68aa811… [2]
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/2645
Signed-off-by: Mario Limonciello <mario.limonciello(a)amd.com>
---
---
drivers/gpu/drm/amd/display/modules/power/power_helpers.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/amd/display/modules/power/power_helpers.c b/drivers/gpu/drm/amd/display/modules/power/power_helpers.c
index e304e8435fb8..477289846a0a 100644
--- a/drivers/gpu/drm/amd/display/modules/power/power_helpers.c
+++ b/drivers/gpu/drm/amd/display/modules/power/power_helpers.c
@@ -841,6 +841,8 @@ bool is_psr_su_specific_panel(struct dc_link *link)
isPSRSUSupported = false;
else if (dpcd_caps->sink_dev_id_str[1] == 0x08 && dpcd_caps->sink_dev_id_str[0] == 0x03)
isPSRSUSupported = false;
+ else if (dpcd_caps->sink_dev_id_str[1] == 0x08 && dpcd_caps->sink_dev_id_str[0] == 0x01)
+ isPSRSUSupported = false;
else if (dpcd_caps->psr_info.force_psrsu_cap == 0x1)
isPSRSUSupported = true;
}
--
2.34.1
img_info->mhi_buf should be freed on error path in mhi_alloc_bhie_table().
This error case is rare but still needs to be fixed.
Found by Linux Verification Center (linuxtesting.org).
Fixes: 3000f85b8f47 ("bus: mhi: core: Add support for basic PM operations")
Cc: stable(a)vger.kernel.org
Signed-off-by: Fedor Pchelkin <pchelkin(a)ispras.ru>
---
v2: add missing Cc: stable, as Greg Kroah-Hartman's bot reported
drivers/bus/mhi/host/boot.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/bus/mhi/host/boot.c b/drivers/bus/mhi/host/boot.c
index edc0ec5a0933..738dcd11b66f 100644
--- a/drivers/bus/mhi/host/boot.c
+++ b/drivers/bus/mhi/host/boot.c
@@ -357,6 +357,7 @@ int mhi_alloc_bhie_table(struct mhi_controller *mhi_cntrl,
for (--i, --mhi_buf; i >= 0; i--, mhi_buf--)
dma_free_coherent(mhi_cntrl->cntrl_dev, mhi_buf->len,
mhi_buf->buf, mhi_buf->dma_addr);
+ kfree(img_info->mhi_buf);
error_alloc_mhi_buf:
kfree(img_info);
--
2.39.2
I upgraded from kernel 6.1.94 to 6.1.99 on one of my machines and noticed that
the dmesg line "Incomplete global flushes, disabling PCID" had disappeared from
the log.
That message comes from commit c26b9e193172f48cd0ccc64285337106fb8aa804, which
disables PCID support on some broken hardware in arch/x86/mm/init.c:
#define INTEL_MATCH(_model) { .vendor = X86_VENDOR_INTEL, \
.family = 6, \
.model = _model, \
}
/*
* INVLPG may not properly flush Global entries
* on these CPUs when PCIDs are enabled.
*/
static const struct x86_cpu_id invlpg_miss_ids[] = {
INTEL_MATCH(INTEL_FAM6_ALDERLAKE ),
INTEL_MATCH(INTEL_FAM6_ALDERLAKE_L ),
INTEL_MATCH(INTEL_FAM6_ALDERLAKE_N ),
INTEL_MATCH(INTEL_FAM6_RAPTORLAKE ),
INTEL_MATCH(INTEL_FAM6_RAPTORLAKE_P),
INTEL_MATCH(INTEL_FAM6_RAPTORLAKE_S),
{}
...
if (x86_match_cpu(invlpg_miss_ids)) {
pr_info("Incomplete global flushes, disabling PCID");
setup_clear_cpu_cap(X86_FEATURE_PCID);
return;
}
arch/x86/mm/init.c, which has that code, hasn't changed in 6.1.94 -> 6.1.99.
However I found a commit changing how x86_match_cpu() behaves in 6.1.96:
commit 8ab1361b2eae44077fef4adea16228d44ffb860c
Author: Tony Luck <tony.luck(a)intel.com>
Date: Mon May 20 15:45:33 2024 -0700
x86/cpu: Fix x86_match_cpu() to match just X86_VENDOR_INTEL
I suspect this broke the PCID disabling code in arch/x86/mm/init.c.
The commit message says:
"Add a new flags field to struct x86_cpu_id that has a bit set to indicate that
this entry in the array is valid. Update X86_MATCH*() macros to set that bit.
Change the end-marker check in x86_match_cpu() to just check the flags field
for this bit."
But the PCID disabling code in 6.1.99 does not make use of the
X86_MATCH*() macros; instead, it defines a new INTEL_MATCH() macro without the
X86_CPU_ID_FLAG_ENTRY_VALID flag.
I looked in upstream git and found an existing fix:
commit 2eda374e883ad297bd9fe575a16c1dc850346075
Author: Tony Luck <tony.luck(a)intel.com>
Date: Wed Apr 24 11:15:18 2024 -0700
x86/mm: Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.
[ dhansen: vertically align 0's in invlpg_miss_ids[] ]
Signed-off-by: Tony Luck <tony.luck(a)intel.com>
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Link: https://lore.kernel.org/all/20240424181518.41946-1-tony.luck%40intel.com
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 679893ea5e68..6b43b6480354 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -261,21 +261,17 @@ static void __init probe_page_size_mask(void)
}
}
-#define INTEL_MATCH(_model) { .vendor = X86_VENDOR_INTEL, \
- .family = 6, \
- .model = _model, \
- }
/*
* INVLPG may not properly flush Global entries
* on these CPUs when PCIDs are enabled.
*/
static const struct x86_cpu_id invlpg_miss_ids[] = {
- INTEL_MATCH(INTEL_FAM6_ALDERLAKE ),
- INTEL_MATCH(INTEL_FAM6_ALDERLAKE_L ),
- INTEL_MATCH(INTEL_FAM6_ATOM_GRACEMONT ),
- INTEL_MATCH(INTEL_FAM6_RAPTORLAKE ),
- INTEL_MATCH(INTEL_FAM6_RAPTORLAKE_P),
- INTEL_MATCH(INTEL_FAM6_RAPTORLAKE_S),
+ X86_MATCH_VFM(INTEL_ALDERLAKE, 0),
+ X86_MATCH_VFM(INTEL_ALDERLAKE_L, 0),
+ X86_MATCH_VFM(INTEL_ATOM_GRACEMONT, 0),
+ X86_MATCH_VFM(INTEL_RAPTORLAKE, 0),
+ X86_MATCH_VFM(INTEL_RAPTORLAKE_P, 0),
+ X86_MATCH_VFM(INTEL_RAPTORLAKE_S, 0),
{}
};
The fix removed the custom INTEL_MATCH macro and uses the X86_MATCH*() macros
with X86_CPU_ID_FLAG_ENTRY_VALID. This fixed commit was never backported to 6.1,
so it looks like a stable series regression due to a missing backport.
If I apply the fix patch on 6.1.99, the PCID disabling code activates again.
I had to change all the INTEL_* definitions to the old definitions to make it
build:
static const struct x86_cpu_id invlpg_miss_ids[] = {
- INTEL_MATCH(INTEL_FAM6_ALDERLAKE ),
- INTEL_MATCH(INTEL_FAM6_ALDERLAKE_L ),
- INTEL_MATCH(INTEL_FAM6_ALDERLAKE_N ),
- INTEL_MATCH(INTEL_FAM6_RAPTORLAKE ),
- INTEL_MATCH(INTEL_FAM6_RAPTORLAKE_P),
- INTEL_MATCH(INTEL_FAM6_RAPTORLAKE_S),
+ X86_MATCH_VFM(INTEL_FAM6_ALDERLAKE, 0),
+ X86_MATCH_VFM(INTEL_FAM6_ALDERLAKE_L, 0),
+ X86_MATCH_VFM(INTEL_FAM6_ALDERLAKE_N, 0),
+ X86_MATCH_VFM(INTEL_FAM6_RAPTORLAKE, 0),
+ X86_MATCH_VFM(INTEL_FAM6_RAPTORLAKE_P, 0),
+ X86_MATCH_VFM(INTEL_FAM6_RAPTORLAKE_S, 0),
{}
};
I only looked at the code in arch/x86/mm/init.c, so there may be other uses of
x86_match_cpu() in the kernel that are also broken in 6.1.99.
This email is meant as a bug report, not a pull request. Someone else should
confirm the problem and submit the appropriate fix.
Otherwise when the tracer changes syscall number to -1, the kernel fails
to initialize a0 with -ENOSYS and subsequently fails to return the error
code of the failed syscall to userspace. For example, it will break
strace syscall tampering.
Fixes: 52449c17bdd1 ("riscv: entry: set a0 = -ENOSYS only when syscall != -1")
Reported-by: "Dmitry V. Levin" <ldv(a)strace.io>
Reviewed-by: Björn Töpel <bjorn(a)rivosinc.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Celeste Liu <CoelacanthusHex(a)gmail.com>
---
arch/riscv/kernel/traps.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index 05a16b1f0aee..51ebfd23e007 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -319,6 +319,7 @@ void do_trap_ecall_u(struct pt_regs *regs)
regs->epc += 4;
regs->orig_a0 = regs->a0;
+ regs->a0 = -ENOSYS;
riscv_v_vstate_discard(regs);
@@ -328,8 +329,7 @@ void do_trap_ecall_u(struct pt_regs *regs)
if (syscall >= 0 && syscall < NR_syscalls)
syscall_handler(regs, syscall);
- else if (syscall != -1)
- regs->a0 = -ENOSYS;
+
/*
* Ultimately, this value will get limited by KSTACK_OFFSET_MAX(),
* so the maximum stack offset is 1k bytes (10 bits).
--
2.45.2
This reverts commit ad6bcdad2b6724e113f191a12f859a9e8456b26d. I had
nak'd it, and Greg said on the thread that it links that he wasn't going
to take it either, especially since it's not his code or his tree, but
then, seemingly accidentally, it got pushed up some months later, in
what looks like a mistake, with no further discussion in the linked
thread. So revert it, since it's clearly not intended.
Fixes: ad6bcdad2b67 ("vmgenid: emit uevent when VMGENID updates")
Cc: stable(a)vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Link: https://lore.kernel.org/r/20230531095119.11202-2-bchalios@amazon.es
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
---
drivers/virt/vmgenid.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
index b67a28da4702..a1c467a0e9f7 100644
--- a/drivers/virt/vmgenid.c
+++ b/drivers/virt/vmgenid.c
@@ -68,7 +68,6 @@ static int vmgenid_add(struct acpi_device *device)
static void vmgenid_notify(struct acpi_device *device, u32 event)
{
struct vmgenid_state *state = acpi_driver_data(device);
- char *envp[] = { "NEW_VMGENID=1", NULL };
u8 old_id[VMGENID_SIZE];
memcpy(old_id, state->this_id, sizeof(old_id));
@@ -76,7 +75,6 @@ static void vmgenid_notify(struct acpi_device *device, u32 event)
if (!memcmp(old_id, state->this_id, sizeof(old_id)))
return;
add_vmfork_randomness(state->this_id, sizeof(state->this_id));
- kobject_uevent_env(&device->dev.kobj, KOBJ_CHANGE, envp);
}
static const struct acpi_device_id vmgenid_ids[] = {
--
2.44.0
Call work_on_cpu(cpu, fn, arg) in pci_call_probe() while the argument
@cpu is a offline cpu would cause system stuck forever.
This can be happen if a node is online while all its CPUs are
offline (We can use "maxcpus=1" without "nr_cpus=1" to reproduce it).
So, in the above case, let pci_call_probe() call local_pci_probe()
instead of work_on_cpu() when the best selected cpu is offline.
Fixes: 69a18b18699b ("PCI: Restrict probe functions to housekeeping CPUs")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Huacai Chen <chenhuacai(a)loongson.cn>
Signed-off-by: Hongchen Zhang <zhanghongchen(a)loongson.cn>
---
v2 -> v3: Modify commit message according to Markus's suggestion
v1 -> v2: Add a method to reproduce the problem
---
drivers/pci/pci-driver.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index af2996d0d17f..32a99828e6a3 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -386,7 +386,7 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
free_cpumask_var(wq_domain_mask);
}
- if (cpu < nr_cpu_ids)
+ if ((cpu < nr_cpu_ids) && cpu_online(cpu))
error = work_on_cpu(cpu, local_pci_probe, &ddi);
else
error = local_pci_probe(&ddi);
--
2.33.0
It is possible that the usb power_supply is registered after the probe
of dwc3. In this case, trying to get the usb power_supply during the
probe will fail and there is no chance to try again. Also the usb
power_supply might be unregistered at anytime so that the handle of it
in dwc3 would become invalid. To fix this, get the handle right before
calling to power_supply functions and put it afterward.
Fixes: 6f0764b5adea ("usb: dwc3: add a power supply for current control")
Cc: stable(a)vger.kernel.org
Signed-off-by: Kyle Tso <kyletso(a)google.com>
---
drivers/usb/dwc3/core.c | 25 +++++--------------------
drivers/usb/dwc3/core.h | 4 ++--
drivers/usb/dwc3/gadget.c | 19 ++++++++++++++-----
3 files changed, 21 insertions(+), 27 deletions(-)
diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index 734de2a8bd21..ab563edd9b4c 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -1631,8 +1631,6 @@ static void dwc3_get_properties(struct dwc3 *dwc)
u8 tx_thr_num_pkt_prd = 0;
u8 tx_max_burst_prd = 0;
u8 tx_fifo_resize_max_num;
- const char *usb_psy_name;
- int ret;
/* default to highest possible threshold */
lpm_nyet_threshold = 0xf;
@@ -1667,12 +1665,7 @@ static void dwc3_get_properties(struct dwc3 *dwc)
dwc->sys_wakeup = device_may_wakeup(dwc->sysdev);
- ret = device_property_read_string(dev, "usb-psy-name", &usb_psy_name);
- if (ret >= 0) {
- dwc->usb_psy = power_supply_get_by_name(usb_psy_name);
- if (!dwc->usb_psy)
- dev_err(dev, "couldn't get usb power supply\n");
- }
+ device_property_read_string(dev, "usb-psy-name", &dwc->usb_psy_name);
dwc->has_lpm_erratum = device_property_read_bool(dev,
"snps,has-lpm-erratum");
@@ -2133,18 +2126,16 @@ static int dwc3_probe(struct platform_device *pdev)
dwc3_get_software_properties(dwc);
dwc->reset = devm_reset_control_array_get_optional_shared(dev);
- if (IS_ERR(dwc->reset)) {
- ret = PTR_ERR(dwc->reset);
- goto err_put_psy;
- }
+ if (IS_ERR(dwc->reset))
+ return PTR_ERR(dwc->reset);
ret = dwc3_get_clocks(dwc);
if (ret)
- goto err_put_psy;
+ return ret;
ret = reset_control_deassert(dwc->reset);
if (ret)
- goto err_put_psy;
+ return ret;
ret = dwc3_clk_enable(dwc);
if (ret)
@@ -2245,9 +2236,6 @@ static int dwc3_probe(struct platform_device *pdev)
dwc3_clk_disable(dwc);
err_assert_reset:
reset_control_assert(dwc->reset);
-err_put_psy:
- if (dwc->usb_psy)
- power_supply_put(dwc->usb_psy);
return ret;
}
@@ -2276,9 +2264,6 @@ static void dwc3_remove(struct platform_device *pdev)
pm_runtime_set_suspended(&pdev->dev);
dwc3_free_event_buffers(dwc);
-
- if (dwc->usb_psy)
- power_supply_put(dwc->usb_psy);
}
#ifdef CONFIG_PM
diff --git a/drivers/usb/dwc3/core.h b/drivers/usb/dwc3/core.h
index 1e561fd8b86e..ecfe2cc224f7 100644
--- a/drivers/usb/dwc3/core.h
+++ b/drivers/usb/dwc3/core.h
@@ -1045,7 +1045,7 @@ struct dwc3_scratchpad_array {
* @role_sw: usb_role_switch handle
* @role_switch_default_mode: default operation mode of controller while
* usb role is USB_ROLE_NONE.
- * @usb_psy: pointer to power supply interface.
+ * @usb_psy_name: name of the usb power supply interface.
* @usb2_phy: pointer to USB2 PHY
* @usb3_phy: pointer to USB3 PHY
* @usb2_generic_phy: pointer to array of USB2 PHYs
@@ -1223,7 +1223,7 @@ struct dwc3 {
struct usb_role_switch *role_sw;
enum usb_dr_mode role_switch_default_mode;
- struct power_supply *usb_psy;
+ const char *usb_psy_name;
u32 fladj;
u32 ref_clk_per;
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 89fc690fdf34..c89b5b5a64cf 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -3049,20 +3049,29 @@ static void dwc3_gadget_set_ssp_rate(struct usb_gadget *g,
static int dwc3_gadget_vbus_draw(struct usb_gadget *g, unsigned int mA)
{
- struct dwc3 *dwc = gadget_to_dwc(g);
+ struct dwc3 *dwc = gadget_to_dwc(g);
+ struct power_supply *usb_psy;
union power_supply_propval val = {0};
int ret;
if (dwc->usb2_phy)
return usb_phy_set_power(dwc->usb2_phy, mA);
- if (!dwc->usb_psy)
+ if (!dwc->usb_psy_name)
return -EOPNOTSUPP;
- val.intval = 1000 * mA;
- ret = power_supply_set_property(dwc->usb_psy, POWER_SUPPLY_PROP_INPUT_CURRENT_LIMIT, &val);
+ usb_psy = power_supply_get_by_name(dwc->usb_psy_name);
+ if (usb_psy) {
+ val.intval = 1000 * mA;
+ ret = power_supply_set_property(usb_psy, POWER_SUPPLY_PROP_INPUT_CURRENT_LIMIT,
+ &val);
+ power_supply_put(usb_psy);
+ return ret;
+ }
- return ret;
+ dev_err(dwc->dev, "couldn't get usb power supply\n");
+
+ return -EOPNOTSUPP;
}
/**
--
2.45.2.993.g49e7a77208-goog