The orig_a0 is missing in struct user_regs_struct of riscv, and there is
no way to add it without breaking UAPI. (See Link tag below)
Like NT_ARM_SYSTEM_CALL do, we add a new regset name NT_RISCV_ORIG_A0 to
access original a0 register from userspace via ptrace API.
Link: https://lore.kernel.org/all/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/
Signed-off-by: Celeste Liu <uwu(a)coelacanthus.name>
---
Changes in v6:
- Fix obsolute comment.
- Copy include/linux/stddef.h to tools/include to use offsetofend in
selftests.
- Link to v5: https://lore.kernel.org/r/20250115-riscv-new-regset-v5-0-d0e6ec031a23@coela…
Changes in v5:
- Fix wrong usage in selftests.
- Link to v4: https://lore.kernel.org/r/20241226-riscv-new-regset-v4-0-4496a29d0436@coela…
Changes in v4:
- Fix a copy paste error in selftest. (Forget to commit...)
- Link to v3: https://lore.kernel.org/r/20241226-riscv-new-regset-v3-0-f5b96465826b@coela…
Changes in v3:
- Use return 0 directly for readability.
- Fix test for modify a0.
- Add Fixes: tag
- Remove useless Cc: stable.
- Selftest will check both a0 and orig_a0, but depends on the
correctness of PTRACE_GET_SYSCALL_INFO.
- Link to v2: https://lore.kernel.org/r/20241203-riscv-new-regset-v2-0-d37da8c0cba6@coela…
Changes in v2:
- Fix integer width.
- Add selftest.
- Link to v1: https://lore.kernel.org/r/20241201-riscv-new-regset-v1-1-c83c58abcc7b@coela…
---
Celeste Liu (3):
riscv/ptrace: add new regset to access original a0 register
tools: copy include/linux/stddef.h to tools/include
riscv: selftests: Add a ptrace test to verify a0 and orig_a0 access
arch/riscv/kernel/ptrace.c | 32 +++++
include/uapi/linux/elf.h | 1 +
tools/include/linux/stddef.h | 85 ++++++++++++
tools/include/uapi/linux/stddef.h | 6 +-
tools/testing/selftests/riscv/abi/.gitignore | 1 +
tools/testing/selftests/riscv/abi/Makefile | 6 +-
tools/testing/selftests/riscv/abi/ptrace.c | 193 +++++++++++++++++++++++++++
7 files changed, 319 insertions(+), 5 deletions(-)
---
base-commit: 0e287d31b62bb53ad81d5e59778384a40f8b6f56
change-id: 20241201-riscv-new-regset-d529b952ad0d
Best regards,
--
Celeste Liu <uwu(a)coelacanthus.name>
The quilt patch titled
Subject: mm/vmscan: fix pgdemote_* accounting with lru_gen_enabled
has been removed from the -mm tree. Its filename was
mm-vmscan-fix-pgdemote_-accounting-with-lru_gen_enabled.patch
This patch was dropped because it is obsolete
------------------------------------------------------
From: Li Zhijian <lizhijian(a)fujitsu.com>
Subject: mm/vmscan: fix pgdemote_* accounting with lru_gen_enabled
Date: Fri, 10 Jan 2025 20:21:33 +0800
Commit f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA
balancing operations") moved the accounting of PGDEMOTE_* statistics to
shrink_inactive_list(). However, shrink_inactive_list() is not called
when lrugen_enabled is true, leading to incorrect demotion statistics
despite actual demotion events occurring.
Add the PGDEMOTE_* accounting in evict_folios(), ensuring that demotion
statistics are correctly updated regardless of the lru_gen_enabled state.
This fix is crucial for systems that rely on accurate NUMA balancing
metrics for performance tuning and resource management.
Link: https://lkml.kernel.org/r/20250110122133.423481-2-lizhijian@fujitsu.com
Fixes: f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations")
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
Cc: Kaiyang Zhao <kaiyang2(a)cs.cmu.edu>
Cc: Yu Zhao <yuzhao(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmscan.c | 2 ++
1 file changed, 2 insertions(+)
--- a/mm/vmscan.c~mm-vmscan-fix-pgdemote_-accounting-with-lru_gen_enabled
+++ a/mm/vmscan.c
@@ -4650,6 +4650,8 @@ retry:
__mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(),
stat.nr_demoted);
+ __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(),
+ stat.nr_demoted);
item = PGSTEAL_KSWAPD + reclaimer_offset();
if (!cgroup_reclaim(sc))
__count_vm_events(item, reclaimed);
_
Patches currently in -mm which might be from lizhijian(a)fujitsu.com are
mm-vmscan-accumulate-nr_demoted-for-accurate-demotion-statistics.patch
mm-vmscan-accumulate-nr_demoted-for-accurate-demotion-statistics-v2.patch
When porting librseq commit:
commit c7b45750fa85 ("Adapt to glibc __rseq_size feature detection")
from librseq to the kernel selftests, the following line was missed
at the end of rseq_init():
rseq_size = get_rseq_kernel_feature_size();
which effectively leaves rseq_size initialized to -1U when glibc does not
have rseq support. glibc supports rseq from version 2.35 onwards.
In a following librseq commit
commit c67d198627c2 ("Only set 'rseq_size' on first thread registration")
to mimic the libc behavior, a new approach is taken: don't set the
feature size in 'rseq_size' until at least one thread has successfully
registered. This allows using 'rseq_size' in fast-paths to test for both
registration status and available features. The caveat is that on libc
either all threads are registered or none are, while with bare librseq
it is the responsability of the user to register all threads using rseq.
This combines the changes from the following librseq commits:
commit c7b45750fa85 ("Adapt to glibc __rseq_size feature detection")
commit c67d198627c2 ("Only set 'rseq_size' on first thread registration")
Fixes: 73a4f5a704a2 ("selftests/rseq: Fix mm_cid test failure")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Raghavendra Rao Ananta <rananta(a)google.com>
Cc: Shuah Khan <skhan(a)linuxfoundation.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Boqun Feng <boqun.feng(a)gmail.com>
Cc: "Paul E. McKenney" <paulmck(a)kernel.org>
Cc: Carlos O'Donell <carlos(a)redhat.com>
Cc: Florian Weimer <fweimer(a)redhat.com>
Cc: Michael Jeanson <mjeanson(a)efficios.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
---
tools/testing/selftests/rseq/rseq.c | 32 ++++++++++++++++++++++-------
tools/testing/selftests/rseq/rseq.h | 9 +++++++-
2 files changed, 33 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c
index 5b9772cdf265..f6156790c3b4 100644
--- a/tools/testing/selftests/rseq/rseq.c
+++ b/tools/testing/selftests/rseq/rseq.c
@@ -61,7 +61,6 @@ unsigned int rseq_size = -1U;
unsigned int rseq_flags;
static int rseq_ownership;
-static int rseq_reg_success; /* At least one rseq registration has succeded. */
/* Allocate a large area for the TLS. */
#define RSEQ_THREAD_AREA_ALLOC_SIZE 1024
@@ -152,14 +151,27 @@ int rseq_register_current_thread(void)
}
rc = sys_rseq(&__rseq_abi, get_rseq_min_alloc_size(), 0, RSEQ_SIG);
if (rc) {
- if (RSEQ_READ_ONCE(rseq_reg_success)) {
+ /*
+ * After at least one thread has registered successfully
+ * (rseq_size > 0), the registration of other threads should
+ * never fail.
+ */
+ if (RSEQ_READ_ONCE(rseq_size) > 0) {
/* Incoherent success/failure within process. */
abort();
}
return -1;
}
assert(rseq_current_cpu_raw() >= 0);
- RSEQ_WRITE_ONCE(rseq_reg_success, 1);
+
+ /*
+ * The first thread to register sets the rseq_size to mimic the libc
+ * behavior.
+ */
+ if (RSEQ_READ_ONCE(rseq_size) == 0) {
+ RSEQ_WRITE_ONCE(rseq_size, get_rseq_kernel_feature_size());
+ }
+
return 0;
}
@@ -235,12 +247,18 @@ void rseq_init(void)
return;
}
rseq_ownership = 1;
- if (!rseq_available()) {
- rseq_size = 0;
- return;
- }
+
+ /* Calculate the offset of the rseq area from the thread pointer. */
rseq_offset = (void *)&__rseq_abi - rseq_thread_pointer();
+
+ /* rseq flags are deprecated, always set to 0. */
rseq_flags = 0;
+
+ /*
+ * Set the size to 0 until at least one thread registers to mimic the
+ * libc behavior.
+ */
+ rseq_size = 0;
}
static __attribute__((destructor))
diff --git a/tools/testing/selftests/rseq/rseq.h b/tools/testing/selftests/rseq/rseq.h
index 4e217b620e0c..062d10925a10 100644
--- a/tools/testing/selftests/rseq/rseq.h
+++ b/tools/testing/selftests/rseq/rseq.h
@@ -60,7 +60,14 @@
extern ptrdiff_t rseq_offset;
/*
- * Size of the registered rseq area. 0 if the registration was
+ * The rseq ABI is composed of extensible feature fields. The extensions
+ * are done by appending additional fields at the end of the structure.
+ * The rseq_size defines the size of the active feature set which can be
+ * used by the application for the current rseq registration. Features
+ * starting at offset >= rseq_size are inactive and should not be used.
+ *
+ * The rseq_size is the intersection between the available allocation
+ * size for the rseq area and the feature size supported by the kernel.
* unsuccessful.
*/
extern unsigned int rseq_size;
--
2.39.5
On removal of the device or unloading of the kernel module a potential
NULL pointer dereference occurs.
The following sequence deletes the interface:
brcmf_detach()
brcmf_remove_interface()
brcmf_del_if()
Inside the brcmf_del_if() function the drvr->if2bss[ifidx] is updated to
BRCMF_BSSIDX_INVALID (-1) if the bsscfgidx matches.
After brcmf_remove_interface() call the brcmf_proto_detach() function is
called providing the following sequence:
brcmf_detach()
brcmf_proto_detach()
brcmf_proto_msgbuf_detach()
brcmf_flowring_detach()
brcmf_msgbuf_delete_flowring()
brcmf_msgbuf_remove_flowring()
brcmf_flowring_delete()
brcmf_get_ifp()
brcmf_txfinalize()
Since brcmf_get_ip() can and actually will return NULL in this case the
call to brcmf_txfinalize() will result in a NULL pointer dereference
inside brcmf_txfinalize() when trying to update
ifp->ndev->stats.tx_errors.
This will only happen if a flowring still has an skb.
Although the NULL pointer dereference has only been seen when trying to update
the tx statistic, all other uses of the ifp pointer have been guarded as well.
Cc: stable(a)vger.kernel.org
Signed-off-by: Marcel Hamer <marcel.hamer(a)windriver.com>
Link: https://lore.kernel.org/all/b519e746-ddfd-421f-d897-7620d229e4b2@gmail.com/
---
v1 -> v2: guard all uses of the ifp pointer inside brcmf_txfinalize()
---
drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
index c3a57e30c855..791757a3ec13 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
@@ -543,13 +543,13 @@ void brcmf_txfinalize(struct brcmf_if *ifp, struct sk_buff *txp, bool success)
eh = (struct ethhdr *)(txp->data);
type = ntohs(eh->h_proto);
- if (type == ETH_P_PAE) {
+ if (type == ETH_P_PAE && ifp) {
atomic_dec(&ifp->pend_8021x_cnt);
if (waitqueue_active(&ifp->pend_8021x_wait))
wake_up(&ifp->pend_8021x_wait);
}
- if (!success)
+ if (!success && ifp)
ifp->ndev->stats.tx_errors++;
brcmu_pkt_buf_free_skb(txp);
--
2.34.1
The PE Reset State "0" returned by RTAS calls
"ibm_read_slot_reset_[state|state2]" indicates that the reset is
deactivated and the PE is in a state where MMIO and DMA are allowed.
However, the current implementation of "pseries_eeh_get_state()" does
not reflect this, causing drivers to incorrectly assume that MMIO and
DMA operations cannot be resumed.
The userspace drivers as a part of EEH recovery using VFIO ioctls fail
to detect when the recovery process is complete. The VFIO_EEH_PE_GET_STATE
ioctl does not report the expected EEH_PE_STATE_NORMAL state, preventing
userspace drivers from functioning properly on pseries systems.
The patch addresses this issue by updating 'pseries_eeh_get_state()'
to include "EEH_STATE_MMIO_ENABLED" and "EEH_STATE_DMA_ENABLED" in
the result mask for PE Reset State "0". This ensures correct state
reporting to the callers, aligning the behavior with the PAPR specification
and fixing the bug in EEH recovery for VFIO user workflows.
Fixes: 00ba05a12b3c ("powerpc/pseries: Cleanup on pseries_eeh_get_state()")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Narayana Murty N <nnmlinux(a)linux.ibm.com>
---
Changelog:
V1:https://lore.kernel.org/all/20241107042027.338065-1-nnmlinux@linux.ibm.c…
--added Fixes tag for "powerpc/pseries: Cleanup on
pseries_eeh_get_state()".
V2:https://lore.kernel.org/stable/20241212075044.10563-1-nnmlinux%40linux.i…
--Updated the patch description to include it in the stable kernel tree.
V3:https://lore.kernel.org/all/87v7vm8pwz.fsf@gmail.com/
--Updated commit description.
---
arch/powerpc/platforms/pseries/eeh_pseries.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c b/arch/powerpc/platforms/pseries/eeh_pseries.c
index 1893f66371fa..b12ef382fec7 100644
--- a/arch/powerpc/platforms/pseries/eeh_pseries.c
+++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
@@ -580,8 +580,10 @@ static int pseries_eeh_get_state(struct eeh_pe *pe, int *delay)
switch(rets[0]) {
case 0:
- result = EEH_STATE_MMIO_ACTIVE |
- EEH_STATE_DMA_ACTIVE;
+ result = EEH_STATE_MMIO_ACTIVE |
+ EEH_STATE_DMA_ACTIVE |
+ EEH_STATE_MMIO_ENABLED |
+ EEH_STATE_DMA_ENABLED;
break;
case 1:
result = EEH_STATE_RESET_ACTIVE |
--
2.47.1
On 1/14/25 09:06, Tom Lendacky wrote:
> On 1/14/25 08:44, Kirill A. Shutemov wrote:
>> On Tue, Jan 14, 2025 at 08:33:39AM -0600, Tom Lendacky wrote:
>>> On 1/14/25 01:27, Kirill A. Shutemov wrote:
>>>> On Mon, Jan 13, 2025 at 02:47:56PM -0600, Tom Lendacky wrote:
>>>>> On 1/13/25 07:14, Kirill A. Shutemov wrote:
>>>>>> Currently memremap(MEMREMAP_WB) can produce decrypted/shared mapping:
>>>>>>
>>>>>> memremap(MEMREMAP_WB)
>>>>>> arch_memremap_wb()
>>>>>> ioremap_cache()
>>>>>> __ioremap_caller(.encrytped = false)
>>>>>>
>>>>>> In such cases, the IORES_MAP_ENCRYPTED flag on the memory will determine
>>>>>> if the resulting mapping is encrypted or decrypted.
>>>>>>
>>>>>> Creating a decrypted mapping without explicit request from the caller is
>>>>>> risky:
>>>>>>
>>>>>> - It can inadvertently expose the guest's data and compromise the
>>>>>> guest.
>>>>>>
>>>>>> - Accessing private memory via shared/decrypted mapping on TDX will
>>>>>> either trigger implicit conversion to shared or #VE (depending on
>>>>>> VMM implementation).
>>>>>>
>>>>>> Implicit conversion is destructive: subsequent access to the same
>>>>>> memory via private mapping will trigger a hard-to-debug #VE crash.
>>>>>>
>>>>>> The kernel already provides a way to request decrypted mapping
>>>>>> explicitly via the MEMREMAP_DEC flag.
>>>>>>
>>>>>> Modify memremap(MEMREMAP_WB) to produce encrypted/private mapping by
>>>>>> default unless MEMREMAP_DEC is specified.
>>>>>>
>>>>>> Fix the crash due to #VE on kexec in TDX guests if CONFIG_EISA is enabled.
>>>>>
>>>>> This patch causes my bare-metal system to crash during boot when using
>>>>> mem_encrypt=on:
>>>>>
>>>>> [ 2.392934] efi: memattr: Entry type should be RuntimeServiceCode/Data
>>>>> [ 2.393731] efi: memattr: ! 0x214c42f01f1162a-0xee70ac7bd1a9c629 [type=2028324321|attr=0x6590648fa4209879]
>>>>
>>>> Could you try if this helps?
>>>>
>>>> diff --git a/drivers/firmware/efi/memattr.c b/drivers/firmware/efi/memattr.c
>>>> index c38b1a335590..b5051dcb7c1d 100644
>>>> --- a/drivers/firmware/efi/memattr.c
>>>> +++ b/drivers/firmware/efi/memattr.c
>>>> @@ -160,7 +160,7 @@ int __init efi_memattr_apply_permissions(struct mm_struct *mm,
>>>> if (WARN_ON(!efi_enabled(EFI_MEMMAP)))
>>>> return 0;
>>>>
>>>> - tbl = memremap(efi_mem_attr_table, tbl_size, MEMREMAP_WB);
>>>> + tbl = memremap(efi_mem_attr_table, tbl_size, MEMREMAP_WB | MEMREMAP_DEC);
>>>
>>> Well that would work for SME where EFI tables/data are not encrypted,
>>> but will break for SEV where EFI tables/data are encrypted.
>>
>> Hm. Why would it break for SEV? It brings the situation back to what it
>> was before the patch.
>
> Ah, true. I can try it and see how much further SME gets. Hopefully it
> doesn't turn into a whack-a-mole thing.
Unfortunately, it is turning into a whack-a-mole thing.
But it looks the following works for SME:
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 3c36f3f5e688..ff3cd5fc8508 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -505,7 +505,7 @@ EXPORT_SYMBOL(iounmap);
void *arch_memremap_wb(phys_addr_t phys_addr, size_t size, unsigned long flags)
{
- if (flags & MEMREMAP_DEC)
+ if (flags & MEMREMAP_DEC || cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
return (void __force *)ioremap_cache(phys_addr, size);
return (void __force *)ioremap_encrypted(phys_addr, size);
I haven't had a chance to test the series on SEV, yet.
Thanks,
Tom
>
> Thanks,
> Tom
>
>>
>> Note that that __ioremap_caller() would still check io_desc.flags before
>> mapping it as decrypted.
>>
The following commit has been merged into the irq/urgent branch of tip:
Commit-ID: 9322d1915f9d976ee48c09d800fbd5169bc2ddcc
Gitweb: https://git.kernel.org/tip/9322d1915f9d976ee48c09d800fbd5169bc2ddcc
Author: Joe Hattori <joe(a)pf.is.s.u-tokyo.ac.jp>
AuthorDate: Sun, 15 Dec 2024 12:39:45 +09:00
Committer: Thomas Gleixner <tglx(a)linutronix.de>
CommitterDate: Wed, 15 Jan 2025 10:38:43 +01:00
irqchip: Plug a OF node reference leak in platform_irqchip_probe()
platform_irqchip_probe() leaks a OF node when irq_init_cb() fails. Fix it
by declaring par_np with the __free(device_node) cleanup construct.
This bug was found by an experimental static analysis tool that I am
developing.
Fixes: f8410e626569 ("irqchip: Add IRQCHIP_PLATFORM_DRIVER_BEGIN/END and IRQCHIP_MATCH helper macros")
Signed-off-by: Joe Hattori <joe(a)pf.is.s.u-tokyo.ac.jp>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/20241215033945.3414223-1-joe@pf.is.s.u-tokyo.ac…
---
drivers/irqchip/irqchip.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/drivers/irqchip/irqchip.c b/drivers/irqchip/irqchip.c
index 1eeb0d0..0ee7b6b 100644
--- a/drivers/irqchip/irqchip.c
+++ b/drivers/irqchip/irqchip.c
@@ -35,11 +35,10 @@ void __init irqchip_init(void)
int platform_irqchip_probe(struct platform_device *pdev)
{
struct device_node *np = pdev->dev.of_node;
- struct device_node *par_np = of_irq_find_parent(np);
+ struct device_node *par_np __free(device_node) = of_irq_find_parent(np);
of_irq_init_cb_t irq_init_cb = of_device_get_match_data(&pdev->dev);
if (!irq_init_cb) {
- of_node_put(par_np);
return -EINVAL;
}
@@ -55,7 +54,6 @@ int platform_irqchip_probe(struct platform_device *pdev)
* interrupt controller can check for specific domains as necessary.
*/
if (par_np && !irq_find_matching_host(par_np, DOMAIN_BUS_ANY)) {
- of_node_put(par_np);
return -EPROBE_DEFER;
}
Fixed a possible UAF problem in driver_override_show() in
drivers/bus/fsl-mc/fsl-mc-bus.c
This function driver_override_show() is part of DEVICE_ATTR_RW, which
includes both driver_override_show() and driver_override_store().
These functions can be executed concurrently in sysfs.
The driver_override_store() function uses driver_set_override() to
update the driver_override value, and driver_set_override() internally
locks the device (device_lock(dev)). If driver_override_show() reads
cdx_dev->driver_override without locking, it could potentially access
a freed pointer if driver_override_store() frees the string
concurrently. This could lead to printing a kernel address, which is a
security risk since DEVICE_ATTR can be read by all users.
Additionally, a similar pattern is used in drivers/amba/bus.c, as well
as many other bus drivers, where device_lock() is taken in the show
function, and it has been working without issues.
This potential bug was detected by our experimental static analysis tool,
which analyzes locking APIs and paired functions to identify data races
and atomicity violations.
Fixes: 1f86a00c1159 ("bus/fsl-mc: add support for 'driver_override' in the mc-bus")
Cc: stable(a)vger.kernel.org
Signed-off-by: Qiu-ji Chen <chenqiuji666(a)gmail.com>
---
V2:
Revised the description to reduce the confusion it caused.
---
drivers/bus/fsl-mc/fsl-mc-bus.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/bus/fsl-mc/fsl-mc-bus.c b/drivers/bus/fsl-mc/fsl-mc-bus.c
index 930d8a3ba722..62a9da88b4c9 100644
--- a/drivers/bus/fsl-mc/fsl-mc-bus.c
+++ b/drivers/bus/fsl-mc/fsl-mc-bus.c
@@ -201,8 +201,12 @@ static ssize_t driver_override_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct fsl_mc_device *mc_dev = to_fsl_mc_device(dev);
+ ssize_t len;
- return snprintf(buf, PAGE_SIZE, "%s\n", mc_dev->driver_override);
+ device_lock(dev);
+ len = snprintf(buf, PAGE_SIZE, "%s\n", mc_dev->driver_override);
+ device_unlock(dev);
+ return len;
}
static DEVICE_ATTR_RW(driver_override);
--
2.34.1
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: de31b3cd706347044e1a57d68c3a683d58e8cca4
Gitweb: https://git.kernel.org/tip/de31b3cd706347044e1a57d68c3a683d58e8cca4
Author: Xin Li (Intel) <xin(a)zytor.com>
AuthorDate: Fri, 10 Jan 2025 09:46:39 -08:00
Committer: Dave Hansen <dave.hansen(a)linux.intel.com>
CommitterDate: Tue, 14 Jan 2025 14:16:36 -08:00
x86/fred: Fix the FRED RSP0 MSR out of sync with its per-CPU cache
The FRED RSP0 MSR is only used for delivering events when running
userspace. Linux leverages this property to reduce expensive MSR
writes and optimize context switches. The kernel only writes the
MSR when about to run userspace *and* when the MSR has actually
changed since the last time userspace ran.
This optimization is implemented by maintaining a per-CPU cache of
FRED RSP0 and then checking that against the value for the top of
current task stack before running userspace.
However cpu_init_fred_exceptions() writes the MSR without updating
the per-CPU cache. This means that the kernel might return to
userspace with MSR_IA32_FRED_RSP0==0 when it needed to point to the
top of current task stack. This would induce a double fault (#DF),
which is bad.
A context switch after cpu_init_fred_exceptions() can paper over
the issue since it updates the cached value. That evidently
happens most of the time explaining how this bug got through.
Fix the bug through resynchronizing the FRED RSP0 MSR with its
per-CPU cache in cpu_init_fred_exceptions().
Fixes: fe85ee391966 ("x86/entry: Set FRED RSP0 on return to userspace instead of context switch")
Signed-off-by: Xin Li (Intel) <xin(a)zytor.com>
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Acked-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250110174639.1250829-1-xin%40zytor.com
---
arch/x86/kernel/fred.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/fred.c b/arch/x86/kernel/fred.c
index 8d32c3f..5e2cd10 100644
--- a/arch/x86/kernel/fred.c
+++ b/arch/x86/kernel/fred.c
@@ -50,7 +50,13 @@ void cpu_init_fred_exceptions(void)
FRED_CONFIG_ENTRYPOINT(asm_fred_entrypoint_user));
wrmsrl(MSR_IA32_FRED_STKLVLS, 0);
- wrmsrl(MSR_IA32_FRED_RSP0, 0);
+
+ /*
+ * Ater a CPU offline/online cycle, the FRED RSP0 MSR should be
+ * resynchronized with its per-CPU cache.
+ */
+ wrmsrl(MSR_IA32_FRED_RSP0, __this_cpu_read(fred_rsp0));
+
wrmsrl(MSR_IA32_FRED_RSP1, 0);
wrmsrl(MSR_IA32_FRED_RSP2, 0);
wrmsrl(MSR_IA32_FRED_RSP3, 0);