The patch titled
Subject: x86/kfence: avoid writing L1TF-vulnerable PTEs
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
x86-kfence-avoid-writing-l1tf-vulnerable-ptes.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via various
branches at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there most days
------------------------------------------------------
From: Andrew Cooper <andrew.cooper3(a)citrix.com>
Subject: x86/kfence: avoid writing L1TF-vulnerable PTEs
Date: Tue, 6 Jan 2026 18:04:26 +0000
For native, the choice of PTE is fine. There's real memory backing the
non-present PTE. However, for XenPV, Xen complains:
(XEN) d1 L1TF-vulnerable L1e 8010000018200066 - Shadowing
To explain, some background on XenPV pagetables:
Xen PV guests are control their own pagetables; they choose the new
PTE value, and use hypercalls to make changes so Xen can audit for
safety.
In addition to a regular reference count, Xen also maintains a type
reference count. e.g. SegDesc (referenced by vGDT/vLDT), Writable
(referenced with _PAGE_RW) or L{1..4} (referenced by vCR3 or a lower
pagetable level). This is in order to prevent e.g. a page being
inserted into the pagetables for which the guest has a writable mapping.
For non-present mappings, all other bits become software accessible,
and typically contain metadata rather a real frame address. There is
nothing that a reference count could sensibly be tied to. As such, even
if Xen could recognise the address as currently safe, nothing would
prevent that frame from changing owner to another VM in the future.
When Xen detects a PV guest writing a L1TF-PTE, it responds by
activating shadow paging. This is normally only used for the live phase
of migration, and comes with a reasonable overhead.
KFENCE only cares about getting #PF to catch wild accesses; it doesn't
care about the value for non-present mappings. Use a fully inverted PTE,
to avoid hitting the slow path when running under Xen.
While adjusting the logic, take the opportunity to skip all actions if the
PTE is already in the right state, half the number PVOps callouts, and
skip TLB maintenance on a !P -> P transition which benefits non-Xen cases
too.
Link: https://lkml.kernel.org/r/20260106180426.710013-1-andrew.cooper3@citrix.com
Fixes: 1dc0da6e9ec0 ("x86, kfence: enable KFENCE for x86")
Signed-off-by: Andrew Cooper <andrew.cooper3(a)citrix.com>
Tested-by: Marco Elver <elver(a)google.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Marco Elver <elver(a)google.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/x86/include/asm/kfence.h | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
--- a/arch/x86/include/asm/kfence.h~x86-kfence-avoid-writing-l1tf-vulnerable-ptes
+++ a/arch/x86/include/asm/kfence.h
@@ -42,10 +42,34 @@ static inline bool kfence_protect_page(u
{
unsigned int level;
pte_t *pte = lookup_address(addr, &level);
+ pteval_t val;
if (WARN_ON(!pte || level != PG_LEVEL_4K))
return false;
+ val = pte_val(*pte);
+
+ /*
+ * protect requires making the page not-present. If the PTE is
+ * already in the right state, there's nothing to do.
+ */
+ if (protect != !!(val & _PAGE_PRESENT))
+ return true;
+
+ /*
+ * Otherwise, invert the entire PTE. This avoids writing out an
+ * L1TF-vulnerable PTE (not present, without the high address bits
+ * set).
+ */
+ set_pte(pte, __pte(~val));
+
+ /*
+ * If the page was protected (non-present) and we're making it
+ * present, there is no need to flush the TLB at all.
+ */
+ if (!protect)
+ return true;
+
/*
* We need to avoid IPIs, as we may get KFENCE allocations or faults
* with interrupts disabled. Therefore, the below is best-effort, and
@@ -53,11 +77,6 @@ static inline bool kfence_protect_page(u
* lazy fault handling takes care of faults after the page is PRESENT.
*/
- if (protect)
- set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT));
- else
- set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));
-
/*
* Flush this CPU's TLB, assuming whoever did the allocation/free is
* likely to continue running on this CPU.
_
Patches currently in -mm which might be from andrew.cooper3(a)citrix.com are
x86-kfence-avoid-writing-l1tf-vulnerable-ptes.patch
When a device is matched via PRP0001, the driver's OF (DT) match table
must be used to obtain the device match data. If a driver provides both
an acpi_match_table and an of_match_table, the current
acpi_device_get_match_data() path consults the driver's acpi_match_table
and returns NULL (no ACPI ID matches).
Explicitly detect PRP0001 and fetch match data from the driver's
of_match_table via acpi_of_device_get_match_data().
Fixes: 886ca88be6b3 ("ACPI / bus: Respect PRP0001 when retrieving device match data")
Cc: stable(a)vger.kernel.org
Signed-off-by: Kartik Rajput <kkartik(a)nvidia.com>
---
drivers/acpi/bus.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 5e110badac7b..4cd425fffa97 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -1031,8 +1031,9 @@ const void *acpi_device_get_match_data(const struct device *dev)
{
const struct acpi_device_id *acpi_ids = dev->driver->acpi_match_table;
const struct acpi_device_id *match;
+ struct acpi_device = ACPI_COMPANION(dev);
- if (!acpi_ids)
+ if (!strcmp(ACPI_DT_NAMESPACE_HID, acpi_device_hid(adev))
return acpi_of_device_get_match_data(dev);
match = acpi_match_device(acpi_ids, dev);
--
2.43.0
When the CPU frequency policy is set to CPUFREQ_POLICY_PERFORMANCE
(which occurs when EPP hint is set to "performance"), the driver
incorrectly sets the MinPerf field in CPPC request MSR to nominal_perf
instead of lowest_nonlinear_perf.
According to the AMD architectural programmer's manual volume 2 [1],
in section "17.6.4.1 CPPC_CAPABILITY_1", lowest_nonlinear_perf represents
the most energy efficient performance level (in terms of performance per
watt). The MinPerf field should be set to this value even in performance
mode to maintain proper power/performance characteristics.
This fixes a regression introduced by commit 0c411b39e4f4c ("amd-pstate: Set
min_perf to nominal_perf for active mode performance gov"), which correctly
identified that highest_perf was too high but chose nominal_perf as an
intermediate value instead of lowest_nonlinear_perf.
The fix changes amd_pstate_update_min_max_limit() to use lowest_nonlinear_perf
instead of nominal_perf when the policy is CPUFREQ_POLICY_PERFORMANCE.
[1] https://docs.amd.com/v/u/en-US/24593_3.43
AMD64 Architecture Programmer's Manual Volume 2: System Programming
Section 17.6.4.1 CPPC_CAPABILITY_1
(Referenced in commit 5d9a354cf839a)
Fixes: 0c411b39e4f4c ("amd-pstate: Set min_perf to nominal_perf for active mode performance gov")
Tested-by: Kaushik Reddy S <kaushik.reddys(a)amd.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Juan Martinez <juan.martinez(a)amd.com>
---
drivers/cpufreq/amd-pstate.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/cpufreq/amd-pstate.c b/drivers/cpufreq/amd-pstate.c
index e4f1933dd7d47..de0bb5b325502 100644
--- a/drivers/cpufreq/amd-pstate.c
+++ b/drivers/cpufreq/amd-pstate.c
@@ -634,8 +634,8 @@ static void amd_pstate_update_min_max_limit(struct cpufreq_policy *policy)
WRITE_ONCE(cpudata->max_limit_freq, policy->max);
if (cpudata->policy == CPUFREQ_POLICY_PERFORMANCE) {
- perf.min_limit_perf = min(perf.nominal_perf, perf.max_limit_perf);
- WRITE_ONCE(cpudata->min_limit_freq, min(cpudata->nominal_freq, cpudata->max_limit_freq));
+ perf.min_limit_perf = min(perf.lowest_nonlinear_perf, perf.max_limit_perf);
+ WRITE_ONCE(cpudata->min_limit_freq, min(cpudata->lowest_nonlinear_freq, cpudata->max_limit_freq));
} else {
perf.min_limit_perf = freq_to_perf(perf, cpudata->nominal_freq, policy->min);
WRITE_ONCE(cpudata->min_limit_freq, policy->min);
--
2.34.1
The kernel test robot has reported:
BUG: spinlock trylock failure on UP on CPU#0, kcompactd0/28
lock: 0xffff888807e35ef0, .magic: dead4ead, .owner: kcompactd0/28, .owner_cpu: 0
CPU: 0 UID: 0 PID: 28 Comm: kcompactd0 Not tainted 6.18.0-rc5-00127-ga06157804399 #1 PREEMPT 8cc09ef94dcec767faa911515ce9e609c45db470
Call Trace:
<IRQ>
__dump_stack (lib/dump_stack.c:95)
dump_stack_lvl (lib/dump_stack.c:123)
dump_stack (lib/dump_stack.c:130)
spin_dump (kernel/locking/spinlock_debug.c:71)
do_raw_spin_trylock (kernel/locking/spinlock_debug.c:?)
_raw_spin_trylock (include/linux/spinlock_api_smp.h:89 kernel/locking/spinlock.c:138)
__free_frozen_pages (mm/page_alloc.c:2973)
___free_pages (mm/page_alloc.c:5295)
__free_pages (mm/page_alloc.c:5334)
tlb_remove_table_rcu (include/linux/mm.h:? include/linux/mm.h:3122 include/asm-generic/tlb.h:220 mm/mmu_gather.c:227 mm/mmu_gather.c:290)
? __cfi_tlb_remove_table_rcu (mm/mmu_gather.c:289)
? rcu_core (kernel/rcu/tree.c:?)
rcu_core (include/linux/rcupdate.h:341 kernel/rcu/tree.c:2607 kernel/rcu/tree.c:2861)
rcu_core_si (kernel/rcu/tree.c:2879)
handle_softirqs (arch/x86/include/asm/jump_label.h:36 include/trace/events/irq.h:142 kernel/softirq.c:623)
__irq_exit_rcu (arch/x86/include/asm/jump_label.h:36 kernel/softirq.c:725)
irq_exit_rcu (kernel/softirq.c:741)
sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1052)
</IRQ>
<TASK>
RIP: 0010:_raw_spin_unlock_irqrestore (arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194)
free_pcppages_bulk (mm/page_alloc.c:1494)
drain_pages_zone (include/linux/spinlock.h:391 mm/page_alloc.c:2632)
__drain_all_pages (mm/page_alloc.c:2731)
drain_all_pages (mm/page_alloc.c:2747)
kcompactd (mm/compaction.c:3115)
kthread (kernel/kthread.c:465)
? __cfi_kcompactd (mm/compaction.c:3166)
? __cfi_kthread (kernel/kthread.c:412)
ret_from_fork (arch/x86/kernel/process.c:164)
? __cfi_kthread (kernel/kthread.c:412)
ret_from_fork_asm (arch/x86/entry/entry_64.S:255)
</TASK>
Matthew has analyzed the report and identified that in drain_page_zone()
we are in a section protected by spin_lock(&pcp->lock) and then get an
interrupt that attempts spin_trylock() on the same lock. The code is
designed to work this way without disabling IRQs and occasionally fail
the trylock with a fallback. However, the SMP=n spinlock implementation
assumes spin_trylock() will always succeed, and thus it's normally a
no-op. Here the enabled lock debugging catches the problem, but
otherwise it could cause a corruption of the pcp structure.
The problem has been introduced by commit 574907741599 ("mm/page_alloc:
leave IRQs enabled for per-cpu page allocations"). The pcp locking
scheme recognizes the need for disabling IRQs to prevent nesting
spin_trylock() sections on SMP=n, but the need to prevent the nesting in
spin_lock() has not been recognized. Fix it by introducing local
wrappers that change the spin_lock() to spin_lock_iqsave() with SMP=n
and use them in all places that do spin_lock(&pcp->lock).
Fixes: 574907741599 ("mm/page_alloc: leave IRQs enabled for per-cpu page allocations")
Cc: stable(a)vger.kernel.org
Reported-by: kernel test robot <oliver.sang(a)intel.com>
Closes: https://lore.kernel.org/oe-lkp/202512101320.e2f2dd6f-lkp@intel.com
Analyzed-by: Matthew Wilcox <willy(a)infradead.org>
Link: https://lore.kernel.org/all/aUW05pyc9nZkvY-1@casper.infradead.org/
Signed-off-by: Vlastimil Babka <vbabka(a)suse.cz>
---
This fix is intentionally made self-contained and not trying to expand
upon the existing pcp[u]_spin() helpers. This is to make stable
backports easier due to recent cleanups to that helpers.
We could follow up with a proper helpers integration going forward.
However I think the assumptions SMP=n of the spinlock UP implementation
are just wrong. It should be valid to do a spin_lock() without disabling
irq's and rely on a nested spin_trylock() to fail. I will thus try
proposing the remove the UP implementation first. It should be within
the current trend of removing stuff that's optimized for a minority
configuration if it makes maintainability of the majority worse.
(c.f. recent scheduler SMP=n removal)
---
mm/page_alloc.c | 45 +++++++++++++++++++++++++++++++++++++--------
1 file changed, 37 insertions(+), 8 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 822e05f1a964..ec3551d56cde 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -167,6 +167,31 @@ static inline void __pcp_trylock_noop(unsigned long *flags) { }
pcp_trylock_finish(UP_flags); \
})
+/*
+ * With the UP spinlock implementation, when we spin_lock(&pcp->lock) (for i.e.
+ * a potentially remote cpu drain) and get interrupted by an operation that
+ * attempts pcp_spin_trylock(), we can't rely on the trylock failure due to UP
+ * spinlock assumptions making the trylock a no-op. So we have to turn that
+ * spin_lock() to a spin_lock_irqsave(). This works because on UP there are no
+ * remote cpu's so we can only be locking the only existing local one.
+ */
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
+static inline void __flags_noop(unsigned long *flags) { }
+#define spin_lock_maybe_irqsave(lock, flags) \
+({ \
+ __flags_noop(&(flags)); \
+ spin_lock(lock); \
+})
+#define spin_unlock_maybe_irqrestore(lock, flags) \
+({ \
+ spin_unlock(lock); \
+ __flags_noop(&(flags)); \
+})
+#else
+#define spin_lock_maybe_irqsave(lock, flags) spin_lock_irqsave(lock, flags)
+#define spin_unlock_maybe_irqrestore(lock, flags) spin_unlock_irqrestore(lock, flags)
+#endif
+
#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
DEFINE_PER_CPU(int, numa_node);
EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -2556,6 +2581,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
{
int high_min, to_drain, to_drain_batched, batch;
+ unsigned long UP_flags;
bool todo = false;
high_min = READ_ONCE(pcp->high_min);
@@ -2575,9 +2601,9 @@ bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
to_drain = pcp->count - pcp->high;
while (to_drain > 0) {
to_drain_batched = min(to_drain, batch);
- spin_lock(&pcp->lock);
+ spin_lock_maybe_irqsave(&pcp->lock, UP_flags);
free_pcppages_bulk(zone, to_drain_batched, pcp, 0);
- spin_unlock(&pcp->lock);
+ spin_unlock_maybe_irqrestore(&pcp->lock, UP_flags);
todo = true;
to_drain -= to_drain_batched;
@@ -2594,14 +2620,15 @@ bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
*/
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
{
+ unsigned long UP_flags;
int to_drain, batch;
batch = READ_ONCE(pcp->batch);
to_drain = min(pcp->count, batch);
if (to_drain > 0) {
- spin_lock(&pcp->lock);
+ spin_lock_maybe_irqsave(&pcp->lock, UP_flags);
free_pcppages_bulk(zone, to_drain, pcp, 0);
- spin_unlock(&pcp->lock);
+ spin_unlock_maybe_irqrestore(&pcp->lock, UP_flags);
}
}
#endif
@@ -2612,10 +2639,11 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
static void drain_pages_zone(unsigned int cpu, struct zone *zone)
{
struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+ unsigned long UP_flags;
int count;
do {
- spin_lock(&pcp->lock);
+ spin_lock_maybe_irqsave(&pcp->lock, UP_flags);
count = pcp->count;
if (count) {
int to_drain = min(count,
@@ -2624,7 +2652,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
free_pcppages_bulk(zone, to_drain, pcp, 0);
count -= to_drain;
}
- spin_unlock(&pcp->lock);
+ spin_unlock_maybe_irqrestore(&pcp->lock, UP_flags);
} while (count);
}
@@ -6109,6 +6137,7 @@ static void zone_pcp_update_cacheinfo(struct zone *zone, unsigned int cpu)
{
struct per_cpu_pages *pcp;
struct cpu_cacheinfo *cci;
+ unsigned long UP_flags;
pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
cci = get_cpu_cacheinfo(cpu);
@@ -6119,12 +6148,12 @@ static void zone_pcp_update_cacheinfo(struct zone *zone, unsigned int cpu)
* This can reduce zone lock contention without hurting
* cache-hot pages sharing.
*/
- spin_lock(&pcp->lock);
+ spin_lock_maybe_irqsave(&pcp->lock, UP_flags);
if ((cci->per_cpu_data_slice_size >> PAGE_SHIFT) > 3 * pcp->batch)
pcp->flags |= PCPF_FREE_HIGH_BATCH;
else
pcp->flags &= ~PCPF_FREE_HIGH_BATCH;
- spin_unlock(&pcp->lock);
+ spin_unlock_maybe_irqrestore(&pcp->lock, UP_flags);
}
void setup_pcp_cacheinfo(unsigned int cpu)
---
base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8
change-id: 20260105-fix-pcp-up-3c88c09752ec
Best regards,
--
Vlastimil Babka <vbabka(a)suse.cz>
CephFS stores file data across multiple RADOS objects. An object is the
atomic unit of storage, so the writeback code must clean only folios
that belong to the same object with each OSD request.
CephFS also supports RAID0-style striping of file contents: if enabled,
each object stores multiple unbroken "stripe units" covering different
portions of the file; if disabled, a "stripe unit" is simply the whole
object. The stripe unit is (usually) reported as the inode's block size.
Though the writeback logic could, in principle, lock all dirty folios
belonging to the same object, its current design is to lock only a
single stripe unit at a time. Ever since this code was first written,
it has determined this size by checking the inode's block size.
However, the relatively-new fscrypt support needed to reduce the block
size for encrypted inodes to the crypto block size (see 'fixes' commit),
which causes an unnecessarily high number of write operations (~1024x as
many, with 4MiB objects) and correspondingly degraded performance.
Fix this (and clarify intent) by using i_layout.stripe_unit directly in
ceph_define_write_size() so that encrypted inodes are written back with
the same number of operations as if they were unencrypted.
Fixes: 94af0470924c ("ceph: add some fscrypt guardrails")
Cc: stable(a)vger.kernel.org
Signed-off-by: Sam Edwards <CFSworks(a)gmail.com>
---
fs/ceph/addr.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index f2db05b51a3b..b97a6120d4b9 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1000,7 +1000,8 @@ unsigned int ceph_define_write_size(struct address_space *mapping)
{
struct inode *inode = mapping->host;
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
- unsigned int wsize = i_blocksize(inode);
+ struct ceph_inode_info *ci = ceph_inode(inode);
+ unsigned int wsize = ci->i_layout.stripe_unit;
if (fsc->mount_options->wsize < wsize)
wsize = fsc->mount_options->wsize;
--
2.51.2
If `locked_pages` is zero, the page array must not be allocated:
ceph_process_folio_batch() uses `locked_pages` to decide when to
allocate `pages`, and redundant allocations trigger
ceph_allocate_page_array()'s BUG_ON(), resulting in a worker oops (and
writeback stall) or even a kernel panic. Consequently, the main loop in
ceph_writepages_start() assumes that the lifetime of `pages` is confined
to a single iteration.
The ceph_submit_write() function claims ownership of the page array on
success. But failures only redirty/unlock the pages and fail to free the
array, making the failure case in ceph_submit_write() fatal.
Free the page array (and reset locked_pages) in ceph_submit_write()'s
error-handling 'if' block so that the caller's invariant (that the array
does not outlive the iteration) is maintained unconditionally, making
failures in ceph_submit_write() recoverable as originally intended.
Fixes: 1551ec61dc55 ("ceph: introduce ceph_submit_write() method")
Cc: stable(a)vger.kernel.org
Signed-off-by: Sam Edwards <CFSworks(a)gmail.com>
---
fs/ceph/addr.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 2b722916fb9b..467aa7242b49 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1466,6 +1466,14 @@ int ceph_submit_write(struct address_space *mapping,
unlock_page(page);
}
+ if (ceph_wbc->from_pool) {
+ mempool_free(ceph_wbc->pages, ceph_wb_pagevec_pool);
+ ceph_wbc->from_pool = false;
+ } else
+ kfree(ceph_wbc->pages);
+ ceph_wbc->pages = NULL;
+ ceph_wbc->locked_pages = 0;
+
ceph_osdc_put_request(req);
return -EIO;
}
--
2.51.2