From: Nadav Amit <namit(a)vmware.com>
The text poke mechanism claims to have an RCU-like behavior, but it does
not appear that there is any quiescent state to ensure that nobody holds
reference to desc. As a result, the following race appears to be
possible, which can lead to memory corruption.
CPU0 CPU1
---- ----
text_poke_bp_batch()
-> smp_store_release(&bp_desc, &desc)
[ notice that desc is on
the stack ]
poke_int3_handler()
[ int3 might be kprobe's
so sync events are do not
help ]
-> try_get_desc(descp=&bp_desc)
desc = __READ_ONCE(bp_desc)
if (!desc) [false, success]
WRITE_ONCE(bp_desc, NULL);
atomic_dec_and_test(&desc.refs)
[ success, desc space on the stack
is being reused and might have
non-zero value. ]
arch_atomic_inc_not_zero(&desc->refs)
[ might succeed since desc points to
stack memory that was freed and might
be reused. ]
I encountered some occasional crashes of poke_int3_handler() when
kprobes are set, while accessing desc->vec. The analysis has been done
offline and I did not corroborate the cause of the crashes. Yet, it
seems that this race might be the root cause.
Fix this issue with small backportable patch. Instead of trying to make
RCU-like behavior for bp_desc, just eliminate the unnecessary level of
indirection of bp_desc, and hold the whole descriptor on the stack.
Anyhow, there is only a single descriptor at any given moment.
Fixes: 1f676247f36a4 ("x86/alternatives: Implement a better poke_int3_handler() completion scheme"
Cc: stable(a)vger.kernel.org
Cc: "Peter Zijlstra (Intel)" <peterz(a)infradead.org>
Cc: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Signed-off-by: Nadav Amit <namit(a)vmware.com>
---
arch/x86/kernel/alternative.c | 42 +++++++++++++++++------------------
1 file changed, 20 insertions(+), 22 deletions(-)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 62f6b8b7c4a5..4265c9426374 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1319,18 +1319,15 @@ struct bp_patching_desc {
atomic_t refs;
};
-static struct bp_patching_desc *bp_desc;
+static struct bp_patching_desc bp_desc;
static __always_inline
-struct bp_patching_desc *try_get_desc(struct bp_patching_desc **descp)
+struct bp_patching_desc *try_get_desc(void)
{
- /* rcu_dereference */
- struct bp_patching_desc *desc = __READ_ONCE(*descp);
-
- if (!desc || !arch_atomic_inc_not_zero(&desc->refs))
+ if (!arch_atomic_inc_not_zero(&bp_desc.refs))
return NULL;
- return desc;
+ return &bp_desc;
}
static __always_inline void put_desc(struct bp_patching_desc *desc)
@@ -1367,15 +1364,15 @@ noinstr int poke_int3_handler(struct pt_regs *regs)
/*
* Having observed our INT3 instruction, we now must observe
- * bp_desc:
+ * bp_desc with non-zero refcount:
*
- * bp_desc = desc INT3
+ * bp_desc.refs = 1 INT3
* WMB RMB
- * write INT3 if (desc)
+ * write INT3 if (bp_desc.refs != 0)
*/
smp_rmb();
- desc = try_get_desc(&bp_desc);
+ desc = try_get_desc();
if (!desc)
return 0;
@@ -1460,18 +1457,21 @@ static int tp_vec_nr;
*/
static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
{
- struct bp_patching_desc desc = {
- .vec = tp,
- .nr_entries = nr_entries,
- .refs = ATOMIC_INIT(1),
- };
unsigned char int3 = INT3_INSN_OPCODE;
unsigned int i;
int do_sync;
lockdep_assert_held(&text_mutex);
- smp_store_release(&bp_desc, &desc); /* rcu_assign_pointer */
+ bp_desc.vec = tp;
+ bp_desc.nr_entries = nr_entries;
+
+ /*
+ * Corresponds to the implicit memory barrier in try_get_desc() to
+ * ensure reading a non-zero refcount provides up to date bp_desc data.
+ */
+ smp_wmb();
+ atomic_set(&bp_desc.refs, 1);
/*
* Corresponding read barrier in int3 notifier for making sure the
@@ -1559,12 +1559,10 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
text_poke_sync();
/*
- * Remove and synchronize_rcu(), except we have a very primitive
- * refcount based completion.
+ * Remove and wait for refs to be zero.
*/
- WRITE_ONCE(bp_desc, NULL); /* RCU_INIT_POINTER */
- if (!atomic_dec_and_test(&desc.refs))
- atomic_cond_read_acquire(&desc.refs, !VAL);
+ if (!atomic_dec_and_test(&bp_desc.refs))
+ atomic_cond_read_acquire(&bp_desc.refs, !VAL);
}
static void text_poke_loc_init(struct text_poke_loc *tp, void *addr,
--
2.34.1
Hi Greg! As you likely heard already, 5.19.9 introduced a regression
that breaks Thunderbolt and USB-C docks (and apparently Wifi in some
cases as well) on quite a few (many?) modern systems. It's one of those
problems where I think "hey, we ideally should fix this in stable as
fast as possible" we briefly talked about last week on the LPC hallways.
That made me wonder how to actually archive that in this particular case
while keeping all involved parties happy and not skipping any CI testing
queues considered important.
FWIW, here are a few few reports about the issue (I assume there are
some for Arch Linux and openSUSE Tumbleweed as well, but didn't check).
https://lore.kernel.org/linux-iommu/485A6EA5-6D58-42EA-B298-8571E97422DE@ge…https://bugzilla.kernel.org/show_bug.cgi?id=216497https://bugzilla.redhat.com/show_bug.cgi?id=2128458https://bugzilla.redhat.com/show_bug.cgi?id=2127753
A revert of the culprit (9cd4f1434479f ("iommu/vt-d: Fix possible
recursive locking in intel_iommu_init()"); in 5.19.y it's 9516acba29e3)
for mainline is here:
https://lore.kernel.org/lkml/20220920081701.3453504-1-baolu.lu@linux.intel.…
A few hours ago the revert was queued to get send to Joerg:
https://lore.kernel.org/linux-iommu/20220921024054.3570256-1-baolu.lu@linux…
I fear it could easily take another week to get this fixed in stable
depending on how fast the patch makes it to mainline and the timing of
the next 5.19.y release and its -rc phase. That to me sounds like way
too long for a problem like this that apparently plagues quite a few
people.
That made me wonder: would you in cases like this be willing to start
the -rc phase for a interim 5.19.y release with just that revert while
it's still heading towards mainline? Then the CI systems that test
stable -rcs could chew on things already; and the new stable release
could go out right after the revert landed in mainline (unless the
testing finds any problems, of course).
Ciao, Thorsten