----- On Dec 2, 2020, at 10:35 AM, Andy Lutomirski luto@kernel.org wrote:
membarrier()'s MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is documented as syncing the core on all sibling threads but not necessarily the calling thread. This behavior is fundamentally buggy and cannot be used safely. Suppose a user program has two threads. Thread A is on CPU 0 and thread B is on CPU 1. Thread A modifies some text and calls membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE). Then thread B executes the modified code. If, at any point after membarrier() decides which CPUs to target, thread A could be preempted and replaced by thread B on CPU 0. This could even happen on exit from the membarrier() syscall. If this happens, thread B will end up running on CPU 0 without having synced.
Indeed, good catch! We only have sync core in the scheduler when switching between mm, so indeed we cannot rely on the scheduler to issue a sync core for us when switching between threads with the same mm.
In principle, this could be fixed by arranging for the scheduler to sync_core_before_usermode() whenever switching between two threads in the same mm if there is any possibility of a concurrent membarrier() call, but this would have considerable overhead. Instead, make membarrier() sync the calling CPU as well.
Yes, I agree that sync core on self is the right approach here.
As an optimization, this avoids an extra smp_mb() in the default barrier-only mode.
Cc: stable@vger.kernel.org Signed-off-by: Andy Lutomirski luto@kernel.org
kernel/sched/membarrier.c | 49 +++++++++++++++++++++++++-------------- 1 file changed, 32 insertions(+), 17 deletions(-)
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index 01538b31f27e..7df7c0e60647 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -352,8 +352,6 @@ static int membarrier_private_expedited(int flags, int cpu_id)
There is one small optimization you will want to adapt here:
if (atomic_read(&mm->mm_users) == 1 || num_online_cpus() == 1) return 0;
should become:
if (flags != MEMBARRIER_FLAG_SYNC_CORE && atomic_read(&mm->mm_users) == 1 || num_online_cpus() == 1) return 0;
So we issue a core sync for self in single-threaded applications, to make things consistent. We can then document that membarrier sync core issues a core sync on all thread siblings including the caller thread.
if (cpu_id >= nr_cpu_ids || !cpu_online(cpu_id)) goto out;
if (cpu_id == raw_smp_processor_id())
rcu_read_lock(); p = rcu_dereference(cpu_rq(cpu_id)->curr); if (!p || p->mm != mm) {goto out;
@@ -368,16 +366,6 @@ static int membarrier_private_expedited(int flags, int cpu_id) for_each_online_cpu(cpu) { struct task_struct *p;
/*
* Skipping the current CPU is OK even through we can be
* migrated at any point. The current CPU, at the point
* where we read raw_smp_processor_id(), is ensured to
* be in program order with respect to the caller
* thread. Therefore, we can skip this CPU from the
* iteration.
*/
if (cpu == raw_smp_processor_id())
continue; p = rcu_dereference(cpu_rq(cpu)->curr); if (p && p->mm == mm) __cpumask_set_cpu(cpu, tmpmask);
@@ -385,12 +373,39 @@ static int membarrier_private_expedited(int flags, int cpu_id) rcu_read_unlock(); }
- preempt_disable();
- if (cpu_id >= 0)
- if (cpu_id >= 0) {
/*
* smp_call_function_single() will call ipi_func() if cpu_id
* is the calling CPU.
smp_call_function_single(cpu_id, ipi_func, NULL, 1);*/
- else
smp_call_function_many(tmpmask, ipi_func, NULL, 1);
- preempt_enable();
- } else {
/*
* For regular membarrier, we can save a few cycles by
* skipping the current cpu -- we're about to do smp_mb()
* below, and if we migrate to a different cpu, this cpu
* and the new cpu will execute a full barrier in the
* scheduler.
*
* For CORE_SYNC, we do need a barrier on the current cpu --
* otherwise, if we are migrated and replaced by a different
* task in the same mm just before, during, or after
* membarrier, we will end up with some thread in the mm
* running without a core sync.
*
* For RSEQ, it seems polite to target the calling thread
* as well, although it's not clear it makes much difference
* either way. Users aren't supposed to run syscalls in an
* rseq critical section.
Considering that we want a consistent behavior between single and multi-threaded programs (as I pointed out above wrt the optimization change), I think it would be better to skip self for the rseq ipi, in the same way we'd want to return early for a membarrier-rseq-private on a single-threaded mm. Users are _really_ not supposed to run system calls in rseq critical sections. The kernel even kills the offending applications when run on kernels with CONFIG_DEBUG_RSEQ=y. So it seems rather pointless to waste cycles doing a rseq fence on self considering this.
Thanks,
Mathieu
*/
if (ipi_func == ipi_mb) {
preempt_disable();
smp_call_function_many(tmpmask, ipi_func, NULL, true);
preempt_enable();
} else {
on_each_cpu_mask(tmpmask, ipi_func, NULL, true);
}
- }
out: if (cpu_id < 0) -- 2.28.0