From: Ovidiu Panait ovidiu.panait@windriver.com
When booting the 5.10-stable kernel on the zcu102 board with CONFIG_PROVE_RCU=y, the following warning is present: ============================= WARNING: suspicious RCU usage 5.10.194-yocto-standard #1 Not tainted ----------------------------- include/linux/cgroup.h:495 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1 9 locks held by kworker/2:2/106: #0: ffffff8800048948 ((wq_completion)events){..}-{0:0}, at: process_one_work+0x1f8/0x634 #1: ffffffc014c4bda8 (deferred_probe_work){..}-{0:0}, at: process_one_work+0x1f8/0x634 #2: ffffff880005d9a0 (&dev->mutex){....}-{3:3}, at: __device_attach+0x40/0x1c0 #3: ffffffc011c70cb0 (cpu_hotplug_lock){++++}-{0:0}, at: cpus_read_lock+0x18/0x24 #4: ffffff8800b10928 (subsys mutex#5){..}-{3:3}, at: subsys_interface_register+0x58/0x120 #5: ffffff8805e78c00 (&policy->rwsem){..}-{3:3}, at: cpufreq_online+0x590/0x960 #6: ffffffc012a9e770 (cpuset_mutex){..}-{3:3}, at: cpuset_lock+0x24/0x30 #7: ffffff880567bd80 (&p->pi_lock){..}-{2:2}, at: task_rq_lock+0x44/0xf0 #8: ffffff887aff50d8 (&rq->lock){..}-{2:2}, at: task_rq_lock+0x5c/0xf0
stack backtrace: CPU: 2 PID: 106 Comm: kworker/2:2 Not tainted 5.10.194-yocto-standard #1 Hardware name: ZynqMP ZCU102 Rev1.0 (DT) Workqueue: events deferred_probe_work_func Call trace: dump_backtrace+0x0/0x1a4 show_stack+0x20/0x2c dump_stack+0xf0/0x13c lockdep_rcu_suspicious+0xe4/0xf8 inc_dl_tasks_cs+0xb8/0xbc switched_to_dl+0x38/0x280 __sched_setscheduler+0x204/0x860 sched_setattr_nocheck+0x20/0x30 sugov_init+0x1b8/0x380 cpufreq_init_governor.part.0+0x60/0xe0 cpufreq_set_policy+0x1d0/0x33c cpufreq_online+0x35c/0x960 cpufreq_add_dev+0x8c/0xa0 subsys_interface_register+0x10c/0x120 cpufreq_register_driver+0x148/0x2a4 dt_cpufreq_probe+0x288/0x3d0 platform_drv_probe+0x5c/0xb0 really_probe+0xe0/0x4ac driver_probe_device+0x60/0xf4 __device_attach_driver+0xc0/0x12c bus_for_each_drv+0x80/0xe0 __device_attach+0xb0/0x1c0 device_initial_probe+0x1c/0x30 bus_probe_device+0xa8/0xb0 deferred_probe_work_func+0x94/0xd0 process_one_work+0x2b8/0x634 worker_thread+0x7c/0x474 kthread+0x154/0x160 ret_from_fork+0x10/0x34
The warning was introduced in v5.10.193 by commit: 5ac05ce56843 "(sched/cpuset: Keep track of SCHED_DEADLINE task in cpusets)"
This issue was also reported for 5.15 here: https://lore.kernel.org/lkml/CA+G9fYv9xTu4bKJGy=e=KZSG5pZ+tJAmfZr=0dbuKNs=9O...
Backport commit f2aa197e4794 ("cgroup: Fix suspicious rcu_dereference_check() usage warning") and its dependencies to get rid of this warning.
Andrey Ryabinin (1): sched/cpuacct: Fix user/system in shown cpuacct.usage*
Chengming Zhou (3): sched/cpuacct: Fix charge percpu cpuusage sched/cpuacct: Optimize away RCU read lock cgroup: Fix suspicious rcu_dereference_check() usage warning
include/linux/cgroup.h | 3 +- kernel/sched/cpuacct.c | 84 +++++++++++++++++------------------------- 2 files changed, 35 insertions(+), 52 deletions(-)
From: Andrey Ryabinin arbn@yandex-team.com
commit dd02d4234c9a2214a81c57a16484304a1a51872a upstream.
cpuacct has 2 different ways of accounting and showing user and system times.
The first one uses cpuacct_account_field() to account times and cpuacct.stat file to expose them. And this one seems to work ok.
The second one is uses cpuacct_charge() function for accounting and set of cpuacct.usage* files to show times. Despite some attempts to fix it in the past it still doesn't work. Sometimes while running KVM guest the cpuacct_charge() accounts most of the guest time as system time. This doesn't match with user&system times shown in cpuacct.stat or proc/<pid>/stat.
Demonstration: # git clone https://github.com/aryabinin/kvmsample # make # mkdir /sys/fs/cgroup/cpuacct/test # echo $$ > /sys/fs/cgroup/cpuacct/test/tasks # ./kvmsample & # for i in {1..5}; do cat /sys/fs/cgroup/cpuacct/test/cpuacct.usage_sys; sleep 1; done 1976535645 2979839428 3979832704 4983603153 5983604157
Use cpustats accounted in cpuacct_account_field() as the source of user/sys times for cpuacct.usage* files. Make cpuacct_charge() to account only summary execution time.
Fixes: d740037fac70 ("sched/cpuacct: Split usage accounting into user_usage and sys_usage") Signed-off-by: Andrey Ryabinin arbn@yandex-team.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Daniel Jordan daniel.m.jordan@oracle.com Acked-by: Tejun Heo tj@kernel.org Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211115164607.23784-3-arbn@yandex-team.com [OP: adjusted context for v5.10] Signed-off-by: Ovidiu Panait ovidiu.panait@windriver.com --- kernel/sched/cpuacct.c | 79 +++++++++++++++++------------------------- 1 file changed, 32 insertions(+), 47 deletions(-)
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 941c28cf9738..8a260115a137 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -21,15 +21,11 @@ static const char * const cpuacct_stat_desc[] = { [CPUACCT_STAT_SYSTEM] = "system", };
-struct cpuacct_usage { - u64 usages[CPUACCT_STAT_NSTATS]; -}; - /* track CPU usage of a group of tasks and its child groups */ struct cpuacct { struct cgroup_subsys_state css; /* cpuusage holds pointer to a u64-type object on every CPU */ - struct cpuacct_usage __percpu *cpuusage; + u64 __percpu *cpuusage; struct kernel_cpustat __percpu *cpustat; };
@@ -49,7 +45,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca) return css_ca(ca->css.parent); }
-static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage); +static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage); static struct cpuacct root_cpuacct = { .cpustat = &kernel_cpustat, .cpuusage = &root_cpuacct_cpuusage, @@ -68,7 +64,7 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css) if (!ca) goto out;
- ca->cpuusage = alloc_percpu(struct cpuacct_usage); + ca->cpuusage = alloc_percpu(u64); if (!ca->cpuusage) goto out_free_ca;
@@ -99,7 +95,8 @@ static void cpuacct_css_free(struct cgroup_subsys_state *css) static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu, enum cpuacct_stat_index index) { - struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat; u64 data;
/* @@ -115,14 +112,17 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu, raw_spin_lock_irq(&cpu_rq(cpu)->lock); #endif
- if (index == CPUACCT_STAT_NSTATS) { - int i = 0; - - data = 0; - for (i = 0; i < CPUACCT_STAT_NSTATS; i++) - data += cpuusage->usages[i]; - } else { - data = cpuusage->usages[index]; + switch (index) { + case CPUACCT_STAT_USER: + data = cpustat[CPUTIME_USER] + cpustat[CPUTIME_NICE]; + break; + case CPUACCT_STAT_SYSTEM: + data = cpustat[CPUTIME_SYSTEM] + cpustat[CPUTIME_IRQ] + + cpustat[CPUTIME_SOFTIRQ]; + break; + case CPUACCT_STAT_NSTATS: + data = *cpuusage; + break; }
#ifndef CONFIG_64BIT @@ -132,10 +132,14 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu, return data; }
-static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val) +static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu) { - struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); - int i; + u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat; + + /* Don't allow to reset global kernel_cpustat */ + if (ca == &root_cpuacct) + return;
#ifndef CONFIG_64BIT /* @@ -143,9 +147,10 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val) */ raw_spin_lock_irq(&cpu_rq(cpu)->lock); #endif - - for (i = 0; i < CPUACCT_STAT_NSTATS; i++) - cpuusage->usages[i] = val; + *cpuusage = 0; + cpustat[CPUTIME_USER] = cpustat[CPUTIME_NICE] = 0; + cpustat[CPUTIME_SYSTEM] = cpustat[CPUTIME_IRQ] = 0; + cpustat[CPUTIME_SOFTIRQ] = 0;
#ifndef CONFIG_64BIT raw_spin_unlock_irq(&cpu_rq(cpu)->lock); @@ -196,7 +201,7 @@ static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft, return -EINVAL;
for_each_possible_cpu(cpu) - cpuacct_cpuusage_write(ca, cpu, 0); + cpuacct_cpuusage_write(ca, cpu);
return 0; } @@ -243,25 +248,10 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V) seq_puts(m, "\n");
for_each_possible_cpu(cpu) { - struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); - seq_printf(m, "%d", cpu); - - for (index = 0; index < CPUACCT_STAT_NSTATS; index++) { -#ifndef CONFIG_64BIT - /* - * Take rq->lock to make 64-bit read safe on 32-bit - * platforms. - */ - raw_spin_lock_irq(&cpu_rq(cpu)->lock); -#endif - - seq_printf(m, " %llu", cpuusage->usages[index]); - -#ifndef CONFIG_64BIT - raw_spin_unlock_irq(&cpu_rq(cpu)->lock); -#endif - } + for (index = 0; index < CPUACCT_STAT_NSTATS; index++) + seq_printf(m, " %llu", + cpuacct_cpuusage_read(ca, cpu, index)); seq_puts(m, "\n"); } return 0; @@ -339,16 +329,11 @@ static struct cftype files[] = { void cpuacct_charge(struct task_struct *tsk, u64 cputime) { struct cpuacct *ca; - int index = CPUACCT_STAT_SYSTEM; - struct pt_regs *regs = get_irq_regs() ? : task_pt_regs(tsk); - - if (regs && user_mode(regs)) - index = CPUACCT_STAT_USER;
rcu_read_lock();
for (ca = task_ca(tsk); ca; ca = parent_ca(ca)) - __this_cpu_add(ca->cpuusage->usages[index], cputime); + __this_cpu_add(*ca->cpuusage, cputime);
rcu_read_unlock(); }
From: Chengming Zhou zhouchengming@bytedance.com
commit 248cc9993d1cc12b8e9ed716cc3fc09f6c3517dd upstream.
The cpuacct_account_field() is always called by the current task itself, so it's ok to use __this_cpu_add() to charge the tick time.
But cpuacct_charge() maybe called by update_curr() in load_balance() on a random CPU, different from the CPU on which the task is running. So __this_cpu_add() will charge that cputime to a random incorrect CPU.
Fixes: 73e6aafd9ea8 ("sched/cpuacct: Simplify the cpuacct code") Reported-by: Minye Zhu zhuminye@bytedance.com Signed-off-by: Chengming Zhou zhouchengming@bytedance.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Acked-by: Tejun Heo tj@kernel.org Link: https://lore.kernel.org/r/20220220051426.5274-1-zhouchengming@bytedance.com Signed-off-by: Ovidiu Panait ovidiu.panait@windriver.com --- kernel/sched/cpuacct.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 8a260115a137..3c59c541dd31 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -328,12 +328,13 @@ static struct cftype files[] = { */ void cpuacct_charge(struct task_struct *tsk, u64 cputime) { + unsigned int cpu = task_cpu(tsk); struct cpuacct *ca;
rcu_read_lock();
for (ca = task_ca(tsk); ca; ca = parent_ca(ca)) - __this_cpu_add(*ca->cpuusage, cputime); + *per_cpu_ptr(ca->cpuusage, cpu) += cputime;
rcu_read_unlock(); }
From: Chengming Zhou zhouchengming@bytedance.com
commit dc6e0818bc9a0336d9accf3ea35d146d72aa7a18 upstream.
Since cpuacct_charge() is called from the scheduler update_curr(), we must already have rq lock held, then the RCU read lock can be optimized away.
And do the same thing in it's wrapper cgroup_account_cputime(), but we can't use lockdep_assert_rq_held() there, which defined in kernel/sched/sched.h.
Suggested-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Chengming Zhou zhouchengming@bytedance.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lore.kernel.org/r/20220220051426.5274-2-zhouchengming@bytedance.com [OP: adjusted lockdep_assert_rq_held() -> lockdep_assert_held()] Signed-off-by: Ovidiu Panait ovidiu.panait@windriver.com --- include/linux/cgroup.h | 2 -- kernel/sched/cpuacct.c | 4 +--- 2 files changed, 1 insertion(+), 5 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 959b370733f0..7653f5418950 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -779,11 +779,9 @@ static inline void cgroup_account_cputime(struct task_struct *task,
cpuacct_charge(task, delta_exec);
- rcu_read_lock(); cgrp = task_dfl_cgroup(task); if (cgroup_parent(cgrp)) __cgroup_account_cputime(cgrp, delta_exec); - rcu_read_unlock(); }
static inline void cgroup_account_cputime_field(struct task_struct *task, diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 3c59c541dd31..8ee298321d78 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -331,12 +331,10 @@ void cpuacct_charge(struct task_struct *tsk, u64 cputime) unsigned int cpu = task_cpu(tsk); struct cpuacct *ca;
- rcu_read_lock(); + lockdep_assert_held(&cpu_rq(cpu)->lock);
for (ca = task_ca(tsk); ca; ca = parent_ca(ca)) *per_cpu_ptr(ca->cpuusage, cpu) += cputime; - - rcu_read_unlock(); }
/*
From: Chengming Zhou zhouchengming@bytedance.com
commit f2aa197e4794bf4c2c0c9570684f86e6fa103e8b upstream.
task_css_set_check() will use rcu_dereference_check() to check for rcu_read_lock_held() on the read-side, which is not true after commit dc6e0818bc9a ("sched/cpuacct: Optimize away RCU read lock"). This commit drop explicit rcu_read_lock(), change to RCU-sched read-side critical section. So fix the RCU warning by adding check for rcu_read_lock_sched_held().
Fixes: dc6e0818bc9a ("sched/cpuacct: Optimize away RCU read lock") Reported-by: Linux Kernel Functional Testing lkft@linaro.org Reported-by: syzbot+16e3f2c77e7c5a0113f9@syzkaller.appspotmail.com Signed-off-by: Chengming Zhou zhouchengming@bytedance.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Acked-by: Tejun Heo tj@kernel.org Tested-by: Zhouyi Zhou zhouzhouyi@gmail.com Tested-by: Marek Szyprowski m.szyprowski@samsung.com Link: https://lore.kernel.org/r/20220305034103.57123-1-zhouchengming@bytedance.com Signed-off-by: Ovidiu Panait ovidiu.panait@windriver.com --- include/linux/cgroup.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 7653f5418950..c9c430712d47 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -451,6 +451,7 @@ extern struct mutex cgroup_mutex; extern spinlock_t css_set_lock; #define task_css_set_check(task, __c) \ rcu_dereference_check((task)->cgroups, \ + rcu_read_lock_sched_held() || \ lockdep_is_held(&cgroup_mutex) || \ lockdep_is_held(&css_set_lock) || \ ((task)->flags & PF_EXITING) || (__c))
On Fri, Sep 29, 2023 at 04:14:14PM +0300, ovidiu.panait@windriver.com wrote:
Backport commit f2aa197e4794 ("cgroup: Fix suspicious rcu_dereference_check() usage warning") and its dependencies to get rid of this warning.
Andrey Ryabinin (1): sched/cpuacct: Fix user/system in shown cpuacct.usage*
Chengming Zhou (3): sched/cpuacct: Fix charge percpu cpuusage sched/cpuacct: Optimize away RCU read lock cgroup: Fix suspicious rcu_dereference_check() usage warning
I've queued up this and the 5.15 fix, thanks!
linux-stable-mirror@lists.linaro.org