From: Wei Li <liwei391(a)huawei.com>
There is another found exception that the "timerlat/1" thread was
scheduled on CPU0, and lead to timer corruption finally:
```
ODEBUG: init active (active state 0) object: ffff888237c2e108 object type: hrtimer hint: timerlat_irq+0x0/0x220
WARNING: CPU: 0 PID: 426 at lib/debugobjects.c:518 debug_print_object+0x7d/0xb0
Modules linked in:
CPU: 0 UID: 0 PID: 426 Comm: timerlat/1 Not tainted 6.11.0-rc7+ #45
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:debug_print_object+0x7d/0xb0
...
Call Trace:
<TASK>
? __warn+0x7c/0x110
? debug_print_object+0x7d/0xb0
? report_bug+0xf1/0x1d0
? prb_read_valid+0x17/0x20
? handle_bug+0x3f/0x70
? exc_invalid_op+0x13/0x60
? asm_exc_invalid_op+0x16/0x20
? debug_print_object+0x7d/0xb0
? debug_print_object+0x7d/0xb0
? __pfx_timerlat_irq+0x10/0x10
__debug_object_init+0x110/0x150
hrtimer_init+0x1d/0x60
timerlat_main+0xab/0x2d0
? __pfx_timerlat_main+0x10/0x10
kthread+0xb7/0xe0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2d/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
```
After tracing the scheduling event, it was discovered that the migration
of the "timerlat/1" thread was performed during thread creation. Further
analysis confirmed that it is because the CPU online processing for
osnoise is implemented through workers, which is asynchronous with the
offline processing. When the worker was scheduled to create a thread, the
CPU may has already been removed from the cpu_online_mask during the offline
process, resulting in the inability to select the right CPU:
T1 | T2
[CPUHP_ONLINE] | cpu_device_down()
osnoise_hotplug_workfn() |
| cpus_write_lock()
| takedown_cpu(1)
| cpus_write_unlock()
[CPUHP_OFFLINE] |
cpus_read_lock() |
start_kthread(1) |
cpus_read_unlock() |
To fix this, skip online processing if the CPU is already offline.
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Link: https://lore.kernel.org/20240924094515.3561410-4-liwei391@huawei.com
Fixes: c8895e271f79 ("trace/osnoise: Support hotplug operations")
Signed-off-by: Wei Li <liwei391(a)huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_osnoise.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index e22567174dd3..a50ed23bee77 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -2097,6 +2097,8 @@ static void osnoise_hotplug_workfn(struct work_struct *dummy)
mutex_lock(&interface_lock);
cpus_read_lock();
+ if (!cpu_online(cpu))
+ goto out_unlock;
if (!cpumask_test_cpu(cpu, &osnoise_cpumask))
goto out_unlock;
--
2.45.2
From: Wei Li <liwei391(a)huawei.com>
osnoise_hotplug_workfn() is the asynchronous online callback for
"trace/osnoise:online". It may be congested when a CPU goes online and
offline repeatedly and is invoked for multiple times after a certain
online.
This will lead to kthread leak and timer corruption. Add a check
in start_kthread() to prevent this situation.
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Link: https://lore.kernel.org/20240924094515.3561410-2-liwei391@huawei.com
Fixes: c8895e271f79 ("trace/osnoise: Support hotplug operations")
Signed-off-by: Wei Li <liwei391(a)huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_osnoise.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 1439064f65d6..d1a539913a5f 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -2007,6 +2007,10 @@ static int start_kthread(unsigned int cpu)
void *main = osnoise_main;
char comm[24];
+ /* Do not start a new thread if it is already running */
+ if (per_cpu(per_cpu_osnoise_var, cpu).kthread)
+ return 0;
+
if (timerlat_enabled()) {
snprintf(comm, 24, "timerlat/%d", cpu);
main = timerlat_main;
@@ -2061,11 +2065,10 @@ static int start_per_cpu_kthreads(void)
if (cpumask_test_and_clear_cpu(cpu, &kthread_cpumask)) {
struct task_struct *kthread;
- kthread = per_cpu(per_cpu_osnoise_var, cpu).kthread;
+ kthread = xchg_relaxed(&(per_cpu(per_cpu_osnoise_var, cpu).kthread), NULL);
if (!WARN_ON(!kthread))
kthread_stop(kthread);
}
- per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL;
}
for_each_cpu(cpu, current_mask) {
--
2.45.2
From: Eder Zulian <ezulian(a)redhat.com>
The help text in osnoise top and timerlat top had some minor errors
and omissions. The -d option was missing the 's' (second) abbreviation and
the error message for '-d' used '-D'.
Cc: stable(a)vger.kernel.org
Fixes: 1eceb2fc2ca54 ("rtla/osnoise: Add osnoise top mode")
Fixes: a828cd18bc4ad ("rtla: Add timerlat tool and timelart top mode")
Link: https://lore.kernel.org/20240813155831.384446-1-ezulian@redhat.com
Suggested-by: Tomas Glozar <tglozar(a)redhat.com>
Reviewed-by: Tomas Glozar <tglozar(a)redhat.com>
Signed-off-by: Eder Zulian <ezulian(a)redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
tools/tracing/rtla/src/osnoise_top.c | 2 +-
tools/tracing/rtla/src/timerlat_top.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/tracing/rtla/src/osnoise_top.c b/tools/tracing/rtla/src/osnoise_top.c
index 2f756628613d..30e3853076a0 100644
--- a/tools/tracing/rtla/src/osnoise_top.c
+++ b/tools/tracing/rtla/src/osnoise_top.c
@@ -442,7 +442,7 @@ struct osnoise_top_params *osnoise_top_parse_args(int argc, char **argv)
case 'd':
params->duration = parse_seconds_duration(optarg);
if (!params->duration)
- osnoise_top_usage(params, "Invalid -D duration\n");
+ osnoise_top_usage(params, "Invalid -d duration\n");
break;
case 'e':
tevent = trace_event_alloc(optarg);
diff --git a/tools/tracing/rtla/src/timerlat_top.c b/tools/tracing/rtla/src/timerlat_top.c
index 8c16419fe22a..210b0f533534 100644
--- a/tools/tracing/rtla/src/timerlat_top.c
+++ b/tools/tracing/rtla/src/timerlat_top.c
@@ -459,7 +459,7 @@ static void timerlat_top_usage(char *usage)
" -c/--cpus cpus: run the tracer only on the given cpus",
" -H/--house-keeping cpus: run rtla control threads only on the given cpus",
" -C/--cgroup[=cgroup_name]: set cgroup, if no cgroup_name is passed, the rtla's cgroup will be inherited",
- " -d/--duration time[m|h|d]: duration of the session in seconds",
+ " -d/--duration time[s|m|h|d]: duration of the session",
" -D/--debug: print debug info",
" --dump-tasks: prints the task running on all CPUs if stop conditions are met (depends on !--no-aa)",
" -t/--trace[file]: save the stopped trace to [file|timerlat_trace.txt]",
@@ -613,7 +613,7 @@ static struct timerlat_top_params
case 'd':
params->duration = parse_seconds_duration(optarg);
if (!params->duration)
- timerlat_top_usage("Invalid -D duration\n");
+ timerlat_top_usage("Invalid -d duration\n");
break;
case 'e':
tevent = trace_event_alloc(optarg);
--
2.45.2
Hi Vladimir!
On Fri, 2024-10-04 at 14:57 +0300, Vladimir Oltean wrote:
> On Fri, Oct 04, 2024 at 01:36:54PM +0200, A. Sverdlin wrote:
> > From: Anatolij Gustschin <agust(a)denx.de>
> >
> > Accessing device registers seems to be not reliable, the chip
> > revision is sometimes detected wrongly (0 instead of expected 1).
> >
> > Ensure that the chip reset is performed via reset GPIO and then
> > wait for 'Device Ready' status in HW_CFG register before doing
> > any register initializations.
> >
> > Signed-off-by: Anatolij Gustschin <agust(a)denx.de>
> > [alex: reworked using read_poll_timeout()]
> > Signed-off-by: Alexander Sverdlin <alexander.sverdlin(a)siemens.com>
> > ---
>
> Reviewed-by: Vladimir Oltean <olteanv(a)gmail.com>
>
> I see you're missing a Fixes: tag (meaning in this case: backport down
> to all stable tree containing this commit). I think you can just post it
> as a reply to this email, without resending, and it should get picked up
> by maintainers through the same mechanism as Reviewed-by: tags.
Well, the only meaningful would be the very first commit for lan9303:
Fixes: a1292595e006 ("net: dsa: add new DSA switch driver for the SMSC-LAN9303")
Cc: <stable(a)vger.kernel.org>
--
Alexander Sverdlin
Siemens AG
www.siemens.com
The value of an arithmetic expression period_ns * 1000 is subject
to overflow due to a failure to cast operands to a larger data
type before performing arithmetic
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 3e90b1c7ebe9 ("staging: comedi: ni_tio: tidy up ni_tio_set_clock_src() and helpers")
Cc: <stable(a)vger.kernel.org> # v5.15+
Reviewed-by: Ian Abbott <abbotti(a)mev.co.uk>
Signed-off-by: Denis Arefev <arefev(a)swemel.ru>
Signed-off-by: Ian Abbott <abbotti(a)mev.co.uk>
---
V1 -> V2: Oh, good point. It should be 1000ULL.
drivers/comedi/drivers/ni_tio.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/comedi/drivers/ni_tio.c b/drivers/comedi/drivers/ni_tio.c
index da6826d77e60..acc914903c70 100644
--- a/drivers/comedi/drivers/ni_tio.c
+++ b/drivers/comedi/drivers/ni_tio.c
@@ -800,7 +800,7 @@ static int ni_tio_set_clock_src(struct ni_gpct *counter,
GI_PRESCALE_X2(counter_dev->variant) |
GI_PRESCALE_X8(counter_dev->variant), bits);
}
- counter->clock_period_ps = period_ns * 1000;
+ counter->clock_period_ps = period_ns * 1000ULL;
ni_tio_set_sync_mode(counter);
return 0;
}
--
2.25.1