Hi all,
I am sending this patch series here before public review. Please have a look and provide your insight into its correctness.
These set of patches capture latency events caused by interrupts and premption disabled in kernel. The patches are based on the hist trigger feature developed by Tom Zanussi.
Git-commit: 7ef224d1d0e3a1ade02d02c01ce1dcffb736d2c3
As mentioned by Daniel, there is also a good write up in the following blog by Brendan Gregg: http://www.brendangregg.com/blog/2016-06-08/linux-hist-triggers.html
The perf interface for the same have not been developed yet. Related efforts: https://patchwork.kernel.org/patch/8439401
hwlat_detector tracer: https://lkml.org/lkml/2016/8/4/348 https://lkml.org/lkml/2016/8/4/346
The patch series also contains histogram triggers for missed hrtimer offsets.
Dependencies: CONFIG_IRQSOFF_TRACER CONFIG_PREEMPT_TRACER CONFIG_PROVE_LOCKING CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
Usage of triggers to generate histograms:
mount -t debugfs nodev /sys/kernel/debug echo 'hist:key=latency.log2:val=hitcount:sort=latency' > /sys/kernel/debug/tracing/events/latency/latency_irqs/trigger echo 'hist:key=common_pid.execname' > /sys/kernel/debug/tracing/events/latency/latency_hrtimer_interrupt/trigger
CPU specific breakdown of events:
echo 'hist:key=cpu,latency:val=hitcount:sort=latency' > /sys/kernel/debug/tracing/events/latency/latency_preempt/trigger echo 'hist:key=cpu,latency:val=hitcount:sort=latency if cpu==1' > /sys/kernel/debug/tracing/events/latency/latency_preempt/trigger
Histogram output: cat /sys/kernel/debug/tracing/events/latency/latency_irqs/hist cat /sys/kernel/debug/tracing/events/latency/latency_preempt/hist cat /sys/kernel/debug/tracing/events/latency/latency_critical_timings/hist cat /sys/kernel/debug/tracing/events/latency/latency_hrtimer_interrupt/hist
Disable a trigger: echo '!hist:key=latency.log2' > /sys/kernel/debug/tracing/events/latency/latency_irqs/trigger
TODO: 1. perf interface. Not sure if this is needed 2. Effective wakeup latencies of individual processes due to irqs/premption being disabled or due to missed timers.
Excerpt from Carsten Emde's patches: Effective latencies occurs during wakeup of a process. To determine effective latencies, the time stamp when a process is scheduled to be woken up has to be stored, and determines the duration of the wakeup time shortly before control is passed over to this process. Note that the apparent latency in user space may be somewhat longer, since the process may be interrupted after control is passed over to it but before the execution in user space takes place. Simply measuring the interval between enqueuing and wakeup may also not appropriate in cases when a process is scheduled as a result of a timer expiration. The timer may have missed its deadline, e.g. due to disabled interrupts, but this latency would not be registered. Therefore, the offsets of missed timers are recorded in a separate histogram. If both wakeup latency and missed timer offsets are configured and enabled, a third histogram may be enabled that records the overall latency as a sum of the timer latency, if any, and the wakeup latency. This histogram is called "timerandwakeup".
Thanks, Binoy
Binoy Jayan (1): latency: Histogram for missed timer offsets
Daniel Wagner (2): tracing: Deference pointers without RCU checks tracing: Add trace_irqsoff tracepoints
include/linux/hrtimer.h | 1 + include/linux/rculist.h | 36 ++++++++++++++++++ include/linux/tracepoint.h | 4 +- include/trace/events/latency.h | 73 +++++++++++++++++++++++++++++++++++++ kernel/time/hrtimer.c | 27 ++++++++++++++ kernel/trace/trace_events_trigger.c | 6 +-- kernel/trace/trace_irqsoff.c | 47 +++++++++++++++++++++++- 7 files changed, 188 insertions(+), 6 deletions(-) create mode 100644 include/trace/events/latency.h
From: Daniel Wagner daniel.wagner@bmw-carit.de
The tracepoint can't be used in code section where we are in the middle of a state transition.
For example if we place a tracepoint inside start/stop_critical_section(), lockdep complains with
[ 0.035589] WARNING: CPU: 0 PID: 3 at kernel/locking/lockdep.c:3560 \ check_flags.part.36+0x1bc/0x210() [ 0.036000] \ DEBUG_LOCKS_WARN_ON(current->softirqs_enabled) [ 0.036000] Kernel panic - not \ syncing: panic_on_warn set ... [ 0.036000] [ 0.036000] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.0.0-rc7+ #460 [ 0.036000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS \ 1.7.5-20140709_153950- 04/01/2014 [ 0.036000] ffffffff81f2463a ffff88007c93bb98 \ ffffffff81afb317 0000000000000001 [ 0.036000] ffffffff81f212b3 ffff88007c93bc18 \ ffffffff81af7bc2 ffff88007c93bbb8 [ 0.036000] ffffffff00000008 ffff88007c93bc28 \ ffff88007c93bbc8 000000000093bbd8 [ 0.036000] Call Trace: [ 0.036000] [<ffffffff81afb317>] dump_stack+0x4f/0x7b [ 0.036000] [<ffffffff81af7bc2>] panic+0xc0/0x1e9 [ 0.036000] [<ffffffff81b07188>] ? _raw_spin_unlock_irqrestore+0x38/0x80 [ 0.036000] [<ffffffff81052a60>] warn_slowpath_common+0xc0/0xc0 [ 0.036000] [<ffffffff81057116>] ? __local_bh_enable+0x36/0x70 [ 0.036000] [<ffffffff81052aa6>] warn_slowpath_fmt+0x46/0x50 [ 0.036000] [<ffffffff810a468c>] check_flags.part.36+0x1bc/0x210 [ 0.036000] [<ffffffff810a5168>] lock_is_held+0x78/0x90 [ 0.036000] [<ffffffff81057116>] ? __local_bh_enable+0x36/0x70 [ 0.036000] [<ffffffff810588ab>] ? __do_softirq+0x3db/0x500 [ 0.036000] [<ffffffff81134495>] trace_preempt_on+0x255/0x260 [ 0.036000] [<ffffffff81081c0b>] preempt_count_sub+0xab/0xf0 [ 0.036000] [<ffffffff81057116>] __local_bh_enable+0x36/0x70 [ 0.036000] [<ffffffff810588ab>] __do_softirq+0x3db/0x500 [ 0.036000] [<ffffffff810589ef>] run_ksoftirqd+0x1f/0x60 [ 0.036000] [<ffffffff81079653>] smpboot_thread_fn+0x193/0x2a0 [ 0.036000] [<ffffffff810794c0>] ? SyS_setgroups+0x150/0x150 [ 0.036000] [<ffffffff81075312>] kthread+0xf2/0x110 [ 0.036000] [<ffffffff81b026b3>] ? wait_for_completion+0xc3/0x120 [ 0.036000] [<ffffffff81081c0b>] ? preempt_count_sub+0xab/0xf0 [ 0.036000] [<ffffffff81075220>] ? kthread_create_on_node+0x240/0x240 [ 0.036000] [<ffffffff81b07cc8>] ret_from_fork+0x58/0x90 [ 0.036000] [<ffffffff81075220>] ? kthread_create_on_node+0x240/0x240 [ 0.036000] ---[ end Kernel panic - not syncing: panic_on_warn set ...
PeterZ was so kind to explain it to me what is happening:
"__local_bh_enable() tests if this is the last SOFTIRQ_OFFSET, if so it tells lockdep softirqs are enabled with trace_softirqs_on() after that we go an actually modify the preempt_count with preempt_count_sub(). Then in preempt_count_sub() you call into trace_preempt_on() if this was the last preempt_count increment but you do that _before_ you actually change the preempt_count with __preempt_count_sub() at this point lockdep and preempt_count think the world differs and *boom*"
So the simplest way to avoid this is by disabling the consistency checks.
We also need to take care of the iterating in trace_events_trigger.c to avoid a splatter in conjunction with the hist trigger.
Not for inclusion!
Not-Signed-off-by: Daniel Wagner daniel.wagner@bmw-carit.de
Not-Signed-off-by: Binoy Jayan binoy.jayan@linaro.org --- include/linux/rculist.h | 36 ++++++++++++++++++++++++++++++++++++ include/linux/tracepoint.h | 4 ++-- kernel/trace/trace_events_trigger.c | 6 +++--- 3 files changed, 41 insertions(+), 5 deletions(-)
diff --git a/include/linux/rculist.h b/include/linux/rculist.h index 8beb98d..bee836b 100644 --- a/include/linux/rculist.h +++ b/include/linux/rculist.h @@ -279,6 +279,24 @@ static inline void list_splice_tail_init_rcu(struct list_head *list, container_of(lockless_dereference(ptr), type, member)
/** + * list_entry_rcu_notrace - get the struct for this entry (for tracing) + * @ptr: the &struct list_head pointer. + * @type: the type of the struct this is embedded in. + * @member: the name of the list_head within the struct. + * + * This primitive may safely run concurrently with the _rcu list-mutation + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock(). + * + * This is the same as list_entry_rcu() except that it does + * not do any RCU debugging or tracing. + */ +#define list_entry_rcu_notrace(ptr, type, member) \ +({ \ + typeof(*ptr) __rcu *__ptr = (typeof(*ptr) __rcu __force *)ptr; \ + container_of((typeof(ptr))rcu_dereference_raw_notrace(__ptr), type, member); \ +}) + +/** * Where are list_empty_rcu() and list_first_entry_rcu()? * * Implementing those functions following their counterparts list_empty() and @@ -391,6 +409,24 @@ static inline void list_splice_tail_init_rcu(struct list_head *list, pos = list_entry_lockless(pos->member.next, typeof(*pos), member))
/** + * list_for_each_entry_rcu_notrace - iterate over rcu list of given type (for tracing) + * @pos: the type * to use as a loop cursor. + * @head: the head for your list. + * @member: the name of the list_head within the struct. + * + * This list-traversal primitive may safely run concurrently with + * the _rcu list-mutation primitives such as list_add_rcu() + * as long as the traversal is guarded by rcu_read_lock(). + * + * This is the same as list_for_each_entry_rcu() except that it does + * not do any RCU debugging or tracing. + */ +#define list_for_each_entry_rcu_notrace(pos, head, member) \ + for (pos = list_entry_rcu_notrace((head)->next, typeof(*pos), member); \ + &pos->member != (head); \ + pos = list_entry_rcu_notrace(pos->member.next, typeof(*pos), member)) + +/** * list_for_each_entry_continue_rcu - continue iteration over list of given type * @pos: the type * to use as a loop cursor. * @head: the head for your list. diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index be586c6..d0e2a82 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -138,7 +138,7 @@ extern void syscall_unregfunc(void); return; \ prercu; \ rcu_read_lock_sched_notrace(); \ - it_func_ptr = rcu_dereference_sched((tp)->funcs); \ + it_func_ptr = rcu_dereference_raw_notrace((tp)->funcs); \ if (it_func_ptr) { \ do { \ it_func = (it_func_ptr)->func; \ @@ -189,7 +189,7 @@ extern void syscall_unregfunc(void); TP_CONDITION(cond),,); \ if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \ rcu_read_lock_sched_notrace(); \ - rcu_dereference_sched(__tracepoint_##name.funcs);\ + rcu_dereference_raw_notrace(__tracepoint_##name.funcs);\ rcu_read_unlock_sched_notrace(); \ } \ } \ diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c index a975571..4489fb4 100644 --- a/kernel/trace/trace_events_trigger.c +++ b/kernel/trace/trace_events_trigger.c @@ -71,14 +71,14 @@ event_triggers_call(struct trace_event_file *file, void *rec) if (list_empty(&file->triggers)) return tt;
- list_for_each_entry_rcu(data, &file->triggers, list) { + list_for_each_entry_rcu_notrace(data, &file->triggers, list) { if (data->paused) continue; if (!rec) { data->ops->func(data, rec); continue; } - filter = rcu_dereference_sched(data->filter); + filter = rcu_dereference_raw_notrace(data->filter); if (filter && !filter_match_preds(filter, rec)) continue; if (event_command_post_trigger(data->cmd_ops)) { @@ -111,7 +111,7 @@ event_triggers_post_call(struct trace_event_file *file, { struct event_trigger_data *data;
- list_for_each_entry_rcu(data, &file->triggers, list) { + list_for_each_entry_rcu_notrace(data, &file->triggers, list) { if (data->paused) continue; if (data->cmd_ops->trigger_type & tt)
From: Daniel Wagner daniel.wagner@bmw-carit.de
Finally we place a few tracepoint at the end of critical section. With the hist trigger in place we can generate the plots.
There are a few drawbacks compared to the latency_hist.patch [1]
The latency plots contain the values from all CPUs. In theory you can also filter with something like
'hist:key=latency.bucket:val=hitcount:sort=latency if cpu==0'
but I haven't got this working. I didn't spend much time figuring out why this doesn't work. Even if the above is working you still don't get the per CPU breakdown of the events. I don't know if this is a must have feature.
Another point is the need for placing at least one tracepoint so that the hist code is able to do something. Also I think it would be a good idea to reuse the time diff from the tracer instead. Some refactoring would be necessary for this. For simplicity I just added a hack for getting the time diff. And I am not really sure if it is okay to use the *_enabled() in this way, splitting the time stamping and adding the tracepoint in two sections.
Steven was tossing the idea around to introduce a 'variable' to the tracepoints which can be used for timestamping etc. I'd like to avoid placing two tracepoints and doing later the time diff. It sounds like too much overhead.
Not for inclusion!
Not-Signed-off-by: Daniel Wagner daniel.wagner@bmw-carit.de
[1] https://git.kernel.org/cgit/linux/kernel/git/rt/linux-stable-rt.git/commit/?...
Other changes:
1. Added the field 'cpu' to the trace event entry struct so as to capture per-cpu breakdown of events.
2. Trigger for CPU specific breakdown of events:
'hist:key=cpu,latency:val=hitcount:sort=latency' 'hist:key=cpu,latency:val=hitcount:sort=latency if cpu==1'
Not-Signed-off-by: Binoy Jayan binoy.jayan@linaro.org --- include/trace/events/latency.h | 42 +++++++++++++++++++++++++++++++++++++ kernel/trace/trace_irqsoff.c | 47 +++++++++++++++++++++++++++++++++++++++++- 2 files changed, 88 insertions(+), 1 deletion(-) create mode 100644 include/trace/events/latency.h
diff --git a/include/trace/events/latency.h b/include/trace/events/latency.h new file mode 100644 index 0000000..b279e2d --- /dev/null +++ b/include/trace/events/latency.h @@ -0,0 +1,42 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM latency + +#if !defined(_TRACE_HIST_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_HIST_H + +#include <linux/tracepoint.h> + +DECLARE_EVENT_CLASS(latency_template, + TP_PROTO(int cpu, cycles_t latency), + + TP_ARGS(cpu, latency), + + TP_STRUCT__entry( + __field(int, cpu) + __field(cycles_t, latency) + ), + + TP_fast_assign( + __entry->cpu = cpu; + __entry->latency = latency; + ), + + TP_printk("cpu=%d, latency=%lu", __entry->cpu, __entry->latency) +); + +DEFINE_EVENT(latency_template, latency_irqs, + TP_PROTO(int cpu, cycles_t latency), + TP_ARGS(cpu, latency)); + +DEFINE_EVENT(latency_template, latency_preempt, + TP_PROTO(int cpu, cycles_t latency), + TP_ARGS(cpu, latency)); + +DEFINE_EVENT(latency_template, latency_critical_timings, + TP_PROTO(int cpu, cycles_t latency), + TP_ARGS(cpu, latency)); + +#endif /* _TRACE_HIST_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h> diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c index 03cdff8..857c86f 100644 --- a/kernel/trace/trace_irqsoff.c +++ b/kernel/trace/trace_irqsoff.c @@ -14,13 +14,20 @@ #include <linux/module.h> #include <linux/ftrace.h>
+#include <trace/events/sched.h> + #include "trace.h"
+#define CREATE_TRACE_POINTS +#include <trace/events/latency.h> + static struct trace_array *irqsoff_trace __read_mostly; static int tracer_enabled __read_mostly;
static DEFINE_PER_CPU(int, tracing_cpu); - +static DEFINE_PER_CPU(cycle_t, ts_irqs); +static DEFINE_PER_CPU(cycle_t, ts_preempt); +static DEFINE_PER_CPU(cycle_t, ts_critical_timings); static DEFINE_RAW_SPINLOCK(max_trace_lock);
enum { @@ -422,6 +429,11 @@ stop_critical_timing(unsigned long ip, unsigned long parent_ip) /* start and stop critical timings used to for stoppage (in idle) */ void start_critical_timings(void) { + if (trace_latency_critical_timings_enabled()) { + int cpu = raw_smp_processor_id(); + per_cpu(ts_critical_timings, cpu) = ftrace_now(cpu); + } + if (preempt_trace() || irq_trace()) start_critical_timing(CALLER_ADDR0, CALLER_ADDR1); } @@ -431,6 +443,14 @@ void stop_critical_timings(void) { if (preempt_trace() || irq_trace()) stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1); + + if (trace_latency_critical_timings_enabled()) { + int cpu = raw_smp_processor_id(); + trace_latency_critical_timings(cpu, + ftrace_now(cpu) - + per_cpu(ts_critical_timings, cpu)); + } + } EXPORT_SYMBOL_GPL(stop_critical_timings);
@@ -438,6 +458,12 @@ EXPORT_SYMBOL_GPL(stop_critical_timings); #ifdef CONFIG_PROVE_LOCKING void time_hardirqs_on(unsigned long a0, unsigned long a1) { + if (trace_latency_irqs_enabled()) { + int cpu = raw_smp_processor_id(); + trace_latency_irqs(cpu, + ftrace_now(cpu) - + per_cpu(ts_irqs, cpu)); + } if (!preempt_trace() && irq_trace()) stop_critical_timing(a0, a1); } @@ -446,6 +472,11 @@ void time_hardirqs_off(unsigned long a0, unsigned long a1) { if (!preempt_trace() && irq_trace()) start_critical_timing(a0, a1); + + if (trace_latency_irqs_enabled()) { + int cpu = raw_smp_processor_id(); + per_cpu(ts_irqs, cpu) = ftrace_now(cpu); + } }
#else /* !CONFIG_PROVE_LOCKING */ @@ -500,9 +531,18 @@ EXPORT_SYMBOL(trace_hardirqs_off_caller); #endif /* CONFIG_PROVE_LOCKING */ #endif /* CONFIG_IRQSOFF_TRACER */
+int count = 0; + #ifdef CONFIG_PREEMPT_TRACER void trace_preempt_on(unsigned long a0, unsigned long a1) { + if (trace_latency_preempt_enabled()) { + int cpu = raw_smp_processor_id(); + trace_latency_preempt(cpu, + ftrace_now(cpu) - + per_cpu(ts_preempt, cpu)); + } + if (preempt_trace() && !irq_trace()) stop_critical_timing(a0, a1); } @@ -511,6 +551,11 @@ void trace_preempt_off(unsigned long a0, unsigned long a1) { if (preempt_trace() && !irq_trace()) start_critical_timing(a0, a1); + + if (trace_latency_preempt_enabled()) { + int cpu = raw_smp_processor_id(); + per_cpu(ts_preempt, cpu) = ftrace_now(cpu); + } } #endif /* CONFIG_PREEMPT_TRACER */
On 08/12/2016 12:25 PM, Binoy Jayan wrote:
From: Daniel Wagner daniel.wagner@bmw-carit.de
Finally we place a few tracepoint at the end of critical section. With the hist trigger in place we can generate the plots.
There are a few drawbacks compared to the latency_hist.patch [1]
The latency plots contain the values from all CPUs. In theory you can also filter with something like
'hist:key=latency.bucket:val=hitcount:sort=latency if cpu==0'
but I haven't got this working. I didn't spend much time figuring out why this doesn't work. Even if the above is working you still don't get the per CPU breakdown of the events. I don't know if this is a must have feature.
Another point is the need for placing at least one tracepoint so that the hist code is able to do something. Also I think it would be a good idea to reuse the time diff from the tracer instead. Some refactoring would be necessary for this. For simplicity I just added a hack for getting the time diff. And I am not really sure if it is okay to use the *_enabled() in this way, splitting the time stamping and adding the tracepoint in two sections.
Steven was tossing the idea around to introduce a 'variable' to the tracepoints which can be used for timestamping etc. I'd like to avoid placing two tracepoints and doing later the time diff. It sounds like too much overhead.
Not for inclusion!
Not-Signed-off-by: Daniel Wagner daniel.wagner@bmw-carit.de
[1] https://git.kernel.org/cgit/linux/kernel/git/rt/linux-stable-rt.git/commit/?...
Other changes:
- Added the field 'cpu' to the trace event entry struct so as to
capture per-cpu breakdown of events.
- Trigger for CPU specific breakdown of events:
'hist:key=cpu,latency:val=hitcount:sort=latency' 'hist:key=cpu,latency:val=hitcount:sort=latency if cpu==1'
Not-Signed-off-by: Binoy Jayan binoy.jayan@linaro.org
I think you need to update the commit message where I babbling about what is not working. I don't mind if you completely take over this patch (being the author), just mention in the commit message it was based on my work. That should do the trick.
enum { @@ -422,6 +429,11 @@ stop_critical_timing(unsigned long ip, unsigned long parent_ip) /* start and stop critical timings used to for stoppage (in idle) */ void start_critical_timings(void) {
- if (trace_latency_critical_timings_enabled()) {
int cpu = raw_smp_processor_id();
per_cpu(ts_critical_timings, cpu) = ftrace_now(cpu);
- }
- if (preempt_trace() || irq_trace()) start_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
} @@ -431,6 +443,14 @@ void stop_critical_timings(void) { if (preempt_trace() || irq_trace()) stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
- if (trace_latency_critical_timings_enabled()) {
int cpu = raw_smp_processor_id();
trace_latency_critical_timings(cpu,
ftrace_now(cpu) -
per_cpu(ts_critical_timings, cpu));
- }
} EXPORT_SYMBOL_GPL(stop_critical_timings);
Hmm, maybe this is just my bike shedding me speaking here, but I suggest you copy the style with start_critical_timing() and stop_critical_timing() and move the implementation into a small inline function. IMO this is cluttering the reading flow too much. The same is true for the following hunks.
cheers, daniel
On 16 August 2016 at 17:02, Daniel Wagner daniel.wagner@bmw-carit.de wrote:
Hmm, maybe this is just my bike shedding me speaking here, but I suggest you copy the style with start_critical_timing() and stop_critical_timing() and move the implementation into a small inline function. IMO this is cluttering the reading flow too much. The same is true for the following hunks.
cheers, daniel
Hi Daniel,
I think you are not.:) Doing it the better way always helps atleast by not getting rejected by the kernel community. Thanks a lot for reviewing the code.
Arnd,
Do you think I should add the perf interface too? Masami's mail server was giving me bounced email replies.
- Binoy
Latencies of missed timer offsets. Generate a histogram of missed timer offsets in microseconds.
Signed-off-by: Binoy Jayan binoy.jayan@linaro.org --- include/linux/hrtimer.h | 1 + include/trace/events/latency.h | 31 +++++++++++++++++++++++++++++++ kernel/time/hrtimer.c | 27 +++++++++++++++++++++++++++ 3 files changed, 59 insertions(+)
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h index 5e00f80..19cb76c 100644 --- a/include/linux/hrtimer.h +++ b/include/linux/hrtimer.h @@ -104,6 +104,7 @@ struct hrtimer { struct hrtimer_clock_base *base; u8 state; u8 is_rel; + ktime_t praecox; #ifdef CONFIG_TIMER_STATS int start_pid; void *start_site; diff --git a/include/trace/events/latency.h b/include/trace/events/latency.h index b279e2d..32317a7 100644 --- a/include/trace/events/latency.h +++ b/include/trace/events/latency.h @@ -36,6 +36,37 @@ DEFINE_EVENT(latency_template, latency_critical_timings, TP_PROTO(int cpu, cycles_t latency), TP_ARGS(cpu, latency));
+TRACE_EVENT(latency_hrtimer_interrupt, + + TP_PROTO(int cpu, long long toffset, struct task_struct *curr, + struct task_struct *task), + + TP_ARGS(cpu, toffset, curr, task), + + TP_STRUCT__entry( + __field(int, cpu) + __field(long long, toffset) + __array(char, ccomm, TASK_COMM_LEN) + __field(int, cprio) + __array(char, tcomm, TASK_COMM_LEN) + __field(int, tprio) + ), + + TP_fast_assign( + __entry->cpu = cpu; + __entry->toffset = toffset; + memcpy(__entry->ccomm, curr->comm, TASK_COMM_LEN); + __entry->cprio = curr->prio; + memcpy(__entry->tcomm, task != NULL ? task->comm : "<none>", + task != NULL ? TASK_COMM_LEN : 7); + __entry->tprio = task != NULL ? task->prio : -1; + ), + + TP_printk("cpu=%d toffset=%lld curr=%s[%d] thread=%s[%d]", + __entry->cpu, __entry->toffset, __entry->ccomm, + __entry->cprio, __entry->tcomm, __entry->tprio) +); + #endif /* _TRACE_HIST_H */
/* This part must be outside protection */ diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 9ba7c82..71e6fcb 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -53,9 +53,12 @@ #include <asm/uaccess.h>
#include <trace/events/timer.h> +#include <trace/events/latency.h>
#include "tick-internal.h"
+static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer); + /* * The timer bases: * @@ -992,6 +995,15 @@ void hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
timer_stats_hrtimer_set_start_info(timer);
+ if (trace_latency_hrtimer_interrupt_enabled()) { + ktime_t now = new_base->get_time(); + + if (ktime_to_ns(tim) < ktime_to_ns(now)) + timer->praecox = now; + else + timer->praecox = ktime_set(0, 0); + } + leftmost = enqueue_hrtimer(timer, new_base); if (!leftmost) goto unlock; @@ -1265,6 +1277,19 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base, cpu_base->running = NULL; }
+static inline void trace_missed_hrtimer(struct hrtimer *timer, ktime_t basenow) +{ + if (trace_latency_hrtimer_interrupt_enabled()) + trace_latency_hrtimer_interrupt(raw_smp_processor_id(), + ktime_to_ns(ktime_sub(ktime_to_ns(timer->praecox) ? + timer->praecox : hrtimer_get_expires(timer), + basenow)), + current, + timer->function == hrtimer_wakeup ? + container_of(timer, struct hrtimer_sleeper, + timer)->task : NULL); + +} static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now) { struct hrtimer_clock_base *base = cpu_base->clock_base; @@ -1284,6 +1309,8 @@ static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now)
timer = container_of(node, struct hrtimer, node);
+ trace_missed_hrtimer(timer, basenow); + /* * The immediate goal for using the softexpires is * minimizing wakeups, not running timers at the
Hi Binoy,
On 08/12/2016 12:25 PM, Binoy Jayan wrote:
Hi all,
I am sending this patch series here before public review. Please have a look and provide your insight into its correctness.
These set of patches capture latency events caused by interrupts and premption disabled in kernel. The patches are based on the hist trigger feature developed by Tom Zanussi.
Git-commit: 7ef224d1d0e3a1ade02d02c01ce1dcffb736d2c3
As mentioned by Daniel, there is also a good write up in the following blog by Brendan Gregg: http://www.brendangregg.com/blog/2016-06-08/linux-hist-triggers.html
The perf interface for the same have not been developed yet. Related efforts: https://patchwork.kernel.org/patch/8439401
hwlat_detector tracer: https://lkml.org/lkml/2016/8/4/348 https://lkml.org/lkml/2016/8/4/346
The patch series also contains histogram triggers for missed hrtimer offsets.
Thanks for picking this up. I suggest you also send these patches for review to lkml and probably linux-rt-users.
I am fine with dropping my 'not for inclusion' tag and add my sob.
cheers, daniel
linaro-kernel@lists.linaro.org