Hi,
Here is an RFC patch to support polling on event 'hist' file.
There has been interest in allowing user programs to monitor kernel events in real time. Ftrace provides `trace_pipe` interface to wait on events in the ring buffer, but it is needed to wait until filling up a page with events in the ring buffer. We can also peek the `trace` file periodically, but that is inefficient way to monitor a randomely happening event.
This patch set allows user to `poll`(or `select`, `epoll`) on event histogram interface. As you know each event has its own `hist` file which shows histograms generated by trigger action. So user can set a new hist trigger on any event you want to monitor, and poll on the `hist` file until it is updated.
There are 2 poll events are supported, POLLIN and POLLPRI. POLLIN means that there are any readable update on `hist` file and this event will be flashed only when you call read(). So, this is useful if you want to read the histogram periodically. The other POLLPRI event is for monitoring trace event. Like the POLLIN, this will be returned when the histogram is updated, but you don't need to read() the file and use poll() again.
Note that this waits for histogram update (not event arrival), thus you must set a histogram on the event at first.
Here is an example usage:
---- TRACEFS=/sys/kernel/tracing EVENT=$TRACEFS/events/sched/sched_process_free
# setup histogram trigger and enable event echo "hist:key=comm" >> $EVENT/trigger echo 1 > $EVENT/enable
# Wait for update poll $EVENT/hist
# Event arrived. echo "process free event is comming" tail $TRACEFS/trace ----
The 'poll' command is in the selftest patch.
You can take this series also from here;
https://git.kernel.org/pub/scm/linux/kernel/git/mhiramat/linux.git/log/?h=to...
Thank you,
---
Masami Hiramatsu (Google) (3): tracing/hist: Add poll(POLLIN) support on hist file tracing/hist: Support POLLPRI event for poll on histogram selftests/tracing: Add hist poll() support test
include/linux/trace_events.h | 5 + kernel/trace/trace_events.c | 18 ++++ kernel/trace/trace_events_hist.c | 101 +++++++++++++++++++- tools/testing/selftests/ftrace/Makefile | 3 + tools/testing/selftests/ftrace/poll.c | 34 +++++++ .../ftrace/test.d/trigger/trigger-hist-poll.tc | 46 +++++++++ 6 files changed, 204 insertions(+), 3 deletions(-) create mode 100644 tools/testing/selftests/ftrace/poll.c create mode 100644 tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-poll.tc
-- Masami Hiramatsu (Google) mhiramat@kernel.org
From: Masami Hiramatsu (Google) mhiramat@kernel.org
Add poll syscall support on the `hist` file. The Waiter will be waken up when the histogram is updated with POLLIN.
Currently, there is no way to wait for a specific event in userspace. So user needs to peek the `trace` periodicaly, or wait on `trace_pipe`. But that is not good idea to peek the `trace` for the event randomely happens. And `trace_pipe` is not coming back until a page is filled with events.
This allows user to wait for a specific events on `hist` file. User can set a histogram trigger on the event which they want to monitor. And poll() on its `hist` file. Since this poll() returns POLLIN, the next poll() will return soon unless you do read() on hist file.
NOTE: To read the hist file again, you must set the file offset to 0, but just for monitoring the event, you may not need to read the histogram.
Signed-off-by: Masami Hiramatsu (Google) mhiramat@kernel.org --- include/linux/trace_events.h | 5 +++ kernel/trace/trace_events.c | 18 +++++++++ kernel/trace/trace_events_hist.c | 76 +++++++++++++++++++++++++++++++++++++- 3 files changed, 96 insertions(+), 3 deletions(-)
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index 9df3e2973626..0d496e2d1064 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -663,6 +663,11 @@ struct trace_event_file { struct trace_subsystem_dir *system; struct list_head triggers;
+#ifdef CONFIG_HIST_TRIGGERS + struct irq_work hist_work; + wait_queue_head_t hist_wq; +#endif + /* * 32 bit flags: * bit 0: enabled diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index 6ef29eba90ce..07ce5b024dd9 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -2965,6 +2965,20 @@ static bool event_in_systems(struct trace_event_call *call, return !*p || isspace(*p) || *p == ','; }
+#ifdef CONFIG_HIST_TRIGGERS +/* + * Wake up waiter on the hist_wq from irq_work because the hist trigger + * may happen in any context. + */ +static void hist_event_irq_work(struct irq_work *work) +{ + struct trace_event_file *event_file; + + event_file = container_of(work, struct trace_event_file, hist_work); + wake_up_all(&event_file->hist_wq); +} +#endif + static struct trace_event_file * trace_create_new_event(struct trace_event_call *call, struct trace_array *tr) @@ -2996,6 +3010,10 @@ trace_create_new_event(struct trace_event_call *call, atomic_set(&file->tm_ref, 0); INIT_LIST_HEAD(&file->triggers); list_add(&file->list, &tr->events); +#ifdef CONFIG_HIST_TRIGGERS + init_irq_work(&file->hist_work, hist_event_irq_work); + init_waitqueue_head(&file->hist_wq); +#endif event_file_get(file);
return file; diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 6ece1308d36a..136d91139949 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -5314,6 +5314,9 @@ static void event_hist_trigger(struct event_trigger_data *data,
if (resolve_var_refs(hist_data, key, var_ref_vals, true)) hist_trigger_actions(hist_data, elt, buffer, rec, rbe, key, var_ref_vals); + + if (hist_data->event_file && wq_has_sleeper(&hist_data->event_file->hist_wq)) + irq_work_queue(&hist_data->event_file->hist_work); }
static void hist_trigger_stacktrace_print(struct seq_file *m, @@ -5593,15 +5596,36 @@ static void hist_trigger_show(struct seq_file *m, n_entries, (u64)atomic64_read(&hist_data->map->drops)); }
+struct hist_file_data { + struct file *file; + u64 last_read; +}; + +static u64 get_hist_hit_count(struct trace_event_file *event_file) +{ + struct hist_trigger_data *hist_data; + struct event_trigger_data *data; + u64 ret = 0; + + list_for_each_entry(data, &event_file->triggers, list) { + if (data->cmd_ops->trigger_type == ETT_EVENT_HIST) { + hist_data = data->private_data; + ret += atomic64_read(&hist_data->map->hits); + } + } + return ret; +} + static int hist_show(struct seq_file *m, void *v) { + struct hist_file_data *hist_file = m->private; struct event_trigger_data *data; struct trace_event_file *event_file; int n = 0, ret = 0;
mutex_lock(&event_mutex);
- event_file = event_file_data(m->private); + event_file = event_file_data(hist_file->file); if (unlikely(!event_file)) { ret = -ENODEV; goto out_unlock; @@ -5611,6 +5635,7 @@ static int hist_show(struct seq_file *m, void *v) if (data->cmd_ops->trigger_type == ETT_EVENT_HIST) hist_trigger_show(m, data, n++); } + hist_file->last_read = get_hist_hit_count(event_file);
out_unlock: mutex_unlock(&event_mutex); @@ -5618,24 +5643,69 @@ static int hist_show(struct seq_file *m, void *v) return ret; }
+static __poll_t event_hist_poll(struct file *file, struct poll_table_struct *wait) +{ + struct trace_event_file *event_file; + struct seq_file *m = file->private_data; + struct hist_file_data *hist_file = m->private; + __poll_t ret = 0; + + mutex_lock(&event_mutex); + + event_file = event_file_data(file); + if (!event_file) { + ret = EPOLLERR; + goto out_unlock; + } + + poll_wait(file, &event_file->hist_wq, wait); + + if (hist_file->last_read != get_hist_hit_count(event_file)) + ret = EPOLLIN | EPOLLRDNORM; + +out_unlock: + mutex_unlock(&event_mutex); + + return ret; +} + +static int event_hist_release(struct inode *inode, struct file *file) +{ + struct seq_file *m = file->private_data; + struct hist_file_data *hist_file = m->private; + + kfree(hist_file); + return tracing_single_release_file_tr(inode, file); +} + static int event_hist_open(struct inode *inode, struct file *file) { + struct hist_file_data *hist_file; int ret;
ret = tracing_open_file_tr(inode, file); if (ret) return ret;
+ hist_file = kzalloc(sizeof(*hist_file), GFP_KERNEL); + if (!hist_file) + return -ENOMEM; + hist_file->file = file; + /* Clear private_data to avoid warning in single_open() */ file->private_data = NULL; - return single_open(file, hist_show, file); + ret = single_open(file, hist_show, hist_file); + if (ret) + kfree(hist_file); + return ret; }
const struct file_operations event_hist_fops = { .open = event_hist_open, .read = seq_read, .llseek = seq_lseek, - .release = tracing_single_release_file_tr, + .release = event_hist_release, + .poll = event_hist_poll, };
#ifdef CONFIG_HIST_TRIGGERS_DEBUG
From: Masami Hiramatsu (Google) mhiramat@kernel.org
Since POLLIN will not be flashed until read the hist file, user needs to repeat read() and poll() on hist for monitoring the event continuously. But the read() is somewhat redundant only for monitoring events.
This add POLLPRI poll event on hist, this event returns when a histogram is updated after open(), poll() or read(). Thus it is possible to wait next event without read().
Signed-off-by: Masami Hiramatsu (Google) mhiramat@kernel.org --- kernel/trace/trace_events_hist.c | 29 +++++++++++++++++++++++++++-- 1 file changed, 27 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 136d91139949..a22454ad36d8 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -5599,6 +5599,7 @@ static void hist_trigger_show(struct seq_file *m, struct hist_file_data { struct file *file; u64 last_read; + u64 last_act; };
static u64 get_hist_hit_count(struct trace_event_file *event_file) @@ -5636,6 +5637,11 @@ static int hist_show(struct seq_file *m, void *v) hist_trigger_show(m, data, n++); } hist_file->last_read = get_hist_hit_count(event_file); + /* + * Update last_act too so that poll()/POLLPRI can wait for the next + * event after any syscall on hist file. + */ + hist_file->last_act = hist_file->last_read;
out_unlock: mutex_unlock(&event_mutex); @@ -5649,6 +5655,7 @@ static __poll_t event_hist_poll(struct file *file, struct poll_table_struct *wai struct seq_file *m = file->private_data; struct hist_file_data *hist_file = m->private; __poll_t ret = 0; + u64 cnt;
mutex_lock(&event_mutex);
@@ -5660,8 +5667,13 @@ static __poll_t event_hist_poll(struct file *file, struct poll_table_struct *wai
poll_wait(file, &event_file->hist_wq, wait);
- if (hist_file->last_read != get_hist_hit_count(event_file)) - ret = EPOLLIN | EPOLLRDNORM; + cnt = get_hist_hit_count(event_file); + if (hist_file->last_read != cnt) + ret |= EPOLLIN | EPOLLRDNORM; + if (hist_file->last_act != cnt) { + hist_file->last_act = cnt; + ret |= EPOLLPRI; + }
out_unlock: mutex_unlock(&event_mutex); @@ -5680,6 +5692,7 @@ static int event_hist_release(struct inode *inode, struct file *file)
static int event_hist_open(struct inode *inode, struct file *file) { + struct trace_event_file *event_file; struct hist_file_data *hist_file; int ret;
@@ -5690,13 +5703,25 @@ static int event_hist_open(struct inode *inode, struct file *file) hist_file = kzalloc(sizeof(*hist_file), GFP_KERNEL); if (!hist_file) return -ENOMEM; + + mutex_lock(&event_mutex); + event_file = event_file_data(file); + if (!event_file) { + ret = -ENODEV; + goto out_unlock; + } + hist_file->file = file; + hist_file->last_act = get_hist_hit_count(event_file);
/* Clear private_data to avoid warning in single_open() */ file->private_data = NULL; ret = single_open(file, hist_show, hist_file); + +out_unlock: if (ret) kfree(hist_file); + mutex_unlock(&event_mutex); return ret; }
From: Masami Hiramatsu (Google) mhiramat@kernel.org
Add a testcase for poll() on hist file. This introduces a helper binary to the ftracetest, because there is no good way to reliably execute poll() on hist file.
Signed-off-by: Masami Hiramatsu (Google) mhiramat@kernel.org --- tools/testing/selftests/ftrace/Makefile | 3 + tools/testing/selftests/ftrace/poll.c | 34 +++++++++++++++ .../ftrace/test.d/trigger/trigger-hist-poll.tc | 46 ++++++++++++++++++++ 3 files changed, 83 insertions(+) create mode 100644 tools/testing/selftests/ftrace/poll.c create mode 100644 tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-poll.tc
diff --git a/tools/testing/selftests/ftrace/Makefile b/tools/testing/selftests/ftrace/Makefile index a1e955d2de4c..830b2299ea27 100644 --- a/tools/testing/selftests/ftrace/Makefile +++ b/tools/testing/selftests/ftrace/Makefile @@ -6,4 +6,7 @@ TEST_PROGS := ftracetest-ktap TEST_FILES := test.d settings EXTRA_CLEAN := $(OUTPUT)/logs/*
+LDFLAGS += -static +TEST_GEN_PROGS = poll + include ../lib.mk diff --git a/tools/testing/selftests/ftrace/poll.c b/tools/testing/selftests/ftrace/poll.c new file mode 100644 index 000000000000..800d1114629c --- /dev/null +++ b/tools/testing/selftests/ftrace/poll.c @@ -0,0 +1,34 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Simple poll on a file. + * + * Copyright (c) 2024 Google LLC. + */ + +#include <fcntl.h> +#include <poll.h> +#include <stdio.h> +#include <unistd.h> + +#define BUFSIZE 4096 + +int main(int argc, char *argv[]) +{ + struct pollfd pfd = {.events = POLLPRI}; + char buf[BUFSIZE]; + + if (argc < 2) + return -1; + pfd.fd = open(argv[1], O_RDONLY); + if (pfd.fd < 0) { + perror("open"); + return -1; + } + + if (poll(&pfd, 1, -1) < 0) { + perror("poll"); + return -1; + } + close(pfd.fd); + return 0; +} diff --git a/tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-poll.tc b/tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-poll.tc new file mode 100644 index 000000000000..ac3999dee40b --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-poll.tc @@ -0,0 +1,46 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: event trigger - test poll wait on histogram +# requires: set_event events/sched/sched_process_free/trigger events/sched/sched_process_free/hist +# flags: instance + +POLL=${FTRACETEST_ROOT}/poll + +if [ ! -x ${POLL} ]; then + echo "poll program is not compiled!" + exit_unresolved +fi + +EVENT=events/sched/sched_process_free/ + +echo "hist:key=comm" > ${EVENT}/trigger +echo 1 > ${EVENT}/enable + +# setting timeout +PID=$$ +SIG_FAIL=37 +trap exit_fail $SIG_FAIL +sleep 10 && (kill -s ${SIG_FAIL} ${PID} ||:) & +TOPID=$! + +# sleep command will exit after 2 seconds +sleep 2 & +BGPID=$! +${POLL} ${EVENT}/hist +# stop timeout +kill -KILL ${TOPID} ||: + +cat trace > ${TMPDIR}/trace + +if [ -d /proc/${BGPID} ]; then + echo "poll exits too soon" + kill -KILL ${BGPID} ||: + exit_fail +fi + +if ! grep -qw "sleep" ${TMPDIR}/trace; then + echo "poll exits before event happens" + exit_fail +fi + +exit_pass
Hi Masami,
On Wed, 2024-06-26 at 00:16 +0900, Masami Hiramatsu (Google) wrote:
Hi,
Here is an RFC patch to support polling on event 'hist' file.
There has been interest in allowing user programs to monitor kernel events in real time. Ftrace provides `trace_pipe` interface to wait on events in the ring buffer, but it is needed to wait until filling up a page with events in the ring buffer. We can also peek the `trace` file periodically, but that is inefficient way to monitor a randomely happening event.
This patch set allows user to `poll`(or `select`, `epoll`) on event histogram interface. As you know each event has its own `hist` file which shows histograms generated by trigger action. So user can set a new hist trigger on any event you want to monitor, and poll on the `hist` file until it is updated.
There are 2 poll events are supported, POLLIN and POLLPRI. POLLIN means that there are any readable update on `hist` file and this event will be flashed only when you call read(). So, this is useful if you want to read the histogram periodically. The other POLLPRI event is for monitoring trace event. Like the POLLIN, this will be returned when the histogram is updated, but you don't need to read() the file and use poll() again.
Note that this waits for histogram update (not event arrival), thus you must set a histogram on the event at first.
Here is an example usage:
TRACEFS=/sys/kernel/tracing EVENT=$TRACEFS/events/sched/sched_process_free
# setup histogram trigger and enable event echo "hist:key=comm" >> $EVENT/trigger echo 1 > $EVENT/enable
# Wait for update poll $EVENT/hist
# Event arrived. echo "process free event is comming" tail $TRACEFS/trace
The 'poll' command is in the selftest patch.
You can take this series also from here;
https://git.kernel.org/pub/scm/linux/kernel/git/mhiramat/linux.git/log/?h=to...
Thank you,
I think this is a clever use of the histogram files, and will be very useful for real-time monitoring apps. I'm looking forward to using it myself - thanks for doing this.
For the whole series,
Reviewed-by: Tom Zanussi zanussi@kernel.org
Masami Hiramatsu (Google) (3): tracing/hist: Add poll(POLLIN) support on hist file tracing/hist: Support POLLPRI event for poll on histogram selftests/tracing: Add hist poll() support test
include/linux/trace_events.h | 5 + kernel/trace/trace_events.c | 18 ++++ kernel/trace/trace_events_hist.c | 101 +++++++++++++++++++- tools/testing/selftests/ftrace/Makefile | 3 + tools/testing/selftests/ftrace/poll.c | 34 +++++++ .../ftrace/test.d/trigger/trigger-hist-poll.tc | 46 +++++++++ 6 files changed, 204 insertions(+), 3 deletions(-) create mode 100644 tools/testing/selftests/ftrace/poll.c create mode 100644 tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-poll.tc
-- Masami Hiramatsu (Google) mhiramat@kernel.org
On Sun, 30 Jun 2024 16:07:43 -0500 Tom Zanussi zanussi@kernel.org wrote:
Hi Masami,
On Wed, 2024-06-26 at 00:16 +0900, Masami Hiramatsu (Google) wrote:
Hi,
Here is an RFC patch to support polling on event 'hist' file.
There has been interest in allowing user programs to monitor kernel events in real time. Ftrace provides `trace_pipe` interface to wait on events in the ring buffer, but it is needed to wait until filling up a page with events in the ring buffer. We can also peek the `trace` file periodically, but that is inefficient way to monitor a randomely happening event.
This patch set allows user to `poll`(or `select`, `epoll`) on event histogram interface. As you know each event has its own `hist` file which shows histograms generated by trigger action. So user can set a new hist trigger on any event you want to monitor, and poll on the `hist` file until it is updated.
There are 2 poll events are supported, POLLIN and POLLPRI. POLLIN means that there are any readable update on `hist` file and this event will be flashed only when you call read(). So, this is useful if you want to read the histogram periodically. The other POLLPRI event is for monitoring trace event. Like the POLLIN, this will be returned when the histogram is updated, but you don't need to read() the file and use poll() again.
Note that this waits for histogram update (not event arrival), thus you must set a histogram on the event at first.
Here is an example usage:
TRACEFS=/sys/kernel/tracing EVENT=$TRACEFS/events/sched/sched_process_free
# setup histogram trigger and enable event echo "hist:key=comm" >> $EVENT/trigger echo 1 > $EVENT/enable
# Wait for update poll $EVENT/hist
# Event arrived. echo "process free event is comming" tail $TRACEFS/trace
The 'poll' command is in the selftest patch.
You can take this series also from here;
https://git.kernel.org/pub/scm/linux/kernel/git/mhiramat/linux.git/log/?h=to...
Thank you,
I think this is a clever use of the histogram files, and will be very useful for real-time monitoring apps. I'm looking forward to using it myself - thanks for doing this.
For the whole series,
Reviewed-by: Tom Zanussi zanussi@kernel.org
Thanks Tom!
I found an issue in the selftests (can not support old stable kernel) so let me update it.
Thank you,
Masami Hiramatsu (Google) (3): tracing/hist: Add poll(POLLIN) support on hist file tracing/hist: Support POLLPRI event for poll on histogram selftests/tracing: Add hist poll() support test
include/linux/trace_events.h | 5 + kernel/trace/trace_events.c | 18 ++++ kernel/trace/trace_events_hist.c | 101 +++++++++++++++++++- tools/testing/selftests/ftrace/Makefile | 3 + tools/testing/selftests/ftrace/poll.c | 34 +++++++ .../ftrace/test.d/trigger/trigger-hist-poll.tc | 46 +++++++++ 6 files changed, 204 insertions(+), 3 deletions(-) create mode 100644 tools/testing/selftests/ftrace/poll.c create mode 100644 tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-poll.tc
-- Masami Hiramatsu (Google) mhiramat@kernel.org
linux-kselftest-mirror@lists.linaro.org