Hi, These patches make it possible to attach BPF programs directly to tracepoints using ftrace (/sys/kernel/debug/tracing) without needing the process doing the attach to be alive. This has the following benefits:
1. Simplified Security: In Android, we have finer-grained security controls to specific ftrace trace events using SELinux labels. We control precisely who is allowed to enable an ftrace event already. By adding a node to ftrace for attaching BPF programs, we can use the same mechanism to further control who is allowed to attach to a trace event.
2. Process lifetime: In Android we are adding usecases where a tracing program needs to be attached all the time to a tracepoint, for the full life time of the system. Such as to gather statistics where there no need for a detach for the full system lifetime. With perf or bpf(2)'s BPF_RAW_TRACEPOINT_OPEN, this means keeping a process alive all the time. However, in Android our BPF loader currently (for hardeneded security) involves just starting a process at boot time, doing the BPF program loading, and then pinning them to /sys/fs/bpf. We don't keep this process alive all the time. It is more suitable to do a one-shot attach of the program using ftrace and not need to have a process alive all the time anymore for this. Such process also needs elevated privileges since tracepoint program loading currently requires CAP_SYS_ADMIN anyway so by design Android's bpfloader runs once at init and exits.
This series add a new bpf file to /sys/kernel/debug/tracing/events/X/Y/bpf The following commands can be written into it: attach:<fd> Attaches BPF prog fd to tracepoint detach:<fd> Detaches BPF prog fd to tracepoint
Reading the bpf file will show all the attached programs to the tracepoint.
Joel Fernandes (Google) (4): Move bpf_raw_tracepoint functionality into bpf_trace.c trace/bpf: Add support for attach/detach of ftrace events to BPF lib/bpf: Add support for ftrace event attach and detach selftests/bpf: Add test for ftrace-based BPF attach/detach
include/linux/bpf_trace.h | 16 ++ include/linux/trace_events.h | 1 + kernel/bpf/syscall.c | 69 +----- kernel/trace/bpf_trace.c | 225 ++++++++++++++++++ kernel/trace/trace.h | 1 + kernel/trace/trace_events.c | 8 + tools/lib/bpf/bpf.c | 53 +++++ tools/lib/bpf/bpf.h | 4 + tools/lib/bpf/libbpf.map | 2 + .../raw_tp_writable_test_ftrace_run.c | 89 +++++++ 10 files changed, 410 insertions(+), 58 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_ftrace_run.c
-- 2.22.0.410.gd8fdbe21b5-goog
In preparation to use raw tracepoints for BPF directly from ftrace, move the bpf_raw_tracepoint functionality into bpf_trace.c
Signed-off-by: Joel Fernandes (Google) joel@joelfernandes.org --- include/linux/bpf_trace.h | 10 ++++++ kernel/bpf/syscall.c | 69 ++++++------------------------------- kernel/trace/bpf_trace.c | 56 ++++++++++++++++++++++++++++++ kernel/trace/trace_events.c | 3 ++ 4 files changed, 80 insertions(+), 58 deletions(-)
diff --git a/include/linux/bpf_trace.h b/include/linux/bpf_trace.h index ddf896abcfb6..4a593827fd87 100644 --- a/include/linux/bpf_trace.h +++ b/include/linux/bpf_trace.h @@ -4,4 +4,14 @@
#include <trace/events/xdp.h>
+#define BPF_RAW_TRACEPOINT_OPEN_LAST_FIELD raw_tracepoint.prog_fd + +struct bpf_raw_tracepoint { + struct bpf_raw_event_map *btp; + struct bpf_prog *prog; +}; + +struct bpf_raw_tracepoint *bpf_raw_tracepoint_open(char *tp_name, int prog_fd); +void bpf_raw_tracepoint_close(struct bpf_raw_tracepoint *tp); + #endif /* __LINUX_BPF_TRACE_H__ */ diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 42d17f730780..2001949b33f1 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1737,21 +1737,11 @@ static int bpf_obj_get(const union bpf_attr *attr) attr->file_flags); }
-struct bpf_raw_tracepoint { - struct bpf_raw_event_map *btp; - struct bpf_prog *prog; -}; - static int bpf_raw_tracepoint_release(struct inode *inode, struct file *filp) { struct bpf_raw_tracepoint *raw_tp = filp->private_data;
- if (raw_tp->prog) { - bpf_probe_unregister(raw_tp->btp, raw_tp->prog); - bpf_prog_put(raw_tp->prog); - } - bpf_put_raw_tracepoint(raw_tp->btp); - kfree(raw_tp); + bpf_raw_tracepoint_close(raw_tp); return 0; }
@@ -1761,64 +1751,27 @@ static const struct file_operations bpf_raw_tp_fops = { .write = bpf_dummy_write, };
-#define BPF_RAW_TRACEPOINT_OPEN_LAST_FIELD raw_tracepoint.prog_fd - -static int bpf_raw_tracepoint_open(const union bpf_attr *attr) +static int bpf_raw_tracepoint_open_syscall(const union bpf_attr *attr) { - struct bpf_raw_tracepoint *raw_tp; - struct bpf_raw_event_map *btp; - struct bpf_prog *prog; + int tp_fd; char tp_name[128]; - int tp_fd, err; + struct bpf_raw_tracepoint *raw_tp;
if (strncpy_from_user(tp_name, u64_to_user_ptr(attr->raw_tracepoint.name), sizeof(tp_name) - 1) < 0) return -EFAULT; tp_name[sizeof(tp_name) - 1] = 0;
- btp = bpf_get_raw_tracepoint(tp_name); - if (!btp) - return -ENOENT; - - raw_tp = kzalloc(sizeof(*raw_tp), GFP_USER); - if (!raw_tp) { - err = -ENOMEM; - goto out_put_btp; - } - raw_tp->btp = btp; - - prog = bpf_prog_get(attr->raw_tracepoint.prog_fd); - if (IS_ERR(prog)) { - err = PTR_ERR(prog); - goto out_free_tp; - } - if (prog->type != BPF_PROG_TYPE_RAW_TRACEPOINT && - prog->type != BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE) { - err = -EINVAL; - goto out_put_prog; - } - - err = bpf_probe_register(raw_tp->btp, prog); - if (err) - goto out_put_prog; + raw_tp = bpf_raw_tracepoint_open(tp_name, attr->raw_tracepoint.prog_fd); + if (IS_ERR(raw_tp)) + return PTR_ERR(raw_tp);
- raw_tp->prog = prog; tp_fd = anon_inode_getfd("bpf-raw-tracepoint", &bpf_raw_tp_fops, raw_tp, O_CLOEXEC); - if (tp_fd < 0) { - bpf_probe_unregister(raw_tp->btp, prog); - err = tp_fd; - goto out_put_prog; - } - return tp_fd; + if (tp_fd < 0) + bpf_probe_unregister(raw_tp->btp, raw_tp->prog);
-out_put_prog: - bpf_prog_put(prog); -out_free_tp: - kfree(raw_tp); -out_put_btp: - bpf_put_raw_tracepoint(btp); - return err; + return tp_fd; }
static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog, @@ -2848,7 +2801,7 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz err = bpf_obj_get_info_by_fd(&attr, uattr); break; case BPF_RAW_TRACEPOINT_OPEN: - err = bpf_raw_tracepoint_open(&attr); + err = bpf_raw_tracepoint_open_syscall(&attr); break; case BPF_BTF_LOAD: err = bpf_btf_load(&attr); diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 1c9a4745e596..c4b543bc617f 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -7,6 +7,7 @@ #include <linux/slab.h> #include <linux/bpf.h> #include <linux/bpf_perf_event.h> +#include <linux/bpf_trace.h> #include <linux/filter.h> #include <linux/uaccess.h> #include <linux/ctype.h> @@ -1413,3 +1414,58 @@ static int __init bpf_event_init(void)
fs_initcall(bpf_event_init); #endif /* CONFIG_MODULES */ + +void bpf_raw_tracepoint_close(struct bpf_raw_tracepoint *raw_tp) +{ + if (raw_tp->prog) { + bpf_probe_unregister(raw_tp->btp, raw_tp->prog); + bpf_prog_put(raw_tp->prog); + } + bpf_put_raw_tracepoint(raw_tp->btp); + kfree(raw_tp); +} + +struct bpf_raw_tracepoint *bpf_raw_tracepoint_open(char *tp_name, int prog_fd) +{ + struct bpf_raw_tracepoint *raw_tp; + struct bpf_raw_event_map *btp; + struct bpf_prog *prog; + int err; + + btp = bpf_get_raw_tracepoint(tp_name); + if (!btp) + return ERR_PTR(-ENOENT); + + raw_tp = kzalloc(sizeof(*raw_tp), GFP_USER); + if (!raw_tp) { + err = -ENOMEM; + goto out_put_btp; + } + raw_tp->btp = btp; + + prog = bpf_prog_get(prog_fd); + if (IS_ERR(prog)) { + err = PTR_ERR(prog); + goto out_free_tp; + } + if (prog->type != BPF_PROG_TYPE_RAW_TRACEPOINT && + prog->type != BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE) { + err = -EINVAL; + goto out_put_prog; + } + + err = bpf_probe_register(raw_tp->btp, prog); + if (err) + goto out_put_prog; + + raw_tp->prog = prog; + return raw_tp; + +out_put_prog: + bpf_prog_put(prog); +out_free_tp: + kfree(raw_tp); +out_put_btp: + bpf_put_raw_tracepoint(btp); + return ERR_PTR(err); +} diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index 0ce3db67f556..67851fb66b6b 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -2017,6 +2017,9 @@ event_create_dir(struct dentry *parent, struct trace_event_file *file)
trace_create_file("trigger", 0644, file->dir, file, &event_trigger_fops); + + trace_create_file("bpf_attach", 0644, file->dir, file, + &bpf_attach_trigger_fops); }
#ifdef CONFIG_HIST_TRIGGERS
Add a new bpf file to each trace event. The following commands can be written into it: attach:<fd> Attaches BPF prog fd to tracepoint detach:<fd> Detaches BPF prog fd to tracepoint
Reading the bpf file will show all the attached programs to the tracepoint.
Signed-off-by: Joel Fernandes (Google) joel@joelfernandes.org --- include/linux/bpf_trace.h | 6 ++ include/linux/trace_events.h | 1 + kernel/trace/bpf_trace.c | 169 +++++++++++++++++++++++++++++++++++ kernel/trace/trace.h | 1 + kernel/trace/trace_events.c | 9 +- 5 files changed, 184 insertions(+), 2 deletions(-)
diff --git a/include/linux/bpf_trace.h b/include/linux/bpf_trace.h index 4a593827fd87..1fe73501809c 100644 --- a/include/linux/bpf_trace.h +++ b/include/linux/bpf_trace.h @@ -9,6 +9,12 @@ struct bpf_raw_tracepoint { struct bpf_raw_event_map *btp; struct bpf_prog *prog; + /* + * Multiple programs can be attached to a tracepoint, + * All of these are linked to each other and can be reached + * from the event's bpf_attach file in tracefs. + */ + struct list_head event_attached; };
struct bpf_raw_tracepoint *bpf_raw_tracepoint_open(char *tp_name, int prog_fd); diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index 8a62731673f7..525f2ac44aa3 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -371,6 +371,7 @@ struct trace_event_file { struct trace_array *tr; struct trace_subsystem_dir *system; struct list_head triggers; + struct list_head bpf_attached;
/* * 32 bit flags: diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index c4b543bc617f..28621ad88c12 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1469,3 +1469,172 @@ struct bpf_raw_tracepoint *bpf_raw_tracepoint_open(char *tp_name, int prog_fd) bpf_put_raw_tracepoint(btp); return ERR_PTR(err); } + +enum event_bpf_cmd { BPF_ATTACH, BPF_DETACH }; +#define BPF_CMD_BUF_LEN 32 + +static ssize_t +event_bpf_attach_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + int err, prog_fd, cmd_num, len; + struct trace_event_call *call; + struct trace_event_file *file; + struct bpf_raw_tracepoint *raw_tp, *next; + char buf[BPF_CMD_BUF_LEN], *end, *tok; + enum event_bpf_cmd cmd; + struct bpf_prog *prog; + bool prog_put = true; + + len = min((int)cnt, BPF_CMD_BUF_LEN - 1); + + err = copy_from_user(buf, ubuf, len); + if (err) + return err; + buf[len] = 0; + + /* Parse 2 arguments of format: <cmd>:<fd> */ + end = &buf[0]; + cmd_num = 1; + while (cmd_num < 3) { + tok = strsep(&end, ":"); + if (!tok) + return -EINVAL; + + switch (cmd_num) { + case 1: + if (!strncmp(tok, "attach", 6)) + cmd = BPF_ATTACH; + else if (!strncmp(tok, "detach", 6)) + cmd = BPF_DETACH; + else + return -EINVAL; + break; + case 2: + err = kstrtoint(tok, 10, &prog_fd); + if (err) + return err; + break; + } + cmd_num++; + } + if (cmd_num != 3) + return -EINVAL; + + file = event_file_data(filp); + /* Command is to attach fd to tracepoint */ + if (cmd == BPF_ATTACH) { + mutex_lock(&event_mutex); + call = file->event_call; + + raw_tp = bpf_raw_tracepoint_open((char *)call->tp->name, + prog_fd); + if (IS_ERR(raw_tp)) { + mutex_unlock(&event_mutex); + return PTR_ERR(raw_tp); + } + + list_add(&raw_tp->event_attached, &file->bpf_attached); + mutex_unlock(&event_mutex); + *ppos += cnt; + return cnt; + } + + /* Command is to detach fd from tracepoint */ + prog = bpf_prog_get(prog_fd); + if (IS_ERR(prog)) + return PTR_ERR(prog); + + mutex_lock(&event_mutex); + list_for_each_entry_safe(raw_tp, next, &file->bpf_attached, + event_attached) { + if (raw_tp->prog == prog) { + list_del(&raw_tp->event_attached); + bpf_raw_tracepoint_close(raw_tp); + prog_put = false; + break; + } + } + mutex_unlock(&event_mutex); + + if (prog_put) + bpf_prog_put(prog); + *ppos += cnt; + return cnt; +} + +static void *event_bpf_attach_next(struct seq_file *m, void *t, loff_t *pos) +{ + struct trace_event_file *file = event_file_data(m->private); + + return seq_list_next(t, &file->bpf_attached, pos); +} + +static void *event_bpf_attach_start(struct seq_file *m, loff_t *pos) +{ + struct trace_event_file *event_file; + + /* ->stop() is called even if ->start() fails */ + mutex_lock(&event_mutex); + event_file = event_file_data(m->private); + if (unlikely(!event_file)) + return ERR_PTR(-ENODEV); + + if (list_empty(&event_file->bpf_attached)) + return NULL; + + return seq_list_start(&event_file->bpf_attached, *pos); +} + +static void event_bpf_attach_stop(struct seq_file *m, void *t) +{ + mutex_unlock(&event_mutex); +} + +static int event_bpf_attach_show(struct seq_file *m, void *v) +{ + struct bpf_raw_tracepoint *raw_tp; + + raw_tp = list_entry(v, struct bpf_raw_tracepoint, event_attached); + seq_printf(m, "prog id: %u\n", raw_tp->prog->aux->id); + return 0; +} + +static const struct seq_operations event_bpf_attach_seq_ops = { + .start = event_bpf_attach_start, + .next = event_bpf_attach_next, + .stop = event_bpf_attach_stop, + .show = event_bpf_attach_show, +}; + +static int event_bpf_attach_open(struct inode *inode, struct file *file) +{ + int ret = 0; + + mutex_lock(&event_mutex); + + if (unlikely(!event_file_data(file))) { + mutex_unlock(&event_mutex); + return -ENODEV; + } + + if (file->f_mode & FMODE_READ) { + ret = seq_open(file, &event_bpf_attach_seq_ops); + if (!ret) { + struct seq_file *m = file->private_data; + + m->private = file; + } + } + + mutex_unlock(&event_mutex); + + return ret; +} + +const struct file_operations event_bpf_attach_fops = { + .open = event_bpf_attach_open, + .read = seq_read, + .write = event_bpf_attach_write, + .llseek = default_llseek, +}; diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 005f08629b8b..e33828d24eb2 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -1582,6 +1582,7 @@ extern struct list_head ftrace_events;
extern const struct file_operations event_trigger_fops; extern const struct file_operations event_hist_fops; +extern const struct file_operations event_bpf_attach_fops;
#ifdef CONFIG_HIST_TRIGGERS extern int register_trigger_hist_cmd(void); diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index 67851fb66b6b..79420d5efaef 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -2018,8 +2018,10 @@ event_create_dir(struct dentry *parent, struct trace_event_file *file) trace_create_file("trigger", 0644, file->dir, file, &event_trigger_fops);
- trace_create_file("bpf_attach", 0644, file->dir, file, - &bpf_attach_trigger_fops); +#ifdef CONFIG_BPF_EVENTS + trace_create_file("bpf", 0644, file->dir, file, + &event_bpf_attach_fops); +#endif }
#ifdef CONFIG_HIST_TRIGGERS @@ -2267,6 +2269,9 @@ trace_create_new_event(struct trace_event_call *call, atomic_set(&file->sm_ref, 0); atomic_set(&file->tm_ref, 0); INIT_LIST_HEAD(&file->triggers); +#ifdef CONFIG_BPF_EVENTS + INIT_LIST_HEAD(&file->bpf_attached); +#endif list_add(&file->list, &tr->events);
return file;
Add the needed library support in this commit.
Signed-off-by: Joel Fernandes (Google) joel@joelfernandes.org --- tools/lib/bpf/bpf.c | 53 ++++++++++++++++++++++++++++++++++++++++ tools/lib/bpf/bpf.h | 4 +++ tools/lib/bpf/libbpf.map | 2 ++ 3 files changed, 59 insertions(+)
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index c4a48086dc9a..28c5a7d00d14 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -24,6 +24,9 @@ #include <stdlib.h> #include <string.h> #include <memory.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <fcntl.h> #include <unistd.h> #include <asm/unistd.h> #include <linux/bpf.h> @@ -57,6 +60,8 @@ #define min(x, y) ((x) < (y) ? (x) : (y)) #endif
+#define TRACEFS "/sys/kernel/debug/tracing" + static inline __u64 ptr_to_u64(const void *ptr) { return (__u64) (unsigned long) ptr; @@ -658,6 +663,54 @@ int bpf_raw_tracepoint_open(const char *name, int prog_fd) return sys_bpf(BPF_RAW_TRACEPOINT_OPEN, &attr, sizeof(attr)); }
+int bpf_raw_tracepoint_ftrace_attach(const char *subsys, const char *name, + int prog_fd) +{ + char buf[256]; + int len, ret, tfd; + + sprintf(buf, "%s/events/%s/%s/bpf", TRACEFS, subsys, name); + tfd = open(buf, O_WRONLY); + if (tfd < 0) + return tfd; + + sprintf(buf, "attach:%d", prog_fd); + len = strlen(buf); + ret = write(tfd, buf, len); + + if (ret < 0) + goto err; + if (ret != len) + ret = -1; +err: + close(tfd); + return ret; +} + +int bpf_raw_tracepoint_ftrace_detach(const char *subsys, const char *name, + int prog_fd) +{ + char buf[256]; + int len, ret, tfd; + + sprintf(buf, "%s/events/%s/%s/bpf", TRACEFS, subsys, name); + tfd = open(buf, O_WRONLY); + if (tfd < 0) + return tfd; + + sprintf(buf, "detach:%d", prog_fd); + len = strlen(buf); + ret = write(tfd, buf, len); + + if (ret < 0) + goto err; + if (ret != len) + ret = -1; +err: + close(tfd); + return ret; +} + int bpf_load_btf(void *btf, __u32 btf_size, char *log_buf, __u32 log_buf_size, bool do_log) { diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index 9593fec75652..5b9c44658037 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -163,6 +163,10 @@ LIBBPF_API int bpf_prog_query(int target_fd, enum bpf_attach_type type, __u32 query_flags, __u32 *attach_flags, __u32 *prog_ids, __u32 *prog_cnt); LIBBPF_API int bpf_raw_tracepoint_open(const char *name, int prog_fd); +LIBBPF_API int bpf_raw_tracepoint_ftrace_attach(const char *subsys, + const char *name, int prog_fd); +LIBBPF_API int bpf_raw_tracepoint_ftrace_detach(const char *subsys, + const char *name, int prog_fd); LIBBPF_API int bpf_load_btf(void *btf, __u32 btf_size, char *log_buf, __u32 log_buf_size, bool do_log); LIBBPF_API int bpf_task_fd_query(int pid, int fd, __u32 flags, char *buf, diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map index 673001787cba..fca377b688c2 100644 --- a/tools/lib/bpf/libbpf.map +++ b/tools/lib/bpf/libbpf.map @@ -163,4 +163,6 @@ LIBBPF_0.0.3 { bpf_map__is_internal; bpf_map_freeze; btf__finalize_data; + bpf_raw_tracepoint_ftrace_attach; + bpf_raw_tracepoint_ftrace_detach; } LIBBPF_0.0.2;
Here we add support for testing the attach and detach of a BPF program to a tracepoint through tracefs.
Signed-off-by: Joel Fernandes (Google) joel@joelfernandes.org --- .../raw_tp_writable_test_ftrace_run.c | 89 +++++++++++++++++++ 1 file changed, 89 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_ftrace_run.c
diff --git a/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_ftrace_run.c b/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_ftrace_run.c new file mode 100644 index 000000000000..7b42e3a69b71 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_ftrace_run.c @@ -0,0 +1,89 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <test_progs.h> +#include <linux/nbd.h> + +void test_raw_tp_writable_test_ftrace_run(void) +{ + __u32 duration = 0; + char error[4096]; + int ret; + + const struct bpf_insn trace_program[] = { + BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 0), + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_6, 0), + BPF_MOV64_IMM(BPF_REG_0, 42), + BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_0, 0), + BPF_EXIT_INSN(), + }; + + struct bpf_load_program_attr load_attr = { + .prog_type = BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, + .license = "GPL v2", + .insns = trace_program, + .insns_cnt = sizeof(trace_program) / sizeof(struct bpf_insn), + .log_level = 2, + }; + + int bpf_fd = bpf_load_program_xattr(&load_attr, error, sizeof(error)); + + if (CHECK(bpf_fd < 0, "bpf_raw_tracepoint_writable loaded", + "failed: %d errno %d\n", bpf_fd, errno)) + return; + + const struct bpf_insn skb_program[] = { + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }; + + struct bpf_load_program_attr skb_load_attr = { + .prog_type = BPF_PROG_TYPE_SOCKET_FILTER, + .license = "GPL v2", + .insns = skb_program, + .insns_cnt = sizeof(skb_program) / sizeof(struct bpf_insn), + }; + + int filter_fd = + bpf_load_program_xattr(&skb_load_attr, error, sizeof(error)); + if (CHECK(filter_fd < 0, "test_program_loaded", "failed: %d errno %d\n", + filter_fd, errno)) + goto out_bpffd; + + ret = bpf_raw_tracepoint_ftrace_attach("bpf_test_run", + "bpf_test_finish", + bpf_fd); + if (CHECK(ret < 0, "bpf_raw_tracepoint_ftrace_attach", + "failed: %d errno %d\n", ret, errno)) + goto out_filterfd; + + char test_skb[128] = { + 0, + }; + + __u32 prog_ret; + int err = bpf_prog_test_run(filter_fd, 1, test_skb, sizeof(test_skb), 0, + 0, &prog_ret, 0); + CHECK(err != 42, "test_run", + "tracepoint did not modify return value\n"); + CHECK(prog_ret != 0, "test_run_ret", + "socket_filter did not return 0\n"); + + ret = bpf_raw_tracepoint_ftrace_detach("bpf_test_run", + "bpf_test_finish", + bpf_fd); + if (CHECK(ret < 0, "bpf_raw_tracepoint_ftrace_detach", + "failed: %d errno %d\n", ret, errno)) + goto out_filterfd; + + err = bpf_prog_test_run(filter_fd, 1, test_skb, sizeof(test_skb), 0, 0, + &prog_ret, 0); + CHECK(err != 0, "test_run_notrace", + "test_run failed with %d errno %d\n", err, errno); + CHECK(prog_ret != 0, "test_run_ret_notrace", + "socket_filter did not return 0\n"); + +out_filterfd: + close(filter_fd); +out_bpffd: + close(bpf_fd); +}
On Wed, Jul 10, 2019 at 10:15:44AM -0400, Joel Fernandes (Google) wrote:
Hi,
why are you cc-ing the whole world for this patch set? I'll reply to all as well, but I suspect a bunch of folks consider it spam. Please read Documentation/bpf/bpf_devel_QA.rst
Also, I think, netdev@vger rejects emails with 80+ characters in cc as spam, so I'm not sure this set reached public mailing lists.
These patches make it possible to attach BPF programs directly to tracepoints using ftrace (/sys/kernel/debug/tracing) without needing the process doing the attach to be alive. This has the following benefits:
- Simplified Security: In Android, we have finer-grained security controls to
specific ftrace trace events using SELinux labels. We control precisely who is allowed to enable an ftrace event already. By adding a node to ftrace for attaching BPF programs, we can use the same mechanism to further control who is allowed to attach to a trace event.
- Process lifetime: In Android we are adding usecases where a tracing program
needs to be attached all the time to a tracepoint, for the full life time of the system. Such as to gather statistics where there no need for a detach for the full system lifetime. With perf or bpf(2)'s BPF_RAW_TRACEPOINT_OPEN, this means keeping a process alive all the time. However, in Android our BPF loader currently (for hardeneded security) involves just starting a process at boot time, doing the BPF program loading, and then pinning them to /sys/fs/bpf. We don't keep this process alive all the time. It is more suitable to do a one-shot attach of the program using ftrace and not need to have a process alive all the time anymore for this. Such process also needs elevated privileges since tracepoint program loading currently requires CAP_SYS_ADMIN anyway so by design Android's bpfloader runs once at init and exits.
This series add a new bpf file to /sys/kernel/debug/tracing/events/X/Y/bpf The following commands can be written into it: attach:<fd> Attaches BPF prog fd to tracepoint detach:<fd> Detaches BPF prog fd to tracepoint
Looks like, to detach a program the user needs to read a text file, parse bpf prog id from text into binary. Then call fd_from_id bpf syscall, get a binary FD, convert it back to text and write as a text back into this file. I think this is just a single example why text based apis are not accepted in bpf anymore.
Through the patch set you call it ftrace. As far as I can see, this set has zero overlap with ftrace. There is no ftrace-bpf connection here at all that we discussed in the past Steven. It's all quite confusing.
I suggest android to solve sticky raw_tracepoint problem with user space deamon. The reasons, you point out why user daemon cannot be used, sound weak to me. Another acceptable solution would be to introduce pinning of raw_tp objects. bpf progs and maps can be pinned in bpffs already. Pinning raw_tp would be natural extension.
On Tue, Jul 16, 2019 at 01:54:57PM -0700, Alexei Starovoitov wrote:
On Wed, Jul 10, 2019 at 10:15:44AM -0400, Joel Fernandes (Google) wrote:
Hi,
why are you cc-ing the whole world for this patch set?
Well, the whole world happens to be interested in BPF on Android.
I'll reply to all as well, but I suspect a bunch of folks consider it spam. Please read Documentation/bpf/bpf_devel_QA.rst
Ok, I'll read it.
Also, I think, netdev@vger rejects emails with 80+ characters in cc as spam, so I'm not sure this set reached public mailing lists.
Certainly the CC list here is not added to folks who consider it spam. All the folks added have been interested in BPF on Android at various points of time. Is this CC list really that large? It has around 24 email addresses or so. I can trim it a bit if needed. Also, you sound like as if people are screaming at me to stop emailing them, certainly that's not the case and no one has told me it is spam.
And, it did reach the public archive btw: https://lore.kernel.org/netdev/20190716205455.iimn3pqpvsc3k4ry@ast-mbp.dhcp....
These patches make it possible to attach BPF programs directly to tracepoints using ftrace (/sys/kernel/debug/tracing) without needing the process doing the attach to be alive. This has the following benefits:
- Simplified Security: In Android, we have finer-grained security controls to
specific ftrace trace events using SELinux labels. We control precisely who is allowed to enable an ftrace event already. By adding a node to ftrace for attaching BPF programs, we can use the same mechanism to further control who is allowed to attach to a trace event.
- Process lifetime: In Android we are adding usecases where a tracing program
needs to be attached all the time to a tracepoint, for the full life time of the system. Such as to gather statistics where there no need for a detach for the full system lifetime. With perf or bpf(2)'s BPF_RAW_TRACEPOINT_OPEN, this means keeping a process alive all the time. However, in Android our BPF loader currently (for hardeneded security) involves just starting a process at boot time, doing the BPF program loading, and then pinning them to /sys/fs/bpf. We don't keep this process alive all the time. It is more suitable to do a one-shot attach of the program using ftrace and not need to have a process alive all the time anymore for this. Such process also needs elevated privileges since tracepoint program loading currently requires CAP_SYS_ADMIN anyway so by design Android's bpfloader runs once at init and exits.
This series add a new bpf file to /sys/kernel/debug/tracing/events/X/Y/bpf The following commands can be written into it: attach:<fd> Attaches BPF prog fd to tracepoint detach:<fd> Detaches BPF prog fd to tracepoint
Looks like, to detach a program the user needs to read a text file, parse bpf prog id from text into binary. Then call fd_from_id bpf syscall, get a binary FD, convert it back to text and write as a text back into this file. I think this is just a single example why text based apis are not accepted in bpf anymore.
This can also be considered a tracefs API.
And we can certainly change the detach to accept program ids as well if that's easier. 'detach:prog:<prog_id>' and 'detach:fd:<fd>'.
By the way, I can also list the set of cumbersome steps needed to attach a BPF program using perf and I bet it will be longer ;-)
Through the patch set you call it ftrace. As far as I can see, this set has zero overlap with ftrace. There is no ftrace-bpf connection here at all that we discussed in the past Steven. It's all quite confusing.
It depends on what you mean by ftrace, may be I can call it 'trace events' or something if it is less ambiguious. All of this has been collectively called ftrace before.
I am not sure if you you are making sense actually, trace_events mechanism is a part of ftrace. See the documentation: Documentation/trace/ftrace.rst. Even the documentation file name has the word ftrace in it.
I have also spoken to Steven before about this, I don't think he ever told me there is no connection so again I am a bit lost at your comments.
I suggest android to solve sticky raw_tracepoint problem with user space deamon. The reasons, you point out why user daemon cannot be used, sound weak to me.
I don't think it is weak. It seems overkill to have a daemon for a trace event that is say supposed to be attached to all the time for the lifetime of the system. Why should there be a daemon consuming resources if it is active all the time?
In Android, we are very careful about spawning useless processes and leaving them alive for the lifetime of the system - for no good reason. Our security teams also don't like this, and they can comment more.
Another acceptable solution would be to introduce pinning of raw_tp objects. bpf progs and maps can be pinned in bpffs already. Pinning raw_tp would be natural extension.
I don't think the pinning solves the security problem, it just solves the process lifetime problem. Currently, attaching trace events through perf requires CAP_SYS_ADMIN. However, with ftrace events, we already control security of events by labeling the nodes in tracefs and granting access to the labeled context through the selinux policies. Having a 'bpf' node in tracefs for events, and granting access to the labels is a natural extension.
I also thought about the pinning idea before, but we also want to add support for not just raw tracepoints, but also regular tracepoints (events if you will). I am hesitant to add a new BPF API just for creating regular tracepoints and then pinning those as well.
I don't see why a new bpf node for a trace event is a bad idea, really. tracefs is how we deal with trace events on Android. We do it in production systems. This is a natural extension to that and fits with the security model well.
thanks,
- Joel
On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote:
I also thought about the pinning idea before, but we also want to add support for not just raw tracepoints, but also regular tracepoints (events if you will). I am hesitant to add a new BPF API just for creating regular tracepoints and then pinning those as well.
and they should be done through the pinning as well.
I don't see why a new bpf node for a trace event is a bad idea, really.
See the patches for kprobe/uprobe FD-based api and the reasons behind it. tldr: text is racy, doesn't scale, poor security, etc.
tracefs is how we deal with trace events on Android.
android has made mistakes in the past as well.
This is a natural extension to that and fits with the security model well.
I think it's the opposite. I'm absolutely against text based apis.
On Tue, Jul 16, 2019 at 03:26:52PM -0700, Alexei Starovoitov wrote:
On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote:
I also thought about the pinning idea before, but we also want to add support for not just raw tracepoints, but also regular tracepoints (events if you will). I am hesitant to add a new BPF API just for creating regular tracepoints and then pinning those as well.
and they should be done through the pinning as well.
Hmm ok, I will give it some more thought.
I don't see why a new bpf node for a trace event is a bad idea, really.
See the patches for kprobe/uprobe FD-based api and the reasons behind it. tldr: text is racy, doesn't scale, poor security, etc.
Is it possible to use perf without CAP_SYS_ADMIN and control security at the per-event level? We are selective about who can access which event, using selinux. That's how our ftrace-based tracers work. Its fine grained per-event control. That's where I was going with the tracefs approach since we get that granularity using the file system.
Thanks.
On Tue, Jul 16, 2019 at 06:41:50PM -0400, Joel Fernandes wrote:
On Tue, Jul 16, 2019 at 03:26:52PM -0700, Alexei Starovoitov wrote:
On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote:
I also thought about the pinning idea before, but we also want to add support for not just raw tracepoints, but also regular tracepoints (events if you will). I am hesitant to add a new BPF API just for creating regular tracepoints and then pinning those as well.
and they should be done through the pinning as well.
Hmm ok, I will give it some more thought.
I think I can make the new BPF API + pinning approach work, I will try to work on something like this and post it soon.
Also, I had a question below if you don't mind taking a look:
thanks Alexei!
I don't see why a new bpf node for a trace event is a bad idea, really.
See the patches for kprobe/uprobe FD-based api and the reasons behind it. tldr: text is racy, doesn't scale, poor security, etc.
Is it possible to use perf without CAP_SYS_ADMIN and control security at the per-event level? We are selective about who can access which event, using selinux. That's how our ftrace-based tracers work. Its fine grained per-event control. That's where I was going with the tracefs approach since we get that granularity using the file system.
Thanks.
On Tue, Jul 16, 2019 at 07:55:00PM -0400, Joel Fernandes wrote:
On Tue, Jul 16, 2019 at 06:41:50PM -0400, Joel Fernandes wrote:
On Tue, Jul 16, 2019 at 03:26:52PM -0700, Alexei Starovoitov wrote:
On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote:
I also thought about the pinning idea before, but we also want to add support for not just raw tracepoints, but also regular tracepoints (events if you will). I am hesitant to add a new BPF API just for creating regular tracepoints and then pinning those as well.
and they should be done through the pinning as well.
Hmm ok, I will give it some more thought.
I think I can make the new BPF API + pinning approach work, I will try to work on something like this and post it soon.
Also, I had a question below if you don't mind taking a look:
thanks Alexei!
I don't see why a new bpf node for a trace event is a bad idea, really.
See the patches for kprobe/uprobe FD-based api and the reasons behind it. tldr: text is racy, doesn't scale, poor security, etc.
Is it possible to use perf without CAP_SYS_ADMIN and control security at the per-event level? We are selective about who can access which event, using selinux. That's how our ftrace-based tracers work. Its fine grained per-event control. That's where I was going with the tracefs approach since we get that granularity using the file system.
android's choice of selinux is not a factor in deciding kernel apis. It's completely separate discusion wether disallowing particular tracepoints for given user make sense at all. Just because you can hack it in via selinux blocking particular /sys/debug/tracing/ directory and convince yourself that it's somehow makes android more secure. It doesn't mean that all new api should fit into this model. I think allowing one tracepoint and disallowing another is pointless from security point of view. Tracing bpf program can do bpf_probe_read of anything.
On Tue, Jul 16, 2019 at 06:24:07PM -0700, Alexei Starovoitov wrote: [snip]
I don't see why a new bpf node for a trace event is a bad idea, really.
See the patches for kprobe/uprobe FD-based api and the reasons behind it. tldr: text is racy, doesn't scale, poor security, etc.
Is it possible to use perf without CAP_SYS_ADMIN and control security at the per-event level? We are selective about who can access which event, using selinux. That's how our ftrace-based tracers work. Its fine grained per-event control. That's where I was going with the tracefs approach since we get that granularity using the file system.
android's choice of selinux is not a factor in deciding kernel apis. It's completely separate discusion wether disallowing particular tracepoints for given user make sense at all. Just because you can hack it in via selinux blocking particular /sys/debug/tracing/ directory and convince yourself that it's somehow makes android more secure. It doesn't mean that all new api should fit into this model.
Its not like a hack, it is just control of which tracefs node can be accessed and which cannot be since the tracing can run on production systems out in the field and there are several concerns to address like security, privacy etc. It is not just for debugging usecases. We do collect traces out in the field where these issues are real and cannot be ignored.
SELinux model is deny everything, and then selectively grant access to what is needed. The VFS and security LSM hooks provide this control quite well. I am not sure if such control is possible through perf hence I asked the question.
I think allowing one tracepoint and disallowing another is pointless from security point of view. Tracing bpf program can do bpf_probe_read of anything.
I think the assumption here is the user controls the program instructions at runtime, but that's not the case. The BPF program we are loading is not dynamically generated, it is built at build time and it is loaded from a secure verified partition, so even though it can do bpf_probe_read, it is still not something that the user can change. And, we are planning to make it even more secure by making it kernel verify the program at load time as well (you were on some discussions about that a few months ago).
thanks,
- Joel
On Tue, 16 Jul 2019 15:26:52 -0700 Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
I'm absolutely against text based apis.
I guess you don't use /proc ;-)
-- Steve
On Tue, 16 Jul 2019 17:30:50 -0400 Joel Fernandes joel@joelfernandes.org wrote:
I don't see why a new bpf node for a trace event is a bad idea, really. tracefs is how we deal with trace events on Android. We do it in production systems. This is a natural extension to that and fits with the security model well.
What I would like to see is a way to have BPF inject data into the ftrace ring buffer directly. There's a bpf_trace_printk() that I find a bit of a hack (especially since it hooks into trace_printk() which is only for debugging purposes). Have a dedicated bpf ftrace ring buffer event that can be triggered is what I am looking for. Then comes the issue of what ring buffer to place it in, as ftrace can have multiple ring buffer instances. But these instances are defined by the tracefs instances directory. Having a way to associate a bpf program to a specific event in a specific tracefs directory could allow for ways to trigger writing into the correct ftrace buffer.
But looking over the patches, I see what Alexei means that there's no overlap with ftrace and these patches except for the tracefs directory itself (which is part of the ftrace infrastructure). And the trace events are technically part of the ftrace infrastructure too. I see the tracefs interface being used, but I don't see how the bpf programs being added affect the ftrace ring buffer or other parts of ftrace. And I'm guessing that's what is confusing Alexei.
-- Steve
On Tue, Jul 16, 2019 at 06:31:17PM -0400, Steven Rostedt wrote:
On Tue, 16 Jul 2019 17:30:50 -0400 Joel Fernandes joel@joelfernandes.org wrote:
I don't see why a new bpf node for a trace event is a bad idea, really. tracefs is how we deal with trace events on Android. We do it in production systems. This is a natural extension to that and fits with the security model well.
What I would like to see is a way to have BPF inject data into the ftrace ring buffer directly. There's a bpf_trace_printk() that I find a bit of a hack (especially since it hooks into trace_printk() which is only for debugging purposes). Have a dedicated bpf ftrace ring buffer event that can be triggered is what I am looking for. Then comes the issue of what ring buffer to place it in, as ftrace can have multiple ring buffer instances. But these instances are defined by the tracefs instances directory. Having a way to associate a bpf program to a specific event in a specific tracefs directory could allow for ways to trigger writing into the correct ftrace buffer.
But his problem is with doing the association of a BPF program with tracefs itself. How would you attach a BPF program with tracefs without doing a text based approach? His problem is with the text based approach per his last email.
But looking over the patches, I see what Alexei means that there's no overlap with ftrace and these patches except for the tracefs directory itself (which is part of the ftrace infrastructure). And the trace events are technically part of the ftrace infrastructure too. I see the tracefs interface being used, but I don't see how the bpf programs being added affect the ftrace ring buffer or other parts of ftrace. And I'm guessing that's what is confusing Alexei.
In a follow-up patch which I am still writing, I am using the trace ring buffer as temporary storage since I am formatting the trace event into it. This patch you are replying to is just for raw tracepoint and yes, I agree this one does not use the ring buffer, but a future addition to it does. So I don't think the association of this patch series with ftrace is going to be an issue IMO.
thanks,
- Joel
On Tue, Jul 16, 2019 at 06:31:17PM -0400, Steven Rostedt wrote:
On Tue, 16 Jul 2019 17:30:50 -0400 Joel Fernandes joel@joelfernandes.org wrote:
I don't see why a new bpf node for a trace event is a bad idea, really. tracefs is how we deal with trace events on Android. We do it in production systems. This is a natural extension to that and fits with the security model well.
What I would like to see is a way to have BPF inject data into the ftrace ring buffer directly. There's a bpf_trace_printk() that I find a bit of a hack (especially since it hooks into trace_printk() which is only for debugging purposes). Have a dedicated bpf ftrace ring buffer event that can be triggered is what I am looking for. Then comes the issue of what ring buffer to place it in, as ftrace can have multiple ring buffer instances. But these instances are defined by the tracefs instances directory. Having a way to associate a bpf program to a specific event in a specific tracefs directory could allow for ways to trigger writing into the correct ftrace buffer.
bpf can write anything (including full skb) into perf ring buffer. I don't think there is a use case yet to write into ftrace ring buffer. But I can be convinced otherwise :)
But looking over the patches, I see what Alexei means that there's no overlap with ftrace and these patches except for the tracefs directory itself (which is part of the ftrace infrastructure). And the trace events are technically part of the ftrace infrastructure too. I see the tracefs interface being used, but I don't see how the bpf programs being added affect the ftrace ring buffer or other parts of ftrace. And I'm guessing that's what is confusing Alexei.
yep. What I really like to see some day is proper integration of real ftrace and bpf. So that bpf progs can be invoked from some of the ftrace machinery. And the other way around. Like I'd love to be able to start ftracing all functions from bpf and stop it from bpf. The use case is kernel debugging. I'd like to examine a bunch of condition and start ftracing the execution. Then later decide wether this collection of ip addresses is interesting to analyze and post process it quickly while still inside bpf program.
linux-kselftest-mirror@lists.linaro.org