This series attempts to provide a simple way for BPF programs (and in future other consumers) to utilize BPF Type Format (BTF) information to display kernel data structures in-kernel. The use case this functionality is applied to here is to support a bpf_trace_printk trace event-based method of rendering type information.
There is already support in kernel/bpf/btf.c for "show" functionality; the changes here generalize that support from seq-file specific verifier display to the more generic case and add another specific use case; rather than seq_printf()ing the show data, it is traced using the bpf_trace_printk trace event. Other future uses of the show functionality could include a bpf_printk_btf() function which printk()ed the data instead. Oops messaging in particular would be an interesting application for such functionality.
The above potential use case hints at a potential reply to a reasonable objection that such typed display should be solved by tracing programs, where the in-kernel tracing records data and the userspace program prints it out. While this is certainly the recommended approach for most cases, I believe having an in-kernel mechanism would be valuable also. Critically in BPF programs it greatly simplifies debugging and tracing of such data to invoking a one-line helper.
One challenge raised in an earlier iteration of this work - where the BTF printing was implemented as a printk() format specifier - was that the amount of data printed per printk() was large, and other format specifiers were far simpler. This patchset tackles this by instead displaying data _as the data structure is traversed_, rathern than copying it to a string for later display. The problem in doing this however is that such output can be intermixed with other bpf_trace_printk events. The solution pursued here is to associate a trace ID with the bpf_trace_printk events. For now, the bpf_trace_printk() helper sets this trace ID to 0, and bpf_trace_btf() can specify non-zero values. This allows a BPF program to coordinate with a user-space program which creates a separate trace instance which filters trace events based on trace ID, allowing for clean display without pollution from other data sources. Future work could enhance bpf_trace_printk() to do this too, either via a new helper or by smuggling a 32-bit trace id value into the "fmt_size" argument (the latter might be problematic for 32-bit platforms though, so a new helper might be preferred).
To aid in display the bpf_trace_btf() helper is passed a "struct btf_ptr *" which specifies the data to be traced (struct sk_buff * say), the BTF id of the type (the BTF id of "struct sk_buff") or a string representation of the type ("struct sk_buff"). A flags field is also present for future use.
Separately a number of flags control how the output is rendered, see patch 3 for more details.
A snippet of output from printing "struct sk_buff *" (see patch 3 for the full output) looks like this:
<idle>-0 [023] d.s. 1825.778400: bpf_trace_printk: (struct sk_buff){ <idle>-0 [023] d.s. 1825.778409: bpf_trace_printk: (union){ <idle>-0 [023] d.s. 1825.778410: bpf_trace_printk: (struct){ <idle>-0 [023] d.s. 1825.778412: bpf_trace_printk: .prev = (struct sk_buff *)0x00000000b2a3df7e, <idle>-0 [023] d.s. 1825.778413: bpf_trace_printk: (union){ <idle>-0 [023] d.s. 1825.778414: bpf_trace_printk: .dev = (struct net_device *)0x000000001658808b, <idle>-0 [023] d.s. 1825.778416: bpf_trace_printk: .dev_scratch = (long unsigned int)18446628460391432192, <idle>-0 [023] d.s. 1825.778417: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778417: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778418: bpf_trace_printk: .rbnode = (struct rb_node){ <idle>-0 [023] d.s. 1825.778419: bpf_trace_printk: .rb_right = (struct rb_node *)0x00000000b2a3df7e, <idle>-0 [023] d.s. 1825.778420: bpf_trace_printk: .rb_left = (struct rb_node *)0x000000001658808b, <idle>-0 [023] d.s. 1825.778420: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778421: bpf_trace_printk: .list = (struct list_head){ <idle>-0 [023] d.s. 1825.778422: bpf_trace_printk: .prev = (struct list_head *)0x00000000b2a3df7e, <idle>-0 [023] d.s. 1825.778422: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778422: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778426: bpf_trace_printk: .len = (unsigned int)168, <idle>-0 [023] d.s. 1825.778427: bpf_trace_printk: .mac_len = (__u16)14,
Changes since v3:
- Moved to RFC since the approach is different (and bpf-next is closed) - Rather than using a printk() format specifier as the means of invoking BTF-enabled display, a dedicated BPF helper is used. This solves the issue of printk() having to output large amounts of data using a complex mechanism such as BTF traversal, but still provides a way for the display of such data to be achieved via BPF programs. Future work could include a bpf_printk_btf() function to invoke display via printk() where the elements of a data structure are printk()ed one at a time. Thanks to Petr Mladek, Andy Shevchenko and Rasmus Villemoes who took time to look at the earlier printk() format-specifier-focused version of this and provided feedback clarifying the problems with that approach. - Added trace id to the bpf_trace_printk events as a means of separating output from standard bpf_trace_printk() events, ensuring it can be easily parsed by the reader. - Added bpf_trace_btf() helper tests which do simple verification of the various display options.
Changes since v2:
- Alexei and Yonghong suggested it would be good to use probe_kernel_read() on to-be-shown data to ensure safety during operation. Safe copy via probe_kernel_read() to a buffer object in "struct btf_show" is used to support this. A few different approaches were explored including dynamic allocation and per-cpu buffers. The downside of dynamic allocation is that it would be done during BPF program execution for bpf_trace_printk()s using %pT format specifiers. The problem with per-cpu buffers is we'd have to manage preemption and since the display of an object occurs over an extended period and in printk context where we'd rather not change preemption status, it seemed tricky to manage buffer safety while considering preemption. The approach of utilizing stack buffer space via the "struct btf_show" seemed like the simplest approach. The stack size of the associated functions which have a "struct btf_show" on their stack to support show operation (btf_type_snprintf_show() and btf_type_seq_show()) stays under 500 bytes. The compromise here is the safe buffer we use is small - 256 bytes - and as a result multiple probe_kernel_read()s are needed for larger objects. Most objects of interest are smaller than this (e.g. "struct sk_buff" is 224 bytes), and while task_struct is a notable exception at ~8K, performance is not the priority for BTF-based display. (Alexei and Yonghong, patch 2). - safe buffer use is the default behaviour (and is mandatory for BPF) but unsafe display - meaning no safe copy is done and we operate on the object itself - is supported via a 'u' option. - pointers are prefixed with 0x for clarity (Alexei, patch 2) - added additional comments and explanations around BTF show code, especially around determining whether objects such zeroed. Also tried to comment safe object scheme used. (Yonghong, patch 2) - added late_initcall() to initialize vmlinux BTF so that it would not have to be initialized during printk operation (Alexei, patch 5) - removed CONFIG_BTF_PRINTF config option as it is not needed; CONFIG_DEBUG_INFO_BTF can be used to gate test behaviour and determining behaviour of type-based printk can be done via retrieval of BTF data; if it's not there BTF was unavailable or broken (Alexei, patches 4,6) - fix bpf_trace_printk test to use vmlinux.h and globals via skeleton infrastructure, removing need for perf events (Andrii, patch 8)
Changes since v1:
- changed format to be more drgn-like, rendering indented type info along with type names by default (Alexei) - zeroed values are omitted (Arnaldo) by default unless the '0' modifier is specified (Alexei) - added an option to print pointer values without obfuscation. The reason to do this is the sysctls controlling pointer display are likely to be irrelevant in many if not most tracing contexts. Some questions on this in the outstanding questions section below... - reworked printk format specifer so that we no longer rely on format %pT<type> but instead use a struct * which contains type information (Rasmus). This simplifies the printk parsing, makes use more dynamic and also allows specification by BTF id as well as name. - removed incorrect patch which tried to fix dereferencing of resolved BTF info for vmlinux; instead we skip modifiers for the relevant case (array element type determination) (Alexei). - fixed issues with negative snprintf format length (Rasmus) - added test cases for various data structure formats; base types, typedefs, structs, etc. - tests now iterate through all typedef, enum, struct and unions defined for vmlinux BTF and render a version of the target dummy value which is either all zeros or all 0xff values; the idea is this exercises the "skip if zero" and "print everything" cases. - added support in BPF for using the %pT format specifier in bpf_trace_printk() - added BPF tests which ensure %pT format specifier use works (Alexei).
Alan Maguire (4): bpf: provide function to get vmlinux BTF information bpf: make BTF show support generic, apply to seq files/bpf_trace_printk bpf: add bpf_trace_btf helper selftests/bpf: add bpf_trace_btf helper tests
include/linux/bpf.h | 5 + include/linux/btf.h | 38 + include/uapi/linux/bpf.h | 63 ++ kernel/bpf/btf.c | 962 ++++++++++++++++++--- kernel/bpf/core.c | 5 + kernel/bpf/helpers.c | 4 + kernel/bpf/verifier.c | 18 +- kernel/trace/bpf_trace.c | 121 ++- kernel/trace/bpf_trace.h | 6 +- scripts/bpf_helpers_doc.py | 2 + tools/include/uapi/linux/bpf.h | 63 ++ tools/testing/selftests/bpf/prog_tests/trace_btf.c | 45 + .../selftests/bpf/progs/netif_receive_skb.c | 43 + 13 files changed, 1257 insertions(+), 118 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/trace_btf.c create mode 100644 tools/testing/selftests/bpf/progs/netif_receive_skb.c
It will be used later for BPF structure display support
Signed-off-by: Alan Maguire alan.maguire@oracle.com --- include/linux/bpf.h | 2 ++ kernel/bpf/verifier.c | 18 ++++++++++++------ 2 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index cef4ef0..55eb67d 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1290,6 +1290,8 @@ int bpf_check(struct bpf_prog **fp, union bpf_attr *attr, union bpf_attr __user *uattr); void bpf_patch_call_args(struct bpf_insn *insn, u32 stack_depth);
+struct btf *bpf_get_btf_vmlinux(void); + /* Map specifics */ struct xdp_buff; struct sk_buff; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index b6ccfce..05dfc41 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -11064,6 +11064,17 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) } }
+struct btf *bpf_get_btf_vmlinux(void) +{ + if (!btf_vmlinux && IS_ENABLED(CONFIG_DEBUG_INFO_BTF)) { + mutex_lock(&bpf_verifier_lock); + if (!btf_vmlinux) + btf_vmlinux = btf_parse_vmlinux(); + mutex_unlock(&bpf_verifier_lock); + } + return btf_vmlinux; +} + int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, union bpf_attr __user *uattr) { @@ -11097,12 +11108,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, env->ops = bpf_verifier_ops[env->prog->type]; is_priv = bpf_capable();
- if (!btf_vmlinux && IS_ENABLED(CONFIG_DEBUG_INFO_BTF)) { - mutex_lock(&bpf_verifier_lock); - if (!btf_vmlinux) - btf_vmlinux = btf_parse_vmlinux(); - mutex_unlock(&bpf_verifier_lock); - } + bpf_get_btf_vmlinux();
/* grab the mutex to protect few globals used by verifier */ if (!is_priv)
generalize the "seq_show" seq file support in btf.c to support a generic show callback of which we support three instances;
- the current seq file show - a show which triggers the bpf_trace/bpf_trace_printk tracepoint for each portion of the data displayed
Both classes of show function call btf_type_show() with different targets:
- for seq_show, the seq file is the target - for bpf_trace_printk(), no target is needed.
In the tracing case we need to also track additional data - length of data written specifically for the return value.
By default show will display type information, field members and their types and values etc, and the information is indented based upon structure depth. Zeroed fields are omitted.
Show however supports flags which modify its behaviour:
BTF_SHOW_COMPACT - suppress newline/indent. BTF_SHOW_NONEWLINE - suppress newline only. BTF_SHOW_NONAME - suppress show of type and member names. BTF_SHOW_PTR_RAW - do not obfuscate pointer values. BTF_SHOW_UNSAFE - do not copy data to safe buffer before display. BTF_SHOW_ZERO - show zeroed values (by default they are not shown).
The bpf_trace_printk tracepoint is augmented with a "trace_id" field; it is used to allow tracepoint filtering as typed display information can easily be interspersed with other tracing data, making it hard to read. Specifying a trace_id will allow users to selectively trace data, eliminating noise.
Signed-off-by: Alan Maguire alan.maguire@oracle.com --- include/linux/bpf.h | 2 + include/linux/btf.h | 37 ++ kernel/bpf/btf.c | 962 ++++++++++++++++++++++++++++++++++++++++++----- kernel/trace/bpf_trace.c | 19 +- kernel/trace/bpf_trace.h | 6 +- 5 files changed, 916 insertions(+), 110 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 55eb67d..6143b6e 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -946,6 +946,8 @@ typedef u32 (*bpf_convert_ctx_access_t)(enum bpf_access_type type, u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size, void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy);
+int bpf_trace_vprintk(__u32 trace_id, const char *fmt, va_list ap); + /* an array of programs to be executed under rcu_lock. * * Typical usage: diff --git a/include/linux/btf.h b/include/linux/btf.h index 8b81fbb..46bf9f4 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -13,6 +13,7 @@ struct btf_member; struct btf_type; union bpf_attr; +struct btf_show;
extern const struct file_operations btf_fops;
@@ -46,8 +47,44 @@ int btf_get_info_by_fd(const struct btf *btf, const struct btf_type *btf_type_id_size(const struct btf *btf, u32 *type_id, u32 *ret_size); + +/* + * Options to control show behaviour. + * - BTF_SHOW_COMPACT: no formatting around type information + * - BTF_SHOW_NONAME: no struct/union member names/types + * - BTF_SHOW_PTR_RAW: show raw (unobfuscated) pointer values; + * equivalent to %px. + * - BTF_SHOW_ZERO: show zero-valued struct/union members; they + * are not displayed by default + * - BTF_SHOW_NONEWLINE: include indent, but suppress newline; + * to be used when a show function implicitly includes a newline. + * - BTF_SHOW_UNSAFE: skip use of bpf_probe_read() to safely read + * data before displaying it. + */ +#define BTF_SHOW_COMPACT (1ULL << 0) +#define BTF_SHOW_NONAME (1ULL << 1) +#define BTF_SHOW_PTR_RAW (1ULL << 2) +#define BTF_SHOW_ZERO (1ULL << 3) +#define BTF_SHOW_NONEWLINE (1ULL << 32) +#define BTF_SHOW_UNSAFE (1ULL << 33) + void btf_type_seq_show(const struct btf *btf, u32 type_id, void *obj, struct seq_file *m); + +/* + * Trace string representation of obj of BTF type_id. + * + * @btf: struct btf object + * @type_id: type id of type obj points to + * @obj: pointer to typed data + * @flags: show options (see above) + * + * Return: length that would have been/was copied as per snprintf, or + * negative error. + */ +int btf_type_trace_show(const struct btf *btf, u32 type_id, void *obj, + u32 trace_id, u64 flags); + int btf_get_fd_by_id(u32 id); u32 btf_id(const struct btf *btf); bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s, diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 91afdd4..be47304 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -282,6 +282,88 @@ static const char *btf_type_str(const struct btf_type *t) return btf_kind_str[BTF_INFO_KIND(t->info)]; }
+/* Chunk size we use in safe copy of data to be shown. */ +#define BTF_SHOW_OBJ_SAFE_SIZE 256 + +/* + * This is the maximum size of a base type value (equivalent to a + * 128-bit int); if we are at the end of our safe buffer and have + * less than 16 bytes space we can't be assured of being able + * to copy the next type safely, so in such cases we will initiate + * a new copy. + */ +#define BTF_SHOW_OBJ_BASE_TYPE_SIZE 16 + +/* + * Common data to all BTF show operations. Private show functions can add + * their own data to a structure containing a struct btf_show and consult it + * in the show callback. See btf_type_show() below. + * + * One challenge with showing nested data is we want to skip 0-valued + * data, but in order to figure out whether a nested object is all zeros + * we need to walk through it. As a result, we need to make two passes + * when handling structs, unions and arrays; the first path simply looks + * for nonzero data, while the second actually does the display. The first + * pass is signalled by show->state.depth_check being set, and if we + * encounter a non-zero value we set show->state.depth_to_show to + * the depth at which we encountered it. When we have completed the + * first pass, we will know if anything needs to be displayed if + * depth_to_show > depth. See btf_[struct,array]_show() for the + * implementation of this. + * + * Another problem is we want to ensure the data for display is safe to + * access. To support this, the "struct obj" is used to track the data + * object and our safe copy of it. We copy portions of the data needed + * to the object "copy" buffer, but because its size is limited to + * BTF_SHOW_OBJ_COPY_LEN bytes, multiple copies may be required as we + * traverse larger objects for display. + * + * The various data type show functions all start with a call to + * btf_show_start_type() which returns a pointer to the safe copy + * of the data needed (or if BTF_SHOW_UNSAFE is specified, to the + * raw data itself). btf_show_obj_safe() is responsible for + * using copy_from_kernel_nofault() to update the safe data if necessary + * as we traverse the object's data. skbuff-like semantics are + * used: + * + * - obj.head points to the start of the toplevel object for display + * - obj.size is the size of the toplevel object + * - obj.data points to the current point in the original data at + * which our safe data starts. obj.data will advance as we copy + * portions of the data. + * + * In most cases a single copy will suffice, but larger data structures + * such as "struct task_struct" will require many copies. The logic in + * btf_show_obj_safe() handles the logic that determines if a new + * copy_from_kernel_nofault() is needed. + */ +struct btf_show { + u64 flags; + void *target; /* target of show operation (seq file, buffer) */ + void (*showfn)(struct btf_show *show, const char *fmt, ...); + const struct btf *btf; + /* below are used during iteration */ + struct { + u8 depth; + u8 depth_to_show; + u8 depth_check; + u8 array_member:1, + array_terminated:1; + u16 array_encoding; + u32 type_id; + int status; /* non-zero for error */ + const struct btf_type *type; + const struct btf_member *member; + char name[KSYM_NAME_LEN]; /* space for member name/type */ + } state; + struct { + u32 size; + void *head; + void *data; + u8 safe[BTF_SHOW_OBJ_SAFE_SIZE]; + } obj; +}; + struct btf_kind_operations { s32 (*check_meta)(struct btf_verifier_env *env, const struct btf_type *t, @@ -298,9 +380,9 @@ struct btf_kind_operations { const struct btf_type *member_type); void (*log_details)(struct btf_verifier_env *env, const struct btf_type *t); - void (*seq_show)(const struct btf *btf, const struct btf_type *t, + void (*show)(const struct btf *btf, const struct btf_type *t, u32 type_id, void *data, u8 bits_offsets, - struct seq_file *m); + struct btf_show *show); };
static const struct btf_kind_operations * const kind_ops[NR_BTF_KINDS]; @@ -677,6 +759,458 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s, return true; }
+/* Similar to btf_type_skip_modifiers() but does not skip typedefs. */ +static inline +const struct btf_type *btf_type_skip_qualifiers(const struct btf *btf, u32 id) +{ + const struct btf_type *t = btf_type_by_id(btf, id); + + while (btf_type_is_modifier(t) && + BTF_INFO_KIND(t->info) != BTF_KIND_TYPEDEF) { + id = t->type; + t = btf_type_by_id(btf, t->type); + } + + return t; +} + +#define BTF_SHOW_MAX_ITER 10 + +#define BTF_KIND_BIT(kind) (1ULL << kind) + +/* + * Populate show->state.name with type name information. + * Format of type name is + * + * [.member_name = ] (type_name) + */ +static inline const char *btf_show_name(struct btf_show *show) +{ + /* BTF_MAX_ITER array suffixes "[]" */ + const char *array_suffixes = "[][][][][][][][][][]"; + const char *array_suffix = &array_suffixes[strlen(array_suffixes)]; + /* BTF_MAX_ITER pointer suffixes "*" */ + const char *ptr_suffixes = "**********"; + const char *ptr_suffix = &ptr_suffixes[strlen(ptr_suffixes)]; + const char *name = NULL, *prefix = "", *parens = ""; + const struct btf_member *m = show->state.member; + const struct btf_type *t = show->state.type; + const struct btf_array *array; + u32 id = show->state.type_id; + const char *member = NULL; + bool show_member = false; + u64 kinds = 0; + int i; + + show->state.name[0] = '\0'; + + /* + * Don't show type name if we're showing an array member; + * in that case we show the array type so don't need to repeat + * ourselves for each member. + */ + if (show->state.array_member) + return ""; + + /* Retrieve member name, if any. */ + if (m) { + member = btf_name_by_offset(show->btf, m->name_off); + show_member = strlen(member) > 0; + id = m->type; + } + + /* + * Start with type_id, as we have have resolved the struct btf_type * + * via btf_modifier_show() past the parent typedef to the child + * struct, int etc it is defined as. In such cases, the type_id + * still represents the starting type while the the struct btf_type * + * in our show->state points at the resolved type of the typedef. + */ + t = btf_type_by_id(show->btf, id); + if (!t) + return ""; + + /* + * The goal here is to build up the right number of pointer and + * array suffixes while ensuring the type name for a typedef + * is represented. Along the way we accumulate a list of + * BTF kinds we have encountered, since these will inform later + * display; for example, pointer types will not require an + * opening "{" for struct, we will just display the pointer value. + * + * We also want to accumulate the right number of pointer or array + * indices in the format string while iterating until we get to + * the typedef/pointee/array member target type. + * + * We start by pointing at the end of pointer and array suffix + * strings; as we accumulate pointers and arrays we move the pointer + * or array string backwards so it will show the expected number of + * '*' or '[]' for the type. BTF_SHOW_MAX_ITER of nesting of pointers + * and/or arrays and typedefs are supported as a precaution. + * + * We also want to get typedef name while proceeding to resolve + * type it points to so that we can add parentheses if it is a + * "typedef struct" etc. + */ + for (i = 0; i < BTF_SHOW_MAX_ITER; i++) { + + switch (BTF_INFO_KIND(t->info)) { + case BTF_KIND_TYPEDEF: + if (!name) + name = btf_name_by_offset(show->btf, + t->name_off); + kinds |= BTF_KIND_BIT(BTF_KIND_TYPEDEF); + id = t->type; + break; + case BTF_KIND_ARRAY: + kinds |= BTF_KIND_BIT(BTF_KIND_ARRAY); + parens = "["; + if (!t) + return ""; + array = btf_type_array(t); + if (array_suffix > array_suffixes) + array_suffix -= 2; + id = array->type; + break; + case BTF_KIND_PTR: + kinds |= BTF_KIND_BIT(BTF_KIND_PTR); + if (ptr_suffix > ptr_suffixes) + ptr_suffix -= 1; + id = t->type; + break; + default: + id = 0; + break; + } + if (!id) + break; + t = btf_type_skip_qualifiers(show->btf, id); + } + /* We may not be able to represent this type; bail to be safe */ + if (i == BTF_SHOW_MAX_ITER) + return ""; + + if (!name) + name = btf_name_by_offset(show->btf, t->name_off); + + switch (BTF_INFO_KIND(t->info)) { + case BTF_KIND_STRUCT: + case BTF_KIND_UNION: + prefix = BTF_INFO_KIND(t->info) == BTF_KIND_STRUCT ? + "struct" : "union"; + /* if it's an array of struct/union, parens is already set */ + if (!(kinds & (BTF_KIND_BIT(BTF_KIND_ARRAY)))) + parens = "{"; + break; + case BTF_KIND_ENUM: + prefix = "enum"; + break; + default: + break; + } + + /* pointer does not require parens */ + if (kinds & BTF_KIND_BIT(BTF_KIND_PTR)) + parens = ""; + /* typedef does not require struct/union/enum prefix */ + if (kinds & BTF_KIND_BIT(BTF_KIND_TYPEDEF)) + prefix = ""; + + if (!name) + name = ""; + + /* Even if we don't want type name info, we want parentheses etc */ + if (show->flags & BTF_SHOW_NONAME) + snprintf(show->state.name, sizeof(show->state.name), "%s", + parens); + else + snprintf(show->state.name, sizeof(show->state.name), + "%s%s%s(%s%s%s%s%s%s)%s", + /* first 3 strings comprise ".member = " */ + show_member ? "." : "", + show_member ? member : "", + show_member ? " = " : "", + /* ...next is our prefix (struct, enum, etc) */ + prefix, + strlen(prefix) > 0 && strlen(name) > 0 ? " " : "", + /* ...this is the type name itself */ + name, + /* ...suffixed by the appropriate '*', '[]' suffixes */ + strlen(ptr_suffix) > 0 ? " " : "", ptr_suffix, + array_suffix, parens); + + return show->state.name; +} + +#define btf_show(show, ...) \ + do { \ + if (!show->state.depth_check) \ + show->showfn(show, __VA_ARGS__); \ + } while (0) + +static inline const char *__btf_show_indent(struct btf_show *show) +{ + const char *indents = " "; + const char *indent = &indents[strlen(indents)]; + + if ((indent - show->state.depth) >= indents) + return indent - show->state.depth; + return indents; +} + +#define btf_show_indent(show) \ + ((show->flags & BTF_SHOW_COMPACT) ? "" : __btf_show_indent(show)) + +#define btf_show_newline(show) \ + ((show->flags & (BTF_SHOW_COMPACT | BTF_SHOW_NONEWLINE)) ? "" : "\n") + +#define btf_show_delim(show) \ + (show->state.depth == 0 ? "" : \ + ((show->flags & BTF_SHOW_COMPACT) && show->state.type && \ + BTF_INFO_KIND(show->state.type->info) == BTF_KIND_UNION) ? "|" : ",") + +#define btf_show_type_value(show, fmt, value) \ + do { \ + if ((value) != 0 || (show->flags & BTF_SHOW_ZERO) || \ + show->state.depth == 0) { \ + btf_show(show, "%s%s" fmt "%s%s", \ + btf_show_indent(show), \ + btf_show_name(show), \ + value, btf_show_delim(show), \ + btf_show_newline(show)); \ + if (show->state.depth > show->state.depth_to_show) \ + show->state.depth_to_show = show->state.depth; \ + } \ + } while (0) + +#define btf_show_type_values(show, fmt, ...) \ + do { \ + btf_show(show, "%s%s" fmt "%s%s", btf_show_indent(show), \ + btf_show_name(show), \ + __VA_ARGS__, btf_show_delim(show), \ + btf_show_newline(show)); \ + if (show->state.depth > show->state.depth_to_show) \ + show->state.depth_to_show = show->state.depth; \ + } while (0) + +/* How much is left to copy to safe buffer after @data? */ +#define btf_show_obj_size_left(show, data) \ + (show->obj.head + show->obj.size - data) + +/* Is object pointed to by @data of @size already copied to our safe buffer? */ +#define btf_show_obj_is_safe(show, data, size) \ + (data >= show->obj.data && \ + (data + size) < (show->obj.data + BTF_SHOW_OBJ_SAFE_SIZE)) + +/* + * If object pointed to by @data of @size falls within our safe buffer, return + * the equivalent pointer to the same safe data. Assumes + * copy_from_kernel_nofault() has already happened and our safe buffer is + * populated. + */ +#define __btf_show_obj_safe(show, data, size) \ + (btf_show_obj_is_safe(show, data, size) ? \ + show->obj.safe + (data - show->obj.data) : NULL) + +/* + * Return a safe-to-access version of data pointed to by @data. + * We do this by copying the relevant amount of information + * to the struct btf_show obj.safe buffer using copy_from_kernel_nofault(). + * + * If BTF_SHOW_UNSAFE is specified, just return data as-is; no + * safe copy is needed. + * + * Otherwise we need to determine if we have the required amount + * of data (determined by the @data pointer and the size of the + * largest base type we can encounter (represented by + * BTF_SHOW_OBJ_BASE_TYPE_SIZE). Having that much data ensures + * that we will be able to print some of the current object, + * and if more is needed a copy will be triggered. + * Some objects such as structs will not fit into the buffer; + * in such cases additional copies when we iterate over their + * members may be needed. + * + * btf_show_obj_safe() is used to return a safe buffer for + * btf_show_start_type(); this ensures that as we recurse into + * nested types we always have safe data for the given type. + * This approach is somewhat wasteful; it's possible for example + * that when iterating over a large union we'll end up copying the + * same data repeatedly, but the goal is safety not performance. + * We use stack data as opposed to per-CPU buffers because the + * iteration over a type can take some time, and preemption handling + * would greatly complicate use of the safe buffer. + */ +static inline void *btf_show_obj_safe(struct btf_show *show, + const struct btf_type *t, + void *data) +{ + int size_left, size; + void *safe = NULL; + + if (show->flags & BTF_SHOW_UNSAFE) + return data; + + (void) btf_resolve_size(show->btf, t, &size, NULL, NULL); + + /* + * Is this toplevel object? If so, set total object size and + * initialize pointers. Otherwise check if we still fall within + * our safe object data. + */ + if (show->state.depth == 0) { + show->obj.size = size; + show->obj.head = data; + } else { + /* + * If the size of the current object is > our remaining + * safe buffer we _may_ need to do a new copy. However + * consider the case of a nested struct; it's size pushes + * us over the safe buffer limit, but showing any individual + * struct members does not. In such cases, we don't need + * to initiate a fresh copy yet; however we definitely need + * at least BTF_SHOW_OBJ_BASE_TYPE_SIZE bytes left + * in our buffer, regardless of the current object size. + * The logic here is that as we resolve types we will + * hit a base type at some point, and we need to be sure + * the next chunk of data is safely available to display + * that type info safely. We cannot rely on the size of + * the current object here because it may be much larger + * than our current buffer (e.g. task_struct is 8k). + * All we want to do here is ensure that we can print the + * next basic type, which we can if either + * - the current type size is within the safe buffer; or + * - at least BTF_SHOW_OBJ_BASE_TYPE_SIZE bytes are left in + * the safe buffer. + */ + safe = __btf_show_obj_safe(show, data, + min(size, + BTF_SHOW_OBJ_BASE_TYPE_SIZE)); + } + + /* + * We need a new copy to our safe object, either because we haven't + * yet copied and are intializing safe data, or because the data + * we want falls outside the boundaries of the safe object. + */ + if (!safe) { + size_left = btf_show_obj_size_left(show, data); + if (size_left > BTF_SHOW_OBJ_SAFE_SIZE) + size_left = BTF_SHOW_OBJ_SAFE_SIZE; + show->state.status = copy_from_kernel_nofault(show->obj.safe, + data, + size_left); + if (!show->state.status) { + show->obj.data = data; + safe = show->obj.safe; + } + } + + return safe; +} + +/* + * Set the type we are starting to show and return a safe data pointer + * to be used for showing the associated data. + */ +static inline void *btf_show_start_type(struct btf_show *show, + const struct btf_type *t, + u32 type_id, + void *data) +{ + show->state.type = t; + show->state.type_id = type_id; + show->state.name[0] = '\0'; + + return btf_show_obj_safe(show, t, data); +} + +static inline void btf_show_end_type(struct btf_show *show) +{ + show->state.type = NULL; + show->state.type_id = 0; + show->state.name[0] = '\0'; +} + +static inline void *btf_show_start_aggr_type(struct btf_show *show, + const struct btf_type *t, + u32 type_id, + void *data) +{ + void *safe_data = btf_show_start_type(show, t, type_id, data); + + if (!safe_data) + return safe_data; + + btf_show(show, "%s%s%s", btf_show_indent(show), + btf_show_name(show), + btf_show_newline(show)); + show->state.depth++; + return safe_data; +} + +static inline void btf_show_end_aggr_type(struct btf_show *show, + const char *suffix) +{ + show->state.depth--; + btf_show(show, "%s%s%s%s", btf_show_indent(show), suffix, + btf_show_delim(show), btf_show_newline(show)); + btf_show_end_type(show); +} + +static inline void btf_show_start_member(struct btf_show *show, + const struct btf_member *m) +{ + show->state.member = m; +} + +static inline void btf_show_start_array_member(struct btf_show *show) +{ + show->state.array_member = 1; + btf_show_start_member(show, NULL); +} + +static inline void btf_show_end_member(struct btf_show *show) +{ + show->state.member = NULL; +} + +static inline void btf_show_end_array_member(struct btf_show *show) +{ + show->state.array_member = 0; + btf_show_end_member(show); +} + +static inline void *btf_show_start_array_type(struct btf_show *show, + const struct btf_type *t, + u32 type_id, + u16 array_encoding, + void *data) +{ + show->state.array_encoding = array_encoding; + show->state.array_terminated = 0; + return btf_show_start_aggr_type(show, t, type_id, data); +} + +static inline void btf_show_end_array_type(struct btf_show *show) +{ + show->state.array_encoding = 0; + show->state.array_terminated = 0; + btf_show_end_aggr_type(show, "]"); +} + +static inline void *btf_show_start_struct_type(struct btf_show *show, + const struct btf_type *t, + u32 type_id, + void *data) +{ + return btf_show_start_aggr_type(show, t, type_id, data); +} + +static inline void btf_show_end_struct_type(struct btf_show *show) +{ + btf_show_end_aggr_type(show, "}"); +} + __printf(2, 3) static void __btf_verifier_log(struct bpf_verifier_log *log, const char *fmt, ...) { @@ -1250,11 +1784,11 @@ static int btf_df_resolve(struct btf_verifier_env *env, return -EINVAL; }
-static void btf_df_seq_show(const struct btf *btf, const struct btf_type *t, - u32 type_id, void *data, u8 bits_offsets, - struct seq_file *m) +static void btf_df_show(const struct btf *btf, const struct btf_type *t, + u32 type_id, void *data, u8 bits_offsets, + struct btf_show *show) { - seq_printf(m, "<unsupported kind:%u>", BTF_INFO_KIND(t->info)); + btf_show(show, "<unsupported kind:%u>", BTF_INFO_KIND(t->info)); }
static int btf_int_check_member(struct btf_verifier_env *env, @@ -1427,7 +1961,7 @@ static void btf_int_log(struct btf_verifier_env *env, btf_int_encoding_str(BTF_INT_ENCODING(int_data))); }
-static void btf_int128_print(struct seq_file *m, void *data) +static void btf_int128_print(struct btf_show *show, void *data) { /* data points to a __int128 number. * Suppose @@ -1446,9 +1980,10 @@ static void btf_int128_print(struct seq_file *m, void *data) lower_num = *(u64 *)data; #endif if (upper_num == 0) - seq_printf(m, "0x%llx", lower_num); + btf_show_type_value(show, "0x%llx", lower_num); else - seq_printf(m, "0x%llx%016llx", upper_num, lower_num); + btf_show_type_values(show, "0x%llx%016llx", upper_num, + lower_num); }
static void btf_int128_shift(u64 *print_num, u16 left_shift_bits, @@ -1492,8 +2027,8 @@ static void btf_int128_shift(u64 *print_num, u16 left_shift_bits, #endif }
-static void btf_bitfield_seq_show(void *data, u8 bits_offset, - u8 nr_bits, struct seq_file *m) +static void btf_bitfield_show(void *data, u8 bits_offset, + u8 nr_bits, struct btf_show *show) { u16 left_shift_bits, right_shift_bits; u8 nr_copy_bytes; @@ -1513,14 +2048,14 @@ static void btf_bitfield_seq_show(void *data, u8 bits_offset, right_shift_bits = BITS_PER_U128 - nr_bits;
btf_int128_shift(print_num, left_shift_bits, right_shift_bits); - btf_int128_print(m, print_num); + btf_int128_print(show, print_num); }
-static void btf_int_bits_seq_show(const struct btf *btf, - const struct btf_type *t, - void *data, u8 bits_offset, - struct seq_file *m) +static void btf_int_bits_show(const struct btf *btf, + const struct btf_type *t, + void *data, u8 bits_offset, + struct btf_show *show) { u32 int_data = btf_type_int(t); u8 nr_bits = BTF_INT_BITS(int_data); @@ -1533,55 +2068,77 @@ static void btf_int_bits_seq_show(const struct btf *btf, total_bits_offset = bits_offset + BTF_INT_OFFSET(int_data); data += BITS_ROUNDDOWN_BYTES(total_bits_offset); bits_offset = BITS_PER_BYTE_MASKED(total_bits_offset); - btf_bitfield_seq_show(data, bits_offset, nr_bits, m); + btf_bitfield_show(data, bits_offset, nr_bits, show); }
-static void btf_int_seq_show(const struct btf *btf, const struct btf_type *t, - u32 type_id, void *data, u8 bits_offset, - struct seq_file *m) +static void btf_int_show(const struct btf *btf, const struct btf_type *t, + u32 type_id, void *data, u8 bits_offset, + struct btf_show *show) { u32 int_data = btf_type_int(t); u8 encoding = BTF_INT_ENCODING(int_data); bool sign = encoding & BTF_INT_SIGNED; u8 nr_bits = BTF_INT_BITS(int_data); + void *safe_data; + + safe_data = btf_show_start_type(show, t, type_id, data); + if (!safe_data) + return;
if (bits_offset || BTF_INT_OFFSET(int_data) || BITS_PER_BYTE_MASKED(nr_bits)) { - btf_int_bits_seq_show(btf, t, data, bits_offset, m); - return; + btf_int_bits_show(btf, t, safe_data, bits_offset, show); + goto out; }
switch (nr_bits) { case 128: - btf_int128_print(m, data); + btf_int128_print(show, safe_data); break; case 64: if (sign) - seq_printf(m, "%lld", *(s64 *)data); + btf_show_type_value(show, "%lld", *(s64 *)safe_data); else - seq_printf(m, "%llu", *(u64 *)data); + btf_show_type_value(show, "%llu", *(u64 *)safe_data); break; case 32: if (sign) - seq_printf(m, "%d", *(s32 *)data); + btf_show_type_value(show, "%d", *(s32 *)safe_data); else - seq_printf(m, "%u", *(u32 *)data); + btf_show_type_value(show, "%u", *(u32 *)safe_data); break; case 16: if (sign) - seq_printf(m, "%d", *(s16 *)data); + btf_show_type_value(show, "%d", *(s16 *)safe_data); else - seq_printf(m, "%u", *(u16 *)data); + btf_show_type_value(show, "%u", *(u16 *)safe_data); break; case 8: + if (show->state.array_encoding == BTF_INT_CHAR) { + /* check for null terminator */ + if (show->state.array_terminated) + break; + if (*(char *)data == '\0') { + show->state.array_terminated = 1; + break; + } + if (isprint(*(char *)data)) { + btf_show_type_value(show, "'%c'", + *(char *)safe_data); + break; + } + } if (sign) - seq_printf(m, "%d", *(s8 *)data); + btf_show_type_value(show, "%d", *(s8 *)safe_data); else - seq_printf(m, "%u", *(u8 *)data); + btf_show_type_value(show, "%u", *(u8 *)safe_data); break; default: - btf_int_bits_seq_show(btf, t, data, bits_offset, m); + btf_int_bits_show(btf, t, safe_data, bits_offset, show); + break; } +out: + btf_show_end_type(show); }
static const struct btf_kind_operations int_ops = { @@ -1590,7 +2147,7 @@ static void btf_int_seq_show(const struct btf *btf, const struct btf_type *t, .check_member = btf_int_check_member, .check_kflag_member = btf_int_check_kflag_member, .log_details = btf_int_log, - .seq_show = btf_int_seq_show, + .show = btf_int_show, };
static int btf_modifier_check_member(struct btf_verifier_env *env, @@ -1854,34 +2411,44 @@ static int btf_ptr_resolve(struct btf_verifier_env *env, return 0; }
-static void btf_modifier_seq_show(const struct btf *btf, - const struct btf_type *t, - u32 type_id, void *data, - u8 bits_offset, struct seq_file *m) +static void btf_modifier_show(const struct btf *btf, + const struct btf_type *t, + u32 type_id, void *data, + u8 bits_offset, struct btf_show *show) { if (btf->resolved_ids) t = btf_type_id_resolve(btf, &type_id); else t = btf_type_skip_modifiers(btf, type_id, NULL);
- btf_type_ops(t)->seq_show(btf, t, type_id, data, bits_offset, m); + btf_type_ops(t)->show(btf, t, type_id, data, bits_offset, show); }
-static void btf_var_seq_show(const struct btf *btf, const struct btf_type *t, - u32 type_id, void *data, u8 bits_offset, - struct seq_file *m) +static void btf_var_show(const struct btf *btf, const struct btf_type *t, + u32 type_id, void *data, u8 bits_offset, + struct btf_show *show) { t = btf_type_id_resolve(btf, &type_id);
- btf_type_ops(t)->seq_show(btf, t, type_id, data, bits_offset, m); + btf_type_ops(t)->show(btf, t, type_id, data, bits_offset, show); }
-static void btf_ptr_seq_show(const struct btf *btf, const struct btf_type *t, - u32 type_id, void *data, u8 bits_offset, - struct seq_file *m) +static void btf_ptr_show(const struct btf *btf, const struct btf_type *t, + u32 type_id, void *data, u8 bits_offset, + struct btf_show *show) { - /* It is a hashed value */ - seq_printf(m, "%p", *(void **)data); + void *safe_data; + + safe_data = btf_show_start_type(show, t, type_id, data); + if (!safe_data) + return; + + /* It is a hashed value unless BTF_SHOW_PTR_RAW is specified */ + if (show->flags & BTF_SHOW_PTR_RAW) + btf_show_type_value(show, "0x%px", *(void **)safe_data); + else + btf_show_type_value(show, "0x%p", *(void **)safe_data); + btf_show_end_type(show); }
static void btf_ref_type_log(struct btf_verifier_env *env, @@ -1896,7 +2463,7 @@ static void btf_ref_type_log(struct btf_verifier_env *env, .check_member = btf_modifier_check_member, .check_kflag_member = btf_modifier_check_kflag_member, .log_details = btf_ref_type_log, - .seq_show = btf_modifier_seq_show, + .show = btf_modifier_show, };
static struct btf_kind_operations ptr_ops = { @@ -1905,7 +2472,7 @@ static void btf_ref_type_log(struct btf_verifier_env *env, .check_member = btf_ptr_check_member, .check_kflag_member = btf_generic_check_kflag_member, .log_details = btf_ref_type_log, - .seq_show = btf_ptr_seq_show, + .show = btf_ptr_show, };
static s32 btf_fwd_check_meta(struct btf_verifier_env *env, @@ -1946,7 +2513,7 @@ static void btf_fwd_type_log(struct btf_verifier_env *env, .check_member = btf_df_check_member, .check_kflag_member = btf_df_check_kflag_member, .log_details = btf_fwd_type_log, - .seq_show = btf_df_seq_show, + .show = btf_df_show, };
static int btf_array_check_member(struct btf_verifier_env *env, @@ -2105,28 +2672,90 @@ static void btf_array_log(struct btf_verifier_env *env, array->type, array->index_type, array->nelems); }
-static void btf_array_seq_show(const struct btf *btf, const struct btf_type *t, - u32 type_id, void *data, u8 bits_offset, - struct seq_file *m) +static void __btf_array_show(const struct btf *btf, const struct btf_type *t, + u32 type_id, void *data, u8 bits_offset, + struct btf_show *show) { const struct btf_array *array = btf_type_array(t); const struct btf_kind_operations *elem_ops; const struct btf_type *elem_type; - u32 i, elem_size, elem_type_id; + u32 i, elem_size = 0, elem_type_id; + u16 encoding = 0;
elem_type_id = array->type; - elem_type = btf_type_id_size(btf, &elem_type_id, &elem_size); + elem_type = btf_type_skip_modifiers(btf, elem_type_id, NULL); + if (elem_type && btf_type_has_size(elem_type)) + elem_size = elem_type->size; + + if (elem_type && btf_type_is_int(elem_type)) { + u32 int_type = btf_type_int(elem_type); + + encoding = BTF_INT_ENCODING(int_type); + + /* + * BTF_INT_CHAR encoding never seems to be set for + * char arrays, so if size is 1 and element is + * printable as a char, we'll do that. + */ + if (elem_size == 1) + encoding = BTF_INT_CHAR; + } + + if (!btf_show_start_array_type(show, t, type_id, encoding, data)) + return; + + if (!elem_type) + goto out; elem_ops = btf_type_ops(elem_type); - seq_puts(m, "["); + for (i = 0; i < array->nelems; i++) { - if (i) - seq_puts(m, ",");
- elem_ops->seq_show(btf, elem_type, elem_type_id, data, - bits_offset, m); + btf_show_start_array_member(show); + + elem_ops->show(btf, elem_type, elem_type_id, data, + bits_offset, show); data += elem_size; + + btf_show_end_array_member(show); + + if (show->state.array_terminated) + break; + } +out: + btf_show_end_array_type(show); +} + +static void btf_array_show(const struct btf *btf, const struct btf_type *t, + u32 type_id, void *data, u8 bits_offset, + struct btf_show *show) +{ + const struct btf_member *m = show->state.member; + + /* + * First check if any members would be shown (are non-zero). + * See comments above "struct btf_show" definition for more + * details on how this works at a high-level. + */ + if (show->state.depth > 0 && !(show->flags & BTF_SHOW_ZERO)) { + if (!show->state.depth_check) { + show->state.depth_check = show->state.depth + 1; + show->state.depth_to_show = 0; + } + __btf_array_show(btf, t, type_id, data, bits_offset, show); + show->state.member = m; + + if (show->state.depth_check != show->state.depth + 1) + return; + show->state.depth_check = 0; + + if (show->state.depth_to_show <= show->state.depth) + return; + /* + * Reaching here indicates we have recursed and found + * non-zero array member(s). + */ } - seq_puts(m, "]"); + __btf_array_show(btf, t, type_id, data, bits_offset, show); }
static struct btf_kind_operations array_ops = { @@ -2135,7 +2764,7 @@ static void btf_array_seq_show(const struct btf *btf, const struct btf_type *t, .check_member = btf_array_check_member, .check_kflag_member = btf_generic_check_kflag_member, .log_details = btf_array_log, - .seq_show = btf_array_seq_show, + .show = btf_array_show, };
static int btf_struct_check_member(struct btf_verifier_env *env, @@ -2358,15 +2987,18 @@ int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t) return off; }
-static void btf_struct_seq_show(const struct btf *btf, const struct btf_type *t, - u32 type_id, void *data, u8 bits_offset, - struct seq_file *m) +static void __btf_struct_show(const struct btf *btf, const struct btf_type *t, + u32 type_id, void *data, u8 bits_offset, + struct btf_show *show) { - const char *seq = BTF_INFO_KIND(t->info) == BTF_KIND_UNION ? "|" : ","; const struct btf_member *member; + void *safe_data; u32 i;
- seq_puts(m, "{"); + safe_data = btf_show_start_struct_type(show, t, type_id, data); + if (!safe_data) + return; + for_each_member(i, t, member) { const struct btf_type *member_type = btf_type_by_id(btf, member->type); @@ -2375,23 +3007,60 @@ static void btf_struct_seq_show(const struct btf *btf, const struct btf_type *t, u32 bytes_offset; u8 bits8_offset;
- if (i) - seq_puts(m, seq); + btf_show_start_member(show, member);
member_offset = btf_member_bit_offset(t, member); bitfield_size = btf_member_bitfield_size(t, member); bytes_offset = BITS_ROUNDDOWN_BYTES(member_offset); bits8_offset = BITS_PER_BYTE_MASKED(member_offset); if (bitfield_size) { - btf_bitfield_seq_show(data + bytes_offset, bits8_offset, - bitfield_size, m); + btf_bitfield_show(safe_data + bytes_offset, + bits8_offset, + bitfield_size, show); } else { ops = btf_type_ops(member_type); - ops->seq_show(btf, member_type, member->type, - data + bytes_offset, bits8_offset, m); + ops->show(btf, member_type, member->type, + data + bytes_offset, bits8_offset, show); } + + btf_show_end_member(show); } - seq_puts(m, "}"); + + btf_show_end_struct_type(show); +} + +static void btf_struct_show(const struct btf *btf, const struct btf_type *t, + u32 type_id, void *data, u8 bits_offset, + struct btf_show *show) +{ + const struct btf_member *m = show->state.member; + + /* + * First check if any members would be shown (are non-zero). + * See comments above "struct btf_show" definition for more + * details on how this works at a high-level. + */ + if (show->state.depth > 0 && !(show->flags & BTF_SHOW_ZERO)) { + if (!show->state.depth_check) { + show->state.depth_check = show->state.depth + 1; + show->state.depth_to_show = 0; + } + __btf_struct_show(btf, t, type_id, data, bits_offset, show); + /* Restore saved member data here */ + show->state.member = m; + if (show->state.depth_check != show->state.depth + 1) + return; + show->state.depth_check = 0; + + if (show->state.depth_to_show <= show->state.depth) + return; + /* + * Reaching here indicates we have recursed and found + * non-zero child values. + */ + } + + __btf_struct_show(btf, t, type_id, data, bits_offset, show); }
static struct btf_kind_operations struct_ops = { @@ -2400,7 +3069,7 @@ static void btf_struct_seq_show(const struct btf *btf, const struct btf_type *t, .check_member = btf_struct_check_member, .check_kflag_member = btf_generic_check_kflag_member, .log_details = btf_struct_log, - .seq_show = btf_struct_seq_show, + .show = btf_struct_show, };
static int btf_enum_check_member(struct btf_verifier_env *env, @@ -2531,24 +3200,35 @@ static void btf_enum_log(struct btf_verifier_env *env, btf_verifier_log(env, "size=%u vlen=%u", t->size, btf_type_vlen(t)); }
-static void btf_enum_seq_show(const struct btf *btf, const struct btf_type *t, - u32 type_id, void *data, u8 bits_offset, - struct seq_file *m) +static void btf_enum_show(const struct btf *btf, const struct btf_type *t, + u32 type_id, void *data, u8 bits_offset, + struct btf_show *show) { const struct btf_enum *enums = btf_type_enum(t); u32 i, nr_enums = btf_type_vlen(t); - int v = *(int *)data; + void *safe_data; + int v; + + safe_data = btf_show_start_type(show, t, type_id, data); + if (!safe_data) + return; + + v = *(int *)safe_data;
for (i = 0; i < nr_enums; i++) { - if (v == enums[i].val) { - seq_printf(m, "%s", - __btf_name_by_offset(btf, - enums[i].name_off)); - return; - } + if (v != enums[i].val) + continue; + + btf_show_type_value(show, "%s", + __btf_name_by_offset(btf, + enums[i].name_off)); + + btf_show_end_type(show); + return; }
- seq_printf(m, "%d", v); + btf_show_type_value(show, "%d", v); + btf_show_end_type(show); }
static struct btf_kind_operations enum_ops = { @@ -2557,7 +3237,7 @@ static void btf_enum_seq_show(const struct btf *btf, const struct btf_type *t, .check_member = btf_enum_check_member, .check_kflag_member = btf_enum_check_kflag_member, .log_details = btf_enum_log, - .seq_show = btf_enum_seq_show, + .show = btf_enum_show, };
static s32 btf_func_proto_check_meta(struct btf_verifier_env *env, @@ -2644,7 +3324,7 @@ static void btf_func_proto_log(struct btf_verifier_env *env, .check_member = btf_df_check_member, .check_kflag_member = btf_df_check_kflag_member, .log_details = btf_func_proto_log, - .seq_show = btf_df_seq_show, + .show = btf_df_show, };
static s32 btf_func_check_meta(struct btf_verifier_env *env, @@ -2678,7 +3358,7 @@ static s32 btf_func_check_meta(struct btf_verifier_env *env, .check_member = btf_df_check_member, .check_kflag_member = btf_df_check_kflag_member, .log_details = btf_ref_type_log, - .seq_show = btf_df_seq_show, + .show = btf_df_show, };
static s32 btf_var_check_meta(struct btf_verifier_env *env, @@ -2742,7 +3422,7 @@ static void btf_var_log(struct btf_verifier_env *env, const struct btf_type *t) .check_member = btf_df_check_member, .check_kflag_member = btf_df_check_kflag_member, .log_details = btf_var_log, - .seq_show = btf_var_seq_show, + .show = btf_var_show, };
static s32 btf_datasec_check_meta(struct btf_verifier_env *env, @@ -2868,24 +3548,28 @@ static void btf_datasec_log(struct btf_verifier_env *env, btf_verifier_log(env, "size=%u vlen=%u", t->size, btf_type_vlen(t)); }
-static void btf_datasec_seq_show(const struct btf *btf, - const struct btf_type *t, u32 type_id, - void *data, u8 bits_offset, - struct seq_file *m) +static void btf_datasec_show(const struct btf *btf, + const struct btf_type *t, u32 type_id, + void *data, u8 bits_offset, + struct btf_show *show) { const struct btf_var_secinfo *vsi; const struct btf_type *var; u32 i;
- seq_printf(m, "section ("%s") = {", __btf_name_by_offset(btf, t->name_off)); + if (!btf_show_start_type(show, t, type_id, data)) + return; + + btf_show_type_value(show, "section ("%s") = {", + __btf_name_by_offset(btf, t->name_off)); for_each_vsi(i, t, vsi) { var = btf_type_by_id(btf, vsi->type); if (i) - seq_puts(m, ","); - btf_type_ops(var)->seq_show(btf, var, vsi->type, - data + vsi->offset, bits_offset, m); + btf_show(show, ","); + btf_type_ops(var)->show(btf, var, vsi->type, + data + vsi->offset, bits_offset, show); } - seq_puts(m, "}"); + btf_show_end_type(show); }
static const struct btf_kind_operations datasec_ops = { @@ -2894,7 +3578,7 @@ static void btf_datasec_seq_show(const struct btf *btf, .check_member = btf_df_check_member, .check_kflag_member = btf_df_check_kflag_member, .log_details = btf_datasec_log, - .seq_show = btf_datasec_seq_show, + .show = btf_datasec_show, };
static int btf_func_proto_check(struct btf_verifier_env *env, @@ -4516,12 +5200,84 @@ int btf_prepare_func_args(struct bpf_verifier_env *env, int subprog, return 0; }
-void btf_type_seq_show(const struct btf *btf, u32 type_id, void *obj, - struct seq_file *m) +void btf_type_show(const struct btf *btf, u32 type_id, void *obj, + struct btf_show *show) { const struct btf_type *t = btf_type_by_id(btf, type_id);
- btf_type_ops(t)->seq_show(btf, t, type_id, obj, 0, m); + show->btf = btf; + memset(&show->state, 0, sizeof(show->state)); + memset(&show->obj, 0, sizeof(show->obj)); + + btf_type_ops(t)->show(btf, t, type_id, obj, 0, show); +} + +static void btf_seq_show(struct btf_show *show, const char *fmt, ...) +{ + va_list args; + + va_start(args, fmt); + seq_vprintf((struct seq_file *)show->target, fmt, args); + va_end(args); +} + +void btf_type_seq_show(const struct btf *btf, u32 type_id, void *obj, + struct seq_file *m) +{ + struct btf_show sseq; + + sseq.target = m; + sseq.showfn = btf_seq_show; + sseq.flags = BTF_SHOW_NONAME | BTF_SHOW_COMPACT | BTF_SHOW_ZERO | + BTF_SHOW_UNSAFE; + + btf_type_show(btf, type_id, obj, &sseq); +} + +/* pass in trace id and record number of bytes written. */ +struct btf_show_trace { + struct btf_show show; + __u32 trace_id; + int len; /* length written */ +}; + +static void btf_trace_show(struct btf_show *show, const char *fmt, ...) +{ + struct btf_show_trace *strace = (struct btf_show_trace *)show; + va_list args; + int len; + + va_start(args, fmt); + len = bpf_trace_vprintk(strace->trace_id, fmt, args); + va_end(args); + + if (len > 0) + strace->len += len; +} + +int btf_type_trace_show(const struct btf *btf, u32 type_id, void *obj, + u32 trace_id, u64 flags) +{ + struct btf_show_trace strace; + + strace.show.showfn = btf_trace_show; + strace.show.target = NULL; + /* Each trace event in object display is already emitted on a different + * line; suppress show newline output to avoid a double newline after + * each event. + */ + strace.show.flags = flags | BTF_SHOW_NONEWLINE; + strace.trace_id = trace_id; + strace.len = 0; + + btf_type_show(btf, type_id, obj, (struct btf_show *)&strace); + + /* If we encontered an error, return it. */ + if (strace.show.state.status) + return strace.show.state.status; + + /* Otherwise return length we would have written */ + return strace.len; }
#ifdef CONFIG_PROC_FS diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index cb91ef9..6453a75 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -383,26 +383,35 @@ static void bpf_trace_copy_string(char *buf, void *unsafe_ptr, char fmt_ptype,
#define BPF_TRACE_PRINTK_SIZE 1024
-static inline __printf(1, 0) int bpf_do_trace_printk(const char *fmt, ...) +int bpf_trace_vprintk(__u32 trace_id, const char *fmt, va_list ap) { static char buf[BPF_TRACE_PRINTK_SIZE]; unsigned long flags; - va_list ap; int ret;
raw_spin_lock_irqsave(&trace_printk_lock, flags); - va_start(ap, fmt); ret = vsnprintf(buf, sizeof(buf), fmt, ap); - va_end(ap); /* vsnprintf() will not append null for zero-length strings */ if (ret == 0) buf[0] = '\0'; - trace_bpf_trace_printk(buf); + trace_bpf_trace_printk(buf, trace_id); raw_spin_unlock_irqrestore(&trace_printk_lock, flags);
return ret; }
+static __printf(1, 0) int bpf_do_trace_printk(const char *fmt, ...) +{ + va_list ap; + int ret; + + va_start(ap, fmt); + ret = bpf_trace_vprintk(0, fmt, ap); + va_end(ap); + + return ret; +} + /* * Only limited trace_printk() conversion specifiers allowed: * %d %i %u %x %ld %li %lu %lx %lld %lli %llu %llx %p %pB %pks %pus %s diff --git a/kernel/trace/bpf_trace.h b/kernel/trace/bpf_trace.h index 9acbc11..5c401b3 100644 --- a/kernel/trace/bpf_trace.h +++ b/kernel/trace/bpf_trace.h @@ -10,16 +10,18 @@
TRACE_EVENT(bpf_trace_printk,
- TP_PROTO(const char *bpf_string), + TP_PROTO(const char *bpf_string, __u32 trace_id),
- TP_ARGS(bpf_string), + TP_ARGS(bpf_string, trace_id),
TP_STRUCT__entry( __string(bpf_string, bpf_string) + __field(__u32, trace_id) ),
TP_fast_assign( __assign_str(bpf_string, bpf_string); + __entry->trace_id = trace_id; ),
TP_printk("%s", __get_str(bpf_string))
On Thu, Aug 06, 2020 at 03:42:23PM +0100, Alan Maguire wrote:
The bpf_trace_printk tracepoint is augmented with a "trace_id" field; it is used to allow tracepoint filtering as typed display information can easily be interspersed with other tracing data, making it hard to read. Specifying a trace_id will allow users to selectively trace data, eliminating noise.
Since trace_id is not seen in trace_pipe, how do you expect users to filter by it? It also feels like workaround. May be let bpf prog print the whole struct in one go with multiple new lines and call trace_bpf_trace_printk(buf) once?
Also please add interface into bpf_seq_printf. BTF enabled struct prints is useful for iterators too and generalization you've done in this patch pretty much takes it there.
+/*
- Options to control show behaviour.
- BTF_SHOW_COMPACT: no formatting around type information
- BTF_SHOW_NONAME: no struct/union member names/types
- BTF_SHOW_PTR_RAW: show raw (unobfuscated) pointer values;
equivalent to %px.
- BTF_SHOW_ZERO: show zero-valued struct/union members; they
are not displayed by default
- BTF_SHOW_NONEWLINE: include indent, but suppress newline;
to be used when a show function implicitly includes a newline.
- BTF_SHOW_UNSAFE: skip use of bpf_probe_read() to safely read
data before displaying it.
- */
+#define BTF_SHOW_COMPACT (1ULL << 0) +#define BTF_SHOW_NONAME (1ULL << 1) +#define BTF_SHOW_PTR_RAW (1ULL << 2) +#define BTF_SHOW_ZERO (1ULL << 3) +#define BTF_SHOW_NONEWLINE (1ULL << 32) +#define BTF_SHOW_UNSAFE (1ULL << 33)
I could have missed it earlier, but what is the motivation to leave the gap in bits? Just do bit 4 and 5 ?
On Wed, 12 Aug 2020, Alexei Starovoitov wrote:
On Thu, Aug 06, 2020 at 03:42:23PM +0100, Alan Maguire wrote:
The bpf_trace_printk tracepoint is augmented with a "trace_id" field; it is used to allow tracepoint filtering as typed display information can easily be interspersed with other tracing data, making it hard to read. Specifying a trace_id will allow users to selectively trace data, eliminating noise.
Since trace_id is not seen in trace_pipe, how do you expect users to filter by it?
Sorry should have specified this. The approach is to use trace instances and filtering such that we only see events associated with a specific trace_id. There's no need for the trace event to actually display the trace_id - it's still usable as a filter. The steps involved are:
1. create a trace instance within which we can specify a fresh set of trace event enablings, filters etc.
mkdir /sys/kernel/debug/tracing/instances/traceid100
2. enable the filter for the specific trace id
echo "trace_id == 100" > /sys/kernel/debug/tracing/instances/traceid100/events/bpf_trace/bpf_trace_printk/filter
3. enable the trace event
echo 1 > /sys/kernel/debug/tracing/instances/events/bpf_trace/bpf_trace_printk/enable
4. ensure the BPF program uses a trace_id 100 when calling bpf_trace_btf()
So the above can be done for multiple programs; output can then be separated for different programs if trace_ids and filtering are used together. The above trace instance only sees bpf_trace_btf() events which specify trace_id 100.
I've attached a tweaked version of the patch 4 in the patchset that ensures that a trace instance with filtering enabled as above sees the bpf_trace_btf events, but _not_ bpf_trace_printk events (since they have trace_id 0 by default).
To me the above provides a simple way to separate BPF program output for simple BPF programs; ringbuf and perf events require a bit more work in both BPF and userspace to support such coordination. What do you think - does this approach seem worth using? If so we could also consider extending it to bpf_trace_printk(), if we can find a way to provide a trace_id there too.
It also feels like workaround. May be let bpf prog print the whole struct in one go with multiple new lines and call trace_bpf_trace_printk(buf) once?
We can do that absolutely, but I'd be interested to get your take on the filtering mechanism before taking that approach. I'll add a description of the above mechanism to the cover letter and patch to be clearer next time too.
Also please add interface into bpf_seq_printf. BTF enabled struct prints is useful for iterators too and generalization you've done in this patch pretty much takes it there.
Sure, I'll try and tackle that next time.
+#define BTF_SHOW_COMPACT (1ULL << 0) +#define BTF_SHOW_NONAME (1ULL << 1) +#define BTF_SHOW_PTR_RAW (1ULL << 2) +#define BTF_SHOW_ZERO (1ULL << 3) +#define BTF_SHOW_NONEWLINE (1ULL << 32) +#define BTF_SHOW_UNSAFE (1ULL << 33)
I could have missed it earlier, but what is the motivation to leave the gap in bits? Just do bit 4 and 5 ?
Patch 3 uses the first 4 as flags to bpf_trace_btf(); the final two are not supported for the helper as flag values so I wanted to leave some space for additional bpf_trace_btf() flags. BTF_SHOW_NONEWLINE is always used for bpf_trace_btf(), since the tracing adds a newline for us and we don't want to double up on newlines, so it's ORed in as an implicit argument for the bpf_trace_btf() case. BTF_SHOW_UNSAFE isn't allowed within BPF so it's not available as a flag for the helper.
Thanks!
Alan
From 10bd268b2585084c8f35d1b6ab0c3df76203f5cc Mon Sep 17 00:00:00 2001
From: Alan Maguire alan.maguire@oracle.com Date: Thu, 6 Aug 2020 14:21:10 +0200 Subject: [PATCH] selftests/bpf: add bpf_trace_btf helper tests
Basic tests verifying various flag combinations for bpf_trace_btf() using a tp_btf program to trace skb data. Also verify that we can create a trace instance to filter trace data, using the trace_id value passed to bpf_trace/bpf_trace_printk events.
trace_id is specifiable for bpf_trace_btf() so the test ensures the trace instance sees the filtered events only.
Signed-off-by: Alan Maguire alan.maguire@oracle.com --- tools/testing/selftests/bpf/prog_tests/trace_btf.c | 150 +++++++++++++++++++++ .../selftests/bpf/progs/netif_receive_skb.c | 48 +++++++ 2 files changed, 198 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/trace_btf.c create mode 100644 tools/testing/selftests/bpf/progs/netif_receive_skb.c
diff --git a/tools/testing/selftests/bpf/prog_tests/trace_btf.c b/tools/testing/selftests/bpf/prog_tests/trace_btf.c new file mode 100644 index 0000000..64a920f --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/trace_btf.c @@ -0,0 +1,150 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <test_progs.h> + +#include "netif_receive_skb.skel.h" + +#define TRACEFS "/sys/kernel/debug/tracing/" +#define TRACEBUF TRACEFS "trace_pipe" +#define TRACEID "100" +#define TRACEINST TRACEFS "instances/traceid" TRACEID "/" +#define TRACEINSTBUF TRACEINST "trace_pipe" +#define TRACEINSTEVENT TRACEINST "events/bpf_trace/bpf_trace_printk/" +#define BTFMSG "(struct sk_buff)" +#define PRINTKMSG "testing,testing" +#define MAXITER 1000 + +static bool findmsg(FILE *fp, const char *msg) +{ + bool found = false; + char *buf = NULL; + size_t buflen; + int iter = 0; + + /* verify our search string is in the trace buffer */ + while (++iter < MAXITER && + (getline(&buf, &buflen, fp) >= 0 || errno == EAGAIN)) { + found = strstr(buf, msg) != NULL; + if (found) + break; + } + free(buf); + + return found; +} + +/* Demonstrate that bpf_trace_btf succeeds with non-zero return values, + * and that filtering by trace instance works. bpf_trace_btf() is called + * with trace_id of 100, so we create a trace instance that enables + * bpf_trace_printk() events and filters on trace_id. The trace instance + * pipe should therefore only see the trace_id == 100 events, i.e. the + * bpf_trace_btf() events. The bpf program also uses bpf_trace_printk() - + * we ensure the global trace_pipe sees those but the instance does not. + */ +void test_trace_btf(void) +{ + struct netif_receive_skb *skel; + struct netif_receive_skb__bss *bss; + FILE *fp = NULL, *fpinst = NULL; + int err, duration = 0; + + skel = netif_receive_skb__open(); + if (CHECK(!skel, "skel_open", "failed to open skeleton\n")) + return; + + err = netif_receive_skb__load(skel); + if (CHECK(err, "skel_load", "failed to load skeleton: %d\n", err)) + goto cleanup; + + bss = skel->bss; + + err = netif_receive_skb__attach(skel); + if (CHECK(err, "skel_attach", "skeleton attach failed: %d\n", err)) + goto cleanup; + + /* Enable trace instance which filters on trace id associated with + * bpf_trace_btf(). + */ + if (CHECK(system("mkdir " TRACEINST), "could not create trace instance", + "error creating %s", TRACEINST)) + goto cleanup; + + if (CHECK(system("echo "trace_id == " TRACEID "" > " TRACEINSTEVENT "filter"), + "could not specify trace id filter for %s", TRACEID)) + goto cleanup; + if (CHECK(system("echo 1 > " TRACEINSTEVENT "enable"), "could not enable trace event", + "error enabling %s", TRACEINSTEVENT)) + goto cleanup; + + fp = fopen(TRACEBUF, "r"); + if (CHECK(fp == NULL, "could not open trace buffer", + "error %d opening %s", errno, TRACEBUF)) + goto cleanup; + + /* We do not want to wait forever if this test fails... */ + fcntl(fileno(fp), F_SETFL, O_NONBLOCK); + + fpinst = fopen(TRACEINSTBUF, "r"); + if (CHECK(fpinst == NULL, "could not open instance trace buffer", + "error %d opening %s", errno, TRACEINSTBUF)) + goto cleanup; + + /* We do not want to wait forever if this test fails... */ + fcntl(fileno(fpinst), F_SETFL, O_NONBLOCK); + + /* generate receive event */ + system("ping -c 1 127.0.0.1 > /dev/null"); + + /* + * Make sure netif_receive_skb program was triggered + * and it set expected return values from bpf_trace_printk()s + * and all tests ran. + */ + if (CHECK(bss->ret <= 0, + "bpf_trace_btf: got return value", + "ret <= 0 %ld test %d\n", bss->ret, bss->num_subtests)) + goto cleanup; + + if (CHECK(bss->num_subtests != bss->ran_subtests, "check all subtests ran", + "only ran %d of %d tests\n", bss->num_subtests, + bss->ran_subtests)) + goto cleanup; + + /* All messages should be in global trace_pipe */ + if (CHECK(!findmsg(fp, BTFMSG), "global trace pipe should have BTF", + "btf data not in trace pipe")) + goto cleanup; + + + if (CHECK(!findmsg(fp, PRINTKMSG), + "global trace pipe should have printk", + "trace_printk data not in global trace pipe")) + goto cleanup; + + + if (CHECK(!findmsg(fpinst, BTFMSG), + "instance trace pipe should have BTF", + "btf data not in instance trace pipe")) + goto cleanup; + + + /* we filtered on trace id, so bpf_trace_printk() message should not + * be in our trace instance log. + */ + CHECK(findmsg(fpinst, PRINTKMSG), + "instance trace pipe should not have printk", + "trace_printk data should not be in instance trace pipe"); + +cleanup: + if (fpinst) + fclose(fpinst); + if (fp) + fclose(fp); + + system("echo 0 > " TRACEINSTEVENT "enable"); + system("rmdir " TRACEINST); + netif_receive_skb__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/progs/netif_receive_skb.c b/tools/testing/selftests/bpf/progs/netif_receive_skb.c new file mode 100644 index 0000000..a14f79b --- /dev/null +++ b/tools/testing/selftests/bpf/progs/netif_receive_skb.c @@ -0,0 +1,48 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2020, Oracle and/or its affiliates. */ +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +char _license[] SEC("license") = "GPL"; + +long ret = 0; +int num_subtests = 0; +int ran_subtests = 0; + +/* Note a trace id of 100 is used, allowing filtering. */ +#define CHECK_TRACE(_p, flags) \ + do { \ + ++num_subtests; \ + if (ret >= 0) { \ + ++ran_subtests; \ + ret = bpf_trace_btf(_p, sizeof(*(_p)), 100, flags); \ + } \ + } while (0) + +/* TRACE_EVENT(netif_receive_skb, + * TP_PROTO(struct sk_buff *skb), + */ +SEC("tp_btf/netif_receive_skb") +int BPF_PROG(trace_netif_receive_skb, struct sk_buff *skb) +{ + static const char printkmsg[] = "testing,testing, ran %d so far..."; + static const char skb_type[] = "struct sk_buff"; + static struct btf_ptr p = { }; + + p.ptr = skb; + p.type = skb_type; + + /* This message will go to the global tracing log */ + bpf_trace_printk(printkmsg, sizeof(printkmsg), ran_subtests); + + CHECK_TRACE(&p, 0); + CHECK_TRACE(&p, BTF_TRACE_F_COMPACT); + CHECK_TRACE(&p, BTF_TRACE_F_NONAME); + CHECK_TRACE(&p, BTF_TRACE_F_PTR_RAW); + CHECK_TRACE(&p, BTF_TRACE_F_ZERO); + CHECK_TRACE(&p, BTF_TRACE_F_COMPACT | BTF_TRACE_F_NONAME | + BTF_TRACE_F_PTR_RAW | BTF_TRACE_F_ZERO); + + return 0; +}
On Fri, Aug 14, 2020 at 02:06:37PM +0100, Alan Maguire wrote:
On Wed, 12 Aug 2020, Alexei Starovoitov wrote:
On Thu, Aug 06, 2020 at 03:42:23PM +0100, Alan Maguire wrote:
The bpf_trace_printk tracepoint is augmented with a "trace_id" field; it is used to allow tracepoint filtering as typed display information can easily be interspersed with other tracing data, making it hard to read. Specifying a trace_id will allow users to selectively trace data, eliminating noise.
Since trace_id is not seen in trace_pipe, how do you expect users to filter by it?
Sorry should have specified this. The approach is to use trace instances and filtering such that we only see events associated with a specific trace_id. There's no need for the trace event to actually display the trace_id - it's still usable as a filter. The steps involved are:
- create a trace instance within which we can specify a fresh set of trace event enablings, filters etc.
mkdir /sys/kernel/debug/tracing/instances/traceid100
- enable the filter for the specific trace id
echo "trace_id == 100" > /sys/kernel/debug/tracing/instances/traceid100/events/bpf_trace/bpf_trace_printk/filter
- enable the trace event
echo 1 > /sys/kernel/debug/tracing/instances/events/bpf_trace/bpf_trace_printk/enable
- ensure the BPF program uses a trace_id 100 when calling bpf_trace_btf()
ouch. I think you interpreted the acceptance of the commit 7fb20f9e901e ("bpf, doc: Remove references to warning message when using bpf_trace_printk()") in the wrong way.
Everything that doc had said is still valid. In particular: -A: This is done to nudge program authors into better interfaces when -programs need to pass data to user space. Like bpf_perf_event_output() -can be used to efficiently stream data via perf ring buffer. -BPF maps can be used for asynchronous data sharing between kernel -and user space. bpf_trace_printk() should only be used for debugging.
bpf_trace_printk is for debugging only. _debugging of bpf programs themselves_. What you're describing above is logging and tracing. It's not debugging of programs. perf buffer, ring buffer, and seq_file interfaces are the right interfaces for tracing, logging, and kernel debugging.
It also feels like workaround. May be let bpf prog print the whole struct in one go with multiple new lines and call trace_bpf_trace_printk(buf) once?
We can do that absolutely, but I'd be interested to get your take on the filtering mechanism before taking that approach. I'll add a description of the above mechanism to the cover letter and patch to be clearer next time too.
I think patch 3 is no go, because it takes bpf_trace_printk in the wrong direction. Instead please refactor it to use string buffer or seq_file as an output. If the user happen to use bpf_trace_printk("%s", buf); after that to print that string buffer to trace_pipe that's user's choice. I can see such use case when program author wants to debug their bpf program. That's fine. But for kernel debugging, on demand and "always on" logging and tracing the documentation should point to sustainable interfaces that don't interfere with each other, can be run in parallel by multiple users, etc.
Also please add interface into bpf_seq_printf. BTF enabled struct prints is useful for iterators too and generalization you've done in this patch pretty much takes it there.
Sure, I'll try and tackle that next time.
+#define BTF_SHOW_COMPACT (1ULL << 0) +#define BTF_SHOW_NONAME (1ULL << 1) +#define BTF_SHOW_PTR_RAW (1ULL << 2) +#define BTF_SHOW_ZERO (1ULL << 3) +#define BTF_SHOW_NONEWLINE (1ULL << 32) +#define BTF_SHOW_UNSAFE (1ULL << 33)
I could have missed it earlier, but what is the motivation to leave the gap in bits? Just do bit 4 and 5 ?
Patch 3 uses the first 4 as flags to bpf_trace_btf(); the final two are not supported for the helper as flag values so I wanted to leave some space for additional bpf_trace_btf() flags. BTF_SHOW_NONEWLINE is always used for bpf_trace_btf(), since the tracing adds a newline for us and we don't want to double up on newlines, so it's ORed in as an implicit argument for the bpf_trace_btf() case. BTF_SHOW_UNSAFE isn't allowed within BPF so it's not available as a flag for the helper.
I see. I think such approach is more error prone than explicit filtering. It's better to check for NONEWLINE and UNSAFE explicitly in those places instead of playing obscure game of upper bits. Also the choice of NONEWLINE vs NEWLINE should be available from the program when it wants to print it into a string or into a seq_file.
On Fri, 14 Aug 2020, Alexei Starovoitov wrote:
On Fri, Aug 14, 2020 at 02:06:37PM +0100, Alan Maguire wrote:
On Wed, 12 Aug 2020, Alexei Starovoitov wrote:
On Thu, Aug 06, 2020 at 03:42:23PM +0100, Alan Maguire wrote:
The bpf_trace_printk tracepoint is augmented with a "trace_id" field; it is used to allow tracepoint filtering as typed display information can easily be interspersed with other tracing data, making it hard to read. Specifying a trace_id will allow users to selectively trace data, eliminating noise.
Since trace_id is not seen in trace_pipe, how do you expect users to filter by it?
Sorry should have specified this. The approach is to use trace instances and filtering such that we only see events associated with a specific trace_id. There's no need for the trace event to actually display the trace_id - it's still usable as a filter. The steps involved are:
- create a trace instance within which we can specify a fresh set of trace event enablings, filters etc.
mkdir /sys/kernel/debug/tracing/instances/traceid100
- enable the filter for the specific trace id
echo "trace_id == 100" > /sys/kernel/debug/tracing/instances/traceid100/events/bpf_trace/bpf_trace_printk/filter
- enable the trace event
echo 1 > /sys/kernel/debug/tracing/instances/events/bpf_trace/bpf_trace_printk/enable
- ensure the BPF program uses a trace_id 100 when calling bpf_trace_btf()
ouch. I think you interpreted the acceptance of the commit 7fb20f9e901e ("bpf, doc: Remove references to warning message when using bpf_trace_printk()") in the wrong way.
Everything that doc had said is still valid. In particular: -A: This is done to nudge program authors into better interfaces when -programs need to pass data to user space. Like bpf_perf_event_output() -can be used to efficiently stream data via perf ring buffer. -BPF maps can be used for asynchronous data sharing between kernel -and user space. bpf_trace_printk() should only be used for debugging.
bpf_trace_printk is for debugging only. _debugging of bpf programs themselves_. What you're describing above is logging and tracing. It's not debugging of programs. perf buffer, ring buffer, and seq_file interfaces are the right interfaces for tracing, logging, and kernel debugging.
It also feels like workaround. May be let bpf prog print the whole struct in one go with multiple new lines and call trace_bpf_trace_printk(buf) once?
We can do that absolutely, but I'd be interested to get your take on the filtering mechanism before taking that approach. I'll add a description of the above mechanism to the cover letter and patch to be clearer next time too.
I think patch 3 is no go, because it takes bpf_trace_printk in the wrong direction. Instead please refactor it to use string buffer or seq_file as an output.
Fair enough. I'm thinking a helper like
long bpf_btf_snprintf(char *str, u32 str_size, struct btf_ptr *ptr, u32 ptr_size, u64 flags);
Then the user can choose perf event or ringbuf interfaces to share the results with userspace.
If the user happen to use bpf_trace_printk("%s", buf); after that to print that string buffer to trace_pipe that's user's choice. I can see such use case when program author wants to debug their bpf program. That's fine. But for kernel debugging, on demand and "always on" logging and tracing the documentation should point to sustainable interfaces that don't interfere with each other, can be run in parallel by multiple users, etc.
The problem with bpf_trace_printk() under this approach is that the string size for %s arguments is very limited; bpf_trace_printk() restricts these to 64 bytes in size. Looks like bpf_seq_printf() restricts a %s string to 128 bytes also. We could add an additional helper for the bpf_seq case which calls bpf_seq_printf() for each component in the object, i.e.
long bpf_seq_btf_printf(struct seq_file *m, struct btf_ptr *ptr, u32 ptr_size, u64 flags);
This would steer users away from bpf_trace_printk() for this use case - since it can print only a small amount of the string - while supporting all the other user-space communication mechanisms.
Alan
On Tue, Aug 18, 2020 at 10:12:05AM +0100, Alan Maguire wrote:
Fair enough. I'm thinking a helper like
long bpf_btf_snprintf(char *str, u32 str_size, struct btf_ptr *ptr, u32 ptr_size, u64 flags);
Then the user can choose perf event or ringbuf interfaces to share the results with userspace.
makes sense.
If the user happen to use bpf_trace_printk("%s", buf); after that to print that string buffer to trace_pipe that's user's choice. I can see such use case when program author wants to debug their bpf program. That's fine. But for kernel debugging, on demand and "always on" logging and tracing the documentation should point to sustainable interfaces that don't interfere with each other, can be run in parallel by multiple users, etc.
The problem with bpf_trace_printk() under this approach is that the string size for %s arguments is very limited; bpf_trace_printk() restricts these to 64 bytes in size. Looks like bpf_seq_printf() restricts a %s string to 128 bytes also. We could add an additional helper for the bpf_seq case which calls bpf_seq_printf() for each component in the object, i.e.
long bpf_seq_btf_printf(struct seq_file *m, struct btf_ptr *ptr, u32 ptr_size, u64 flags);
yeah. this one is needed as well. Please double check that it works out of bpf iterator too. Especially the case when output is not power of 2. bpf_iter.c keeps page sized buffer and will restart iteration on overflow. Could you please modify progs/bpf_iter_task.c to do bpf_seq_btf_printf(seq, {task, }, 0); and check that the output doesn't contain repeated/garbled lines.
Also I'm not sure why you see 64 limit of bpf_trace_printk as a blocker. We could do: if (in_nmi()) use 64 byte on stack else spin_lock_irqsave and use 1k static buffer. and/or we can introduce bpf_trace_puts(const char *buf, u32 size_of_buf); with .arg1_type = ARG_PTR_TO_MEM, .arg2_type = ARG_CONST_SIZE, add null termination check and call trace_bpf_trace_printk() on that buffer. That will avoid a bunch of copies. But that still doesn't make bpf_trace_puts() a good interface for dumping stuff to user space. It's still debug only feature.
A helper is added to support tracing kernel type information in BPF using the BPF Type Format (BTF). Its signature is
long bpf_trace_btf(struct btf_ptr *ptr, u32 btf_ptr_size, u32 trace_id, u64 flags);
struct btf_ptr * specifies
- a pointer to the data to be traced; - the BTF id of the type of data pointed to; or - a string representation of the type of data pointed to - a flags field is provided for future use; these flags are not to be confused with the BTF_TRACE_F_* flags below that control how the btf_ptr is displayed; the flags member of the struct btf_ptr may be used to disambiguate types in kernel versus module BTF, etc; the main distinction is the flags relate to the type and information needed in identifying it; not how it is displayed.
The helper also specifies a trace id which is set for the bpf_trace_printk tracepoint; this allows BPF programs to filter on specific trace ids, ensuring output does not become mixed between different traced events and hard to read.
For example a BPF program with a struct sk_buff *skb could do the following:
static const char *skb_type = "struct sk_buff"; static struct btf_ptr b = { };
b.ptr = skb; b.type = skb_type; bpf_trace_btf(&b, sizeof(b), 0, 0);
Default output in the trace_pipe looks like this:
<idle>-0 [023] d.s. 1825.778400: bpf_trace_printk: (struct sk_buff){ <idle>-0 [023] d.s. 1825.778409: bpf_trace_printk: (union){ <idle>-0 [023] d.s. 1825.778410: bpf_trace_printk: (struct){ <idle>-0 [023] d.s. 1825.778412: bpf_trace_printk: .prev = (struct sk_buff *)0x00000000b2a3df7e, <idle>-0 [023] d.s. 1825.778413: bpf_trace_printk: (union){ <idle>-0 [023] d.s. 1825.778414: bpf_trace_printk: .dev = (struct net_device *)0x000000001658808b, <idle>-0 [023] d.s. 1825.778416: bpf_trace_printk: .dev_scratch = (long unsigned int)18446628460391432192, <idle>-0 [023] d.s. 1825.778417: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778417: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778418: bpf_trace_printk: .rbnode = (struct rb_node){ <idle>-0 [023] d.s. 1825.778419: bpf_trace_printk: .rb_right = (struct rb_node *)0x00000000b2a3df7e, <idle>-0 [023] d.s. 1825.778420: bpf_trace_printk: .rb_left = (struct rb_node *)0x000000001658808b, <idle>-0 [023] d.s. 1825.778420: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778421: bpf_trace_printk: .list = (struct list_head){ <idle>-0 [023] d.s. 1825.778422: bpf_trace_printk: .prev = (struct list_head *)0x00000000b2a3df7e, <idle>-0 [023] d.s. 1825.778422: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778422: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778426: bpf_trace_printk: .len = (unsigned int)168, <idle>-0 [023] d.s. 1825.778427: bpf_trace_printk: .mac_len = (__u16)14, <idle>-0 [023] d.s. 1825.778428: bpf_trace_printk: .queue_mapping = (__u16)17, <idle>-0 [023] d.s. 1825.778430: bpf_trace_printk: .head_frag = (__u8)0x1, <idle>-0 [023] d.s. 1825.778431: bpf_trace_printk: .ip_summed = (__u8)0x1, <idle>-0 [023] d.s. 1825.778432: bpf_trace_printk: .l4_hash = (__u8)0x1, <idle>-0 [023] d.s. 1825.778433: bpf_trace_printk: .hash = (__u32)1873247608, <idle>-0 [023] d.s. 1825.778434: bpf_trace_printk: (union){ <idle>-0 [023] d.s. 1825.778435: bpf_trace_printk: .napi_id = (unsigned int)8209, <idle>-0 [023] d.s. 1825.778436: bpf_trace_printk: .sender_cpu = (unsigned int)8209, <idle>-0 [023] d.s. 1825.778436: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778437: bpf_trace_printk: .protocol = (__be16)8, <idle>-0 [023] d.s. 1825.778438: bpf_trace_printk: .transport_header = (__u16)226, <idle>-0 [023] d.s. 1825.778439: bpf_trace_printk: .network_header = (__u16)206, <idle>-0 [023] d.s. 1825.778440: bpf_trace_printk: .mac_header = (__u16)192, <idle>-0 [023] d.s. 1825.778440: bpf_trace_printk: .tail = (sk_buff_data_t)374, <idle>-0 [023] d.s. 1825.778441: bpf_trace_printk: .end = (sk_buff_data_t)1728, <idle>-0 [023] d.s. 1825.778442: bpf_trace_printk: .head = (unsigned char *)0x000000009798cb6b, <idle>-0 [023] d.s. 1825.778443: bpf_trace_printk: .data = (unsigned char *)0x0000000064823282, <idle>-0 [023] d.s. 1825.778444: bpf_trace_printk: .truesize = (unsigned int)2304, <idle>-0 [023] d.s. 1825.778445: bpf_trace_printk: .users = (refcount_t){ <idle>-0 [023] d.s. 1825.778445: bpf_trace_printk: .refs = (atomic_t){ <idle>-0 [023] d.s. 1825.778447: bpf_trace_printk: .counter = (int)1, <idle>-0 [023] d.s. 1825.778447: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778447: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778448: bpf_trace_printk: }
Flags modifying display are as follows:
- BTF_TRACE_F_COMPACT: no formatting around type information - BTF_TRACE_F_NONAME: no struct/union member names/types - BTF_TRACE_F_PTR_RAW: show raw (unobfuscated) pointer values; equivalent to %px. - BTF_TRACE_F_ZERO: show zero-valued struct/union members; they are not displayed by default
Signed-off-by: Alan Maguire alan.maguire@oracle.com --- include/linux/bpf.h | 1 + include/linux/btf.h | 9 ++-- include/uapi/linux/bpf.h | 63 +++++++++++++++++++++++++ kernel/bpf/core.c | 5 ++ kernel/bpf/helpers.c | 4 ++ kernel/trace/bpf_trace.c | 102 ++++++++++++++++++++++++++++++++++++++++- scripts/bpf_helpers_doc.py | 2 + tools/include/uapi/linux/bpf.h | 63 +++++++++++++++++++++++++ 8 files changed, 243 insertions(+), 6 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 6143b6e..f67819d 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -934,6 +934,7 @@ struct bpf_event_entry { const char *kernel_type_name(u32 btf_type_id);
const struct bpf_func_proto *bpf_get_trace_printk_proto(void); +const struct bpf_func_proto *bpf_get_trace_btf_proto(void);
typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src, unsigned long off, unsigned long len); diff --git a/include/linux/btf.h b/include/linux/btf.h index 46bf9f4..3d31e28 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -6,6 +6,7 @@
#include <linux/types.h> #include <uapi/linux/btf.h> +#include <uapi/linux/bpf.h>
#define BTF_TYPE_EMIT(type) ((void)(type *)0)
@@ -61,10 +62,10 @@ const struct btf_type *btf_type_id_size(const struct btf *btf, * - BTF_SHOW_UNSAFE: skip use of bpf_probe_read() to safely read * data before displaying it. */ -#define BTF_SHOW_COMPACT (1ULL << 0) -#define BTF_SHOW_NONAME (1ULL << 1) -#define BTF_SHOW_PTR_RAW (1ULL << 2) -#define BTF_SHOW_ZERO (1ULL << 3) +#define BTF_SHOW_COMPACT BTF_TRACE_F_COMPACT +#define BTF_SHOW_NONAME BTF_TRACE_F_NONAME +#define BTF_SHOW_PTR_RAW BTF_TRACE_F_PTR_RAW +#define BTF_SHOW_ZERO BTF_TRACE_F_ZERO #define BTF_SHOW_NONEWLINE (1ULL << 32) #define BTF_SHOW_UNSAFE (1ULL << 33)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index b134e67..726fee4 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -3394,6 +3394,36 @@ struct bpf_stack_build_id { * A non-negative value equal to or less than *size* on success, * or a negative error in case of failure. * + * long bpf_trace_btf(struct btf_ptr *ptr, u32 btf_ptr_size, u32 trace_id, u64 flags) + * Description + * Utilize BTF to trace a representation of *ptr*->ptr, using + * *ptr*->type name or *ptr*->type_id. *ptr*->type_name + * should specify the type *ptr*->ptr points to. Traversing that + * data structure using BTF, the type information and values are + * bpf_trace_printk()ed. Safe copy of the pointer data is + * carried out to avoid kernel crashes during data display. + * Tracing specifies *trace_id* as the id associated with the + * trace event; this can be used to filter trace events + * to show a subset of all traced output, helping to avoid + * the situation where BTF output is intermixed with other + * output. + * + * *flags* is a combination of + * + * **BTF_TRACE_F_COMPACT** + * no formatting around type information + * **BTF_TRACE_F_NONAME** + * no struct/union member names/types + * **BTF_TRACE_F_PTR_RAW** + * show raw (unobfuscated) pointer values; + * equivalent to printk specifier %px. + * **BTF_TRACE_F_ZERO** + * show zero-valued struct/union members; they + * are not displayed by default + * + * Return + * The number of bytes traced, or a negative error in cases of + * failure. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3538,6 +3568,7 @@ struct bpf_stack_build_id { FN(skc_to_tcp_request_sock), \ FN(skc_to_udp6_sock), \ FN(get_task_stack), \ + FN(trace_btf), \ /* */
/* integer value in 'imm' field of BPF_CALL instruction selects which helper @@ -4446,4 +4477,36 @@ struct bpf_sk_lookup { __u32 local_port; /* Host byte order */ };
+/* + * struct btf_ptr is used for typed pointer display; the + * additional type string/BTF type id are used to render the pointer + * data as the appropriate type via the bpf_trace_btf() helper + * above. A flags field - potentially to specify additional details + * about the BTF pointer (rather than its mode of display) - is + * present for future use. Display flags - BTF_TRACE_F_* - are + * passed to display functions separately. + */ +struct btf_ptr { + void *ptr; + const char *type; + __u32 type_id; + __u32 flags; /* BTF ptr flags; unused at present. */ +}; + +/* + * Flags to control bpf_trace_btf() behaviour. + * - BTF_TRACE_F_COMPACT: no formatting around type information + * - BTF_TRACE_F_NONAME: no struct/union member names/types + * - BTF_TRACE_F_PTR_RAW: show raw (unobfuscated) pointer values; + * equivalent to %px. + * - BTF_TRACE_F_ZERO: show zero-valued struct/union members; they + * are not displayed by default + */ +enum { + BTF_TRACE_F_COMPACT = (1ULL << 0), + BTF_TRACE_F_NONAME = (1ULL << 1), + BTF_TRACE_F_PTR_RAW = (1ULL << 2), + BTF_TRACE_F_ZERO = (1ULL << 3), +}; + #endif /* _UAPI__LINUX_BPF_H__ */ diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index bde9334..82b3a98 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -2214,6 +2214,11 @@ const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void) return NULL; }
+const struct bpf_func_proto * __weak bpf_get_trace_btf_proto(void) +{ + return NULL; +} + u64 __weak bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size, void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index be43ab3..b9a842b 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -661,6 +661,10 @@ static int __bpf_strtoll(const char *buf, size_t buf_len, u64 flags, if (!perfmon_capable()) return NULL; return bpf_get_trace_printk_proto(); + case BPF_FUNC_trace_btf: + if (!perfmon_capable()) + return NULL; + return bpf_get_trace_btf_proto(); case BPF_FUNC_jiffies64: return &bpf_jiffies64_proto; default: diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 6453a75..92212a1 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -14,8 +14,12 @@ #include <linux/spinlock.h> #include <linux/syscalls.h> #include <linux/error-injection.h> +#include <linux/btf.h> #include <linux/btf_ids.h>
+#include <uapi/linux/bpf.h> +#include <uapi/linux/btf.h> + #include <asm/tlb.h>
#include "trace_probe.h" @@ -555,10 +559,91 @@ static __printf(1, 0) int bpf_do_trace_printk(const char *fmt, ...) .arg2_type = ARG_CONST_SIZE, };
-const struct bpf_func_proto *bpf_get_trace_printk_proto(void) +#define BTF_TRACE_F_ALL (BTF_TRACE_F_COMPACT | BTF_TRACE_F_NONAME | \ + BTF_TRACE_F_PTR_RAW | BTF_TRACE_F_ZERO) + +BPF_CALL_4(bpf_trace_btf, struct btf_ptr *, ptr, u32, btf_ptr_size, + u32, trace_id, u64, flags) +{ + u8 btf_kind = BTF_KIND_TYPEDEF; + char type_name[KSYM_NAME_LEN]; + const struct btf_type *t; + const struct btf *btf; + const char *btf_type; + s32 btf_id; + int ret; + + if (unlikely(flags & ~(BTF_TRACE_F_ALL))) + return -EINVAL; + + if (btf_ptr_size != sizeof(struct btf_ptr)) + return -EINVAL; + + btf = bpf_get_btf_vmlinux(); + + if (IS_ERR_OR_NULL(btf)) + return PTR_ERR(btf); + + if (ptr->type != NULL) { + ret = copy_from_kernel_nofault(type_name, ptr->type, + sizeof(type_name)); + if (ret) + return ret; + + btf_type = type_name; + + if (strncmp(btf_type, "struct ", strlen("struct ")) == 0) { + btf_kind = BTF_KIND_STRUCT; + btf_type += strlen("struct "); + } else if (strncmp(btf_type, "union ", strlen("union ")) == 0) { + btf_kind = BTF_KIND_UNION; + btf_type += strlen("union "); + } else if (strncmp(btf_type, "enum ", strlen("enum ")) == 0) { + btf_kind = BTF_KIND_ENUM; + btf_type += strlen("enum "); + } + + if (strlen(btf_type) == 0) + return -EINVAL; + + /* + * Assume type specified is a typedef as there's not much + * benefit in specifying int types other than wasting time + * on BTF lookups; we optimize for the most useful path. + * + * Fall back to BTF_KIND_INT if this fails. + */ + btf_id = btf_find_by_name_kind(btf, btf_type, btf_kind); + if (btf_id < 0) + btf_id = btf_find_by_name_kind(btf, btf_type, + BTF_KIND_INT); + } else if (ptr->type_id > 0) + btf_id = ptr->type_id; + else + return -EINVAL; + + if (btf_id > 0) + t = btf_type_by_id(btf, btf_id); + if (btf_id <= 0 || !t) + return -ENOENT; + + return btf_type_trace_show(btf, btf_id, ptr->ptr, trace_id, flags); +} + +static const struct bpf_func_proto bpf_trace_btf_proto = { + .func = bpf_trace_btf, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_MEM, + .arg2_type = ARG_CONST_SIZE, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_ANYTHING, +}; + +static void bpf_trace_printk_enable(void) { /* - * This program might be calling bpf_trace_printk, + * This program might be calling bpf_trace_[printk|btf], * so enable the associated bpf_trace/bpf_trace_printk event. * Repeat this each time as it is possible a user has * disabled bpf_trace_printk events. By loading a program @@ -567,10 +652,21 @@ const struct bpf_func_proto *bpf_get_trace_printk_proto(void) */ if (trace_set_clr_event("bpf_trace", "bpf_trace_printk", 1)) pr_warn_ratelimited("could not enable bpf_trace_printk events"); +} +const struct bpf_func_proto *bpf_get_trace_printk_proto(void) +{ + bpf_trace_printk_enable();
return &bpf_trace_printk_proto; }
+const struct bpf_func_proto *bpf_get_trace_btf_proto(void) +{ + bpf_trace_printk_enable(); + + return &bpf_trace_btf_proto; +} + #define MAX_SEQ_PRINTF_VARARGS 12 #define MAX_SEQ_PRINTF_MAX_MEMCPY 6 #define MAX_SEQ_PRINTF_STR_LEN 128 @@ -1139,6 +1235,8 @@ static int bpf_send_signal_common(u32 sig, enum pid_type type) return &bpf_get_current_comm_proto; case BPF_FUNC_trace_printk: return bpf_get_trace_printk_proto(); + case BPF_FUNC_trace_btf: + return bpf_get_trace_btf_proto(); case BPF_FUNC_get_smp_processor_id: return &bpf_get_smp_processor_id_proto; case BPF_FUNC_get_numa_node_id: diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py index 5bfa448..7c7384b 100755 --- a/scripts/bpf_helpers_doc.py +++ b/scripts/bpf_helpers_doc.py @@ -432,6 +432,7 @@ class PrinterHelpers(Printer): 'struct __sk_buff', 'struct sk_msg_md', 'struct xdp_md', + 'struct btf_ptr', ] known_types = { '...', @@ -472,6 +473,7 @@ class PrinterHelpers(Printer): 'struct tcp_request_sock', 'struct udp6_sock', 'struct task_struct', + 'struct btf_ptr', } mapped_types = { 'u8': '__u8', diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index b134e67..726fee4 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -3394,6 +3394,36 @@ struct bpf_stack_build_id { * A non-negative value equal to or less than *size* on success, * or a negative error in case of failure. * + * long bpf_trace_btf(struct btf_ptr *ptr, u32 btf_ptr_size, u32 trace_id, u64 flags) + * Description + * Utilize BTF to trace a representation of *ptr*->ptr, using + * *ptr*->type name or *ptr*->type_id. *ptr*->type_name + * should specify the type *ptr*->ptr points to. Traversing that + * data structure using BTF, the type information and values are + * bpf_trace_printk()ed. Safe copy of the pointer data is + * carried out to avoid kernel crashes during data display. + * Tracing specifies *trace_id* as the id associated with the + * trace event; this can be used to filter trace events + * to show a subset of all traced output, helping to avoid + * the situation where BTF output is intermixed with other + * output. + * + * *flags* is a combination of + * + * **BTF_TRACE_F_COMPACT** + * no formatting around type information + * **BTF_TRACE_F_NONAME** + * no struct/union member names/types + * **BTF_TRACE_F_PTR_RAW** + * show raw (unobfuscated) pointer values; + * equivalent to printk specifier %px. + * **BTF_TRACE_F_ZERO** + * show zero-valued struct/union members; they + * are not displayed by default + * + * Return + * The number of bytes traced, or a negative error in cases of + * failure. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3538,6 +3568,7 @@ struct bpf_stack_build_id { FN(skc_to_tcp_request_sock), \ FN(skc_to_udp6_sock), \ FN(get_task_stack), \ + FN(trace_btf), \ /* */
/* integer value in 'imm' field of BPF_CALL instruction selects which helper @@ -4446,4 +4477,36 @@ struct bpf_sk_lookup { __u32 local_port; /* Host byte order */ };
+/* + * struct btf_ptr is used for typed pointer display; the + * additional type string/BTF type id are used to render the pointer + * data as the appropriate type via the bpf_trace_btf() helper + * above. A flags field - potentially to specify additional details + * about the BTF pointer (rather than its mode of display) - is + * present for future use. Display flags - BTF_TRACE_F_* - are + * passed to display functions separately. + */ +struct btf_ptr { + void *ptr; + const char *type; + __u32 type_id; + __u32 flags; /* BTF ptr flags; unused at present. */ +}; + +/* + * Flags to control bpf_trace_btf() behaviour. + * - BTF_TRACE_F_COMPACT: no formatting around type information + * - BTF_TRACE_F_NONAME: no struct/union member names/types + * - BTF_TRACE_F_PTR_RAW: show raw (unobfuscated) pointer values; + * equivalent to %px. + * - BTF_TRACE_F_ZERO: show zero-valued struct/union members; they + * are not displayed by default + */ +enum { + BTF_TRACE_F_COMPACT = (1ULL << 0), + BTF_TRACE_F_NONAME = (1ULL << 1), + BTF_TRACE_F_PTR_RAW = (1ULL << 2), + BTF_TRACE_F_ZERO = (1ULL << 3), +}; + #endif /* _UAPI__LINUX_BPF_H__ */
On Thu, Aug 6, 2020 at 1:24 PM Alan Maguire alan.maguire@oracle.com wrote:
A helper is added to support tracing kernel type information in BPF using the BPF Type Format (BTF). Its signature is
long bpf_trace_btf(struct btf_ptr *ptr, u32 btf_ptr_size, u32 trace_id, u64 flags);
struct btf_ptr * specifies
- a pointer to the data to be traced;
- the BTF id of the type of data pointed to; or
- a string representation of the type of data pointed to
- a flags field is provided for future use; these flags are not to be confused with the BTF_TRACE_F_* flags below that control how the btf_ptr is displayed; the flags member of the struct btf_ptr may be used to disambiguate types in kernel versus module BTF, etc; the main distinction is the flags relate to the type and information needed in identifying it; not how it is displayed.
The helper also specifies a trace id which is set for the bpf_trace_printk tracepoint; this allows BPF programs to filter on specific trace ids, ensuring output does not become mixed between different traced events and hard to read.
For example a BPF program with a struct sk_buff *skb could do the following:
static const char *skb_type = "struct sk_buff"; static struct btf_ptr b = { }; b.ptr = skb; b.type = skb_type; bpf_trace_btf(&b, sizeof(b), 0, 0);
Default output in the trace_pipe looks like this:
<idle>-0 [023] d.s. 1825.778400: bpf_trace_printk: (struct sk_buff){ <idle>-0 [023] d.s. 1825.778409: bpf_trace_printk: (union){ <idle>-0 [023] d.s. 1825.778410: bpf_trace_printk: (struct){ <idle>-0 [023] d.s. 1825.778412: bpf_trace_printk: .prev = (struct sk_buff *)0x00000000b2a3df7e, <idle>-0 [023] d.s. 1825.778413: bpf_trace_printk: (union){ <idle>-0 [023] d.s. 1825.778414: bpf_trace_printk: .dev = (struct net_device *)0x000000001658808b, <idle>-0 [023] d.s. 1825.778416: bpf_trace_printk: .dev_scratch = (long unsigned int)18446628460391432192, <idle>-0 [023] d.s. 1825.778417: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778417: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778418: bpf_trace_printk: .rbnode = (struct rb_node){ <idle>-0 [023] d.s. 1825.778419: bpf_trace_printk: .rb_right = (struct rb_node *)0x00000000b2a3df7e, <idle>-0 [023] d.s. 1825.778420: bpf_trace_printk: .rb_left = (struct rb_node *)0x000000001658808b, <idle>-0 [023] d.s. 1825.778420: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778421: bpf_trace_printk: .list = (struct list_head){ <idle>-0 [023] d.s. 1825.778422: bpf_trace_printk: .prev = (struct list_head *)0x00000000b2a3df7e, <idle>-0 [023] d.s. 1825.778422: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778422: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778426: bpf_trace_printk: .len = (unsigned int)168, <idle>-0 [023] d.s. 1825.778427: bpf_trace_printk: .mac_len = (__u16)14, <idle>-0 [023] d.s. 1825.778428: bpf_trace_printk: .queue_mapping = (__u16)17, <idle>-0 [023] d.s. 1825.778430: bpf_trace_printk: .head_frag = (__u8)0x1, <idle>-0 [023] d.s. 1825.778431: bpf_trace_printk: .ip_summed = (__u8)0x1, <idle>-0 [023] d.s. 1825.778432: bpf_trace_printk: .l4_hash = (__u8)0x1, <idle>-0 [023] d.s. 1825.778433: bpf_trace_printk: .hash = (__u32)1873247608, <idle>-0 [023] d.s. 1825.778434: bpf_trace_printk: (union){ <idle>-0 [023] d.s. 1825.778435: bpf_trace_printk: .napi_id = (unsigned int)8209, <idle>-0 [023] d.s. 1825.778436: bpf_trace_printk: .sender_cpu = (unsigned int)8209, <idle>-0 [023] d.s. 1825.778436: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778437: bpf_trace_printk: .protocol = (__be16)8, <idle>-0 [023] d.s. 1825.778438: bpf_trace_printk: .transport_header = (__u16)226, <idle>-0 [023] d.s. 1825.778439: bpf_trace_printk: .network_header = (__u16)206, <idle>-0 [023] d.s. 1825.778440: bpf_trace_printk: .mac_header = (__u16)192, <idle>-0 [023] d.s. 1825.778440: bpf_trace_printk: .tail = (sk_buff_data_t)374, <idle>-0 [023] d.s. 1825.778441: bpf_trace_printk: .end = (sk_buff_data_t)1728, <idle>-0 [023] d.s. 1825.778442: bpf_trace_printk: .head = (unsigned char *)0x000000009798cb6b, <idle>-0 [023] d.s. 1825.778443: bpf_trace_printk: .data = (unsigned char *)0x0000000064823282, <idle>-0 [023] d.s. 1825.778444: bpf_trace_printk: .truesize = (unsigned int)2304, <idle>-0 [023] d.s. 1825.778445: bpf_trace_printk: .users = (refcount_t){ <idle>-0 [023] d.s. 1825.778445: bpf_trace_printk: .refs = (atomic_t){ <idle>-0 [023] d.s. 1825.778447: bpf_trace_printk: .counter = (int)1, <idle>-0 [023] d.s. 1825.778447: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778447: bpf_trace_printk: }, <idle>-0 [023] d.s. 1825.778448: bpf_trace_printk: }
Flags modifying display are as follows:
- BTF_TRACE_F_COMPACT: no formatting around type information
- BTF_TRACE_F_NONAME: no struct/union member names/types
- BTF_TRACE_F_PTR_RAW: show raw (unobfuscated) pointer values; equivalent to %px.
- BTF_TRACE_F_ZERO: show zero-valued struct/union members; they are not displayed by default
Signed-off-by: Alan Maguire alan.maguire@oracle.com
include/linux/bpf.h | 1 + include/linux/btf.h | 9 ++-- include/uapi/linux/bpf.h | 63 +++++++++++++++++++++++++ kernel/bpf/core.c | 5 ++ kernel/bpf/helpers.c | 4 ++ kernel/trace/bpf_trace.c | 102 ++++++++++++++++++++++++++++++++++++++++- scripts/bpf_helpers_doc.py | 2 + tools/include/uapi/linux/bpf.h | 63 +++++++++++++++++++++++++ 8 files changed, 243 insertions(+), 6 deletions(-)
[...]
+/*
- struct btf_ptr is used for typed pointer display; the
- additional type string/BTF type id are used to render the pointer
- data as the appropriate type via the bpf_trace_btf() helper
- above. A flags field - potentially to specify additional details
- about the BTF pointer (rather than its mode of display) - is
- present for future use. Display flags - BTF_TRACE_F_* - are
- passed to display functions separately.
- */
+struct btf_ptr {
void *ptr;
const char *type;
__u32 type_id;
__u32 flags; /* BTF ptr flags; unused at present. */
+};
Would it be possible to just utilize __builtin_btf_type_id() to pass BTF type id directly, without this string -> BTF type translation? Please check [0] to see if that would make sense here. Thanks.
[0] https://patchwork.ozlabs.org/project/netdev/patch/20200819194519.3375898-4-a...
[...]
Basic tests verifying various flag combinations for bpf_trace_btf() using a tp_btf program to trace skb data.
Signed-off-by: Alan Maguire alan.maguire@oracle.com --- tools/testing/selftests/bpf/prog_tests/trace_btf.c | 45 ++++++++++++++++++++++ .../selftests/bpf/progs/netif_receive_skb.c | 43 +++++++++++++++++++++ 2 files changed, 88 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/trace_btf.c create mode 100644 tools/testing/selftests/bpf/progs/netif_receive_skb.c
diff --git a/tools/testing/selftests/bpf/prog_tests/trace_btf.c b/tools/testing/selftests/bpf/prog_tests/trace_btf.c new file mode 100644 index 0000000..e64b69d --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/trace_btf.c @@ -0,0 +1,45 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <test_progs.h> + +#include "netif_receive_skb.skel.h" + +void test_trace_btf(void) +{ + struct netif_receive_skb *skel; + struct netif_receive_skb__bss *bss; + int err, duration = 0; + + skel = netif_receive_skb__open(); + if (CHECK(!skel, "skel_open", "failed to open skeleton\n")) + return; + + err = netif_receive_skb__load(skel); + if (CHECK(err, "skel_load", "failed to load skeleton: %d\n", err)) + goto cleanup; + + bss = skel->bss; + + err = netif_receive_skb__attach(skel); + if (CHECK(err, "skel_attach", "skeleton attach failed: %d\n", err)) + goto cleanup; + + /* generate receive event */ + system("ping -c 10 127.0.0.1"); + + /* + * Make sure netif_receive_skb program was triggered + * and it set expected return values from bpf_trace_printk()s + * and all tests ran. + */ + if (CHECK(bss->ret <= 0, + "bpf_trace_btf: got return value", + "ret <= 0 %ld test %d\n", bss->ret, bss->num_subtests)) + goto cleanup; + + CHECK(bss->num_subtests != bss->ran_subtests, "check all subtests ran", + "only ran %d of %d tests\n", bss->num_subtests, + bss->ran_subtests); + +cleanup: + netif_receive_skb__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/progs/netif_receive_skb.c b/tools/testing/selftests/bpf/progs/netif_receive_skb.c new file mode 100644 index 0000000..cab764e --- /dev/null +++ b/tools/testing/selftests/bpf/progs/netif_receive_skb.c @@ -0,0 +1,43 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2020, Oracle and/or its affiliates. */ +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +char _license[] SEC("license") = "GPL"; + +long ret = 0; +int num_subtests = 0; +int ran_subtests = 0; + +#define CHECK_TRACE(_p, flags) \ + do { \ + ++num_subtests; \ + if (ret >= 0) { \ + ++ran_subtests; \ + ret = bpf_trace_btf(_p, sizeof(*(_p)), 0, flags);\ + } \ + } while (0) + +/* TRACE_EVENT(netif_receive_skb, + * TP_PROTO(struct sk_buff *skb), + */ +SEC("tp_btf/netif_receive_skb") +int BPF_PROG(trace_netif_receive_skb, struct sk_buff *skb) +{ + static const char skb_type[] = "struct sk_buff"; + static struct btf_ptr p = { }; + + p.ptr = skb; + p.type = skb_type; + + CHECK_TRACE(&p, 0); + CHECK_TRACE(&p, BTF_TRACE_F_COMPACT); + CHECK_TRACE(&p, BTF_TRACE_F_NONAME); + CHECK_TRACE(&p, BTF_TRACE_F_PTR_RAW); + CHECK_TRACE(&p, BTF_TRACE_F_ZERO); + CHECK_TRACE(&p, BTF_TRACE_F_COMPACT | BTF_TRACE_F_NONAME | + BTF_TRACE_F_PTR_RAW | BTF_TRACE_F_ZERO); + + return 0; +}
linux-kselftest-mirror@lists.linaro.org