Context =======
We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a pure-userspace application get regularly interrupted by IPIs sent from housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs leading to various on_each_cpu() calls, e.g.:
64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush) smp_call_function_many_cond+0x1 smp_call_function+0x39 on_each_cpu+0x2a flush_tlb_kernel_range+0x7b __purge_vmap_area_lazy+0x70 _vm_unmap_aliases.part.42+0xdf change_page_attr_set_clr+0x16a set_memory_ro+0x26 bpf_int_jit_compile+0x2f9 bpf_prog_select_runtime+0xc6 bpf_prepare_filter+0x523 sk_attach_filter+0x13 sock_setsockopt+0x92c __sys_setsockopt+0x16a __x64_sys_setsockopt+0x20 do_syscall_64+0x87 entry_SYSCALL_64_after_hwframe+0x65
The heart of this series is the thought that while we cannot remove NOHZ_FULL CPUs from the list of CPUs targeted by these IPIs, they may not have to execute the callbacks immediately. Anything that only affects kernelspace can wait until the next user->kernel transition, providing it can be executed "early enough" in the entry code.
The original implementation is from Peter [1]. Nicolas then added kernel TLB invalidation deferral to that [2], and I picked it up from there.
Deferral approach =================
Storing each and every callback, like a secondary call_single_queue turned out to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in userspace for as long as possible - no signal of any form would be sent when deferring an IPI. This means that any form of queuing for deferred callbacks would end up as a convoluted memory leak.
Deferred IPIs must thus be coalesced, which this series achieves by assigning IPIs a "type" and having a mapping of IPI type to callback, leveraged upon kernel entry.
What about IPIs whose callback take a parameter, you may ask?
Peter suggested during OSPM23 [3] that since on_each_cpu() targets housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or housekeeping-CPU-local state to "reconstruct" the data that would have been sent via the IPI.
This series does not affect any IPI callback that requires an argument, but the approach would remain the same (one coalescable callback executed on kernel entry).
Kernel entry vs execution of the deferred operation ===================================================
There is a non-zero length of code that is executed upon kernel entry before the deferred operation can be itself executed (i.e. before we start getting into context_tracking.c proper).
This means one must take extra care to what can happen in the early entry code, and that <bad things> cannot happen. For instance, we really don't want to hit instructions that have been modified by a remote text_poke() while we're on our way to execute a deferred sync_core().
Patches =======
o Patches 1-9 have been submitted separately and are included for the sake of testing
o Patches 10-14 focus on having objtool detect problematic static key usage in early entry
o Patch 15 adds the infrastructure for IPI deferral. o Patches 16-17 add some RCU testing infrastructure o Patch 18 adds text_poke() IPI deferral.
o Patches 19-20 add vunmap() flush_tlb_kernel_range() IPI deferral
These ones I'm a lot less confident about, mostly due to lacking instrumentation/verification.
The actual deferred callback is also incomplete as it's not properly noinstr: vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x19: call to native_write_cr4() leaves .noinstr.text section and it doesn't support PARAVIRT - it's going to need a pv_ops.mmu entry, but I have *no idea* what a sane implementation would be for Xen so I haven't touched that yet.
Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v2
Testing =======
Note: this is a different machine than used for v1, because that machine decided to act difficult.
Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs. RHEL9 userspace.
Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \ -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \ -e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \ rteval --onlyload --loads-cpulist=$HK_CPUS \ --hackbench-runlowmem=True --duration=$DURATION
This only records IPIs sent to isolated CPUs, so any event there is interference (with a bit of fuzz at the start/end of the workload when spawning the processes). All tests were done with a duration of 30 minutes.
v6.5-rc1 (+ cpumask filtering patches): # This is the actual IPI count $ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr 338 callback=generic_smp_call_function_single_interrupt+0x0
# These are the different CSD's that caused IPIs $ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr 9207 func=do_flush_tlb_all 1116 func=do_sync_core 62 func=do_kernel_range_flush 3 func=nohz_full_kick_func
v6.5-rc1 + patches: # This is the actual IPI count $ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr 2 callback=generic_smp_call_function_single_interrupt+0x0
# These are the different CSD's that caused IPIs $ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr 2 func=nohz_full_kick_func
The incriminating IPIs are all gone, but note that on the machine I used to test v1 there were still some do_flush_tlb_all() IPIs caused by pcpu_balance_workfn(), since only vmalloc is affected by the deferral mechanism.
Acknowledgements ================
Special thanks to: o Clark Williams for listening to my ramblings about this and throwing ideas my way o Josh Poimboeuf for his guidance regarding objtool and hinting at the .data..ro_after_init section.
Links =====
[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/ [2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip [3]: https://youtu.be/0vjE6fjoVVE
Revisions =========
RFCv1 -> RFCv2 ++++++++++++++
o Rebased onto v6.5-rc1
o Updated the trace filter patches (Steven)
o Fixed __ro_after_init keys used in modules (Peter) o Dropped the extra context_tracking atomic, squashed the new bits in the existing .state field (Peter, Frederic)
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an rcutorture case for a low-size counter (Paul)
The new TREE11 case with a 2-bit dynticks counter seems to pass when ran against this series.
o Fixed flush_tlb_kernel_range_deferrable() definition
Peter Zijlstra (1): jump_label,module: Don't alloc static_key_mod for __ro_after_init keys
Valentin Schneider (19): tracing/filters: Dynamically allocate filter_pred.regex tracing/filters: Enable filtering a cpumask field by another cpumask tracing/filters: Enable filtering a scalar field by a cpumask tracing/filters: Enable filtering the CPU common field by a cpumask tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU tracing/filters: Optimise CPU vs cpumask filtering when the user mask is a single CPU tracing/filters: Further optimise scalar vs cpumask comparison tracing/filters: Document cpumask filtering objtool: Flesh out warning related to pv_ops[] calls objtool: Warn about non __ro_after_init static key usage in .noinstr context_tracking: Make context_tracking_key __ro_after_init x86/kvm: Make kvm_async_pf_enabled __ro_after_init context-tracking: Introduce work deferral infrastructure rcu: Make RCU dynticks counter size configurable rcutorture: Add a test config to torture test low RCU_DYNTICKS width context_tracking,x86: Defer kernel text patching IPIs context_tracking,x86: Add infrastructure to defer kernel TLBI x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
Documentation/trace/events.rst | 14 + arch/Kconfig | 9 + arch/x86/Kconfig | 1 + arch/x86/include/asm/context_tracking_work.h | 20 ++ arch/x86/include/asm/text-patching.h | 1 + arch/x86/include/asm/tlbflush.h | 2 + arch/x86/kernel/alternative.c | 24 +- arch/x86/kernel/kprobes/core.c | 4 +- arch/x86/kernel/kprobes/opt.c | 4 +- arch/x86/kernel/kvm.c | 2 +- arch/x86/kernel/module.c | 2 +- arch/x86/mm/tlb.c | 40 ++- include/asm-generic/sections.h | 5 + include/linux/context_tracking.h | 26 ++ include/linux/context_tracking_state.h | 65 +++- include/linux/context_tracking_work.h | 28 ++ include/linux/jump_label.h | 1 + include/linux/trace_events.h | 1 + init/main.c | 1 + kernel/context_tracking.c | 53 ++- kernel/jump_label.c | 49 +++ kernel/rcu/Kconfig | 33 ++ kernel/time/Kconfig | 5 + kernel/trace/trace_events_filter.c | 302 ++++++++++++++++-- mm/vmalloc.c | 19 +- tools/objtool/check.c | 22 +- tools/objtool/include/objtool/check.h | 1 + tools/objtool/include/objtool/special.h | 2 + tools/objtool/special.c | 3 + .../selftests/rcutorture/configs/rcu/TREE11 | 19 ++ .../rcutorture/configs/rcu/TREE11.boot | 1 + 31 files changed, 695 insertions(+), 64 deletions(-) create mode 100644 arch/x86/include/asm/context_tracking_work.h create mode 100644 include/linux/context_tracking_work.h create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
-- 2.31.1
Every predicate allocation includes a MAX_FILTER_STR_VAL (256) char array in the regex field, even if the predicate function does not use the field.
A later commit will introduce a dynamically allocated cpumask to struct filter_pred, which will require a dedicated freeing function. Bite the bullet and make filter_pred.regex dynamically allocated.
While at it, reorder the fields of filter_pred to fill in the byte holes. The struct now fits on a single cacheline.
No change in behaviour intended.
The kfree()'s were patched via Coccinelle: @@ struct filter_pred *pred; @@
-kfree(pred); +free_predicate(pred);
Signed-off-by: Valentin Schneider vschneid@redhat.com --- kernel/trace/trace_events_filter.c | 64 ++++++++++++++++++------------ 1 file changed, 39 insertions(+), 25 deletions(-)
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index 1dad64267878c..91fc9990107f1 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -70,15 +70,15 @@ enum filter_pred_fn { };
struct filter_pred { - enum filter_pred_fn fn_num; - u64 val; - u64 val2; - struct regex regex; + struct regex *regex; unsigned short *ops; struct ftrace_event_field *field; - int offset; + u64 val; + u64 val2; + enum filter_pred_fn fn_num; + int offset; int not; - int op; + int op; };
/* @@ -186,6 +186,14 @@ enum { PROCESS_OR = 4, };
+static void free_predicate(struct filter_pred *pred) +{ + if (pred) { + kfree(pred->regex); + kfree(pred); + } +} + /* * Without going into a formal proof, this explains the method that is used in * parsing the logical expressions. @@ -623,7 +631,7 @@ predicate_parse(const char *str, int nr_parens, int nr_preds, kfree(inverts); if (prog_stack) { for (i = 0; prog_stack[i].pred; i++) - kfree(prog_stack[i].pred); + free_predicate(prog_stack[i].pred); kfree(prog_stack); } return ERR_PTR(ret); @@ -750,7 +758,7 @@ static int filter_pred_string(struct filter_pred *pred, void *event) char *addr = (char *)(event + pred->offset); int cmp, match;
- cmp = pred->regex.match(addr, &pred->regex, pred->regex.field_len); + cmp = pred->regex->match(addr, pred->regex, pred->regex->field_len);
match = cmp ^ pred->not;
@@ -763,7 +771,7 @@ static __always_inline int filter_pchar(struct filter_pred *pred, char *str) int len;
len = strlen(str) + 1; /* including tailing '\0' */ - cmp = pred->regex.match(str, &pred->regex, len); + cmp = pred->regex->match(str, pred->regex, len);
match = cmp ^ pred->not;
@@ -813,7 +821,7 @@ static int filter_pred_strloc(struct filter_pred *pred, void *event) char *addr = (char *)(event + str_loc); int cmp, match;
- cmp = pred->regex.match(addr, &pred->regex, str_len); + cmp = pred->regex->match(addr, pred->regex, str_len);
match = cmp ^ pred->not;
@@ -836,7 +844,7 @@ static int filter_pred_strrelloc(struct filter_pred *pred, void *event) char *addr = (char *)(&item[1]) + str_loc; int cmp, match;
- cmp = pred->regex.match(addr, &pred->regex, str_len); + cmp = pred->regex->match(addr, pred->regex, str_len);
match = cmp ^ pred->not;
@@ -874,7 +882,7 @@ static int filter_pred_comm(struct filter_pred *pred, void *event) { int cmp;
- cmp = pred->regex.match(current->comm, &pred->regex, + cmp = pred->regex->match(current->comm, pred->regex, TASK_COMM_LEN); return cmp ^ pred->not; } @@ -1004,7 +1012,7 @@ enum regex_type filter_parse_regex(char *buff, int len, char **search, int *not)
static void filter_build_regex(struct filter_pred *pred) { - struct regex *r = &pred->regex; + struct regex *r = pred->regex; char *search; enum regex_type type = MATCH_FULL;
@@ -1169,7 +1177,7 @@ static void free_prog(struct event_filter *filter) return;
for (i = 0; prog[i].pred; i++) - kfree(prog[i].pred); + free_predicate(prog[i].pred); kfree(prog); }
@@ -1553,9 +1561,12 @@ static int parse_pred(const char *str, void *data, goto err_free; }
- pred->regex.len = len; - strncpy(pred->regex.pattern, str + s, len); - pred->regex.pattern[len] = 0; + pred->regex = kzalloc(sizeof(*pred->regex), GFP_KERNEL); + if (!pred->regex) + goto err_mem; + pred->regex->len = len; + strncpy(pred->regex->pattern, str + s, len); + pred->regex->pattern[len] = 0;
/* This is either a string, or an integer */ } else if (str[i] == ''' || str[i] == '"') { @@ -1597,9 +1608,12 @@ static int parse_pred(const char *str, void *data, goto err_free; }
- pred->regex.len = len; - strncpy(pred->regex.pattern, str + s, len); - pred->regex.pattern[len] = 0; + pred->regex = kzalloc(sizeof(*pred->regex), GFP_KERNEL); + if (!pred->regex) + goto err_mem; + pred->regex->len = len; + strncpy(pred->regex->pattern, str + s, len); + pred->regex->pattern[len] = 0;
filter_build_regex(pred);
@@ -1608,7 +1622,7 @@ static int parse_pred(const char *str, void *data,
} else if (field->filter_type == FILTER_STATIC_STRING) { pred->fn_num = FILTER_PRED_FN_STRING; - pred->regex.field_len = field->size; + pred->regex->field_len = field->size;
} else if (field->filter_type == FILTER_DYN_STRING) { pred->fn_num = FILTER_PRED_FN_STRLOC; @@ -1691,10 +1705,10 @@ static int parse_pred(const char *str, void *data, return i;
err_free: - kfree(pred); + free_predicate(pred); return -EINVAL; err_mem: - kfree(pred); + free_predicate(pred); return -ENOMEM; }
@@ -2287,8 +2301,8 @@ static int ftrace_function_set_filter_pred(struct filter_pred *pred, return ret;
return __ftrace_function_set_filter(pred->op == OP_EQ, - pred->regex.pattern, - pred->regex.len, + pred->regex->pattern, + pred->regex->len, data); }
The recently introduced ipi_send_cpumask trace event contains a cpumask field, but it currently cannot be used in filter expressions.
Make event filtering aware of cpumask fields, and allow these to be filtered by a user-provided cpumask.
The user-provided cpumask is to be given in cpulist format and wrapped as: "CPUS{$cpulist}". The use of curly braces instead of parentheses is to prevent predicate_parse() from parsing the contents of CPUS{...} as a full-fledged predicate subexpression.
This enables e.g.:
$ trace-cmd record -e 'ipi_send_cpumask' -f 'cpumask & CPUS{2,4,6,8-32}'
Signed-off-by: Valentin Schneider vschneid@redhat.com --- include/linux/trace_events.h | 1 + kernel/trace/trace_events_filter.c | 97 +++++++++++++++++++++++++++++- 2 files changed, 96 insertions(+), 2 deletions(-)
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index 3930e676436c9..302be73918336 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -807,6 +807,7 @@ enum { FILTER_RDYN_STRING, FILTER_PTR_STRING, FILTER_TRACE_FN, + FILTER_CPUMASK, FILTER_COMM, FILTER_CPU, FILTER_STACKTRACE, diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index 91fc9990107f1..cb1863dfa280b 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -64,6 +64,7 @@ enum filter_pred_fn { FILTER_PRED_FN_PCHAR_USER, FILTER_PRED_FN_PCHAR, FILTER_PRED_FN_CPU, + FILTER_PRED_FN_CPUMASK, FILTER_PRED_FN_FUNCTION, FILTER_PRED_FN_, FILTER_PRED_TEST_VISITED, @@ -71,6 +72,7 @@ enum filter_pred_fn {
struct filter_pred { struct regex *regex; + struct cpumask *mask; unsigned short *ops; struct ftrace_event_field *field; u64 val; @@ -94,6 +96,8 @@ struct filter_pred { C(TOO_MANY_OPEN, "Too many '('"), \ C(TOO_MANY_CLOSE, "Too few '('"), \ C(MISSING_QUOTE, "Missing matching quote"), \ + C(MISSING_BRACE_OPEN, "Missing '{'"), \ + C(MISSING_BRACE_CLOSE, "Missing '}'"), \ C(OPERAND_TOO_LONG, "Operand too long"), \ C(EXPECT_STRING, "Expecting string field"), \ C(EXPECT_DIGIT, "Expecting numeric field"), \ @@ -103,6 +107,7 @@ struct filter_pred { C(BAD_SUBSYS_FILTER, "Couldn't find or set field in one of a subsystem's events"), \ C(TOO_MANY_PREDS, "Too many terms in predicate expression"), \ C(INVALID_FILTER, "Meaningless filter expression"), \ + C(INVALID_CPULIST, "Invalid cpulist"), \ C(IP_FIELD_ONLY, "Only 'ip' field is supported for function trace"), \ C(INVALID_VALUE, "Invalid value (did you forget quotes)?"), \ C(NO_FUNCTION, "Function not found"), \ @@ -190,6 +195,7 @@ static void free_predicate(struct filter_pred *pred) { if (pred) { kfree(pred->regex); + kfree(pred->mask); kfree(pred); } } @@ -877,6 +883,26 @@ static int filter_pred_cpu(struct filter_pred *pred, void *event) } }
+/* Filter predicate for cpumask field vs user-provided cpumask */ +static int filter_pred_cpumask(struct filter_pred *pred, void *event) +{ + u32 item = *(u32 *)(event + pred->offset); + int loc = item & 0xffff; + const struct cpumask *mask = (event + loc); + const struct cpumask *cmp = pred->mask; + + switch (pred->op) { + case OP_EQ: + return cpumask_equal(mask, cmp); + case OP_NE: + return !cpumask_equal(mask, cmp); + case OP_BAND: + return cpumask_intersects(mask, cmp); + default: + return 0; + } +} + /* Filter predicate for COMM. */ static int filter_pred_comm(struct filter_pred *pred, void *event) { @@ -1244,8 +1270,12 @@ static void filter_free_subsystem_filters(struct trace_subsystem_dir *dir,
int filter_assign_type(const char *type) { - if (strstr(type, "__data_loc") && strstr(type, "char")) - return FILTER_DYN_STRING; + if (strstr(type, "__data_loc")) { + if (strstr(type, "char")) + return FILTER_DYN_STRING; + if (strstr(type, "cpumask_t")) + return FILTER_CPUMASK; + }
if (strstr(type, "__rel_loc") && strstr(type, "char")) return FILTER_RDYN_STRING; @@ -1357,6 +1387,8 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event) return filter_pred_pchar(pred, event); case FILTER_PRED_FN_CPU: return filter_pred_cpu(pred, event); + case FILTER_PRED_FN_CPUMASK: + return filter_pred_cpumask(pred, event); case FILTER_PRED_FN_FUNCTION: return filter_pred_function(pred, event); case FILTER_PRED_TEST_VISITED: @@ -1568,6 +1600,67 @@ static int parse_pred(const char *str, void *data, strncpy(pred->regex->pattern, str + s, len); pred->regex->pattern[len] = 0;
+ } else if (!strncmp(str + i, "CPUS", 4)) { + unsigned int maskstart; + char *tmp; + + switch (field->filter_type) { + case FILTER_CPUMASK: + break; + default: + parse_error(pe, FILT_ERR_ILLEGAL_FIELD_OP, pos + i); + goto err_free; + } + + switch (op) { + case OP_EQ: + case OP_NE: + case OP_BAND: + break; + default: + parse_error(pe, FILT_ERR_ILLEGAL_FIELD_OP, pos + i); + goto err_free; + } + + /* Skip CPUS */ + i += 4; + if (str[i++] != '{') { + parse_error(pe, FILT_ERR_MISSING_BRACE_OPEN, pos + i); + goto err_free; + } + maskstart = i; + + /* Walk the cpulist until closing } */ + for (; str[i] && str[i] != '}'; i++); + if (str[i] != '}') { + parse_error(pe, FILT_ERR_MISSING_BRACE_CLOSE, pos + i); + goto err_free; + } + + if (maskstart == i) { + parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i); + goto err_free; + } + + /* Copy the cpulist between { and } */ + tmp = kmalloc((i - maskstart) + 1, GFP_KERNEL); + strscpy(tmp, str + maskstart, (i - maskstart) + 1); + + pred->mask = kzalloc(cpumask_size(), GFP_KERNEL); + if (!pred->mask) + goto err_mem; + + /* Now parse it */ + if (cpulist_parse(tmp, pred->mask)) { + parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i); + goto err_free; + } + + /* Move along */ + i++; + if (field->filter_type == FILTER_CPUMASK) + pred->fn_num = FILTER_PRED_FN_CPUMASK; + /* This is either a string, or an integer */ } else if (str[i] == ''' || str[i] == '"') { char q = str[i];
On Thu, Jul 20, 2023 at 05:30:38PM +0100, Valentin Schneider wrote:
int filter_assign_type(const char *type) {
- if (strstr(type, "__data_loc") && strstr(type, "char"))
return FILTER_DYN_STRING;
- if (strstr(type, "__data_loc")) {
if (strstr(type, "char"))
return FILTER_DYN_STRING;
if (strstr(type, "cpumask_t"))
return FILTER_CPUMASK;
}
The closing bracket has the wrong indentation.
/* Copy the cpulist between { and } */
tmp = kmalloc((i - maskstart) + 1, GFP_KERNEL);
strscpy(tmp, str + maskstart, (i - maskstart) + 1);
Need to check kmalloc() failure? And also free tmp?
pred->mask = kzalloc(cpumask_size(), GFP_KERNEL);
if (!pred->mask)
goto err_mem;
/* Now parse it */
if (cpulist_parse(tmp, pred->mask)) {
parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
goto err_free;
}
/* Move along */
i++;
if (field->filter_type == FILTER_CPUMASK)
pred->fn_num = FILTER_PRED_FN_CPUMASK;
On 26/07/23 12:41, Josh Poimboeuf wrote:
On Thu, Jul 20, 2023 at 05:30:38PM +0100, Valentin Schneider wrote:
int filter_assign_type(const char *type) {
- if (strstr(type, "__data_loc") && strstr(type, "char"))
return FILTER_DYN_STRING;
- if (strstr(type, "__data_loc")) {
if (strstr(type, "char"))
return FILTER_DYN_STRING;
if (strstr(type, "cpumask_t"))
return FILTER_CPUMASK;
}
The closing bracket has the wrong indentation.
/* Copy the cpulist between { and } */
tmp = kmalloc((i - maskstart) + 1, GFP_KERNEL);
strscpy(tmp, str + maskstart, (i - maskstart) + 1);
Need to check kmalloc() failure? And also free tmp?
Heh, indeed, shoddy that :-)
Thanks!
pred->mask = kzalloc(cpumask_size(), GFP_KERNEL);
if (!pred->mask)
goto err_mem;
/* Now parse it */
if (cpulist_parse(tmp, pred->mask)) {
parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
goto err_free;
}
/* Move along */
i++;
if (field->filter_type == FILTER_CPUMASK)
pred->fn_num = FILTER_PRED_FN_CPUMASK;
-- Josh
On Wed, 26 Jul 2023 12:41:48 -0700 Josh Poimboeuf jpoimboe@kernel.org wrote:
On Thu, Jul 20, 2023 at 05:30:38PM +0100, Valentin Schneider wrote:
int filter_assign_type(const char *type) {
- if (strstr(type, "__data_loc") && strstr(type, "char"))
return FILTER_DYN_STRING;
- if (strstr(type, "__data_loc")) {
if (strstr(type, "char"))
return FILTER_DYN_STRING;
if (strstr(type, "cpumask_t"))
return FILTER_CPUMASK;
}
The closing bracket has the wrong indentation.
/* Copy the cpulist between { and } */
tmp = kmalloc((i - maskstart) + 1, GFP_KERNEL);
strscpy(tmp, str + maskstart, (i - maskstart) + 1);
Need to check kmalloc() failure? And also free tmp?
I came to state the same thing.
Also, when you do an empty for loop:
for (; str[i] && str[i] != '}'; i++);
Always put the semicolon on the next line, otherwise it is really easy to think that the next line is part of the for loop. That is, instead of the above, do:
for (; str[i] && str[i] != '}'; i++) ;
-- Steve
pred->mask = kzalloc(cpumask_size(), GFP_KERNEL);
if (!pred->mask)
goto err_mem;
/* Now parse it */
if (cpulist_parse(tmp, pred->mask)) {
parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
goto err_free;
}
/* Move along */
i++;
if (field->filter_type == FILTER_CPUMASK)
pred->fn_num = FILTER_PRED_FN_CPUMASK;
On 29/07/23 15:09, Steven Rostedt wrote:
On Wed, 26 Jul 2023 12:41:48 -0700 Josh Poimboeuf jpoimboe@kernel.org wrote:
On Thu, Jul 20, 2023 at 05:30:38PM +0100, Valentin Schneider wrote:
int filter_assign_type(const char *type) {
- if (strstr(type, "__data_loc") && strstr(type, "char"))
return FILTER_DYN_STRING;
- if (strstr(type, "__data_loc")) {
if (strstr(type, "char"))
return FILTER_DYN_STRING;
if (strstr(type, "cpumask_t"))
return FILTER_CPUMASK;
}
The closing bracket has the wrong indentation.
/* Copy the cpulist between { and } */
tmp = kmalloc((i - maskstart) + 1, GFP_KERNEL);
strscpy(tmp, str + maskstart, (i - maskstart) + 1);
Need to check kmalloc() failure? And also free tmp?
I came to state the same thing.
Also, when you do an empty for loop:
for (; str[i] && str[i] != '}'; i++);
Always put the semicolon on the next line, otherwise it is really easy to think that the next line is part of the for loop. That is, instead of the above, do:
for (; str[i] && str[i] != '}'; i++) ;
Interestingly I don't think I've ever encountered that variant, usually having an empty line (which this lacks) and the indentation level is enough to identify these - regardless, I'll change it.
-- Steve
pred->mask = kzalloc(cpumask_size(), GFP_KERNEL);
if (!pred->mask)
goto err_mem;
/* Now parse it */
if (cpulist_parse(tmp, pred->mask)) {
parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
goto err_free;
}
/* Move along */
i++;
if (field->filter_type == FILTER_CPUMASK)
pred->fn_num = FILTER_PRED_FN_CPUMASK;
On Mon, 31 Jul 2023 12:19:51 +0100 Valentin Schneider vschneid@redhat.com wrote:
Also, when you do an empty for loop:
for (; str[i] && str[i] != '}'; i++);
Always put the semicolon on the next line, otherwise it is really easy to think that the next line is part of the for loop. That is, instead of the above, do:
for (; str[i] && str[i] != '}'; i++) ;
Interestingly I don't think I've ever encountered that variant, usually having an empty line (which this lacks) and the indentation level is enough to identify these - regardless, I'll change it.
Do a "git grep -B1 -e '^\s*;\s*$'"
You'll find that it is quite common.
Thanks,
-- Steve
Several events use a scalar field to denote a CPU: o sched_wakeup.target_cpu o sched_migrate_task.orig_cpu,dest_cpu o sched_move_numa.src_cpu,dst_cpu o ipi_send_cpu.cpu o ...
Filtering these currently requires using arithmetic comparison functions, which can be tedious when dealing with interleaved SMT or NUMA CPU ids.
Allow these to be filtered by a user-provided cpumask, which enables e.g.:
$ trace-cmd record -e 'sched_wakeup' -f 'target_cpu & CPUS{2,4,6,8-32}'
Signed-off-by: Valentin Schneider vschneid@redhat.com --- NOTE: I went with an implicit cpumask conversion of the event field, as AFAICT predicate_parse() does not support parsing the application of a function to a field (e.g. 'CPUS(target_cpu) & CPUS{2,4,6,8-32}') --- kernel/trace/trace_events_filter.c | 92 ++++++++++++++++++++++++++---- 1 file changed, 81 insertions(+), 11 deletions(-)
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index cb1863dfa280b..1e14f801685a8 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -46,15 +46,19 @@ static const char * ops[] = { OPS }; enum filter_pred_fn { FILTER_PRED_FN_NOP, FILTER_PRED_FN_64, + FILTER_PRED_FN_64_CPUMASK, FILTER_PRED_FN_S64, FILTER_PRED_FN_U64, FILTER_PRED_FN_32, + FILTER_PRED_FN_32_CPUMASK, FILTER_PRED_FN_S32, FILTER_PRED_FN_U32, FILTER_PRED_FN_16, + FILTER_PRED_FN_16_CPUMASK, FILTER_PRED_FN_S16, FILTER_PRED_FN_U16, FILTER_PRED_FN_8, + FILTER_PRED_FN_8_CPUMASK, FILTER_PRED_FN_S8, FILTER_PRED_FN_U8, FILTER_PRED_FN_COMM, @@ -643,6 +647,39 @@ predicate_parse(const char *str, int nr_parens, int nr_preds, return ERR_PTR(ret); }
+static inline int +do_filter_cpumask(int op, const struct cpumask *mask, const struct cpumask *cmp) +{ + switch (op) { + case OP_EQ: + return cpumask_equal(mask, cmp); + case OP_NE: + return !cpumask_equal(mask, cmp); + case OP_BAND: + return cpumask_intersects(mask, cmp); + default: + return 0; + } +} + +/* Optimisation of do_filter_cpumask() for scalar fields */ +static inline int +do_filter_scalar_cpumask(int op, unsigned int cpu, const struct cpumask *mask) +{ + switch (op) { + case OP_EQ: + return cpumask_test_cpu(cpu, mask) && + cpumask_nth(1, mask) >= nr_cpu_ids; + case OP_NE: + return !cpumask_test_cpu(cpu, mask) || + cpumask_nth(1, mask) < nr_cpu_ids; + case OP_BAND: + return cpumask_test_cpu(cpu, mask); + default: + return 0; + } +} + enum pred_cmp_types { PRED_CMP_TYPE_NOP, PRED_CMP_TYPE_LT, @@ -686,6 +723,18 @@ static int filter_pred_##type(struct filter_pred *pred, void *event) \ } \ }
+#define DEFINE_CPUMASK_COMPARISON_PRED(size) \ +static int filter_pred_##size##_cpumask(struct filter_pred *pred, void *event) \ +{ \ + u##size *addr = (u##size *)(event + pred->offset); \ + unsigned int cpu = *addr; \ + \ + if (cpu >= nr_cpu_ids) \ + return 0; \ + \ + return do_filter_scalar_cpumask(pred->op, cpu, pred->mask); \ +} + #define DEFINE_EQUALITY_PRED(size) \ static int filter_pred_##size(struct filter_pred *pred, void *event) \ { \ @@ -707,6 +756,11 @@ DEFINE_COMPARISON_PRED(u16); DEFINE_COMPARISON_PRED(s8); DEFINE_COMPARISON_PRED(u8);
+DEFINE_CPUMASK_COMPARISON_PRED(64); +DEFINE_CPUMASK_COMPARISON_PRED(32); +DEFINE_CPUMASK_COMPARISON_PRED(16); +DEFINE_CPUMASK_COMPARISON_PRED(8); + DEFINE_EQUALITY_PRED(64); DEFINE_EQUALITY_PRED(32); DEFINE_EQUALITY_PRED(16); @@ -891,16 +945,7 @@ static int filter_pred_cpumask(struct filter_pred *pred, void *event) const struct cpumask *mask = (event + loc); const struct cpumask *cmp = pred->mask;
- switch (pred->op) { - case OP_EQ: - return cpumask_equal(mask, cmp); - case OP_NE: - return !cpumask_equal(mask, cmp); - case OP_BAND: - return cpumask_intersects(mask, cmp); - default: - return 0; - } + return do_filter_cpumask(pred->op, mask, cmp); }
/* Filter predicate for COMM. */ @@ -1351,24 +1396,32 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event) switch (pred->fn_num) { case FILTER_PRED_FN_64: return filter_pred_64(pred, event); + case FILTER_PRED_FN_64_CPUMASK: + return filter_pred_64_cpumask(pred, event); case FILTER_PRED_FN_S64: return filter_pred_s64(pred, event); case FILTER_PRED_FN_U64: return filter_pred_u64(pred, event); case FILTER_PRED_FN_32: return filter_pred_32(pred, event); + case FILTER_PRED_FN_32_CPUMASK: + return filter_pred_32_cpumask(pred, event); case FILTER_PRED_FN_S32: return filter_pred_s32(pred, event); case FILTER_PRED_FN_U32: return filter_pred_u32(pred, event); case FILTER_PRED_FN_16: return filter_pred_16(pred, event); + case FILTER_PRED_FN_16_CPUMASK: + return filter_pred_16_cpumask(pred, event); case FILTER_PRED_FN_S16: return filter_pred_s16(pred, event); case FILTER_PRED_FN_U16: return filter_pred_u16(pred, event); case FILTER_PRED_FN_8: return filter_pred_8(pred, event); + case FILTER_PRED_FN_8_CPUMASK: + return filter_pred_8_cpumask(pred, event); case FILTER_PRED_FN_S8: return filter_pred_s8(pred, event); case FILTER_PRED_FN_U8: @@ -1606,6 +1659,7 @@ static int parse_pred(const char *str, void *data,
switch (field->filter_type) { case FILTER_CPUMASK: + case FILTER_OTHER: break; default: parse_error(pe, FILT_ERR_ILLEGAL_FIELD_OP, pos + i); @@ -1658,8 +1712,24 @@ static int parse_pred(const char *str, void *data,
/* Move along */ i++; - if (field->filter_type == FILTER_CPUMASK) + if (field->filter_type == FILTER_CPUMASK) { pred->fn_num = FILTER_PRED_FN_CPUMASK; + } else { + switch (field->size) { + case 8: + pred->fn_num = FILTER_PRED_FN_64_CPUMASK; + break; + case 4: + pred->fn_num = FILTER_PRED_FN_32_CPUMASK; + break; + case 2: + pred->fn_num = FILTER_PRED_FN_16_CPUMASK; + break; + case 1: + pred->fn_num = FILTER_PRED_FN_8_CPUMASK; + break; + } + }
/* This is either a string, or an integer */ } else if (str[i] == ''' || str[i] == '"') {
The tracing_cpumask lets us specify which CPUs are traced in a buffer instance, but doesn't let us do this on a per-event basis (unless one creates an instance per event).
A previous commit added filtering scalar fields by a user-given cpumask, make this work with the CPU common field as well.
This enables doing things like
$ trace-cmd record -e 'sched_switch' -f 'CPU & CPUS{12-52}' \ -e 'sched_wakeup' -f 'target_cpu & CPUS{12-52}'
Signed-off-by: Valentin Schneider vschneid@redhat.com --- kernel/trace/trace_events_filter.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index 1e14f801685a8..3009d0c61b532 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -68,6 +68,7 @@ enum filter_pred_fn { FILTER_PRED_FN_PCHAR_USER, FILTER_PRED_FN_PCHAR, FILTER_PRED_FN_CPU, + FILTER_PRED_FN_CPU_CPUMASK, FILTER_PRED_FN_CPUMASK, FILTER_PRED_FN_FUNCTION, FILTER_PRED_FN_, @@ -937,6 +938,14 @@ static int filter_pred_cpu(struct filter_pred *pred, void *event) } }
+/* Filter predicate for current CPU vs user-provided cpumask */ +static int filter_pred_cpu_cpumask(struct filter_pred *pred, void *event) +{ + int cpu = raw_smp_processor_id(); + + return do_filter_scalar_cpumask(pred->op, cpu, pred->mask); +} + /* Filter predicate for cpumask field vs user-provided cpumask */ static int filter_pred_cpumask(struct filter_pred *pred, void *event) { @@ -1440,6 +1449,8 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event) return filter_pred_pchar(pred, event); case FILTER_PRED_FN_CPU: return filter_pred_cpu(pred, event); + case FILTER_PRED_FN_CPU_CPUMASK: + return filter_pred_cpu_cpumask(pred, event); case FILTER_PRED_FN_CPUMASK: return filter_pred_cpumask(pred, event); case FILTER_PRED_FN_FUNCTION: @@ -1659,6 +1670,7 @@ static int parse_pred(const char *str, void *data,
switch (field->filter_type) { case FILTER_CPUMASK: + case FILTER_CPU: case FILTER_OTHER: break; default: @@ -1714,6 +1726,8 @@ static int parse_pred(const char *str, void *data, i++; if (field->filter_type == FILTER_CPUMASK) { pred->fn_num = FILTER_PRED_FN_CPUMASK; + } else if (field->filter_type == FILTER_CPU) { + pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK; } else { switch (field->size) { case 8:
Steven noted that when the user-provided cpumask contains a single CPU, then the filtering function can use a scalar as input instead of a full-fledged cpumask.
Reuse do_filter_scalar_cpumask() when the input mask has a weight of one.
Suggested-by: Steven Rostedt rostedt@goodmis.org Signed-off-by: Valentin Schneider vschneid@redhat.com --- kernel/trace/trace_events_filter.c | 35 +++++++++++++++++++++++++++++- 1 file changed, 34 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index 3009d0c61b532..2fe65ddeb34ef 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -70,6 +70,7 @@ enum filter_pred_fn { FILTER_PRED_FN_CPU, FILTER_PRED_FN_CPU_CPUMASK, FILTER_PRED_FN_CPUMASK, + FILTER_PRED_FN_CPUMASK_CPU, FILTER_PRED_FN_FUNCTION, FILTER_PRED_FN_, FILTER_PRED_TEST_VISITED, @@ -957,6 +958,22 @@ static int filter_pred_cpumask(struct filter_pred *pred, void *event) return do_filter_cpumask(pred->op, mask, cmp); }
+/* Filter predicate for cpumask field vs user-provided scalar */ +static int filter_pred_cpumask_cpu(struct filter_pred *pred, void *event) +{ + u32 item = *(u32 *)(event + pred->offset); + int loc = item & 0xffff; + const struct cpumask *mask = (event + loc); + unsigned int cpu = pred->val; + + /* + * This inverts the usual usage of the function (field is first element, + * user parameter is second), but that's fine because the (scalar, mask) + * operations used are symmetric. + */ + return do_filter_scalar_cpumask(pred->op, cpu, mask); +} + /* Filter predicate for COMM. */ static int filter_pred_comm(struct filter_pred *pred, void *event) { @@ -1453,6 +1470,8 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event) return filter_pred_cpu_cpumask(pred, event); case FILTER_PRED_FN_CPUMASK: return filter_pred_cpumask(pred, event); + case FILTER_PRED_FN_CPUMASK_CPU: + return filter_pred_cpumask_cpu(pred, event); case FILTER_PRED_FN_FUNCTION: return filter_pred_function(pred, event); case FILTER_PRED_TEST_VISITED: @@ -1666,6 +1685,7 @@ static int parse_pred(const char *str, void *data,
} else if (!strncmp(str + i, "CPUS", 4)) { unsigned int maskstart; + bool single; char *tmp;
switch (field->filter_type) { @@ -1724,8 +1744,21 @@ static int parse_pred(const char *str, void *data,
/* Move along */ i++; + + /* + * Optimisation: if the user-provided mask has a weight of one + * then we can treat it as a scalar input. + */ + single = cpumask_weight(pred->mask) == 1; + if (single && field->filter_type == FILTER_CPUMASK) { + pred->val = cpumask_first(pred->mask); + kfree(pred->mask); + } + if (field->filter_type == FILTER_CPUMASK) { - pred->fn_num = FILTER_PRED_FN_CPUMASK; + pred->fn_num = single ? + FILTER_PRED_FN_CPUMASK_CPU : + FILTER_PRED_FN_CPUMASK; } else if (field->filter_type == FILTER_CPU) { pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK; } else {
On Thu, 20 Jul 2023 17:30:41 +0100 Valentin Schneider vschneid@redhat.com wrote:
/* Move along */ i++;
/*
* Optimisation: if the user-provided mask has a weight of one
* then we can treat it as a scalar input.
*/
single = cpumask_weight(pred->mask) == 1;
if (single && field->filter_type == FILTER_CPUMASK) {
pred->val = cpumask_first(pred->mask);
kfree(pred->mask);
Don't we need: pred->mask = NULL;
here, or the free_predicate() will cause a double free?
-- Steve
}
On 29/07/23 15:34, Steven Rostedt wrote:
On Thu, 20 Jul 2023 17:30:41 +0100 Valentin Schneider vschneid@redhat.com wrote:
/* Move along */ i++;
/*
* Optimisation: if the user-provided mask has a weight of one
* then we can treat it as a scalar input.
*/
single = cpumask_weight(pred->mask) == 1;
if (single && field->filter_type == FILTER_CPUMASK) {
pred->val = cpumask_first(pred->mask);
kfree(pred->mask);
Don't we need: pred->mask = NULL;
here, or the free_predicate() will cause a double free?
We do, thanks for spotting this.
-- Steve
}
Steven noted that when the user-provided cpumask contains a single CPU, then the filtering function can use a scalar as input instead of a full-fledged cpumask.
When the mask contains a single CPU, directly re-use the unsigned field predicate functions. Transform '&' into '==' beforehand.
Suggested-by: Steven Rostedt rostedt@goodmis.org Signed-off-by: Valentin Schneider vschneid@redhat.com --- kernel/trace/trace_events_filter.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index 2fe65ddeb34ef..54d642fabb7f1 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -1750,7 +1750,7 @@ static int parse_pred(const char *str, void *data, * then we can treat it as a scalar input. */ single = cpumask_weight(pred->mask) == 1; - if (single && field->filter_type == FILTER_CPUMASK) { + if (single && field->filter_type != FILTER_CPU) { pred->val = cpumask_first(pred->mask); kfree(pred->mask); } @@ -1761,6 +1761,11 @@ static int parse_pred(const char *str, void *data, FILTER_PRED_FN_CPUMASK; } else if (field->filter_type == FILTER_CPU) { pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK; + } else if (single) { + pred->op = pred->op == OP_BAND ? OP_EQ : pred->op; + pred->fn_num = select_comparison_fn(pred->op, field->size, false); + if (pred->op == OP_NE) + pred->not = 1; } else { switch (field->size) { case 8:
On Thu, 20 Jul 2023 17:30:42 +0100 Valentin Schneider vschneid@redhat.com wrote:
Steven noted that when the user-provided cpumask contains a single CPU, then the filtering function can use a scalar as input instead of a full-fledged cpumask.
When the mask contains a single CPU, directly re-use the unsigned field predicate functions. Transform '&' into '==' beforehand.
Suggested-by: Steven Rostedt rostedt@goodmis.org Signed-off-by: Valentin Schneider vschneid@redhat.com
kernel/trace/trace_events_filter.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index 2fe65ddeb34ef..54d642fabb7f1 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -1750,7 +1750,7 @@ static int parse_pred(const char *str, void *data, * then we can treat it as a scalar input. */ single = cpumask_weight(pred->mask) == 1;
if (single && field->filter_type == FILTER_CPUMASK) {
}if (single && field->filter_type != FILTER_CPU) { pred->val = cpumask_first(pred->mask); kfree(pred->mask);
@@ -1761,6 +1761,11 @@ static int parse_pred(const char *str, void *data, FILTER_PRED_FN_CPUMASK; } else if (field->filter_type == FILTER_CPU) { pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK;
} else if (single) {
pred->op = pred->op == OP_BAND ? OP_EQ : pred->op;
Nit, the above can be written as:
pred->op = pret->op != OP_BAND ? : OP_EQ;
-- Steve
pred->fn_num = select_comparison_fn(pred->op, field->size, false);
if (pred->op == OP_NE)
} else { switch (field->size) { case 8:pred->not = 1;
On 29/07/23 15:55, Steven Rostedt wrote:
On Thu, 20 Jul 2023 17:30:42 +0100 Valentin Schneider vschneid@redhat.com wrote:
Steven noted that when the user-provided cpumask contains a single CPU, then the filtering function can use a scalar as input instead of a full-fledged cpumask.
When the mask contains a single CPU, directly re-use the unsigned field predicate functions. Transform '&' into '==' beforehand.
Suggested-by: Steven Rostedt rostedt@goodmis.org Signed-off-by: Valentin Schneider vschneid@redhat.com
kernel/trace/trace_events_filter.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index 2fe65ddeb34ef..54d642fabb7f1 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -1750,7 +1750,7 @@ static int parse_pred(const char *str, void *data, * then we can treat it as a scalar input. */ single = cpumask_weight(pred->mask) == 1;
if (single && field->filter_type == FILTER_CPUMASK) {
if (single && field->filter_type != FILTER_CPU) { pred->val = cpumask_first(pred->mask); kfree(pred->mask); }
@@ -1761,6 +1761,11 @@ static int parse_pred(const char *str, void *data, FILTER_PRED_FN_CPUMASK; } else if (field->filter_type == FILTER_CPU) { pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK;
} else if (single) {
pred->op = pred->op == OP_BAND ? OP_EQ : pred->op;
Nit, the above can be written as:
pred->op = pret->op != OP_BAND ? : OP_EQ;
That's neater, thanks!
-- Steve
pred->fn_num = select_comparison_fn(pred->op, field->size, false);
if (pred->op == OP_NE)
pred->not = 1; } else { switch (field->size) { case 8:
On Sat, Jul 29, 2023 at 03:55:47PM -0400, Steven Rostedt wrote:
@@ -1761,6 +1761,11 @@ static int parse_pred(const char *str, void *data, FILTER_PRED_FN_CPUMASK; } else if (field->filter_type == FILTER_CPU) { pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK;
} else if (single) {
pred->op = pred->op == OP_BAND ? OP_EQ : pred->op;
Nit, the above can be written as:
pred->op = pret->op != OP_BAND ? : OP_EQ;
Heh. Those are not equivalent. The right way to write this is:
if (pred->op == OP_BAND) pred->op = OP_EQ;
regards, dan carpenter
On Mon, 31 Jul 2023 15:07:52 +0300 Dan Carpenter dan.carpenter@linaro.org wrote:
On Sat, Jul 29, 2023 at 03:55:47PM -0400, Steven Rostedt wrote:
@@ -1761,6 +1761,11 @@ static int parse_pred(const char *str, void *data, FILTER_PRED_FN_CPUMASK; } else if (field->filter_type == FILTER_CPU) { pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK;
} else if (single) {
pred->op = pred->op == OP_BAND ? OP_EQ : pred->op;
Nit, the above can be written as:
pred->op = pret->op != OP_BAND ? : OP_EQ;
Heh. Those are not equivalent. The right way to write this is:
You mean because of my typo?
if (pred->op == OP_BAND) pred->op = OP_EQ;
But sure, I'm fine with that, and it's probably more readable too.
-- Steve
On Mon, Jul 31, 2023 at 11:54:53AM -0400, Steven Rostedt wrote:
On Mon, 31 Jul 2023 15:07:52 +0300 Dan Carpenter dan.carpenter@linaro.org wrote:
On Sat, Jul 29, 2023 at 03:55:47PM -0400, Steven Rostedt wrote:
@@ -1761,6 +1761,11 @@ static int parse_pred(const char *str, void *data, FILTER_PRED_FN_CPUMASK; } else if (field->filter_type == FILTER_CPU) { pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK;
} else if (single) {
pred->op = pred->op == OP_BAND ? OP_EQ : pred->op;
Nit, the above can be written as:
pred->op = pret->op != OP_BAND ? : OP_EQ;
Heh. Those are not equivalent. The right way to write this is:
You mean because of my typo?
No, I hadn't seen the s/pred/pret/ typo. Your code does:
if (pred->op != OP_BAND) pred->op = true; else pred->op OP_EQ;
Realy we should probably trigger a static checker warning any time someone does a compare operations as part of a "x = comparison ?: bar; Years ago, someone asked me to do that with regards to error codes like:
return ret < 0 ?: -EINVAL;
but I don't remember the results.
regards, dan carpenter
On 31/07/23 19:03, Dan Carpenter wrote:
On Mon, Jul 31, 2023 at 11:54:53AM -0400, Steven Rostedt wrote:
On Mon, 31 Jul 2023 15:07:52 +0300 Dan Carpenter dan.carpenter@linaro.org wrote:
On Sat, Jul 29, 2023 at 03:55:47PM -0400, Steven Rostedt wrote:
@@ -1761,6 +1761,11 @@ static int parse_pred(const char *str, void *data, FILTER_PRED_FN_CPUMASK; } else if (field->filter_type == FILTER_CPU) { pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK;
} else if (single) {
pred->op = pred->op == OP_BAND ? OP_EQ : pred->op;
Nit, the above can be written as:
pred->op = pret->op != OP_BAND ? : OP_EQ;
Heh. Those are not equivalent. The right way to write this is:
You mean because of my typo?
No, I hadn't seen the s/pred/pret/ typo. Your code does:
if (pred->op != OP_BAND) pred->op = true; else pred->op OP_EQ;
Realy we should probably trigger a static checker warning any time someone does a compare operations as part of a "x = comparison ?: bar; Years ago, someone asked me to do that with regards to error codes like:
return ret < 0 ?: -EINVAL;
but I don't remember the results.
FWIW this is caught by GCC:
error: the omitted middle operand in ?: will always be ‘true’, suggest explicit middle operand [-Werror=parentheses] pred->op = pred->op != OP_BAND ? : OP_EQ;
regards, dan carpenter
On Mon, 31 Jul 2023 19:03:04 +0300 Dan Carpenter dan.carpenter@linaro.org wrote:
Nit, the above can be written as:
pred->op = pret->op != OP_BAND ? : OP_EQ;
Heh. Those are not equivalent. The right way to write this is:
You mean because of my typo?
No, I hadn't seen the s/pred/pret/ typo. Your code does:
if (pred->op != OP_BAND) pred->op = true; else pred->op OP_EQ;
Ah, for some reason I was thinking the ? : just was just a nop, but I guess it is to assign the cond value :-/
But of course every place I've done that, it was the condition value I wanted, which was the same as the value being assigned.
Thanks,
-- Steve
Steven noted that when the user-provided cpumask contains a single CPU, then the filtering function can use a scalar as input instead of a full-fledged cpumask.
In this case we can directly re-use filter_pred_cpu(), we just need to transform '&' into '==' before executing it.
Suggested-by: Steven Rostedt rostedt@goodmis.org Signed-off-by: Valentin Schneider vschneid@redhat.com --- kernel/trace/trace_events_filter.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index 54d642fabb7f1..fd72dacc5d1b8 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -1750,7 +1750,7 @@ static int parse_pred(const char *str, void *data, * then we can treat it as a scalar input. */ single = cpumask_weight(pred->mask) == 1; - if (single && field->filter_type != FILTER_CPU) { + if (single) { pred->val = cpumask_first(pred->mask); kfree(pred->mask); } @@ -1760,7 +1760,12 @@ static int parse_pred(const char *str, void *data, FILTER_PRED_FN_CPUMASK_CPU : FILTER_PRED_FN_CPUMASK; } else if (field->filter_type == FILTER_CPU) { - pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK; + if (single) { + pred->op = pred->op == OP_BAND ? OP_EQ : pred->op; + pred->fn_num = FILTER_PRED_FN_CPU; + } else { + pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK; + } } else if (single) { pred->op = pred->op == OP_BAND ? OP_EQ : pred->op; pred->fn_num = select_comparison_fn(pred->op, field->size, false);
Per the previous commits, we now only enter do_filter_scalar_cpumask() with a mask of weight greater than one. Optimise the equality checks.
Suggested-by: Steven Rostedt rostedt@goodmis.org Signed-off-by: Valentin Schneider vschneid@redhat.com --- kernel/trace/trace_events_filter.c | 26 ++++++++++++++++++++------ 1 file changed, 20 insertions(+), 6 deletions(-)
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index fd72dacc5d1b8..3a529214a21b7 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -667,6 +667,25 @@ do_filter_cpumask(int op, const struct cpumask *mask, const struct cpumask *cmp) /* Optimisation of do_filter_cpumask() for scalar fields */ static inline int do_filter_scalar_cpumask(int op, unsigned int cpu, const struct cpumask *mask) +{ + /* + * Per the weight-of-one cpumask optimisations, the mask passed in this + * function has a weight >= 2, so it is never equal to a single scalar. + */ + switch (op) { + case OP_EQ: + return false; + case OP_NE: + return true; + case OP_BAND: + return cpumask_test_cpu(cpu, mask); + default: + return 0; + } +} + +static inline int +do_filter_cpumask_scalar(int op, const struct cpumask *mask, unsigned int cpu) { switch (op) { case OP_EQ: @@ -966,12 +985,7 @@ static int filter_pred_cpumask_cpu(struct filter_pred *pred, void *event) const struct cpumask *mask = (event + loc); unsigned int cpu = pred->val;
- /* - * This inverts the usual usage of the function (field is first element, - * user parameter is second), but that's fine because the (scalar, mask) - * operations used are symmetric. - */ - return do_filter_scalar_cpumask(pred->op, cpu, mask); + return do_filter_cpumask_scalar(pred->op, mask, cpu); }
/* Filter predicate for COMM. */
Cpumask, scalar and CPU fields can now be filtered by a user-provided cpumask, document the syntax.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- Documentation/trace/events.rst | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/Documentation/trace/events.rst b/Documentation/trace/events.rst index f5fcb8e1218f6..34108d5a55b41 100644 --- a/Documentation/trace/events.rst +++ b/Documentation/trace/events.rst @@ -219,6 +219,20 @@ the function "security_prepare_creds" and less than the end of that function. The ".function" postfix can only be attached to values of size long, and can only be compared with "==" or "!=".
+Cpumask fields or scalar fields that encode a CPU number can be filtered using +a user-provided cpumask in cpulist format. The format is as follows:: + + CPUS{$cpulist} + +Operators available to cpumask filtering are: + +& (intersection), ==, != + +For example, this will filter events that have their .target_cpu field present +in the given cpumask:: + + target_cpu & CPUS{17-42} + 5.2 Setting filters -------------------
From: Peter Zijlstra peterz@infradead.org
When a static_key is marked ro_after_init, its state will never change (after init), therefore jump_label_update() will never need to iterate the entries, and thus module load won't actually need to track this -- avoiding the static_key::next write.
Therefore, mark these keys such that jump_label_add_module() might recognise them and avoid the modification.
Use the special state: 'static_key_linked(key) && !static_key_mod(key)' to denote such keys.
Link: http://lore.kernel.org/r/20230705204142.GB2813335@hirez.programming.kicks-as... NOT-Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Valentin Schneider vschneid@redhat.com --- @Peter: I've barely touched this patch, it's just been writing a comment and fixing benign compilation issues, so credit's all yours really! --- include/asm-generic/sections.h | 5 ++++ include/linux/jump_label.h | 1 + init/main.c | 1 + kernel/jump_label.c | 49 ++++++++++++++++++++++++++++++++++ 4 files changed, 56 insertions(+)
diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h index db13bb620f527..c768de6f19a9a 100644 --- a/include/asm-generic/sections.h +++ b/include/asm-generic/sections.h @@ -180,6 +180,11 @@ static inline bool is_kernel_rodata(unsigned long addr) addr < (unsigned long)__end_rodata; }
+static inline bool is_kernel_ro_after_init(unsigned long addr) +{ + return addr >= (unsigned long)__start_ro_after_init && + addr < (unsigned long)__end_ro_after_init; +} /** * is_kernel_inittext - checks if the pointer address is located in the * .init.text section diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h index f0a949b7c9733..88ef9e776af8d 100644 --- a/include/linux/jump_label.h +++ b/include/linux/jump_label.h @@ -216,6 +216,7 @@ extern struct jump_entry __start___jump_table[]; extern struct jump_entry __stop___jump_table[];
extern void jump_label_init(void); +extern void jump_label_ro(void); extern void jump_label_lock(void); extern void jump_label_unlock(void); extern void arch_jump_label_transform(struct jump_entry *entry, diff --git a/init/main.c b/init/main.c index ad920fac325c3..cb5304ca18f4d 100644 --- a/init/main.c +++ b/init/main.c @@ -1403,6 +1403,7 @@ static void mark_readonly(void) * insecure pages which are W+X. */ rcu_barrier(); + jump_label_ro(); mark_rodata_ro(); rodata_test(); } else diff --git a/kernel/jump_label.c b/kernel/jump_label.c index d9c822bbffb8d..661ef74dee9b7 100644 --- a/kernel/jump_label.c +++ b/kernel/jump_label.c @@ -530,6 +530,45 @@ void __init jump_label_init(void) cpus_read_unlock(); }
+static inline bool static_key_sealed(struct static_key *key) +{ + return (key->type & JUMP_TYPE_LINKED) && !(key->type & ~JUMP_TYPE_MASK); +} + +static inline void static_key_seal(struct static_key *key) +{ + unsigned long type = key->type & JUMP_TYPE_TRUE; + key->type = JUMP_TYPE_LINKED | type; +} + +void jump_label_ro(void) +{ + struct jump_entry *iter_start = __start___jump_table; + struct jump_entry *iter_stop = __stop___jump_table; + struct jump_entry *iter; + + if (WARN_ON_ONCE(!static_key_initialized)) + return; + + cpus_read_lock(); + jump_label_lock(); + + for (iter = iter_start; iter < iter_stop; iter++) { + struct static_key *iterk = jump_entry_key(iter); + + if (!is_kernel_ro_after_init((unsigned long)iterk)) + continue; + + if (static_key_sealed(iterk)) + continue; + + static_key_seal(iterk); + } + + jump_label_unlock(); + cpus_read_unlock(); +} + #ifdef CONFIG_MODULES
enum jump_label_type jump_label_init_type(struct jump_entry *entry) @@ -650,6 +689,15 @@ static int jump_label_add_module(struct module *mod) static_key_set_entries(key, iter); continue; } + + /* + * If the key was sealed at init, then there's no need to keep a + * a reference to its module entries - just patch them now and + * be done with it. + */ + if (static_key_sealed(key)) + goto do_poke; + jlm = kzalloc(sizeof(struct static_key_mod), GFP_KERNEL); if (!jlm) return -ENOMEM; @@ -675,6 +723,7 @@ static int jump_label_add_module(struct module *mod) static_key_set_linked(key);
/* Only update if we've changed from our initial state */ +do_poke: if (jump_label_type(iter) != jump_label_init_type(iter)) __jump_label_update(key, iter, iter_stop, true); }
On Thu, Jul 20, 2023 at 05:30:46PM +0100, Valentin Schneider wrote:
From: Peter Zijlstra peterz@infradead.org
When a static_key is marked ro_after_init, its state will never change (after init), therefore jump_label_update() will never need to iterate the entries, and thus module load won't actually need to track this -- avoiding the static_key::next write.
Therefore, mark these keys such that jump_label_add_module() might recognise them and avoid the modification.
Use the special state: 'static_key_linked(key) && !static_key_mod(key)' to denote such keys.
Link: http://lore.kernel.org/r/20230705204142.GB2813335@hirez.programming.kicks-as... NOT-Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Valentin Schneider vschneid@redhat.com
@Peter: I've barely touched this patch, it's just been writing a comment and fixing benign compilation issues, so credit's all yours really!
Ah, it works? Excellent! You can remove the NOT from the SoB then ;-)
I had to look into objtool itself to understand what this warning was about; make it more explicit.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- tools/objtool/check.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 8936a05f0e5ac..d308330f2910e 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -3360,7 +3360,7 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn)
list_for_each_entry(target, &file->pv_ops[idx].targets, pv_target) { if (!target->sec->noinstr) { - WARN("pv_ops[%d]: %s", idx, target->name); + WARN("pv_ops[%d]: indirect call to %s() leaves .noinstr.text section", idx, target->name); file->pv_ops[idx].clean = false; } }
On Thu, Jul 20, 2023 at 05:30:47PM +0100, Valentin Schneider wrote:
I had to look into objtool itself to understand what this warning was about; make it more explicit.
Signed-off-by: Valentin Schneider vschneid@redhat.com
tools/objtool/check.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 8936a05f0e5ac..d308330f2910e 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -3360,7 +3360,7 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn) list_for_each_entry(target, &file->pv_ops[idx].targets, pv_target) { if (!target->sec->noinstr) {
WARN("pv_ops[%d]: %s", idx, target->name);
WARN("pv_ops[%d]: indirect call to %s() leaves .noinstr.text section", idx, target->name); file->pv_ops[idx].clean = false;
This is an improvement, though I think it still results in two warnings, with the second not-so-useful warning happening in validate_call().
Ideally it would only show a single warning, I guess that would need a little bit of restructuring the code.
On 28/07/23 10:33, Josh Poimboeuf wrote:
On Thu, Jul 20, 2023 at 05:30:47PM +0100, Valentin Schneider wrote:
I had to look into objtool itself to understand what this warning was about; make it more explicit.
Signed-off-by: Valentin Schneider vschneid@redhat.com
tools/objtool/check.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 8936a05f0e5ac..d308330f2910e 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -3360,7 +3360,7 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn)
list_for_each_entry(target, &file->pv_ops[idx].targets, pv_target) { if (!target->sec->noinstr) {
WARN("pv_ops[%d]: %s", idx, target->name);
WARN("pv_ops[%d]: indirect call to %s() leaves .noinstr.text section", idx, target->name); file->pv_ops[idx].clean = false;
This is an improvement, though I think it still results in two warnings, with the second not-so-useful warning happening in validate_call().
Ideally it would only show a single warning, I guess that would need a little bit of restructuring the code.
You're quite right - fabricating an artificial warning with a call to __flush_tlb_local():
vmlinux.o: warning: objtool: pv_ops[1]: indirect call to native_flush_tlb_local() leaves .noinstr.text section vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to {dynamic}() leaves .noinstr.text section
Interestingly the second one doesn't seem to have triggered the "pv_ops" bit of call_dest_name. Seems like any call to insn_reloc(NULL, x) will return NULL.
Trickling down the file yields:
vmlinux.o: warning: objtool: pv_ops[1]: indirect call to native_flush_tlb_local() leaves .noinstr.text section vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to pv_ops[0]() leaves .noinstr.text section
In my case (!PARAVIRT_XXL) pv_ops should look like: [0]: .cpu.io_delay [1]: .mmu.flush_tlb_user()
so pv_ops[1] looks right. Seems like pv_call_dest() gets it right because it uses arch_dest_reloc_offset().
If I use the above to fix up validate_call(), would we still need pv_call_dest() & co?
-- Josh
On Mon, Jul 31, 2023 at 12:16:59PM +0100, Valentin Schneider wrote:
You're quite right - fabricating an artificial warning with a call to __flush_tlb_local():
vmlinux.o: warning: objtool: pv_ops[1]: indirect call to native_flush_tlb_local() leaves .noinstr.text section vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to {dynamic}() leaves .noinstr.text section
Interestingly the second one doesn't seem to have triggered the "pv_ops" bit of call_dest_name. Seems like any call to insn_reloc(NULL, x) will return NULL.
Yeah, that's weird.
Trickling down the file yields:
vmlinux.o: warning: objtool: pv_ops[1]: indirect call to native_flush_tlb_local() leaves .noinstr.text section vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to pv_ops[0]() leaves .noinstr.text section
In my case (!PARAVIRT_XXL) pv_ops should look like: [0]: .cpu.io_delay [1]: .mmu.flush_tlb_user()
so pv_ops[1] looks right. Seems like pv_call_dest() gets it right because it uses arch_dest_reloc_offset().
If I use the above to fix up validate_call(), would we still need pv_call_dest() & co?
The functionality in pv_call_dest() is still needed because it goes through all the possible targets for the .mmu.flush_tlb_user() pointer -- xen_flush_tlb() and native_flush_tlb_local() -- and makes sure they're noinstr.
Ideally it would only print a single warning for this case, something like:
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: indirect call to native_flush_tlb_local() leaves .noinstr.text section
I left out "pv_ops[1]" because it's already long enough :-)
It would need a little bit of code shuffling. But it's really a preexisting problem so don't feel compelled to fix it with this patch set.
On Mon, Jul 31, 2023 at 04:36:31PM -0500, Josh Poimboeuf wrote:
On Mon, Jul 31, 2023 at 12:16:59PM +0100, Valentin Schneider wrote:
You're quite right - fabricating an artificial warning with a call to __flush_tlb_local():
vmlinux.o: warning: objtool: pv_ops[1]: indirect call to native_flush_tlb_local() leaves .noinstr.text section vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to {dynamic}() leaves .noinstr.text section
Interestingly the second one doesn't seem to have triggered the "pv_ops" bit of call_dest_name. Seems like any call to insn_reloc(NULL, x) will return NULL.
Yeah, that's weird.
Trickling down the file yields:
vmlinux.o: warning: objtool: pv_ops[1]: indirect call to native_flush_tlb_local() leaves .noinstr.text section vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to pv_ops[0]() leaves .noinstr.text section
In my case (!PARAVIRT_XXL) pv_ops should look like: [0]: .cpu.io_delay [1]: .mmu.flush_tlb_user()
so pv_ops[1] looks right. Seems like pv_call_dest() gets it right because it uses arch_dest_reloc_offset().
If I use the above to fix up validate_call(), would we still need pv_call_dest() & co?
The functionality in pv_call_dest() is still needed because it goes through all the possible targets for the .mmu.flush_tlb_user() pointer -- xen_flush_tlb() and native_flush_tlb_local() -- and makes sure they're noinstr.
Ideally it would only print a single warning for this case, something like:
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: indirect call to native_flush_tlb_local() leaves .noinstr.text section
But then what for the case where there are multiple implementations and more than one isn't noinstr? IIRC that is where these double prints came from. One is the callsite (always one) and the second is the offending implementation (but there could be more).
I left out "pv_ops[1]" because it's already long enough :-)
The index number is useful when also looking at the assembler, which IIRC is an indexed indirect call.
On Mon, Jul 31, 2023 at 11:46:12PM +0200, Peter Zijlstra wrote:
Ideally it would only print a single warning for this case, something like:
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: indirect call to native_flush_tlb_local() leaves .noinstr.text section
But then what for the case where there are multiple implementations and more than one isn't noinstr?
The warning would be in the loop in pv_call_dest(), so it would potentially print multiple warnings, one for each potential dest.
IIRC that is where these double prints came from. One is the callsite (always one) and the second is the offending implementation (but there could be more).
It's confusing to warn about the call site and the destination in two separate warnings. That's why I'm proposing combining them into a single warning (which still could end up as multiple warnings if there are multiple affected dests).
I left out "pv_ops[1]" because it's already long enough :-)
The index number is useful when also looking at the assembler, which IIRC is an indexed indirect call.
Ok, so something like so?
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: indirect call to pv_ops[1] (native_flush_tlb_local) leaves .noinstr.text section
On Tue, Aug 01, 2023 at 11:06:36AM -0500, Josh Poimboeuf wrote:
On Mon, Jul 31, 2023 at 11:46:12PM +0200, Peter Zijlstra wrote:
Ideally it would only print a single warning for this case, something like:
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: indirect call to native_flush_tlb_local() leaves .noinstr.text section
But then what for the case where there are multiple implementations and more than one isn't noinstr?
The warning would be in the loop in pv_call_dest(), so it would potentially print multiple warnings, one for each potential dest.
IIRC that is where these double prints came from. One is the callsite (always one) and the second is the offending implementation (but there could be more).
It's confusing to warn about the call site and the destination in two separate warnings. That's why I'm proposing combining them into a single warning (which still could end up as multiple warnings if there are multiple affected dests).
I left out "pv_ops[1]" because it's already long enough :-)
The index number is useful when also looking at the assembler, which IIRC is an indexed indirect call.
Ok, so something like so?
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: indirect call to pv_ops[1] (native_flush_tlb_local) leaves .noinstr.text section
Sure, that all would work I suppose.
Later commits will depend on having no runtime-mutable text in early entry code. (ab)use the .noinstr section as a marker of early entry code and warn about static keys used in it that can be flipped at runtime.
Suggested-by: Josh Poimboeuf jpoimboe@redhat.com Signed-off-by: Valentin Schneider vschneid@redhat.com --- tools/objtool/check.c | 20 ++++++++++++++++++++ tools/objtool/include/objtool/check.h | 1 + tools/objtool/include/objtool/special.h | 2 ++ tools/objtool/special.c | 3 +++ 4 files changed, 26 insertions(+)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c index d308330f2910e..d973bb4df4341 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -1968,6 +1968,9 @@ static int add_special_section_alts(struct objtool_file *file) alt->next = orig_insn->alts; orig_insn->alts = alt;
+ if (special_alt->key_sym) + orig_insn->key_sym = special_alt->key_sym; + list_del(&special_alt->list); free(special_alt); } @@ -3476,6 +3479,20 @@ static int validate_return(struct symbol *func, struct instruction *insn, struct return 0; }
+static int validate_static_key(struct instruction *insn, struct insn_state *state) +{ + if (state->noinstr && state->instr <= 0) { + if ((strcmp(insn->key_sym->sec->name, ".data..ro_after_init"))) { + WARN_INSN(insn, + "Non __ro_after_init static key "%s" in .noinstr section", + insn->key_sym->name); + return 1; + } + } + + return 0; +} + static struct instruction *next_insn_to_validate(struct objtool_file *file, struct instruction *insn) { @@ -3625,6 +3642,9 @@ static int validate_branch(struct objtool_file *file, struct symbol *func, if (handle_insn_ops(insn, next_insn, &state)) return 1;
+ if (insn->key_sym) + validate_static_key(insn, &state); + switch (insn->type) {
case INSN_RETURN: diff --git a/tools/objtool/include/objtool/check.h b/tools/objtool/include/objtool/check.h index daa46f1f0965a..35dd21f8f41e1 100644 --- a/tools/objtool/include/objtool/check.h +++ b/tools/objtool/include/objtool/check.h @@ -77,6 +77,7 @@ struct instruction { struct symbol *sym; struct stack_op *stack_ops; struct cfi_state *cfi; + struct symbol *key_sym; };
static inline struct symbol *insn_func(struct instruction *insn) diff --git a/tools/objtool/include/objtool/special.h b/tools/objtool/include/objtool/special.h index 86d4af9c5aa9d..0e61f34fe3a28 100644 --- a/tools/objtool/include/objtool/special.h +++ b/tools/objtool/include/objtool/special.h @@ -27,6 +27,8 @@ struct special_alt { struct section *new_sec; unsigned long new_off;
+ struct symbol *key_sym; + unsigned int orig_len, new_len; /* group only */ };
diff --git a/tools/objtool/special.c b/tools/objtool/special.c index 91b1950f5bd8a..1f76cfd815bf3 100644 --- a/tools/objtool/special.c +++ b/tools/objtool/special.c @@ -127,6 +127,9 @@ static int get_alt_entry(struct elf *elf, const struct special_entry *entry, return -1; } alt->key_addend = reloc_addend(key_reloc); + + reloc_to_sec_off(key_reloc, &sec, &offset); + alt->key_sym = find_symbol_by_offset(sec, offset & ~2); }
return 0;
On Thu, Jul 20, 2023 at 05:30:48PM +0100, Valentin Schneider wrote:
+static int validate_static_key(struct instruction *insn, struct insn_state *state) +{
- if (state->noinstr && state->instr <= 0) {
if ((strcmp(insn->key_sym->sec->name, ".data..ro_after_init"))) {
WARN_INSN(insn,
"Non __ro_after_init static key \"%s\" in .noinstr section",
For consistency with other warnings, this should start with a lowercase "n" and the string literal should be on the same line as the WARN_INSN, like
WARN_INSN(insn, "non __ro_after_init static key "%s" in .noinstr section", ...
diff --git a/tools/objtool/special.c b/tools/objtool/special.c index 91b1950f5bd8a..1f76cfd815bf3 100644 --- a/tools/objtool/special.c +++ b/tools/objtool/special.c @@ -127,6 +127,9 @@ static int get_alt_entry(struct elf *elf, const struct special_entry *entry, return -1; } alt->key_addend = reloc_addend(key_reloc);
reloc_to_sec_off(key_reloc, &sec, &offset);
alt->key_sym = find_symbol_by_offset(sec, offset & ~2);
Bits 0 and 1 can both store data, should be ~3?
On 28/07/23 10:35, Josh Poimboeuf wrote:
On Thu, Jul 20, 2023 at 05:30:48PM +0100, Valentin Schneider wrote:
+static int validate_static_key(struct instruction *insn, struct insn_state *state) +{
- if (state->noinstr && state->instr <= 0) {
if ((strcmp(insn->key_sym->sec->name, ".data..ro_after_init"))) {
WARN_INSN(insn,
"Non __ro_after_init static key \"%s\" in .noinstr section",
For consistency with other warnings, this should start with a lowercase "n" and the string literal should be on the same line as the WARN_INSN, like
WARN_INSN(insn, "non __ro_after_init static key \"%s\" in .noinstr section", ...
diff --git a/tools/objtool/special.c b/tools/objtool/special.c index 91b1950f5bd8a..1f76cfd815bf3 100644 --- a/tools/objtool/special.c +++ b/tools/objtool/special.c @@ -127,6 +127,9 @@ static int get_alt_entry(struct elf *elf, const struct special_entry *entry, return -1; } alt->key_addend = reloc_addend(key_reloc);
reloc_to_sec_off(key_reloc, &sec, &offset);
alt->key_sym = find_symbol_by_offset(sec, offset & ~2);
Bits 0 and 1 can both store data, should be ~3?
Quite so, that needs to be the same as jump_entry_key().
-- Josh
On Thu, Jul 20, 2023 at 05:30:48PM +0100, Valentin Schneider wrote:
Later commits will depend on having no runtime-mutable text in early entry code. (ab)use the .noinstr section as a marker of early entry code and warn about static keys used in it that can be flipped at runtime.
Similar to my comment on patch 13, this could also use a short justification for adding the feature, i.e. why runtime-mutable text isn't going to be allowed in .noinstr.
Also, please add a short description of the warning (and why it exists) to tools/objtool/Documentation/objtool.txt.
On 28/07/23 11:02, Josh Poimboeuf wrote:
On Thu, Jul 20, 2023 at 05:30:48PM +0100, Valentin Schneider wrote:
Later commits will depend on having no runtime-mutable text in early entry code. (ab)use the .noinstr section as a marker of early entry code and warn about static keys used in it that can be flipped at runtime.
Similar to my comment on patch 13, this could also use a short justification for adding the feature, i.e. why runtime-mutable text isn't going to be allowed in .noinstr.
Also, please add a short description of the warning (and why it exists) to tools/objtool/Documentation/objtool.txt.
I had missed this, thanks for the pointer.
-- Josh
objtool now warns about it:
vmlinux.o: warning: objtool: enter_from_user_mode+0x4e: Non __ro_after_init static key "context_tracking_key" in .noinstr section vmlinux.o: warning: objtool: enter_from_user_mode+0x50: Non __ro_after_init static key "context_tracking_key" in .noinstr section vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0x60: Non __ro_after_init static key "context_tracking_key" in .noinstr section vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0x62: Non __ro_after_init static key "context_tracking_key" in .noinstr section [...]
The key can only be enabled (and not disabled) in the __init function ct_cpu_tracker_user(), so mark it as __ro_after_init.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- kernel/context_tracking.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 6ef0b35fc28c5..cc4f3a57f848c 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -432,7 +432,7 @@ static __always_inline void ct_kernel_enter(bool user, int offset) { } #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h>
-DEFINE_STATIC_KEY_FALSE(context_tracking_key); +DEFINE_STATIC_KEY_FALSE_RO(context_tracking_key); EXPORT_SYMBOL_GPL(context_tracking_key);
static noinstr bool context_tracking_recursion_enter(void)
On Thu, Jul 20, 2023 at 05:30:49PM +0100, Valentin Schneider wrote:
objtool now warns about it:
vmlinux.o: warning: objtool: enter_from_user_mode+0x4e: Non __ro_after_init static key "context_tracking_key" in .noinstr section vmlinux.o: warning: objtool: enter_from_user_mode+0x50: Non __ro_after_init static key "context_tracking_key" in .noinstr section vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0x60: Non __ro_after_init static key "context_tracking_key" in .noinstr section vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0x62: Non __ro_after_init static key "context_tracking_key" in .noinstr section [...]
The key can only be enabled (and not disabled) in the __init function ct_cpu_tracker_user(), so mark it as __ro_after_init.
Signed-off-by: Valentin Schneider vschneid@redhat.com
It's best to avoid temporarily introducing warnings. Bots will rightfully complain about that. This patch and the next one should come before the objtool patches.
Also it would be helpful for the commit log to have a brief justification for the patch beyond "fix the objtool warning". Something roughly like:
Soon, runtime-mutable text won't be allowed in .noinstr sections, so that a code patching IPI to a userspace-bound CPU can be safely deferred to the next kernel entry.
'context_tracking_key' is only enabled in __init ct_cpu_tracker_user(). Mark it as __ro_after_init.
On 28/07/23 11:00, Josh Poimboeuf wrote:
On Thu, Jul 20, 2023 at 05:30:49PM +0100, Valentin Schneider wrote:
objtool now warns about it:
vmlinux.o: warning: objtool: enter_from_user_mode+0x4e: Non __ro_after_init static key "context_tracking_key" in .noinstr section vmlinux.o: warning: objtool: enter_from_user_mode+0x50: Non __ro_after_init static key "context_tracking_key" in .noinstr section vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0x60: Non __ro_after_init static key "context_tracking_key" in .noinstr section vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0x62: Non __ro_after_init static key "context_tracking_key" in .noinstr section [...]
The key can only be enabled (and not disabled) in the __init function ct_cpu_tracker_user(), so mark it as __ro_after_init.
Signed-off-by: Valentin Schneider vschneid@redhat.com
It's best to avoid temporarily introducing warnings. Bots will rightfully complain about that. This patch and the next one should come before the objtool patches.
Ack, I'll reverse the order of these.
Also it would be helpful for the commit log to have a brief justification for the patch beyond "fix the objtool warning". Something roughly like:
Soon, runtime-mutable text won't be allowed in .noinstr sections, so that a code patching IPI to a userspace-bound CPU can be safely deferred to the next kernel entry.
'context_tracking_key' is only enabled in __init ct_cpu_tracker_user(). Mark it as __ro_after_init.
Looks better indeed, thanks!
-- Josh
objtool now warns about it:
vmlinux.o: warning: objtool: exc_page_fault+0x2a: Non __ro_after_init static key "kvm_async_pf_enabled" in .noinstr section
The key can only be enabled (and not disabled) in the __init function kvm_guest_init(), so mark it as __ro_after_init.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/kernel/kvm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 1cceac5984daa..319460090a836 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -44,7 +44,7 @@ #include <asm/svm.h> #include <asm/e820/api.h>
-DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled); +DEFINE_STATIC_KEY_FALSE_RO(kvm_async_pf_enabled);
static int kvmapf = 1;
У чт, 2023-07-20 у 17:30 +0100, Valentin Schneider пише:
objtool now warns about it:
vmlinux.o: warning: objtool: exc_page_fault+0x2a: Non __ro_after_init static key "kvm_async_pf_enabled" in .noinstr section
The key can only be enabled (and not disabled) in the __init function kvm_guest_init(), so mark it as __ro_after_init.
Signed-off-by: Valentin Schneider vschneid@redhat.com
arch/x86/kernel/kvm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 1cceac5984daa..319460090a836 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -44,7 +44,7 @@ #include <asm/svm.h> #include <asm/e820/api.h> -DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled); +DEFINE_STATIC_KEY_FALSE_RO(kvm_async_pf_enabled); static int kvmapf = 1;
Reviewed-by: Maxim Levitsky mlevitsk@redhat.com
Best regards, Maxim Levitsky
smp_call_function() & friends have the unfortunate habit of sending IPIs to isolated, NOHZ_FULL, in-userspace CPUs, as they blindly target all online CPUs.
Some callsites can be bent into doing the right, such as done by commit:
cc9e303c91f5 ("x86/cpu: Disable frequency requests via aperfmperf IPI for nohz_full CPUs")
Unfortunately, not all SMP callbacks can be omitted in this fashion. However, some of them only affect execution in kernelspace, which means they don't have to be executed *immediately* if the target CPU is in userspace: stashing the callback and executing it upon the next kernel entry would suffice. x86 kernel instruction patching or kernel TLB invalidation are prime examples of it.
Reduce the RCU dynticks counter width to free up some bits to be used as a deferred callback bitmask. Add some build-time checks to validate that setup.
Presence of CONTEXT_KERNEL in the ct_state prevents queuing deferred work.
Later commits introduce the bit:callback mappings.
Link: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/ Signed-off-by: Nicolas Saenz Julienne nsaenzju@redhat.com Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/Kconfig | 9 +++ arch/x86/Kconfig | 1 + arch/x86/include/asm/context_tracking_work.h | 14 +++++ include/linux/context_tracking.h | 25 ++++++++ include/linux/context_tracking_state.h | 62 +++++++++++++++----- include/linux/context_tracking_work.h | 26 ++++++++ kernel/context_tracking.c | 51 +++++++++++++++- kernel/time/Kconfig | 5 ++ 8 files changed, 176 insertions(+), 17 deletions(-) create mode 100644 arch/x86/include/asm/context_tracking_work.h create mode 100644 include/linux/context_tracking_work.h
diff --git a/arch/Kconfig b/arch/Kconfig index aff2746c8af28..1bcb3bbdddaad 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -871,6 +871,15 @@ config HAVE_CONTEXT_TRACKING_USER_OFFSTACK - No use of instrumentation, unless instrumentation_begin() got called.
+config HAVE_CONTEXT_TRACKING_WORK + bool + help + Architecture supports deferring work while not in kernel context. + This is especially useful on setups with isolated CPUs that might + want to avoid being interrupted to perform housekeeping tasks (for + ex. TLB invalidation or icache invalidation). The housekeeping + operations are performed upon re-entering the kernel. + config HAVE_TIF_NOHZ bool help diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 7422db4097701..71481a80774f6 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -198,6 +198,7 @@ config X86 select HAVE_CMPXCHG_LOCAL select HAVE_CONTEXT_TRACKING_USER if X86_64 select HAVE_CONTEXT_TRACKING_USER_OFFSTACK if HAVE_CONTEXT_TRACKING_USER + select HAVE_CONTEXT_TRACKING_WORK if X86_64 select HAVE_C_RECORDMCOUNT select HAVE_OBJTOOL_MCOUNT if HAVE_OBJTOOL select HAVE_OBJTOOL_NOP_MCOUNT if HAVE_OBJTOOL_MCOUNT diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h new file mode 100644 index 0000000000000..5bc29e6b2ed38 --- /dev/null +++ b/arch/x86/include/asm/context_tracking_work.h @@ -0,0 +1,14 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H +#define _ASM_X86_CONTEXT_TRACKING_WORK_H + +static __always_inline void arch_context_tracking_work(int work) +{ + switch (work) { + case CONTEXT_WORK_n: + // Do work... + break; + } +} + +#endif diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index 6e76b9dba00e7..8aee086d0a25f 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -5,10 +5,15 @@ #include <linux/sched.h> #include <linux/vtime.h> #include <linux/context_tracking_state.h> +#include <linux/context_tracking_work.h> #include <linux/instrumentation.h>
#include <asm/ptrace.h>
+#ifdef CONFIG_CONTEXT_TRACKING_WORK +static_assert(CONTEXT_WORK_MAX_OFFSET <= CONTEXT_WORK_END + 1 - CONTEXT_WORK_START, + "Not enough bits for CONTEXT_WORK"); +#endif
#ifdef CONFIG_CONTEXT_TRACKING_USER extern void ct_cpu_track_user(int cpu); @@ -131,6 +136,26 @@ static __always_inline unsigned long ct_state_inc(int incby) return raw_atomic_add_return(incby, this_cpu_ptr(&context_tracking.state)); }
+#ifdef CONTEXT_TRACKING_WORK +static __always_inline unsigned long ct_state_inc_clear_work(int incby) +{ + struct context_tracking *ct = this_cpu_ptr(&context_tracking); + unsigned long new, old, state; + + state = arch_atomic_read(&ct->state); + do { + old = state; + new = old & ~CONTEXT_WORK_MASK; + new += incby; + state = arch_atomic_cmpxchg(&ct->state, old, new); + } while (old != state); + + return new; +} +#else +#define ct_state_inc_clear_work(x) ct_state_inc(x) +#endif + static __always_inline bool warn_rcu_enter(void) { bool ret = false; diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h index bbff5f7f88030..828fcdb801f73 100644 --- a/include/linux/context_tracking_state.h +++ b/include/linux/context_tracking_state.h @@ -9,21 +9,6 @@ /* Offset to allow distinguishing irq vs. task-based idle entry/exit. */ #define DYNTICK_IRQ_NONIDLE ((LONG_MAX / 2) + 1)
-enum ctx_state { - CONTEXT_DISABLED = -1, /* returned by ct_state() if unknown */ - CONTEXT_KERNEL = 0, - CONTEXT_IDLE = 1, - CONTEXT_USER = 2, - CONTEXT_GUEST = 3, - CONTEXT_MAX = 4, -}; - -/* Even value for idle, else odd. */ -#define RCU_DYNTICKS_IDX CONTEXT_MAX - -#define CT_STATE_MASK (CONTEXT_MAX - 1) -#define CT_DYNTICKS_MASK (~CT_STATE_MASK) - struct context_tracking { #ifdef CONFIG_CONTEXT_TRACKING_USER /* @@ -44,6 +29,53 @@ struct context_tracking { #endif };
+enum ctx_state { + /* Following are values */ + CONTEXT_DISABLED = -1, /* returned by ct_state() if unknown */ + CONTEXT_KERNEL = 0, + CONTEXT_IDLE = 1, + CONTEXT_USER = 2, + CONTEXT_GUEST = 3, + CONTEXT_MAX = 4, +}; + +/* + * We cram three different things within the same atomic variable: + * + * CONTEXT_STATE_END RCU_DYNTICKS_END + * | CONTEXT_WORK_END | + * | | | + * v v v + * [ context_state ][ context work ][ RCU dynticks counter ] + * ^ ^ ^ + * | | | + * | CONTEXT_WORK_START | + * CONTEXT_STATE_START RCU_DYNTICKS_START + */ + +#define CT_STATE_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE) + +#define CONTEXT_STATE_START 0 +#define CONTEXT_STATE_END (bits_per(CONTEXT_MAX - 1) - 1) + +#define RCU_DYNTICKS_BITS (IS_ENABLED(CONFIG_CONTEXT_TRACKING_WORK) ? 16 : 31) +#define RCU_DYNTICKS_START (CT_STATE_SIZE - RCU_DYNTICKS_BITS) +#define RCU_DYNTICKS_END (CT_STATE_SIZE - 1) +#define RCU_DYNTICKS_IDX BIT(RCU_DYNTICKS_START) + +#define CONTEXT_WORK_START (CONTEXT_STATE_END + 1) +#define CONTEXT_WORK_END (RCU_DYNTICKS_START - 1) + +/* Make sure all our bits are accounted for */ +static_assert((CONTEXT_STATE_END + 1 - CONTEXT_STATE_START) + + (CONTEXT_WORK_END + 1 - CONTEXT_WORK_START) + + (RCU_DYNTICKS_END + 1 - RCU_DYNTICKS_START) == + CT_STATE_SIZE); + +#define CT_STATE_MASK GENMASK(CONTEXT_STATE_END, CONTEXT_STATE_START) +#define CT_WORK_MASK GENMASK(CONTEXT_WORK_END, CONTEXT_WORK_START) +#define CT_DYNTICKS_MASK GENMASK(RCU_DYNTICKS_END, RCU_DYNTICKS_START) + #ifdef CONFIG_CONTEXT_TRACKING DECLARE_PER_CPU(struct context_tracking, context_tracking); #endif diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h new file mode 100644 index 0000000000000..fb74db8876dd2 --- /dev/null +++ b/include/linux/context_tracking_work.h @@ -0,0 +1,26 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_CONTEXT_TRACKING_WORK_H +#define _LINUX_CONTEXT_TRACKING_WORK_H + +#include <linux/bitops.h> + +enum { + CONTEXT_WORK_n_OFFSET, + CONTEXT_WORK_MAX_OFFSET +}; + +enum ct_work { + CONTEXT_WORK_n = BIT(CONTEXT_WORK_n_OFFSET), + CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET) +}; + +#include <asm/context_tracking_work.h> + +#ifdef CONFIG_CONTEXT_TRACKING_WORK +extern bool ct_set_cpu_work(unsigned int cpu, unsigned int work); +#else +static inline bool +ct_set_cpu_work(unsigned int cpu, unsigned int work) { return false; } +#endif + +#endif diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index cc4f3a57f848c..1a3f6e355826d 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -72,6 +72,51 @@ static __always_inline void rcu_dynticks_task_trace_exit(void) #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */ }
+#ifdef CONFIG_CONTEXT_TRACKING_WORK +static noinstr void ct_work_flush(unsigned long seq) +{ + int bit; + + seq = (seq & CT_WORK_MASK) >> CONTEXT_WORK_START; + + /* + * arch_context_tracking_work() must be noinstr, non-blocking, + * and NMI safe. + */ + for_each_set_bit(bit, &seq, CONTEXT_WORK_MAX) + arch_context_tracking_work(BIT(bit)); +} + +bool ct_set_cpu_work(unsigned int cpu, unsigned int work) +{ + struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu); + unsigned int old; + bool ret = false; + + preempt_disable(); + + old = atomic_read(&ct->state); + /* + * Try setting the work until either + * - the target CPU no longer accepts any more deferred work + * - the work has been set + * + * NOTE: CONTEXT_GUEST intersects with CONTEXT_USER and CONTEXT_IDLE + * as they are regular integers rather than bits, but that doesn't + * matter here: if any of the context state bit is set, the CPU isn't + * in kernel context. + */ + while ((old & (CONTEXT_GUEST | CONTEXT_USER | CONTEXT_IDLE)) && !ret) + ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CONTEXT_WORK_START)); + + preempt_enable(); + return ret; +} +#else +static __always_inline void ct_work_flush(unsigned long work) { } +static __always_inline void ct_work_clear(struct context_tracking *ct) { } +#endif + /* * Record entry into an extended quiescent state. This is only to be * called when not already in an extended quiescent state, that is, @@ -88,7 +133,8 @@ static noinstr void ct_kernel_exit_state(int offset) * next idle sojourn. */ rcu_dynticks_task_trace_enter(); // Before ->dynticks update! - seq = ct_state_inc(offset); + seq = ct_state_inc_clear_work(offset); + // RCU is no longer watching. Better be in extended quiescent state! WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && (seq & RCU_DYNTICKS_IDX)); } @@ -100,7 +146,7 @@ static noinstr void ct_kernel_exit_state(int offset) */ static noinstr void ct_kernel_enter_state(int offset) { - int seq; + unsigned long seq;
/* * CPUs seeing atomic_add_return() must see prior idle sojourns, @@ -108,6 +154,7 @@ static noinstr void ct_kernel_enter_state(int offset) * critical section. */ seq = ct_state_inc(offset); + ct_work_flush(seq); // RCU is now watching. Better not be in an extended quiescent state! rcu_dynticks_task_trace_exit(); // After ->dynticks update! WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & RCU_DYNTICKS_IDX)); diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig index bae8f11070bef..fdb266f2d774b 100644 --- a/kernel/time/Kconfig +++ b/kernel/time/Kconfig @@ -181,6 +181,11 @@ config CONTEXT_TRACKING_USER_FORCE Say N otherwise, this option brings an overhead that you don't want in production.
+config CONTEXT_TRACKING_WORK + bool + depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER + default y + config NO_HZ bool "Old Idle dynticks config" help
Le Thu, Jul 20, 2023 at 05:30:51PM +0100, Valentin Schneider a écrit :
+enum ctx_state {
- /* Following are values */
- CONTEXT_DISABLED = -1, /* returned by ct_state() if unknown */
- CONTEXT_KERNEL = 0,
- CONTEXT_IDLE = 1,
- CONTEXT_USER = 2,
- CONTEXT_GUEST = 3,
- CONTEXT_MAX = 4,
+};
+/*
- We cram three different things within the same atomic variable:
CONTEXT_STATE_END RCU_DYNTICKS_END
| CONTEXT_WORK_END |
| | |
v v v
[ context_state ][ context work ][ RCU dynticks counter ]
^ ^ ^
| | |
| CONTEXT_WORK_START |
- CONTEXT_STATE_START RCU_DYNTICKS_START
Should the layout be displayed in reverse? Well at least I always picture bitmaps in reverse, that's probably due to the direction of the shift arrows. Not sure what is the usual way to picture it though...
- */
+#define CT_STATE_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE)
+#define CONTEXT_STATE_START 0 +#define CONTEXT_STATE_END (bits_per(CONTEXT_MAX - 1) - 1)
Since you have non overlapping *_START symbols, perhaps the *_END are superfluous?
+#define RCU_DYNTICKS_BITS (IS_ENABLED(CONFIG_CONTEXT_TRACKING_WORK) ? 16 : 31) +#define RCU_DYNTICKS_START (CT_STATE_SIZE - RCU_DYNTICKS_BITS) +#define RCU_DYNTICKS_END (CT_STATE_SIZE - 1) +#define RCU_DYNTICKS_IDX BIT(RCU_DYNTICKS_START)
Might be the right time to standardize and fix our naming:
CT_STATE_START, CT_STATE_KERNEL, CT_STATE_USER, ... CT_WORK_START, CT_WORK_*, ... CT_RCU_DYNTICKS_START, CT_RCU_DYNTICKS_IDX
+bool ct_set_cpu_work(unsigned int cpu, unsigned int work) +{
- struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
- unsigned int old;
- bool ret = false;
- preempt_disable();
- old = atomic_read(&ct->state);
- /*
* Try setting the work until either
* - the target CPU no longer accepts any more deferred work
* - the work has been set
*
* NOTE: CONTEXT_GUEST intersects with CONTEXT_USER and CONTEXT_IDLE
* as they are regular integers rather than bits, but that doesn't
* matter here: if any of the context state bit is set, the CPU isn't
* in kernel context.
*/
- while ((old & (CONTEXT_GUEST | CONTEXT_USER | CONTEXT_IDLE)) && !ret)
That may still miss a recent entry to userspace due to the first plain read, ending with an undesired interrupt.
You need at least one cmpxchg. Well, of course that stays racy by nature because between the cmpxchg() returning CONTEXT_KERNEL and the actual IPI raised and received, the remote CPU may have gone to userspace already. But still it limits a little the window.
Thanks.
ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CONTEXT_WORK_START));
- preempt_enable();
- return ret;
+} +#else +static __always_inline void ct_work_flush(unsigned long work) { } +static __always_inline void ct_work_clear(struct context_tracking *ct) { } +#endif
/*
- Record entry into an extended quiescent state. This is only to be
- called when not already in an extended quiescent state, that is,
@@ -88,7 +133,8 @@ static noinstr void ct_kernel_exit_state(int offset) * next idle sojourn. */ rcu_dynticks_task_trace_enter(); // Before ->dynticks update!
- seq = ct_state_inc(offset);
- seq = ct_state_inc_clear_work(offset);
- // RCU is no longer watching. Better be in extended quiescent state! WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && (seq & RCU_DYNTICKS_IDX));
} @@ -100,7 +146,7 @@ static noinstr void ct_kernel_exit_state(int offset) */ static noinstr void ct_kernel_enter_state(int offset) {
- int seq;
- unsigned long seq;
/* * CPUs seeing atomic_add_return() must see prior idle sojourns, @@ -108,6 +154,7 @@ static noinstr void ct_kernel_enter_state(int offset) * critical section. */ seq = ct_state_inc(offset);
- ct_work_flush(seq); // RCU is now watching. Better not be in an extended quiescent state! rcu_dynticks_task_trace_exit(); // After ->dynticks update! WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & RCU_DYNTICKS_IDX));
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig index bae8f11070bef..fdb266f2d774b 100644 --- a/kernel/time/Kconfig +++ b/kernel/time/Kconfig @@ -181,6 +181,11 @@ config CONTEXT_TRACKING_USER_FORCE Say N otherwise, this option brings an overhead that you don't want in production. +config CONTEXT_TRACKING_WORK
- bool
- depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER
- default y
config NO_HZ bool "Old Idle dynticks config" help -- 2.31.1
On 24/07/23 16:52, Frederic Weisbecker wrote:
Le Thu, Jul 20, 2023 at 05:30:51PM +0100, Valentin Schneider a écrit :
+enum ctx_state {
- /* Following are values */
- CONTEXT_DISABLED = -1, /* returned by ct_state() if unknown */
- CONTEXT_KERNEL = 0,
- CONTEXT_IDLE = 1,
- CONTEXT_USER = 2,
- CONTEXT_GUEST = 3,
- CONTEXT_MAX = 4,
+};
+/*
- We cram three different things within the same atomic variable:
CONTEXT_STATE_END RCU_DYNTICKS_END
| CONTEXT_WORK_END |
| | |
v v v
[ context_state ][ context work ][ RCU dynticks counter ]
^ ^ ^
| | |
| CONTEXT_WORK_START |
- CONTEXT_STATE_START RCU_DYNTICKS_START
Should the layout be displayed in reverse? Well at least I always picture bitmaps in reverse, that's probably due to the direction of the shift arrows. Not sure what is the usual way to picture it though...
Surprisingly, I managed to confuse myself with that comment :-) I think I am subconsciously more used to the reverse as well. I've flipped that and put "MSB" / "LSB" at either end.
- */
+#define CT_STATE_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE)
+#define CONTEXT_STATE_START 0 +#define CONTEXT_STATE_END (bits_per(CONTEXT_MAX - 1) - 1)
Since you have non overlapping *_START symbols, perhaps the *_END are superfluous?
They're only really there to tidy up the GENMASK() further down - it keeps the range and index definitions in one hunk. I tried defining that directly within the GENMASK() themselves but it got too ugly IMO.
+#define RCU_DYNTICKS_BITS (IS_ENABLED(CONFIG_CONTEXT_TRACKING_WORK) ? 16 : 31) +#define RCU_DYNTICKS_START (CT_STATE_SIZE - RCU_DYNTICKS_BITS) +#define RCU_DYNTICKS_END (CT_STATE_SIZE - 1) +#define RCU_DYNTICKS_IDX BIT(RCU_DYNTICKS_START)
Might be the right time to standardize and fix our naming:
CT_STATE_START, CT_STATE_KERNEL, CT_STATE_USER, ... CT_WORK_START, CT_WORK_*, ... CT_RCU_DYNTICKS_START, CT_RCU_DYNTICKS_IDX
Heh, I have actually already done this for v3, though I hadn't touched the RCU_DYNTICKS* family. I'll fold that in.
+bool ct_set_cpu_work(unsigned int cpu, unsigned int work) +{
- struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
- unsigned int old;
- bool ret = false;
- preempt_disable();
- old = atomic_read(&ct->state);
- /*
* Try setting the work until either
* - the target CPU no longer accepts any more deferred work
* - the work has been set
*
* NOTE: CONTEXT_GUEST intersects with CONTEXT_USER and CONTEXT_IDLE
* as they are regular integers rather than bits, but that doesn't
* matter here: if any of the context state bit is set, the CPU isn't
* in kernel context.
*/
- while ((old & (CONTEXT_GUEST | CONTEXT_USER | CONTEXT_IDLE)) && !ret)
That may still miss a recent entry to userspace due to the first plain read, ending with an undesired interrupt.
You need at least one cmpxchg. Well, of course that stays racy by nature because between the cmpxchg() returning CONTEXT_KERNEL and the actual IPI raised and received, the remote CPU may have gone to userspace already. But still it limits a little the window.
I can make that a 'do {} while ()' instead to force at least one execution of the cmpxchg().
This is only about reducing the race window, right? If we're executing this just as the target CPU is about to enter userspace, we're going to be in racy territory anyway. Regardless, I'm happy to do that change.
On Mon, Jul 24, 2023 at 05:55:44PM +0100, Valentin Schneider wrote:
I can make that a 'do {} while ()' instead to force at least one execution of the cmpxchg().
This is only about reducing the race window, right? If we're executing this just as the target CPU is about to enter userspace, we're going to be in racy territory anyway. Regardless, I'm happy to do that change.
Right, it's only about narrowing down the race window. It probably don't matter in practice, but it's one less thing to consider for the brain :-)
Also, why bothering with handling CONTEXT_IDLE?
Thanks.
On 24/07/23 21:18, Frederic Weisbecker wrote:
On Mon, Jul 24, 2023 at 05:55:44PM +0100, Valentin Schneider wrote:
I can make that a 'do {} while ()' instead to force at least one execution of the cmpxchg().
This is only about reducing the race window, right? If we're executing this just as the target CPU is about to enter userspace, we're going to be in racy territory anyway. Regardless, I'm happy to do that change.
Right, it's only about narrowing down the race window. It probably don't matter in practice, but it's one less thing to consider for the brain :-)
Ack
Also, why bothering with handling CONTEXT_IDLE?
I have reasons! I just swept them under the rug and didn't mention them :D Also looking at the config dependencies again I got it wrong, but nevertheless that means I get to ramble about it.
With NO_HZ_IDLE, we get CONTEXT_TRACKING_IDLE, so we get these transitions:
ct_idle_enter() ct_kernel_exit() ct_state_inc_clear_work()
ct_idle_exit() ct_kernel_enter() ct_work_flush()
Now, if we just make CONTEXT_TRACKING_WORK depend on CONTEXT_TRACKING_IDLE rather than CONTEXT_TRACKING_USER, we get to leverage the IPI deferral for NO_HZ_IDLE kernels - in other words, we get to keep idle CPUs idle longer.
It's a completely different argument than reducing interference for NOHZ_FULL userspace applications and I should have at the very least mentioned it in the cover letter, but it's the exact same backing mechanism.
Looking at it again, I'll probably make the CONTEXT_IDLE thing a separate patch with a proper changelog.
On Tue, Jul 25, 2023 at 11:10:31AM +0100, Valentin Schneider wrote:
I have reasons! I just swept them under the rug and didn't mention them :D Also looking at the config dependencies again I got it wrong, but nevertheless that means I get to ramble about it.
With NO_HZ_IDLE, we get CONTEXT_TRACKING_IDLE, so we get these transitions:
ct_idle_enter() ct_kernel_exit() ct_state_inc_clear_work()
ct_idle_exit() ct_kernel_enter() ct_work_flush()
Now, if we just make CONTEXT_TRACKING_WORK depend on CONTEXT_TRACKING_IDLE rather than CONTEXT_TRACKING_USER, we get to leverage the IPI deferral for NO_HZ_IDLE kernels - in other words, we get to keep idle CPUs idle longer.
It's a completely different argument than reducing interference for NOHZ_FULL userspace applications and I should have at the very least mentioned it in the cover letter, but it's the exact same backing mechanism.
Looking at it again, I'll probably make the CONTEXT_IDLE thing a separate patch with a proper changelog.
Ok should that be a seperate Kconfig? This indeed can bring power improvement but at the cost of more overhead from the sender. A balance to be measured...
On 25/07/23 13:22, Frederic Weisbecker wrote:
On Tue, Jul 25, 2023 at 11:10:31AM +0100, Valentin Schneider wrote:
I have reasons! I just swept them under the rug and didn't mention them :D Also looking at the config dependencies again I got it wrong, but nevertheless that means I get to ramble about it.
With NO_HZ_IDLE, we get CONTEXT_TRACKING_IDLE, so we get these transitions:
ct_idle_enter() ct_kernel_exit() ct_state_inc_clear_work()
ct_idle_exit() ct_kernel_enter() ct_work_flush()
Now, if we just make CONTEXT_TRACKING_WORK depend on CONTEXT_TRACKING_IDLE rather than CONTEXT_TRACKING_USER, we get to leverage the IPI deferral for NO_HZ_IDLE kernels - in other words, we get to keep idle CPUs idle longer.
It's a completely different argument than reducing interference for NOHZ_FULL userspace applications and I should have at the very least mentioned it in the cover letter, but it's the exact same backing mechanism.
Looking at it again, I'll probably make the CONTEXT_IDLE thing a separate patch with a proper changelog.
Ok should that be a seperate Kconfig? This indeed can bring power improvement but at the cost of more overhead from the sender. A balance to be measured...
Yep agreed, I'll make that an optional config.
CONTEXT_TRACKING_WORK reduces the size of the dynticks counter to free up some bits for work deferral. Paul suggested making the actual counter size configurable for rcutorture to poke at, so do that.
Make it only configurable under RCU_EXPERT. Previous commits have added build-time checks that ensure a kernel with problematic dynticks counter width can't be built.
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop Suggested-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com --- include/linux/context_tracking.h | 3 ++- include/linux/context_tracking_state.h | 3 +-- kernel/rcu/Kconfig | 33 ++++++++++++++++++++++++++ 3 files changed, 36 insertions(+), 3 deletions(-)
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index 8aee086d0a25f..9c0c622bc27bb 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -12,7 +12,8 @@
#ifdef CONFIG_CONTEXT_TRACKING_WORK static_assert(CONTEXT_WORK_MAX_OFFSET <= CONTEXT_WORK_END + 1 - CONTEXT_WORK_START, - "Not enough bits for CONTEXT_WORK"); + "Not enough bits for CONTEXT_WORK, " + "CONFIG_RCU_DYNTICKS_BITS might be too high"); #endif
#ifdef CONFIG_CONTEXT_TRACKING_USER diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h index 828fcdb801f73..292a0b7c06948 100644 --- a/include/linux/context_tracking_state.h +++ b/include/linux/context_tracking_state.h @@ -58,8 +58,7 @@ enum ctx_state { #define CONTEXT_STATE_START 0 #define CONTEXT_STATE_END (bits_per(CONTEXT_MAX - 1) - 1)
-#define RCU_DYNTICKS_BITS (IS_ENABLED(CONFIG_CONTEXT_TRACKING_WORK) ? 16 : 31) -#define RCU_DYNTICKS_START (CT_STATE_SIZE - RCU_DYNTICKS_BITS) +#define RCU_DYNTICKS_START (CT_STATE_SIZE - CONFIG_RCU_DYNTICKS_BITS) #define RCU_DYNTICKS_END (CT_STATE_SIZE - 1) #define RCU_DYNTICKS_IDX BIT(RCU_DYNTICKS_START)
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig index bdd7eadb33d8f..1ff2aab24e964 100644 --- a/kernel/rcu/Kconfig +++ b/kernel/rcu/Kconfig @@ -332,4 +332,37 @@ config RCU_DOUBLE_CHECK_CB_TIME Say Y here if you need tighter callback-limit enforcement. Say N here if you are unsure.
+config RCU_DYNTICKS_RANGE_BEGIN + int + depends on !RCU_EXPERT + default 31 if !CONTEXT_TRACKING_WORK + default 16 if CONTEXT_TRACKING_WORK + +config RCU_DYNTICKS_RANGE_BEGIN + int + depends on RCU_EXPERT + default 2 + +config RCU_DYNTICKS_RANGE_END + int + default 31 if !CONTEXT_TRACKING_WORK + default 16 if CONTEXT_TRACKING_WORK + +config RCU_DYNTICKS_BITS_DEFAULT + int + default 31 if !CONTEXT_TRACKING_WORK + default 16 if CONTEXT_TRACKING_WORK + +config RCU_DYNTICKS_BITS + int "Dynticks counter width" if CONTEXT_TRACKING_WORK + range RCU_DYNTICKS_RANGE_BEGIN RCU_DYNTICKS_RANGE_END + default RCU_DYNTICKS_BITS_DEFAULT + help + This option controls the width of the dynticks counter. + + Lower values will make overflows more frequent, which will increase + the likelihood of extending grace-periods. + + Don't touch this unless you are running some tests. + endmenu # "RCU Subsystem"
On 20/07/23 17:30, Valentin Schneider wrote:
index bdd7eadb33d8f..1ff2aab24e964 100644 --- a/kernel/rcu/Kconfig +++ b/kernel/rcu/Kconfig @@ -332,4 +332,37 @@ config RCU_DOUBLE_CHECK_CB_TIME Say Y here if you need tighter callback-limit enforcement. Say N here if you are unsure.
+config RCU_DYNTICKS_RANGE_BEGIN
- int
- depends on !RCU_EXPERT
- default 31 if !CONTEXT_TRACKING_WORK
You'll note that this should be 30 really, because the lower *2* bits are taken by the context state (CONTEXT_GUEST has a value of 3).
This highlights the fragile part of this: the Kconfig values are hardcoded, but they depend on CT_STATE_SIZE, CONTEXT_MASK and CONTEXT_WORK_MAX. The static_assert() will at least capture any misconfiguration, but having that enforced by the actual Kconfig ranges would be less awkward.
Do we currently have a way of e.g. making a Kconfig file depend on and use values generated by a C header?
On Fri, Jul 21, 2023 at 09:17:53AM +0100, Valentin Schneider wrote:
On 20/07/23 17:30, Valentin Schneider wrote:
index bdd7eadb33d8f..1ff2aab24e964 100644 --- a/kernel/rcu/Kconfig +++ b/kernel/rcu/Kconfig @@ -332,4 +332,37 @@ config RCU_DOUBLE_CHECK_CB_TIME Say Y here if you need tighter callback-limit enforcement. Say N here if you are unsure.
+config RCU_DYNTICKS_RANGE_BEGIN
- int
- depends on !RCU_EXPERT
- default 31 if !CONTEXT_TRACKING_WORK
You'll note that this should be 30 really, because the lower *2* bits are taken by the context state (CONTEXT_GUEST has a value of 3).
This highlights the fragile part of this: the Kconfig values are hardcoded, but they depend on CT_STATE_SIZE, CONTEXT_MASK and CONTEXT_WORK_MAX. The static_assert() will at least capture any misconfiguration, but having that enforced by the actual Kconfig ranges would be less awkward.
Do we currently have a way of e.g. making a Kconfig file depend on and use values generated by a C header?
Why not just have something like a boolean RCU_DYNTICKS_TORTURE Kconfig option and let the C code work out what the number of bits should be?
I suppose that there might be a failure whose frequency depended on the number of bits, which might be an argument for keeping something like RCU_DYNTICKS_RANGE_BEGIN for fault isolation. But still using RCU_DYNTICKS_TORTURE for normal testing.
Thoughts?
Thanx, Paul
On 21/07/23 07:10, Paul E. McKenney wrote:
On Fri, Jul 21, 2023 at 09:17:53AM +0100, Valentin Schneider wrote:
On 20/07/23 17:30, Valentin Schneider wrote:
index bdd7eadb33d8f..1ff2aab24e964 100644 --- a/kernel/rcu/Kconfig +++ b/kernel/rcu/Kconfig @@ -332,4 +332,37 @@ config RCU_DOUBLE_CHECK_CB_TIME Say Y here if you need tighter callback-limit enforcement. Say N here if you are unsure.
+config RCU_DYNTICKS_RANGE_BEGIN
- int
- depends on !RCU_EXPERT
- default 31 if !CONTEXT_TRACKING_WORK
You'll note that this should be 30 really, because the lower *2* bits are taken by the context state (CONTEXT_GUEST has a value of 3).
This highlights the fragile part of this: the Kconfig values are hardcoded, but they depend on CT_STATE_SIZE, CONTEXT_MASK and CONTEXT_WORK_MAX. The static_assert() will at least capture any misconfiguration, but having that enforced by the actual Kconfig ranges would be less awkward.
Do we currently have a way of e.g. making a Kconfig file depend on and use values generated by a C header?
Why not just have something like a boolean RCU_DYNTICKS_TORTURE Kconfig option and let the C code work out what the number of bits should be?
I suppose that there might be a failure whose frequency depended on the number of bits, which might be an argument for keeping something like RCU_DYNTICKS_RANGE_BEGIN for fault isolation. But still using RCU_DYNTICKS_TORTURE for normal testing.
Thoughts?
AFAICT if we run tests with the minimum possible width, then intermediate values shouldn't have much value.
Your RCU_DYNTICKS_TORTURE suggestion sounds like a saner option than what I came up with, as we can let the context tracking code figure out the widths itself and not expose any of that to Kconfig.
Thanx, Paul
On Fri, Jul 21, 2023 at 04:08:10PM +0100, Valentin Schneider wrote:
On 21/07/23 07:10, Paul E. McKenney wrote:
On Fri, Jul 21, 2023 at 09:17:53AM +0100, Valentin Schneider wrote:
On 20/07/23 17:30, Valentin Schneider wrote:
index bdd7eadb33d8f..1ff2aab24e964 100644 --- a/kernel/rcu/Kconfig +++ b/kernel/rcu/Kconfig @@ -332,4 +332,37 @@ config RCU_DOUBLE_CHECK_CB_TIME Say Y here if you need tighter callback-limit enforcement. Say N here if you are unsure.
+config RCU_DYNTICKS_RANGE_BEGIN
- int
- depends on !RCU_EXPERT
- default 31 if !CONTEXT_TRACKING_WORK
You'll note that this should be 30 really, because the lower *2* bits are taken by the context state (CONTEXT_GUEST has a value of 3).
This highlights the fragile part of this: the Kconfig values are hardcoded, but they depend on CT_STATE_SIZE, CONTEXT_MASK and CONTEXT_WORK_MAX. The static_assert() will at least capture any misconfiguration, but having that enforced by the actual Kconfig ranges would be less awkward.
Do we currently have a way of e.g. making a Kconfig file depend on and use values generated by a C header?
Why not just have something like a boolean RCU_DYNTICKS_TORTURE Kconfig option and let the C code work out what the number of bits should be?
I suppose that there might be a failure whose frequency depended on the number of bits, which might be an argument for keeping something like RCU_DYNTICKS_RANGE_BEGIN for fault isolation. But still using RCU_DYNTICKS_TORTURE for normal testing.
Thoughts?
AFAICT if we run tests with the minimum possible width, then intermediate values shouldn't have much value.
Your RCU_DYNTICKS_TORTURE suggestion sounds like a saner option than what I came up with, as we can let the context tracking code figure out the widths itself and not expose any of that to Kconfig.
Agreed. If a need for variable numbers of bits ever does arise, we can worry about it at that time. And then we would have more information on what a variable-bit facility should look like.
Thanx, Paul
We now have an RCU_EXPORT knob for configuring the size of the dynticks counter: CONFIG_RCU_DYNTICKS_BITS.
Add a torture config for a ridiculously small counter (2 bits). This is ac opy of TREE4 with the added counter size restriction.
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop Suggested-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com --- .../selftests/rcutorture/configs/rcu/TREE11 | 19 +++++++++++++++++++ .../rcutorture/configs/rcu/TREE11.boot | 1 + 2 files changed, 20 insertions(+) create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11 new file mode 100644 index 0000000000000..aa7274efd9819 --- /dev/null +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11 @@ -0,0 +1,19 @@ +CONFIG_SMP=y +CONFIG_NR_CPUS=8 +CONFIG_PREEMPT_NONE=n +CONFIG_PREEMPT_VOLUNTARY=y +CONFIG_PREEMPT=n +CONFIG_PREEMPT_DYNAMIC=n +#CHECK#CONFIG_TREE_RCU=y +CONFIG_HZ_PERIODIC=n +CONFIG_NO_HZ_IDLE=n +CONFIG_NO_HZ_FULL=y +CONFIG_RCU_TRACE=y +CONFIG_RCU_FANOUT=4 +CONFIG_RCU_FANOUT_LEAF=3 +CONFIG_DEBUG_LOCK_ALLOC=n +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n +CONFIG_RCU_EXPERT=y +CONFIG_RCU_EQS_DEBUG=y +CONFIG_RCU_LAZY=y +CONFIG_RCU_DYNTICKS_BITS=2 diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot b/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot new file mode 100644 index 0000000000000..a8d94caf7d2fd --- /dev/null +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot @@ -0,0 +1 @@ +rcutree.rcu_fanout_leaf=4 nohz_full=1-N
On Thu, Jul 20, 2023 at 05:30:53PM +0100, Valentin Schneider wrote:
We now have an RCU_EXPORT knob for configuring the size of the dynticks counter: CONFIG_RCU_DYNTICKS_BITS.
Add a torture config for a ridiculously small counter (2 bits). This is ac opy of TREE4 with the added counter size restriction.
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop Suggested-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com
.../selftests/rcutorture/configs/rcu/TREE11 | 19 +++++++++++++++++++ .../rcutorture/configs/rcu/TREE11.boot | 1 + 2 files changed, 20 insertions(+) create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11 new file mode 100644 index 0000000000000..aa7274efd9819 --- /dev/null +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11 @@ -0,0 +1,19 @@ +CONFIG_SMP=y +CONFIG_NR_CPUS=8 +CONFIG_PREEMPT_NONE=n +CONFIG_PREEMPT_VOLUNTARY=y +CONFIG_PREEMPT=n +CONFIG_PREEMPT_DYNAMIC=n +#CHECK#CONFIG_TREE_RCU=y +CONFIG_HZ_PERIODIC=n +CONFIG_NO_HZ_IDLE=n +CONFIG_NO_HZ_FULL=y +CONFIG_RCU_TRACE=y +CONFIG_RCU_FANOUT=4 +CONFIG_RCU_FANOUT_LEAF=3 +CONFIG_DEBUG_LOCK_ALLOC=n +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n +CONFIG_RCU_EXPERT=y +CONFIG_RCU_EQS_DEBUG=y +CONFIG_RCU_LAZY=y +CONFIG_RCU_DYNTICKS_BITS=2
Why not just add this last line to the existing TREE04 scenario? That would ensure that it gets tested regularly without extending the time required to run a full set of rcutorture tests.
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot b/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot new file mode 100644 index 0000000000000..a8d94caf7d2fd --- /dev/null +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot @@ -0,0 +1 @@
+rcutree.rcu_fanout_leaf=4 nohz_full=1-N
2.31.1
On Thu, Jul 20, 2023 at 12:53:05PM -0700, Paul E. McKenney wrote:
On Thu, Jul 20, 2023 at 05:30:53PM +0100, Valentin Schneider wrote:
We now have an RCU_EXPORT knob for configuring the size of the dynticks counter: CONFIG_RCU_DYNTICKS_BITS.
Add a torture config for a ridiculously small counter (2 bits). This is ac opy of TREE4 with the added counter size restriction.
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop Suggested-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com
.../selftests/rcutorture/configs/rcu/TREE11 | 19 +++++++++++++++++++ .../rcutorture/configs/rcu/TREE11.boot | 1 + 2 files changed, 20 insertions(+) create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11 new file mode 100644 index 0000000000000..aa7274efd9819 --- /dev/null +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11 @@ -0,0 +1,19 @@ +CONFIG_SMP=y +CONFIG_NR_CPUS=8 +CONFIG_PREEMPT_NONE=n +CONFIG_PREEMPT_VOLUNTARY=y +CONFIG_PREEMPT=n +CONFIG_PREEMPT_DYNAMIC=n +#CHECK#CONFIG_TREE_RCU=y +CONFIG_HZ_PERIODIC=n +CONFIG_NO_HZ_IDLE=n +CONFIG_NO_HZ_FULL=y +CONFIG_RCU_TRACE=y +CONFIG_RCU_FANOUT=4 +CONFIG_RCU_FANOUT_LEAF=3 +CONFIG_DEBUG_LOCK_ALLOC=n +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n +CONFIG_RCU_EXPERT=y +CONFIG_RCU_EQS_DEBUG=y +CONFIG_RCU_LAZY=y +CONFIG_RCU_DYNTICKS_BITS=2
Why not just add this last line to the existing TREE04 scenario? That would ensure that it gets tested regularly without extending the time required to run a full set of rcutorture tests.
Please see below for the version of this patch that I am running overnight tests with. Does this one work for you?
Thanx, Paul
------------------------------------------------------------------------
commit 1aa13731e665193cd833edac5ebc86a9c3fea2b7 Author: Valentin Schneider vschneid@redhat.com Date: Thu Jul 20 20:58:41 2023 -0700
rcutorture: Add a test config to torture test low RCU_DYNTICKS width
We now have an RCU_EXPORT knob for configuring the size of the dynticks counter: CONFIG_RCU_DYNTICKS_BITS.
Modify scenario TREE04 to exercise a a ridiculously small counter (2 bits).
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop Suggested-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com Signed-off-by: Paul E. McKenney paulmck@kernel.org
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 index dc4985064b3a..aa7274efd981 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 @@ -16,3 +16,4 @@ CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_RCU_EXPERT=y CONFIG_RCU_EQS_DEBUG=y CONFIG_RCU_LAZY=y +CONFIG_RCU_DYNTICKS_BITS=2
On 20/07/23 21:00, Paul E. McKenney wrote:
On Thu, Jul 20, 2023 at 12:53:05PM -0700, Paul E. McKenney wrote:
On Thu, Jul 20, 2023 at 05:30:53PM +0100, Valentin Schneider wrote:
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11 new file mode 100644 index 0000000000000..aa7274efd9819 --- /dev/null +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11 @@ -0,0 +1,19 @@ +CONFIG_SMP=y +CONFIG_NR_CPUS=8 +CONFIG_PREEMPT_NONE=n +CONFIG_PREEMPT_VOLUNTARY=y +CONFIG_PREEMPT=n +CONFIG_PREEMPT_DYNAMIC=n +#CHECK#CONFIG_TREE_RCU=y +CONFIG_HZ_PERIODIC=n +CONFIG_NO_HZ_IDLE=n +CONFIG_NO_HZ_FULL=y +CONFIG_RCU_TRACE=y +CONFIG_RCU_FANOUT=4 +CONFIG_RCU_FANOUT_LEAF=3 +CONFIG_DEBUG_LOCK_ALLOC=n +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n +CONFIG_RCU_EXPERT=y +CONFIG_RCU_EQS_DEBUG=y +CONFIG_RCU_LAZY=y +CONFIG_RCU_DYNTICKS_BITS=2
Why not just add this last line to the existing TREE04 scenario? That would ensure that it gets tested regularly without extending the time required to run a full set of rcutorture tests.
Please see below for the version of this patch that I am running overnight tests with. Does this one work for you?
Yep that's fine with me. I only went with a separate test file as wasn't sure how new test options should be handled (merged into existing tests vs new tests created), and didn't want to negatively impact TREE04 or TREE06. If merging into TREE04 is preferred, then I'll do just that and carry this path moving forwards.
Thanks!
On Fri, Jul 21, 2023 at 08:58:53AM +0100, Valentin Schneider wrote:
On 20/07/23 21:00, Paul E. McKenney wrote:
On Thu, Jul 20, 2023 at 12:53:05PM -0700, Paul E. McKenney wrote:
On Thu, Jul 20, 2023 at 05:30:53PM +0100, Valentin Schneider wrote:
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11 new file mode 100644 index 0000000000000..aa7274efd9819 --- /dev/null +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11 @@ -0,0 +1,19 @@ +CONFIG_SMP=y +CONFIG_NR_CPUS=8 +CONFIG_PREEMPT_NONE=n +CONFIG_PREEMPT_VOLUNTARY=y +CONFIG_PREEMPT=n +CONFIG_PREEMPT_DYNAMIC=n +#CHECK#CONFIG_TREE_RCU=y +CONFIG_HZ_PERIODIC=n +CONFIG_NO_HZ_IDLE=n +CONFIG_NO_HZ_FULL=y +CONFIG_RCU_TRACE=y +CONFIG_RCU_FANOUT=4 +CONFIG_RCU_FANOUT_LEAF=3 +CONFIG_DEBUG_LOCK_ALLOC=n +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n +CONFIG_RCU_EXPERT=y +CONFIG_RCU_EQS_DEBUG=y +CONFIG_RCU_LAZY=y +CONFIG_RCU_DYNTICKS_BITS=2
Why not just add this last line to the existing TREE04 scenario? That would ensure that it gets tested regularly without extending the time required to run a full set of rcutorture tests.
Please see below for the version of this patch that I am running overnight tests with. Does this one work for you?
Yep that's fine with me. I only went with a separate test file as wasn't sure how new test options should be handled (merged into existing tests vs new tests created), and didn't want to negatively impact TREE04 or TREE06. If merging into TREE04 is preferred, then I'll do just that and carry this path moving forwards.
Things worked fine for this one-hour-per-scenario test run on my laptop, except for the CONFIG_SMP=n runs, which all got build errors like the following.
Thanx, Paul
------------------------------------------------------------------------
In file included from ./include/linux/container_of.h:5, from ./include/linux/list.h:5, from ./include/linux/swait.h:5, from ./include/linux/completion.h:12, from ./include/linux/crypto.h:15, from arch/x86/kernel/asm-offsets.c:9: ./include/linux/context_tracking_state.h:56:61: error: ‘struct context_tracking’ has no member named ‘state’ 56 | #define CT_STATE_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE) | ^~ ./include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’ 78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg) | ^~~~ ./include/linux/context_tracking_state.h:73:1: note: in expansion of macro ‘static_assert’ 73 | static_assert((CONTEXT_STATE_END + 1 - CONTEXT_STATE_START) + | ^~~~~~~~~~~~~ ./include/linux/context_tracking_state.h:61:29: note: in expansion of macro ‘CT_STATE_SIZE’ 61 | #define RCU_DYNTICKS_START (CT_STATE_SIZE - CONFIG_RCU_DYNTICKS_BITS) | ^~~~~~~~~~~~~ ./include/linux/context_tracking_state.h:70:29: note: in expansion of macro ‘RCU_DYNTICKS_START’ 70 | #define CONTEXT_WORK_END (RCU_DYNTICKS_START - 1) | ^~~~~~~~~~~~~~~~~~ ./include/linux/context_tracking_state.h:74:16: note: in expansion of macro ‘CONTEXT_WORK_END’ 74 | (CONTEXT_WORK_END + 1 - CONTEXT_WORK_START) + | ^~~~~~~~~~~~~~~~ ./include/linux/context_tracking_state.h:56:61: error: ‘struct context_tracking’ has no member named ‘state’ 56 | #define CT_STATE_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE) | ^~ ./include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’ 78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg) | ^~~~ ./include/linux/context_tracking_state.h:73:1: note: in expansion of macro ‘static_assert’ 73 | static_assert((CONTEXT_STATE_END + 1 - CONTEXT_STATE_START) + | ^~~~~~~~~~~~~ ./include/linux/context_tracking_state.h:62:29: note: in expansion of macro ‘CT_STATE_SIZE’ 62 | #define RCU_DYNTICKS_END (CT_STATE_SIZE - 1) | ^~~~~~~~~~~~~ ./include/linux/context_tracking_state.h:75:16: note: in expansion of macro ‘RCU_DYNTICKS_END’ 75 | (RCU_DYNTICKS_END + 1 - RCU_DYNTICKS_START) == | ^~~~~~~~~~~~~~~~ ./include/linux/context_tracking_state.h:56:61: error: ‘struct context_tracking’ has no member named ‘state’ 56 | #define CT_STATE_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE) | ^~ ./include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’ 78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg) | ^~~~ ./include/linux/context_tracking_state.h:73:1: note: in expansion of macro ‘static_assert’ 73 | static_assert((CONTEXT_STATE_END + 1 - CONTEXT_STATE_START) + | ^~~~~~~~~~~~~ ./include/linux/context_tracking_state.h:61:29: note: in expansion of macro ‘CT_STATE_SIZE’ 61 | #define RCU_DYNTICKS_START (CT_STATE_SIZE - CONFIG_RCU_DYNTICKS_BITS) | ^~~~~~~~~~~~~ ./include/linux/context_tracking_state.h:75:40: note: in expansion of macro ‘RCU_DYNTICKS_START’ 75 | (RCU_DYNTICKS_END + 1 - RCU_DYNTICKS_START) == | ^~~~~~~~~~~~~~~~~~ ./include/linux/context_tracking_state.h:56:61: error: ‘struct context_tracking’ has no member named ‘state’ 56 | #define CT_STATE_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE) | ^~ ./include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’ 78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg) | ^~~~ ./include/linux/context_tracking_state.h:73:1: note: in expansion of macro ‘static_assert’ 73 | static_assert((CONTEXT_STATE_END + 1 - CONTEXT_STATE_START) + | ^~~~~~~~~~~~~ ./include/linux/context_tracking_state.h:76:15: note: in expansion of macro ‘CT_STATE_SIZE’ 76 | CT_STATE_SIZE); | ^~~~~~~~~~~~~ ./include/linux/context_tracking_state.h:73:15: error: expression in static assertion is not an integer 73 | static_assert((CONTEXT_STATE_END + 1 - CONTEXT_STATE_START) + | ^ ./include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’ 78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg) | ^~~~ ./include/linux/context_tracking_state.h:73:1: note: in expansion of macro ‘static_assert’ 73 | static_assert((CONTEXT_STATE_END + 1 - CONTEXT_STATE_START) + | ^~~~~~~~~~~~~ make[2]: *** [scripts/Makefile.build:116: arch/x86/kernel/asm-offsets.s] Error 1 make[1]: *** [/home/git/linux-rcu-1/Makefile:1275: prepare0] Error 2 make[1]: *** Waiting for unfinished jobs.... LD /home/git/linux-rcu-1/tools/objtool/objtool-in.o LINK /home/git/linux-rcu-1/tools/objtool/objtool make: *** [Makefile:234: __sub-make] Error 2
On 21/07/23 07:07, Paul E. McKenney wrote:
On Fri, Jul 21, 2023 at 08:58:53AM +0100, Valentin Schneider wrote:
On 20/07/23 21:00, Paul E. McKenney wrote:
On Thu, Jul 20, 2023 at 12:53:05PM -0700, Paul E. McKenney wrote:
On Thu, Jul 20, 2023 at 05:30:53PM +0100, Valentin Schneider wrote:
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11 new file mode 100644 index 0000000000000..aa7274efd9819 --- /dev/null +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11 @@ -0,0 +1,19 @@ +CONFIG_SMP=y +CONFIG_NR_CPUS=8 +CONFIG_PREEMPT_NONE=n +CONFIG_PREEMPT_VOLUNTARY=y +CONFIG_PREEMPT=n +CONFIG_PREEMPT_DYNAMIC=n +#CHECK#CONFIG_TREE_RCU=y +CONFIG_HZ_PERIODIC=n +CONFIG_NO_HZ_IDLE=n +CONFIG_NO_HZ_FULL=y +CONFIG_RCU_TRACE=y +CONFIG_RCU_FANOUT=4 +CONFIG_RCU_FANOUT_LEAF=3 +CONFIG_DEBUG_LOCK_ALLOC=n +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n +CONFIG_RCU_EXPERT=y +CONFIG_RCU_EQS_DEBUG=y +CONFIG_RCU_LAZY=y +CONFIG_RCU_DYNTICKS_BITS=2
Why not just add this last line to the existing TREE04 scenario? That would ensure that it gets tested regularly without extending the time required to run a full set of rcutorture tests.
Please see below for the version of this patch that I am running overnight tests with. Does this one work for you?
Yep that's fine with me. I only went with a separate test file as wasn't sure how new test options should be handled (merged into existing tests vs new tests created), and didn't want to negatively impact TREE04 or TREE06. If merging into TREE04 is preferred, then I'll do just that and carry this path moving forwards.
Things worked fine for this one-hour-per-scenario test run on my laptop,
Many thanks for testing!
except for the CONFIG_SMP=n runs, which all got build errors like the following.
Harumph, yes !SMP (and !CONTEXT_TRACKING_WORK) doesn't compile nicely, I'll fix that for v3.
text_poke_bp_batch() sends IPIs to all online CPUs to synchronize them vs the newly patched instruction. CPUs that are executing in userspace do not need this synchronization to happen immediately, and this is actually harmful interference for NOHZ_FULL CPUs.
As the synchronization IPIs are sent using a blocking call, returning from text_poke_bp_batch() implies all CPUs will observe the patched instruction(s), and this should be preserved even if the IPI is deferred. In other words, to safely defer this synchronization, any kernel instruction leading to the execution of the deferred instruction sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
This means we must pay attention to mutable instructions in the early entry code: - alternatives - static keys - all sorts of probes (kprobes/ftrace/bpf/???)
The early entry code leading to ct_work_flush() is noinstr, which gets rid of the probes.
Alternatives are safe, because it's boot-time patching (before SMP is even brought up) which is before any IPI deferral can happen.
This leaves us with static keys. Any static key used in early entry code should be only forever-enabled at boot time, IOW __ro_after_init (pretty much like alternatives). Objtool is now able to point at static keys that don't respect this, and all static keys used in early entry code have now been verified as behaving like so.
Leverage the new context_tracking infrastructure to defer sync_core() IPIs to a target CPU's next kernel entry.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Nicolas Saenz Julienne nsaenzju@redhat.com Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/include/asm/context_tracking_work.h | 6 +++-- arch/x86/include/asm/text-patching.h | 1 + arch/x86/kernel/alternative.c | 24 ++++++++++++++++---- arch/x86/kernel/kprobes/core.c | 4 ++-- arch/x86/kernel/kprobes/opt.c | 4 ++-- arch/x86/kernel/module.c | 2 +- include/linux/context_tracking_work.h | 4 ++-- 7 files changed, 32 insertions(+), 13 deletions(-)
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h index 5bc29e6b2ed38..2c66687ce00e2 100644 --- a/arch/x86/include/asm/context_tracking_work.h +++ b/arch/x86/include/asm/context_tracking_work.h @@ -2,11 +2,13 @@ #ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H #define _ASM_X86_CONTEXT_TRACKING_WORK_H
+#include <asm/sync_core.h> + static __always_inline void arch_context_tracking_work(int work) { switch (work) { - case CONTEXT_WORK_n: - // Do work... + case CONTEXT_WORK_SYNC: + sync_core(); break; } } diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index 29832c338cdc5..b6939e965e69d 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -43,6 +43,7 @@ extern void text_poke_early(void *addr, const void *opcode, size_t len); */ extern void *text_poke(void *addr, const void *opcode, size_t len); extern void text_poke_sync(void); +extern void text_poke_sync_deferrable(void); extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len); extern void *text_poke_copy(void *addr, const void *opcode, size_t len); extern void *text_poke_copy_locked(void *addr, const void *opcode, size_t len, bool core_ok); diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 72646d75b6ffe..fcce480e1919e 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -18,6 +18,7 @@ #include <linux/mmu_context.h> #include <linux/bsearch.h> #include <linux/sync_core.h> +#include <linux/context_tracking.h> #include <asm/text-patching.h> #include <asm/alternative.h> #include <asm/sections.h> @@ -1933,9 +1934,24 @@ static void do_sync_core(void *info) sync_core(); }
+static bool do_sync_core_defer_cond(int cpu, void *info) +{ + return !ct_set_cpu_work(cpu, CONTEXT_WORK_SYNC); +} + +static void __text_poke_sync(smp_cond_func_t cond_func) +{ + on_each_cpu_cond(cond_func, do_sync_core, NULL, 1); +} + void text_poke_sync(void) { - on_each_cpu(do_sync_core, NULL, 1); + __text_poke_sync(NULL); +} + +void text_poke_sync_deferrable(void) +{ + __text_poke_sync(do_sync_core_defer_cond); }
/* @@ -2145,7 +2161,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE); }
- text_poke_sync(); + text_poke_sync_deferrable();
/* * Second step: update all but the first byte of the patched range. @@ -2207,7 +2223,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * not necessary and we'd be safe even without it. But * better safe than sorry (plus there's not only Intel). */ - text_poke_sync(); + text_poke_sync_deferrable(); }
/* @@ -2228,7 +2244,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries }
if (do_sync) - text_poke_sync(); + text_poke_sync_deferrable();
/* * Remove and wait for refs to be zero. diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c index f7f6042eb7e6c..a38c914753397 100644 --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -735,7 +735,7 @@ void arch_arm_kprobe(struct kprobe *p) u8 int3 = INT3_INSN_OPCODE;
text_poke(p->addr, &int3, 1); - text_poke_sync(); + text_poke_sync_deferrable(); perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1); }
@@ -745,7 +745,7 @@ void arch_disarm_kprobe(struct kprobe *p)
perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1); text_poke(p->addr, &p->opcode, 1); - text_poke_sync(); + text_poke_sync_deferrable(); }
void arch_remove_kprobe(struct kprobe *p) diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c index 57b0037d0a996..88451a744ceda 100644 --- a/arch/x86/kernel/kprobes/opt.c +++ b/arch/x86/kernel/kprobes/opt.c @@ -521,11 +521,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op) JMP32_INSN_SIZE - INT3_INSN_SIZE);
text_poke(addr, new, INT3_INSN_SIZE); - text_poke_sync(); + text_poke_sync_deferrable(); text_poke(addr + INT3_INSN_SIZE, new + INT3_INSN_SIZE, JMP32_INSN_SIZE - INT3_INSN_SIZE); - text_poke_sync(); + text_poke_sync_deferrable();
perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_SIZE); } diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c index b05f62ee2344b..8b4542dc51b6d 100644 --- a/arch/x86/kernel/module.c +++ b/arch/x86/kernel/module.c @@ -242,7 +242,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs, write, apply);
if (!early) { - text_poke_sync(); + text_poke_sync_deferrable(); mutex_unlock(&text_mutex); }
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h index fb74db8876dd2..13fc97b395030 100644 --- a/include/linux/context_tracking_work.h +++ b/include/linux/context_tracking_work.h @@ -5,12 +5,12 @@ #include <linux/bitops.h>
enum { - CONTEXT_WORK_n_OFFSET, + CONTEXT_WORK_SYNC_OFFSET, CONTEXT_WORK_MAX_OFFSET };
enum ct_work { - CONTEXT_WORK_n = BIT(CONTEXT_WORK_n_OFFSET), + CONTEXT_WORK_SYNC = BIT(CONTEXT_WORK_SYNC_OFFSET), CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET) };
Interesting series Valentin. Some high-level question/comments on this one:
On Jul 20, 2023, at 12:34 PM, Valentin Schneider vschneid@redhat.com wrote:
text_poke_bp_batch() sends IPIs to all online CPUs to synchronize them vs the newly patched instruction. CPUs that are executing in userspace do not need this synchronization to happen immediately, and this is actually harmful interference for NOHZ_FULL CPUs.
Does the amount of harm not correspond to practical frequency of text_poke? How often does instruction patching really happen? If it is very infrequent then I am not sure if it is that harmful.
As the synchronization IPIs are sent using a blocking call, returning from text_poke_bp_batch() implies all CPUs will observe the patched instruction(s), and this should be preserved even if the IPI is deferred. In other words, to safely defer this synchronization, any kernel instruction leading to the execution of the deferred instruction sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
If it is not infrequent, then are you handling the case where userland spends multiple seconds before entering the kernel, and all this while the blocking call waits? Perhaps in such situation you want the real IPI to be sent out instead of the deferred one?
thanks,
- Joel
This means we must pay attention to mutable instructions in the early entry code:
- alternatives
- static keys
- all sorts of probes (kprobes/ftrace/bpf/???)
The early entry code leading to ct_work_flush() is noinstr, which gets rid of the probes.
Alternatives are safe, because it's boot-time patching (before SMP is even brought up) which is before any IPI deferral can happen.
This leaves us with static keys. Any static key used in early entry code should be only forever-enabled at boot time, IOW __ro_after_init (pretty much like alternatives). Objtool is now able to point at static keys that don't respect this, and all static keys used in early entry code have now been verified as behaving like so.
Leverage the new context_tracking infrastructure to defer sync_core() IPIs to a target CPU's next kernel entry.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Nicolas Saenz Julienne nsaenzju@redhat.com Signed-off-by: Valentin Schneider vschneid@redhat.com
arch/x86/include/asm/context_tracking_work.h | 6 +++-- arch/x86/include/asm/text-patching.h | 1 + arch/x86/kernel/alternative.c | 24 ++++++++++++++++---- arch/x86/kernel/kprobes/core.c | 4 ++-- arch/x86/kernel/kprobes/opt.c | 4 ++-- arch/x86/kernel/module.c | 2 +- include/linux/context_tracking_work.h | 4 ++-- 7 files changed, 32 insertions(+), 13 deletions(-)
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h index 5bc29e6b2ed38..2c66687ce00e2 100644 --- a/arch/x86/include/asm/context_tracking_work.h +++ b/arch/x86/include/asm/context_tracking_work.h @@ -2,11 +2,13 @@ #ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H #define _ASM_X86_CONTEXT_TRACKING_WORK_H
+#include <asm/sync_core.h>
static __always_inline void arch_context_tracking_work(int work) { switch (work) {
- case CONTEXT_WORK_n:
// Do work...
- case CONTEXT_WORK_SYNC:
}sync_core(); break;
} diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index 29832c338cdc5..b6939e965e69d 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -43,6 +43,7 @@ extern void text_poke_early(void *addr, const void *opcode, size_t len); */ extern void *text_poke(void *addr, const void *opcode, size_t len); extern void text_poke_sync(void); +extern void text_poke_sync_deferrable(void); extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len); extern void *text_poke_copy(void *addr, const void *opcode, size_t len); extern void *text_poke_copy_locked(void *addr, const void *opcode, size_t len, bool core_ok); diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 72646d75b6ffe..fcce480e1919e 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -18,6 +18,7 @@ #include <linux/mmu_context.h> #include <linux/bsearch.h> #include <linux/sync_core.h> +#include <linux/context_tracking.h> #include <asm/text-patching.h> #include <asm/alternative.h> #include <asm/sections.h> @@ -1933,9 +1934,24 @@ static void do_sync_core(void *info) sync_core(); }
+static bool do_sync_core_defer_cond(int cpu, void *info) +{
- return !ct_set_cpu_work(cpu, CONTEXT_WORK_SYNC);
+}
+static void __text_poke_sync(smp_cond_func_t cond_func) +{
- on_each_cpu_cond(cond_func, do_sync_core, NULL, 1);
+}
void text_poke_sync(void) {
- on_each_cpu(do_sync_core, NULL, 1);
- __text_poke_sync(NULL);
+}
+void text_poke_sync_deferrable(void) +{
- __text_poke_sync(do_sync_core_defer_cond);
}
/* @@ -2145,7 +2161,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE); }
- text_poke_sync();
text_poke_sync_deferrable();
/*
- Second step: update all but the first byte of the patched range.
@@ -2207,7 +2223,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * not necessary and we'd be safe even without it. But * better safe than sorry (plus there's not only Intel). */
text_poke_sync();
text_poke_sync_deferrable();
}
/*
@@ -2228,7 +2244,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries }
if (do_sync)
text_poke_sync();
text_poke_sync_deferrable();
/*
- Remove and wait for refs to be zero.
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c index f7f6042eb7e6c..a38c914753397 100644 --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -735,7 +735,7 @@ void arch_arm_kprobe(struct kprobe *p) u8 int3 = INT3_INSN_OPCODE;
text_poke(p->addr, &int3, 1);
- text_poke_sync();
- text_poke_sync_deferrable(); perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1);
}
@@ -745,7 +745,7 @@ void arch_disarm_kprobe(struct kprobe *p)
perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1); text_poke(p->addr, &p->opcode, 1);
- text_poke_sync();
- text_poke_sync_deferrable();
}
void arch_remove_kprobe(struct kprobe *p) diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c index 57b0037d0a996..88451a744ceda 100644 --- a/arch/x86/kernel/kprobes/opt.c +++ b/arch/x86/kernel/kprobes/opt.c @@ -521,11 +521,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op) JMP32_INSN_SIZE - INT3_INSN_SIZE);
text_poke(addr, new, INT3_INSN_SIZE);
- text_poke_sync();
- text_poke_sync_deferrable(); text_poke(addr + INT3_INSN_SIZE, new + INT3_INSN_SIZE, JMP32_INSN_SIZE - INT3_INSN_SIZE);
- text_poke_sync();
text_poke_sync_deferrable();
perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_SIZE);
} diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c index b05f62ee2344b..8b4542dc51b6d 100644 --- a/arch/x86/kernel/module.c +++ b/arch/x86/kernel/module.c @@ -242,7 +242,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs, write, apply);
if (!early) {
text_poke_sync();
}text_poke_sync_deferrable(); mutex_unlock(&text_mutex);
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h index fb74db8876dd2..13fc97b395030 100644 --- a/include/linux/context_tracking_work.h +++ b/include/linux/context_tracking_work.h @@ -5,12 +5,12 @@ #include <linux/bitops.h>
enum {
- CONTEXT_WORK_n_OFFSET,
- CONTEXT_WORK_SYNC_OFFSET, CONTEXT_WORK_MAX_OFFSET
};
enum ct_work {
- CONTEXT_WORK_n = BIT(CONTEXT_WORK_n_OFFSET),
- CONTEXT_WORK_SYNC = BIT(CONTEXT_WORK_SYNC_OFFSET), CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET)
};
-- 2.31.1
On 25/07/23 06:49, Joel Fernandes wrote:
Interesting series Valentin. Some high-level question/comments on this one:
On Jul 20, 2023, at 12:34 PM, Valentin Schneider vschneid@redhat.com wrote:
text_poke_bp_batch() sends IPIs to all online CPUs to synchronize them vs the newly patched instruction. CPUs that are executing in userspace do not need this synchronization to happen immediately, and this is actually harmful interference for NOHZ_FULL CPUs.
Does the amount of harm not correspond to practical frequency of text_poke? How often does instruction patching really happen? If it is very infrequent then I am not sure if it is that harmful.
Being pushed over a latency threshold *once* is enough to impact the latency evaluation of your given system/application.
It's mainly about shielding the isolated, NOHZ_FULL CPUs from whatever the housekeeping CPUs may be up to (flipping static keys, loading kprobes, using ftrace...) - frequency of the interference isn't such a big part of the reasoning.
As the synchronization IPIs are sent using a blocking call, returning from text_poke_bp_batch() implies all CPUs will observe the patched instruction(s), and this should be preserved even if the IPI is deferred. In other words, to safely defer this synchronization, any kernel instruction leading to the execution of the deferred instruction sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
If it is not infrequent, then are you handling the case where userland spends multiple seconds before entering the kernel, and all this while the blocking call waits? Perhaps in such situation you want the real IPI to be sent out instead of the deferred one?
The blocking call only waits for CPUs for which it queued a CSD. Deferred calls do not queue a CSD thus do not impact the waiting at all. See smp_call_function_many_cond().
On 7/25/23 09:36, Valentin Schneider wrote:
On 25/07/23 06:49, Joel Fernandes wrote:
Interesting series Valentin. Some high-level question/comments on this one:
On Jul 20, 2023, at 12:34 PM, Valentin Schneider vschneid@redhat.com wrote:
text_poke_bp_batch() sends IPIs to all online CPUs to synchronize them vs the newly patched instruction. CPUs that are executing in userspace do not need this synchronization to happen immediately, and this is actually harmful interference for NOHZ_FULL CPUs.
Does the amount of harm not correspond to practical frequency of text_poke? How often does instruction patching really happen? If it is very infrequent then I am not sure if it is that harmful.
Being pushed over a latency threshold *once* is enough to impact the latency evaluation of your given system/application.
It's mainly about shielding the isolated, NOHZ_FULL CPUs from whatever the housekeeping CPUs may be up to (flipping static keys, loading kprobes, using ftrace...) - frequency of the interference isn't such a big part of the reasoning.
Makes sense.
As the synchronization IPIs are sent using a blocking call, returning from text_poke_bp_batch() implies all CPUs will observe the patched instruction(s), and this should be preserved even if the IPI is deferred. In other words, to safely defer this synchronization, any kernel instruction leading to the execution of the deferred instruction sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
If it is not infrequent, then are you handling the case where userland spends multiple seconds before entering the kernel, and all this while the blocking call waits? Perhaps in such situation you want the real IPI to be sent out instead of the deferred one?
The blocking call only waits for CPUs for which it queued a CSD. Deferred calls do not queue a CSD thus do not impact the waiting at all. See smp_call_function_many_cond().
Ah I see you are using on_each_cpu_cond(). I should have gone through the other patch before making noise.
thanks,
- Joel
On Tue, Jul 25, 2023 at 06:49:45AM -0400, Joel Fernandes wrote:
Interesting series Valentin. Some high-level question/comments on this one:
On Jul 20, 2023, at 12:34 PM, Valentin Schneider vschneid@redhat.com wrote:
text_poke_bp_batch() sends IPIs to all online CPUs to synchronize them vs the newly patched instruction. CPUs that are executing in userspace do not need this synchronization to happen immediately, and this is actually harmful interference for NOHZ_FULL CPUs.
Does the amount of harm not correspond to practical frequency of text_poke? How often does instruction patching really happen? If it is very infrequent then I am not sure if it is that harmful.
Well, it can happen quite a bit, also from things people would not typically 'expect' it.
For instance, the moment you create the first per-task perf event we frob some jump-labels (and again some second after the last one goes away).
The same for a bunch of runtime network configurations.
As the synchronization IPIs are sent using a blocking call, returning from text_poke_bp_batch() implies all CPUs will observe the patched instruction(s), and this should be preserved even if the IPI is deferred. In other words, to safely defer this synchronization, any kernel instruction leading to the execution of the deferred instruction sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
If it is not infrequent, then are you handling the case where userland spends multiple seconds before entering the kernel, and all this while the blocking call waits? Perhaps in such situation you want the real IPI to be sent out instead of the deferred one?
Please re-read what Valentin wrote -- nobody is waiting on anything.
On 7/25/23 09:39, Peter Zijlstra wrote:
On Tue, Jul 25, 2023 at 06:49:45AM -0400, Joel Fernandes wrote:
Interesting series Valentin. Some high-level question/comments on this one:
On Jul 20, 2023, at 12:34 PM, Valentin Schneider vschneid@redhat.com wrote:
text_poke_bp_batch() sends IPIs to all online CPUs to synchronize them vs the newly patched instruction. CPUs that are executing in userspace do not need this synchronization to happen immediately, and this is actually harmful interference for NOHZ_FULL CPUs.
Does the amount of harm not correspond to practical frequency of text_poke? How often does instruction patching really happen? If it is very infrequent then I am not sure if it is that harmful.
Well, it can happen quite a bit, also from things people would not typically 'expect' it.
For instance, the moment you create the first per-task perf event we frob some jump-labels (and again some second after the last one goes away).
The same for a bunch of runtime network configurations.
Ok cool. I guess I still have memories of that old ARM device I had where modifications to kernel text was forbidden by hardware (was a security feature). That was making kprobes unusable...
As the synchronization IPIs are sent using a blocking call, returning from text_poke_bp_batch() implies all CPUs will observe the patched instruction(s), and this should be preserved even if the IPI is deferred. In other words, to safely defer this synchronization, any kernel instruction leading to the execution of the deferred instruction sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
If it is not infrequent, then are you handling the case where userland spends multiple seconds before entering the kernel, and all this while the blocking call waits? Perhaps in such situation you want the real IPI to be sent out instead of the deferred one?
Please re-read what Valentin wrote -- nobody is waiting on anything.
Makes sense. To be fair I received his email 3 minutes before yours ;-). But thank you both for clarifying!
- Joel
Kernel TLB invalidation IPIs are a common source of interference on NOHZ_FULL CPUs. Given NOHZ_FULL CPUs executing in userspace are not accessing any kernel addresses, these invalidations do not need to happen immediately, and can be deferred until the next user->kernel transition.
Rather than make __flush_tlb_all() noinstr, add a minimal noinstr variant that doesn't try to leverage INVPCID.
FIXME: not fully noinstr compliant XXX: same issue as with ins patching, when do we access data that should be invalidated?
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/include/asm/context_tracking_work.h | 4 ++++ arch/x86/include/asm/tlbflush.h | 1 + arch/x86/mm/tlb.c | 17 +++++++++++++++++ include/linux/context_tracking_state.h | 4 ++++ include/linux/context_tracking_work.h | 2 ++ 5 files changed, 28 insertions(+)
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h index 2c66687ce00e2..9d4f021b5a45b 100644 --- a/arch/x86/include/asm/context_tracking_work.h +++ b/arch/x86/include/asm/context_tracking_work.h @@ -3,6 +3,7 @@ #define _ASM_X86_CONTEXT_TRACKING_WORK_H
#include <asm/sync_core.h> +#include <asm/tlbflush.h>
static __always_inline void arch_context_tracking_work(int work) { @@ -10,6 +11,9 @@ static __always_inline void arch_context_tracking_work(int work) case CONTEXT_WORK_SYNC: sync_core(); break; + case CONTEXT_WORK_TLBI: + __flush_tlb_all_noinstr(); + break; } }
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 80450e1d5385a..323b971987af7 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -17,6 +17,7 @@ DECLARE_PER_CPU(u64, tlbstate_untag_mask);
void __flush_tlb_all(void); +void noinstr __flush_tlb_all_noinstr(void);
#define TLB_FLUSH_ALL -1UL #define TLB_GENERATION_INVALID 0 diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 267acf27480af..631df9189ded4 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -1237,6 +1237,23 @@ void __flush_tlb_all(void) } EXPORT_SYMBOL_GPL(__flush_tlb_all);
+void noinstr __flush_tlb_all_noinstr(void) +{ + /* + * This is for invocation in early entry code that cannot be + * instrumented. A RMW to CR4 works for most cases, but relies on + * being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID + * would require also resetting CR3.PCID, so just try with CR4.PGE, else + * do the CR3 write. + * + * TODO: paravirt + */ + if (cpu_feature_enabled(X86_FEATURE_PGE)) + __native_tlb_flush_global(this_cpu_read(cpu_tlbstate.cr4)); + else + flush_tlb_local(); +} + void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) { struct flush_tlb_info *info; diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h index 292a0b7c06948..3571c62cbb9cd 100644 --- a/include/linux/context_tracking_state.h +++ b/include/linux/context_tracking_state.h @@ -62,6 +62,10 @@ enum ctx_state { #define RCU_DYNTICKS_END (CT_STATE_SIZE - 1) #define RCU_DYNTICKS_IDX BIT(RCU_DYNTICKS_START)
+/* + * When CONFIG_CONTEXT_TRACKING_WORK=n, _END is 1 behind _START, which makes + * the CONTEXT_WORK size computation below 0, which is what we want! + */ #define CONTEXT_WORK_START (CONTEXT_STATE_END + 1) #define CONTEXT_WORK_END (RCU_DYNTICKS_START - 1)
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h index 13fc97b395030..47d5ced39a43a 100644 --- a/include/linux/context_tracking_work.h +++ b/include/linux/context_tracking_work.h @@ -6,11 +6,13 @@
enum { CONTEXT_WORK_SYNC_OFFSET, + CONTEXT_WORK_TLBI_OFFSET, CONTEXT_WORK_MAX_OFFSET };
enum ct_work { CONTEXT_WORK_SYNC = BIT(CONTEXT_WORK_SYNC_OFFSET), + CONTEXT_WORK_TLBI = BIT(CONTEXT_WORK_TLBI_OFFSET), CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET) };
vunmap()'s issued from housekeeping CPUs are a relatively common source of interference for isolated NOHZ_FULL CPUs, as they are hit by the flush_tlb_kernel_range() IPIs.
Given that CPUs executing in userspace do not access data in the vmalloc range, these IPIs could be deferred until their next kernel entry.
This does require a guarantee that nothing in the vmalloc range can be accessed in early entry code. vmalloc'd kernel stacks (VMAP_STACK) are AFAICT a safe exception, as a task running in userspace needs to enter kernelspace to execute do_exit() before its stack can be vfree'd.
XXX: Validation that nothing in the vmalloc range is accessed in .noinstr or somesuch?
Blindly deferring any and all flush of the kernel mappings is a risky move, so introduce a variant of flush_tlb_kernel_range() that explicitly allows deferral. Use it for vunmap flushes.
Note that while flush_tlb_kernel_range() may end up issuing a full flush (including user mappings), this only happens when reaching a invalidation range threshold where it is cheaper to do a full flush than to individually invalidate each page in the range via INVLPG. IOW, it doesn't *require* invalidating user mappings, and thus remains safe to defer until a later kernel entry.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/include/asm/tlbflush.h | 1 + arch/x86/mm/tlb.c | 23 ++++++++++++++++++++--- mm/vmalloc.c | 19 ++++++++++++++----- 3 files changed, 35 insertions(+), 8 deletions(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 323b971987af7..0b9b1f040c476 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -248,6 +248,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned int stride_shift, bool freed_tables); extern void flush_tlb_kernel_range(unsigned long start, unsigned long end); +extern void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end);
static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a) { diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 631df9189ded4..bb18b35e61b4a 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -10,6 +10,7 @@ #include <linux/debugfs.h> #include <linux/sched/smt.h> #include <linux/task_work.h> +#include <linux/context_tracking.h>
#include <asm/tlbflush.h> #include <asm/mmu_context.h> @@ -1045,6 +1046,11 @@ static void do_flush_tlb_all(void *info) __flush_tlb_all(); }
+static bool do_kernel_flush_defer_cond(int cpu, void *info) +{ + return !ct_set_cpu_work(cpu, CONTEXT_WORK_TLBI); +} + void flush_tlb_all(void) { count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); @@ -1061,12 +1067,13 @@ static void do_kernel_range_flush(void *info) flush_tlb_one_kernel(addr); }
-void flush_tlb_kernel_range(unsigned long start, unsigned long end) +static inline void +__flush_tlb_kernel_range(smp_cond_func_t cond_func, unsigned long start, unsigned long end) { /* Balance as user space task's flush, a bit conservative */ if (end == TLB_FLUSH_ALL || (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) { - on_each_cpu(do_flush_tlb_all, NULL, 1); + on_each_cpu_cond(cond_func, do_flush_tlb_all, NULL, 1); } else { struct flush_tlb_info *info;
@@ -1074,13 +1081,23 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end) info = get_flush_tlb_info(NULL, start, end, 0, false, TLB_GENERATION_INVALID);
- on_each_cpu(do_kernel_range_flush, info, 1); + on_each_cpu_cond(cond_func, do_kernel_range_flush, info, 1);
put_flush_tlb_info(); preempt_enable(); } }
+void flush_tlb_kernel_range(unsigned long start, unsigned long end) +{ + __flush_tlb_kernel_range(NULL, start, end); +} + +void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end) +{ + __flush_tlb_kernel_range(do_kernel_flush_defer_cond, start, end); +} + /* * This can be used from process context to figure out what the value of * CR3 is without needing to do a (slow) __read_cr3(). diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 93cf99aba335b..e08b6c7d22fb6 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -439,6 +439,15 @@ void vunmap_range_noflush(unsigned long start, unsigned long end) __vunmap_range_noflush(start, end); }
+#ifdef CONFIG_CONTEXT_TRACKING_WORK +void __weak flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end) +{ + flush_tlb_kernel_range(start, end); +} +#else +#define flush_tlb_kernel_range_deferrable(start, end) flush_tlb_kernel_range(start, end) +#endif + /** * vunmap_range - unmap kernel virtual addresses * @addr: start of the VM area to unmap @@ -452,7 +461,7 @@ void vunmap_range(unsigned long addr, unsigned long end) { flush_cache_vunmap(addr, end); vunmap_range_noflush(addr, end); - flush_tlb_kernel_range(addr, end); + flush_tlb_kernel_range_deferrable(addr, end); }
static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, @@ -1746,7 +1755,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) list_last_entry(&local_purge_list, struct vmap_area, list)->va_end);
- flush_tlb_kernel_range(start, end); + flush_tlb_kernel_range_deferrable(start, end); resched_threshold = lazy_max_pages() << 1;
spin_lock(&free_vmap_area_lock); @@ -1849,7 +1858,7 @@ static void free_unmap_vmap_area(struct vmap_area *va) flush_cache_vunmap(va->va_start, va->va_end); vunmap_range_noflush(va->va_start, va->va_end); if (debug_pagealloc_enabled_static()) - flush_tlb_kernel_range(va->va_start, va->va_end); + flush_tlb_kernel_range_deferrable(va->va_start, va->va_end);
free_vmap_area_noflush(va); } @@ -2239,7 +2248,7 @@ static void vb_free(unsigned long addr, unsigned long size) vunmap_range_noflush(addr, addr + size);
if (debug_pagealloc_enabled_static()) - flush_tlb_kernel_range(addr, addr + size); + flush_tlb_kernel_range_deferrable(addr, addr + size);
spin_lock(&vb->lock);
@@ -2304,7 +2313,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush) free_purged_blocks(&purge_list);
if (!__purge_vmap_area_lazy(start, end) && flush) - flush_tlb_kernel_range(start, end); + flush_tlb_kernel_range_deferrable(start, end); mutex_unlock(&vmap_purge_lock); }
On Jul 20, 2023, at 9:30 AM, Valentin Schneider vschneid@redhat.com wrote:
vunmap()'s issued from housekeeping CPUs are a relatively common source of interference for isolated NOHZ_FULL CPUs, as they are hit by the flush_tlb_kernel_range() IPIs.
Given that CPUs executing in userspace do not access data in the vmalloc range, these IPIs could be deferred until their next kernel entry.
So I think there are a few assumptions here that it seems suitable to confirm and acknowledge the major one in the commit log (assuming they hold).
There is an assumption that VMAP page-tables are not freed. I actually never paid attention to that, but skimming the code it does seem so. To clarify the issue: if page-tables were freed and their pages were reused, there would be a problem that page-walk caches for instance would be used and “junk” entries from the reused pages would be used. See [1].
I would also assume the memory-hot-unplug of some sorts is not an issue, (i.e., you cannot have a stale TLB entry pointing to memory that was unplugged).
I also think that there might be speculative code execution using stale TLB entries that would point to memory that has been reused and perhaps controllable by the user. If somehow the CPU/OS is tricked to use the stale executable TLB entries early enough on kernel entry that might be an issue. I guess it is probably theoretical issue, but it would be helpful to confirm.
In general, deferring TLB flushes can be done safely. This patch, I think, takes it one step forward and allows the reuse of the memory before the TLB flush is actually done. This is more dangerous.
[1] https://lore.kernel.org/lkml/tip-b956575bed91ecfb136a8300742ecbbf451471ab@gi...
On 21/07/23 18:15, Nadav Amit wrote:
On Jul 20, 2023, at 9:30 AM, Valentin Schneider vschneid@redhat.com wrote:
vunmap()'s issued from housekeeping CPUs are a relatively common source of interference for isolated NOHZ_FULL CPUs, as they are hit by the flush_tlb_kernel_range() IPIs.
Given that CPUs executing in userspace do not access data in the vmalloc range, these IPIs could be deferred until their next kernel entry.
So I think there are a few assumptions here that it seems suitable to confirm and acknowledge the major one in the commit log (assuming they hold).
There is an assumption that VMAP page-tables are not freed. I actually never paid attention to that, but skimming the code it does seem so. To clarify the issue: if page-tables were freed and their pages were reused, there would be a problem that page-walk caches for instance would be used and “junk” entries from the reused pages would be used. See [1].
Thanks for looking into this and sharing context. This is an area I don't have much experience with, so help is much appreciated!
Indeed, accessing addresses that should be impacted by a TLB flush *before* executing the deferred flush is an issue. Deferring sync_core() for instruction patching is a similar problem - it's all in the shape of "access @addr impacted by @operation during kernel entry, before actually executing @operation".
AFAICT the only reasonable way to go about the deferral is to prove that no such access happens before the deferred @operation is done. We got to prove that for sync_core() deferral, cf. PATCH 18.
I'd like to reason about it for deferring vunmap TLB flushes:
What addresses in VMAP range, other than the stack, can early entry code access? Yes, the ranges can be checked at runtime, but is there any chance of figuring this out e.g. at build-time?
I would also assume the memory-hot-unplug of some sorts is not an issue, (i.e., you cannot have a stale TLB entry pointing to memory that was unplugged).
I also think that there might be speculative code execution using stale TLB entries that would point to memory that has been reused and perhaps controllable by the user. If somehow the CPU/OS is tricked to use the stale executable TLB entries early enough on kernel entry that might be an issue. I guess it is probably theoretical issue, but it would be helpful to confirm.
In general, deferring TLB flushes can be done safely. This patch, I think, takes it one step forward and allows the reuse of the memory before the TLB flush is actually done. This is more dangerous.
[1] https://lore.kernel.org/lkml/tip-b956575bed91ecfb136a8300742ecbbf451471ab@gi...
On 7/24/23 04:32, Valentin Schneider wrote:
AFAICT the only reasonable way to go about the deferral is to prove that no such access happens before the deferred @operation is done. We got to prove that for sync_core() deferral, cf. PATCH 18.
I'd like to reason about it for deferring vunmap TLB flushes:
What addresses in VMAP range, other than the stack, can early entry code access? Yes, the ranges can be checked at runtime, but is there any chance of figuring this out e.g. at build-time?
Nadav was touching on a very important point: TLB flushes for addresses are relatively easy to defer. You just need to ensure that the CPU deferring the flush does an actual flush before it might architecturally consume the contents of the flushed entry.
TLB flushes for freed page tables are another game entirely. The CPU is free to cache any part of the paging hierarchy it wants at any time. It's also free to set accessed and dirty bits at any time, even for instructions that may never execute architecturally.
That basically means that if you have *ANY* freed page table page *ANYWHERE* in the page table hierarchy of any CPU at any time ... you're screwed.
There's no reasoning about accesses or ordering. As soon as the CPU does *anything*, it's out to get you.
You're going to need to do something a lot more radical to deal with free page table pages.
On Mon, Jul 24, 2023 at 10:40:04AM -0700, Dave Hansen wrote:
TLB flushes for freed page tables are another game entirely. The CPU is free to cache any part of the paging hierarchy it wants at any time. It's also free to set accessed and dirty bits at any time, even for instructions that may never execute architecturally.
That basically means that if you have *ANY* freed page table page *ANYWHERE* in the page table hierarchy of any CPU at any time ... you're screwed.
There's no reasoning about accesses or ordering. As soon as the CPU does *anything*, it's out to get you.
You're going to need to do something a lot more radical to deal with free page table pages.
Ha! IIRC the only thing we can reasonably do there is to have strict per-cpu page-tables such that NOHZ_FULL CPUs can be isolated. That is, as long we the per-cpu tables do not contain -- and have never contained -- a particular table page, we can avoid flushing it. Because if it never was there, it also couldn't have speculatively loaded it.
Now, x86 doesn't really do per-cpu page tables easily (otherwise we'd have done them ages ago) and doing them is going to be *major* surgery and pain.
Other than that, we must take the TLBI-IPI when freeing page-table-pages.
But yeah, I think Nadav is right, vmalloc.c never frees page-tables (or at least, I couldn't find it in a hurry either), but if we're going to be doing this, then that file must include a very prominent comment explaining it must never actually do so either.
Not being able to free page-tables might be a 'problem' if we're going to be doing more of HUGE_VMALLOC, because that means it becomes rather hard to swizzle from small to large pages.
Sorry, I missed out Dave's email, so now I'm taking my time to page (hah!) all of this.
On 25/07/23 15:21, Peter Zijlstra wrote:
On Mon, Jul 24, 2023 at 10:40:04AM -0700, Dave Hansen wrote:
TLB flushes for freed page tables are another game entirely. The CPU is free to cache any part of the paging hierarchy it wants at any time. It's also free to set accessed and dirty bits at any time, even for instructions that may never execute architecturally.
That basically means that if you have *ANY* freed page table page *ANYWHERE* in the page table hierarchy of any CPU at any time ... you're screwed.
There's no reasoning about accesses or ordering. As soon as the CPU does *anything*, it's out to get you.
OK, I feel like I need to go back do some more reading now, but I think I get the difference. Thanks for spelling it out.
You're going to need to do something a lot more radical to deal with free page table pages.
Ha! IIRC the only thing we can reasonably do there is to have strict per-cpu page-tables such that NOHZ_FULL CPUs can be isolated. That is, as long we the per-cpu tables do not contain -- and have never contained -- a particular table page, we can avoid flushing it. Because if it never was there, it also couldn't have speculatively loaded it.
Now, x86 doesn't really do per-cpu page tables easily (otherwise we'd have done them ages ago) and doing them is going to be *major* surgery and pain.
Other than that, we must take the TLBI-IPI when freeing page-table-pages.
But yeah, I think Nadav is right, vmalloc.c never frees page-tables (or at least, I couldn't find it in a hurry either), but if we're going to be doing this, then that file must include a very prominent comment explaining it must never actually do so either.
I also couldn't find any freeing of the page-table-pages, I'll do another pass and sharpen my quill for a big fat comment.
Not being able to free page-tables might be a 'problem' if we're going to be doing more of HUGE_VMALLOC, because that means it becomes rather hard to swizzle from small to large pages.
On Mon, Jul 24, 2023 at 10:40:04AM -0700, Dave Hansen wrote:
On 7/24/23 04:32, Valentin Schneider wrote:
AFAICT the only reasonable way to go about the deferral is to prove that no such access happens before the deferred @operation is done. We got to prove that for sync_core() deferral, cf. PATCH 18.
I'd like to reason about it for deferring vunmap TLB flushes:
What addresses in VMAP range, other than the stack, can early entry code access? Yes, the ranges can be checked at runtime, but is there any chance of figuring this out e.g. at build-time?
Nadav was touching on a very important point: TLB flushes for addresses are relatively easy to defer. You just need to ensure that the CPU deferring the flush does an actual flush before it might architecturally consume the contents of the flushed entry.
TLB flushes for freed page tables are another game entirely. The CPU is free to cache any part of the paging hierarchy it wants at any time.
Depend on CONFIG_PAGE_TABLE_ISOLATION=y, which flushes TLB (and page table caches) on user->kernel and kernel->user context switches ?
So freeing a kernel pagetable page does not require interrupting a CPU which is in userspace (therefore does not have visibility into kernel pagetables).
It's also free to set accessed and dirty bits at any time, even for instructions that may never execute architecturally.
That basically means that if you have *ANY* freed page table page *ANYWHERE* in the page table hierarchy of any CPU at any time ... you're screwed.
There's no reasoning about accesses or ordering. As soon as the CPU does *anything*, it's out to get you.
You're going to need to do something a lot more radical to deal with free page table pages.
On 7/25/23 09:37, Marcelo Tosatti wrote:
TLB flushes for freed page tables are another game entirely. The CPU is free to cache any part of the paging hierarchy it wants at any time.
Depend on CONFIG_PAGE_TABLE_ISOLATION=y, which flushes TLB (and page table caches) on user->kernel and kernel->user context switches ?
Well, first of all, CONFIG_PAGE_TABLE_ISOLATION doesn't flush the TLB at all on user<->kernel switches when PCIDs are enabled.
Second, even if it did, the CPU is still free to cache any portion of the paging hierarchy at any time. Without LASS[1], userspace can even _compel_ walks of the kernel portion of the address space, and we don't have any infrastructure to tell if a freed kernel page is exposed in the user copy of the page tables with PTI.
Third, (also ignoring PCIDs) there are plenty of instructions between kernel entry and the MOV-to-CR3 that can flush the TLB. All those instructions architecturally permitted to speculatively set Accessed or Dirty bits in any part of the address space. If they run into a free page table page, things get ugly.
These accesses are not _likely_. There probably isn't a predictor out there that's going to see a:
movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
and go off trying to dirty memory in the vmalloc() area. But we'd need some backward *and* forward-looking guarantees from our intrepid CPU designers to promise that this kind of thing is safe yesterday, today and tomorrow. I suspect such a guarantee is going to be hard to obtain.
1. https://lkml.kernel.org/r/20230110055204.3227669-1-yian.chen@intel.com
linux-kselftest-mirror@lists.linaro.org