+cc Steve
On Sun, May 25, 2025 at 06:18:42PM +0100, David Laight wrote:
On Fri, 23 May 2025 20:01:33 +0200 "Arnd Bergmann" arnd@arndb.de wrote:
On Fri, May 23, 2025, at 19:11, Kent Overstreet wrote:
On Fri, May 23, 2025 at 05:17:15PM +0200, Arnd Bergmann wrote:
- KASAN_STACK adds extra redzones for each variable
- KASAN_STACK further prevents stack slots from getting reused inside one function, in order to better pinpoint which instance caused problems like out-of-scope access
- passing structures by value causes them to be put on the stack on some architectures, even when the structure size is only one or two registers
We mainly do this with bkey_s_c, which is just two words: on x86_64, that gets passed in registers. Is riscv different?
Not sure, I think it's mostly older ABIs that are limited, either not passing structures in registers at all, or only possibly one but not two of them.
- sanitizers turn off optimizations that lead to better stack usage
- in some cases, the missed optimization ends up causing local variables to get spilled to the stack many times because of a combination of all the above.
Yeesh.
I suspect we should be running with a larger stack when the sanitizers are running, and perhaps tweak the warnings accordingly. I did a bunch of stack usage work after I found a kmsan build was blowing out the stack, but then running with max stack usage tracing enabled showed it to be a largely non issue on non-sanitizer builds, IIRC.
Enabling KASAN does double the available stack space. However, I don't think we should use that as an excuse to raise the per-function warning limit, because
- the majority of all function stacks do not grow that much when sanitizers are enabled
- allmodconfig enables KASAN and should still catch mistakes where a driver accidentally puts a large structure on the stack
That is rather annoying when you want to look at the generated code :-(
- 2KB on 64-bit targes is a really large limit. At some point in the past I had a series that lowered the limit to 1536 byte for 64-bit targets, but I never managed to get all the changes merged.
I've a cunning plan to do a proper static analysis of stack usage. It is a 'simple' matter of getting objtool to output all calls with the stack offset. Indirect calls need the function hashes from fine-ibt, but also need clang to support 'hash seeds' to disambiguate all the void (*)(void *) functions. That'll first barf at all recursion, and then, I expect, show a massive stack use inside snprintf() in some error path.
I suspect recursion will make the results you get with that approach useless.
We already have "trace max stack", but that only checks at process exit, so it doesn't tell you much.
We could do better with tracing - just inject a trampoline that checks the current stack usage against the maximum stack usage we've seen, and emits a trace event with a stack trace if it's greater.
(and now Steve's going to tell us he's already done this :)
On Sun, 25 May 2025 13:36:16 -0400 Kent Overstreet kent.overstreet@linux.dev wrote:
+cc Steve
...
I've a cunning plan to do a proper static analysis of stack usage. It is a 'simple' matter of getting objtool to output all calls with the stack offset. Indirect calls need the function hashes from fine-ibt, but also need clang to support 'hash seeds' to disambiguate all the void (*)(void *) functions. That'll first barf at all recursion, and then, I expect, show a massive stack use inside snprintf() in some error path.
I suspect recursion will make the results you get with that approach useless.
Recursion is an issue, but the kernel really doesn't support recursion. So you actually want to know the possible recursion loops anyway. I suspect (hope) most will be the 'recurses only once' type. If not they need some other bound.
We already have "trace max stack", but that only checks at process exit, so it doesn't tell you much.
We could do better with tracing - just inject a trampoline that checks the current stack usage against the maximum stack usage we've seen, and emits a trace event with a stack trace if it's greater.
Both those only tells you the stack you've used. The static analysis will show you the stack 'you might use'. Which is really much more important.
I did this for an embedded system a long time ago. The outcome was that we didn't have enough memory to allocate the 'worst case' stacks!
David
On Sun, May 25, 2025 at 06:47:57PM +0100, David Laight wrote:
On Sun, 25 May 2025 13:36:16 -0400 Kent Overstreet kent.overstreet@linux.dev wrote:
+cc Steve
...
I've a cunning plan to do a proper static analysis of stack usage. It is a 'simple' matter of getting objtool to output all calls with the stack offset. Indirect calls need the function hashes from fine-ibt, but also need clang to support 'hash seeds' to disambiguate all the void (*)(void *) functions. That'll first barf at all recursion, and then, I expect, show a massive stack use inside snprintf() in some error path.
I suspect recursion will make the results you get with that approach useless.
Recursion is an issue, but the kernel really doesn't support recursion. So you actually want to know the possible recursion loops anyway. I suspect (hope) most will be the 'recurses only once' type. If not they need some other bound.
Recursion is a fact of life when you get different subsystems interacting in unpredictable ways.
You can be in one filesystem, and then end up in a fault handler (gup(), or a simple copy to/from user), and then end up in a completely different filesystem - and then you call into the block layer, or networking if it's NFS.
Static analysis might get you some useful data within a subsystem, but it won't tell you much about the kernel as a whole as people are actually running it.
On Sun, 25 May 2025 13:36:16 -0400 Kent Overstreet kent.overstreet@linux.dev wrote:
We already have "trace max stack", but that only checks at process exit, so it doesn't tell you much.
Nope, it traces the stack at every function call, but it misses the leaf functions and also doesn't check interrupts as they may use a different stack.
We could do better with tracing - just inject a trampoline that checks the current stack usage against the maximum stack usage we've seen, and emits a trace event with a stack trace if it's greater.
(and now Steve's going to tell us he's already done this :)
Close ;-)
# echo 1 > /proc/sys/kernel/stack_tracer_enabled
Wait.
# cat /sys/kernel/tracing/stack_trace Depth Size Location (33 entries) ----- ---- -------- 0) 8360 48 __msecs_to_jiffies+0x9/0x30 1) 8312 104 update_group_capacity+0x95/0x970 2) 8208 520 update_sd_lb_stats.constprop.0+0x278/0x2f40 3) 7688 416 sched_balance_find_src_group+0x96/0xe30 4) 7272 512 sched_balance_rq+0x53f/0x2fe0 5) 6760 344 sched_balance_newidle+0x6c1/0x1310 6) 6416 80 pick_next_task_fair+0x55/0xe60 7) 6336 328 __schedule+0x8a5/0x33d0 8) 6008 32 schedule+0xe2/0x3b0 9) 5976 32 io_schedule+0x8f/0xf0 10) 5944 264 rq_qos_wait+0x12a/0x200 11) 5680 144 wbt_wait+0x159/0x260 12) 5536 40 __rq_qos_throttle+0x50/0x90 13) 5496 320 blk_mq_submit_bio+0x70b/0x1ff0 14) 5176 240 __submit_bio+0x1b3/0x600 15) 4936 248 submit_bio_noacct_nocheck+0x546/0xca0 16) 4688 144 ext4_bio_write_folio+0x69d/0x1870 17) 4544 64 mpage_submit_folio+0x14c/0x2b0 18) 4480 96 mpage_process_page_bufs+0x392/0x7a0 19) 4384 632 mpage_prepare_extent_to_map+0xa5b/0x1080 20) 3752 496 ext4_do_writepages+0x8af/0x2ee0 21) 3256 304 ext4_writepages+0x26f/0x5c0 22) 2952 344 do_writepages+0x183/0x7c0 23) 2608 152 __writeback_single_inode+0x114/0xb00 24) 2456 744 writeback_sb_inodes+0x52b/0xdf0 25) 1712 168 __writeback_inodes_wb+0xf4/0x270 26) 1544 312 wb_writeback+0x547/0x800 27) 1232 328 wb_workfn+0x7b1/0xbc0 28) 904 352 process_one_work+0x85a/0x1450 29) 552 176 worker_thread+0x5b7/0xf80 30) 376 168 kthread+0x371/0x720 31) 208 32 ret_from_fork+0x34/0x70 32) 176 176 ret_from_fork_asm+0x1a/0x30
The code that does this is in kernel/trace/trace_stack.c
It simply attaches to the function tracer and at ever function checks the current stack size.
Hmm, I need to update this because today we even pass the stack pointer via the ftrace_regs if the arch supports it. Using that would allow me to get rid of the hack:
static void check_stack(unsigned long ip, unsigned long *stack) { [..] this_size = ((unsigned long)stack) & (THREAD_SIZE-1); this_size = THREAD_SIZE - this_size;
unsigned long stack;
[..]
static void stack_trace_call(unsigned long ip, unsigned long parent_ip, struct ftrace_ops *op, struct ftrace_regs *fregs) { unsigned long stack; [..]
check_stack(ip, &stack);
-- Steve
On Sun, May 25, 2025 at 03:25:02PM -0400, Steven Rostedt wrote:
On Sun, 25 May 2025 13:36:16 -0400 Kent Overstreet kent.overstreet@linux.dev wrote:
We already have "trace max stack", but that only checks at process exit, so it doesn't tell you much.
Nope, it traces the stack at every function call, but it misses the leaf functions and also doesn't check interrupts as they may use a different stack.
I was thinking of DEBUG_STACK_USAGE :)
We could do better with tracing - just inject a trampoline that checks the current stack usage against the maximum stack usage we've seen, and emits a trace event with a stack trace if it's greater.
(and now Steve's going to tell us he's already done this :)
Close ;-)
# echo 1 > /proc/sys/kernel/stack_tracer_enabled
Wait.
# cat /sys/kernel/tracing/stack_trace Depth Size Location (33 entries) ----- ---- -------- 0) 8360 48 __msecs_to_jiffies+0x9/0x30
8312 104 update_group_capacity+0x95/0x970
8208 520 update_sd_lb_stats.constprop.0+0x278/0x2f40
7688 416 sched_balance_find_src_group+0x96/0xe30
7272 512 sched_balance_rq+0x53f/0x2fe0
6760 344 sched_balance_newidle+0x6c1/0x1310
6416 80 pick_next_task_fair+0x55/0xe60
6336 328 __schedule+0x8a5/0x33d0
6008 32 schedule+0xe2/0x3b0
5976 32 io_schedule+0x8f/0xf0
5944 264 rq_qos_wait+0x12a/0x200
5680 144 wbt_wait+0x159/0x260
5536 40 __rq_qos_throttle+0x50/0x90
5496 320 blk_mq_submit_bio+0x70b/0x1ff0
5176 240 __submit_bio+0x1b3/0x600
4936 248 submit_bio_noacct_nocheck+0x546/0xca0
4688 144 ext4_bio_write_folio+0x69d/0x1870
4544 64 mpage_submit_folio+0x14c/0x2b0
4480 96 mpage_process_page_bufs+0x392/0x7a0
4384 632 mpage_prepare_extent_to_map+0xa5b/0x1080
3752 496 ext4_do_writepages+0x8af/0x2ee0
3256 304 ext4_writepages+0x26f/0x5c0
2952 344 do_writepages+0x183/0x7c0
2608 152 __writeback_single_inode+0x114/0xb00
2456 744 writeback_sb_inodes+0x52b/0xdf0
1712 168 __writeback_inodes_wb+0xf4/0x270
1544 312 wb_writeback+0x547/0x800
1232 328 wb_workfn+0x7b1/0xbc0
904 352 process_one_work+0x85a/0x1450
552 176 worker_thread+0x5b7/0xf80
376 168 kthread+0x371/0x720
208 32 ret_from_fork+0x34/0x70
176 176 ret_from_fork_asm+0x1a/0x30
Nice! This is exactly what I was looking for :)
Depth Size Location (48 entries) ----- ---- -------- 0) 7728 48 __update_load_avg_se+0x9/0x440 1) 7680 80 update_load_avg+0x25f/0x2b0 2) 7600 56 set_next_task_fair+0x232/0x290 3) 7544 48 pick_next_task_fair+0xcf/0x1a0 4) 7496 120 __schedule+0x284/0xe80 5) 7376 16 preempt_schedule_irq+0x33/0x50 6) 7360 136 asm_common_interrupt+0x26/0x40 7) 7224 48 get_symbol_offset+0x43/0x70 8) 7176 56 kallsyms_lookup_buildid+0x55/0xf0 9) 7120 88 __sprint_symbol.isra.0+0x48/0xf0 10) 7032 720 symbol_string+0xf1/0x120 11) 6312 120 vsnprintf+0x3dc/0x5d0 12) 6192 128 bch2_prt_printf+0x57/0x140 13) 6064 64 bch2_prt_task_backtrace+0x71/0xc0 14) 6000 40 print_cycle+0x71/0xa0 15) 5960 104 trace_would_deadlock+0xb6/0x150 16) 5856 128 break_cycle+0xfe/0x260 17) 5728 368 bch2_check_for_deadlock+0x35f/0x5f0 18) 5360 96 six_lock_slowpath.isra.0+0x204/0x4c0 19) 5264 96 __bch2_btree_node_get+0x384/0x5b0 20) 5168 336 bch2_btree_path_traverse_one+0x7a5/0xd60 21) 4832 232 bch2_btree_iter_peek_slot+0x104/0x7f0 22) 4600 216 btree_key_cache_fill+0xcf/0x1a0 23) 4384 72 bch2_btree_path_traverse_cached+0x2bd/0x310 24) 4312 336 bch2_btree_path_traverse_one+0x705/0xd60 25) 3976 232 bch2_btree_iter_peek_slot+0x104/0x7f0 26) 3744 424 bch2_check_discard_freespace_key+0x172/0x5e0 27) 3320 224 bch2_bucket_alloc_freelist+0x422/0x610 28) 3096 88 bch2_bucket_alloc_trans+0x1f3/0x3a0 29) 3008 168 bch2_bucket_alloc_set_trans+0xf1/0x360 30) 2840 184 __open_bucket_add_buckets+0x40b/0x660 31) 2656 40 open_bucket_add_buckets+0x72/0xf0 32) 2616 280 bch2_alloc_sectors_start_trans+0x76d/0xd00 33) 2336 424 __bch2_write+0x1d1/0x11d0 34) 1912 168 __bch2_writepage+0x3b2/0x790 35) 1744 72 write_cache_pages+0x5c/0xa0 36) 1672 176 bch2_writepages+0x67/0xc0 37) 1496 184 do_writepages+0xcc/0x240 38) 1312 64 __writeback_single_inode+0x41/0x320 39) 1248 456 writeback_sb_inodes+0x216/0x4e0 40) 792 64 __writeback_inodes_wb+0x4c/0xe0 41) 728 168 wb_writeback+0x19c/0x310 42) 560 136 wb_workfn+0x2a4/0x400 43) 424 64 process_one_work+0x18c/0x330 44) 360 72 worker_thread+0x252/0x3a0 45) 288 80 kthread+0xf9/0x210 46) 208 32 ret_from_fork+0x31/0x50 47) 176 176 ret_from_fork_asm+0x11/0x20