[ Added maintainers for various bits and pieces, since I spent the time trying to look at why those bits and pieces wasted stack-space and caused problems ]
On Tue, Sep 7, 2021 at 3:16 PM Linus Torvalds torvalds@linux-foundation.org wrote:
None of these seem to be new.
That said, all but one of them seem to be (a) very much worth fixing and (b) easy to fix.
The do_tcp_getsockopt() one in tpc.c is a classic case of "lots of different case statements, many of them with their own struct allocations on stack, and all of them disjoint".
So the fix for that would be the traditional one of just making helper functions for the cases that aren't entirely trivial. We've done that before, and it not only fixes stack usage problems, it results in code that is easier to read, and generally actually performs better too (exactly because it avoids sparse stacks and extra D$ use)
The pe_test_uints() one is the same horrendous nasty Kunit pattern that I fixed in commit 4b93c544e90e ("thunderbolt: test: split up test cases in tb_test_credit_alloc_all") that had an even worse case.
The KUNIT macros create all these individually reasonably small initialized structures on stack, and when you have more than a small handful of them the KUNIT infrastructure just makes the stack space explode. Sometimes the compiler will be able to re-use the stack slots, but it seems to be an iffy proposition to depend on it - it seems to be a combination of luck and various config options.
I detest code that exists for debugging or for testing, and that violates fundamental rules and causes more problems in the process.
The mac802.11 one seems to be due to 'struct ieee802_11_elems' being big, and allocated on the stack. I think it's probably made worse there with inlining, ie
- ieee80211_sta_rx_queued_mgmt() has one copy
- ieee80211_rx_mgmt_beacon() is possibly inlined, and has its own copy
but even if it isn't due to that kind of duplication due to inlining, that code is dangerous. Exactly because it has two nested stack frames with that big structure, and they are active at the same time in the callchain whether inlined or not.
And it's *pointlessly* dangerous, because the 'struct ieee802_11_elems elems' in ieee80211_sta_rx_queued_mgmt() is only used for the IEEE80211_STYPE_ACTION case, so it is entirely disjoint from the IEEE80211_STYPE_BEACON case, and those stack allocations simply should not nest like that in the first place.
Making the IEEE80211_STYPE_ACTION case be its own function - like the other cases - and moving the struct there should fix it. Possibly a "noinline" or two necessary to make sure that the compiler doesn't then undo the "these two cases are disjoint" thing.
The qede_config_rx_mode() has a fairly big 'struct qed_filter_params' structure on stack (it's mainly just a union of two other structures). That one is a bit silly, because the very same function *alsu* does a temporary allocation for the 'uc_macs[]' array, and I think it could have literally made that allocation just do both the params and the uc_macs[] array together.
But that "a bit silly" is actually *doubly* silly, because that big structure allocated for the stack that is actually a union, uses the QED_FILTER_TYPE_RX_MODE type of the union. Which in turn is literally *one*single*enum*field*.
So the qede_config_rx_mode() case uses that chunk of kernel stack for a big union for no good reason. It really only wants two words, but the way the code is written, it uses a lot, because the union also has a 'struct qed_filter_mcast_params' member that has an array of [64][ETH_ALEN] bytes in it.
So that's about 400 bytes of stack space entirely wasted if I read the code correctly.
The xhci_reserve_bandwidth() one is because it has an array of 31 'struct xhci_bw_info' on the stack. It's not a huge structure (six 32-bit words), but when you have 31 of those in an array, it's about 750 bytes right there. It should likely just be dynamically allocated - it doesn't seem to be some super-important critical thing where an allocation cannot be done.
The do_sys_poll() thing is a bit sad. The code has been tweaked to basically use 1kB of stack space in one configuration. It overflows it in a lot of other configs. Using stack space for those kinds of top-level functions that are guaranteed to have an empty stack is pretty much the best possible situation, but it's one where we don't really have a good way to try to have some kind of dynamic feedback from the compiler for how much other stack space it is using.
So that do_sys_poll() case is the only one I see where the stack usage is actually fine and explicitly expected. We *aim* for 1kB of stack, and then in some - probably quite a few - situations we go over.
There are many more of these cases. I've seen Hyper-V allocate 'struct cpumask' on the stack, which is once again an absolute no-no that people have apparently just ignored the warning for. When you have NR_CPUS being the maximum of 8k, those bits add up, and a single cpumask is 1kB in size. Which is why you should never do that on stack, and instead use '
cpumask_var_t mask; alloc_cpumask_var(&mask,..)
which will do a much more reasonable job. But the reason I call out hyperv is that as far as I know, hyperv itself doesn't actually support 8192 CPU's. So all that apic noise with 'struct cpumask' that uses 1kB of data when NR_CPUS is set to 8192 is just wasted. Maybe I'm wrong. Adding hyperv people to the cc too.
A lot of the stack frame size warnings are hidden by the fact that our default value for warning about stack usage is 2kB for 64-bit builds.
Probably exactly because people did things like that cpumask thing, and have these arrays of structures that are often even bigger in the 64-bit world.
Linus