On 8/11/2025 9:55 PM, Dave Hansen wrote:
On 8/11/25 02:15, Uladzislau Rezki wrote:
kernel_pte_work.list is global shared var, it would make the producer pte_free_kernel() and the consumer kernel_pte_work_func() to operate in serialized timing. In a large system, I don't think you design this deliberately 🙂
Sorry for jumping.
Agree, unless it is never considered as a hot path or something that can be really contented. It looks like you can use just a per-cpu llist to drain thinks.
Remember, the code that has to run just before all this sent an IPI to every single CPU on the system to have them do a (on x86 at least) pretty expensive TLB flush.
It can be easily identified as a bottleneck by multi-CPU stress testing programs involving frequent process creation and destruction, similar to the operation of a heavily loaded multi-process Apache web server. Hot/cold path ?
If this is a hot path, we have bigger problems on our hands: the full TLB flush on every CPU.
Perhaps not "WE", IPI driven TLB flush seems not the shared mechanism of all CPUs, at least not for ARM as far as I know.
So, sure, there are a million ways to make this deferred freeing more scalable. But the code that's here is dirt simple and self contained. If someone has some ideas for something that's simpler and more scalable, then I'm totally open to it.
But this is _not_ the place to add complexity to get scalability.
At least, please dont add bottleneck, how complex to do that ?
Thanks, Ethan