On 2024/9/10 19:44, Thomas Gleixner wrote:
On Tue, Sep 10 2024 at 12:00, Leizhen wrote:
On 2024/9/10 2:41, Thomas Gleixner wrote:
All related functions have this problem and all of this code is very strict about boundaries. Instead of accurately doing the refill, purge etc. we should look into proper batch mode mechanisms. Let me think about it.
It may be helpful to add several arrays to record the first node of each batch in each free list. Take 'percpu_pool' as an example:
struct debug_percpu_free { struct hlist_head free_objs; int obj_free;
- int batch_idx;
- struct hlist_node *batch_first[4]; // ODEBUG_POOL_PERCPU_SIZE / ODEBUG_BATCH_SIZE
};
A new free node is added to the header of the list, and the batch is cut from the tail of the list. NodeA<-->...<-->NodeB<-->...<-->NodeC<-->NodeD<--> free_objs |---one batch---|---one batch---| | | batch_first[0] batch_first[1]
The current data structures are not fit for the purpose. Glueing workarounds into the existing mess makes it just worse.
So the data structures need to be redesigned from ground up to be fit for the purpose.
allocation:
Using the global pool for single object allocations is wrong
During boot this can be a completely disconnected list, which does not need any accounting, does not need pool_lock and can be just protected with irqsave like the per CPU pools. It's effectivly a per CPU pool because at that point there is only one CPU and everything is single threaded.
The per CPU pool handling is backwards
If the per CPU pool is empty, then the pool needs to be refilled with a batch from the global pool and allocated from there.
Allocation then always happens from the active per CPU batch slot.
free:
Early boot
Just put it back on the dedicated boot list and be done
After obj_cache is initialized
Put it back to the per CPU pool into the active batch slot. If the slot becomes full then make the next slot the active slot. It the full slot was the top most slot then move that slot either into the global pool when there is a free slot, or move it to the to_free pool.
That means the per CPU pool is different from the global pools as it can allocate/free single objects, while the global pools are strictly stacks of batches. Any movement between per CPU pools and global pools is batch based and just moves lists from one head to another.
That minimizes the pool lock contention and the cache foot print. The global to free pool must have an extra twist to accomodate non-batch sized drops and to handle the all slots are full case, but that's just a trivial detail.
That's great. I really admire you for completing the refactor in such a short of time. But I have a few minor comments. 1. When kmem_cache_zalloc() is called to allocate objs for filling, if less than one batch of objs are allocated, all of them can be pushed to the local CPU. That's, call pcpu_free() one by one. In this way, the number of free objs cached by pool_global and pool_to_free is always an integer multiple of ODEBUG_BATCH_SIZE. 2. Member tot_cnt of struct global_pool can be deleted. We can get it simply and quickly through (slot_idx * ODEBUG_BATCH_SIZE). Avoid redundant maintenance. 3. debug_objects_pool_min_level also needs to be adjusted accordingly, the number of batches of the min level.
See the completely untested combo patch against tip core/debugobjects below.
Thanks,
tglx