Hello Glauber,
On Fri, Apr 20, 2012 at 06:57:08PM -0300, Glauber Costa wrote:
This is my current attempt at getting the kmem controller into a mergeable state. IMHO, all the important bits are there, and it should't change *that* much from now on. I am, however, expecting at least a couple more interactions before we sort all the edges out.
This series works for both the slub and the slab. One of my main goals was to make sure that the interfaces we are creating actually makes sense for both allocators.
I did some adaptations to the slab-specific patches, but the bulk of it comes from Suleiman's patches. I did the best to use his patches as-is where possible so to keep authorship information. When not possible, I tried to be fair and quote it in the commit message.
In this series, all existing caches are created per-memcg after its first hit. The main reason is, during discussions in the memory summit we came into agreement that the fragmentation problems that could arise from creating all of them are mitigated by the typically small quantity of caches in the system (order of a few megabytes total for sparsely used caches). The lazy creation from Suleiman is kept, although a bit modified. For instance, I now use a locked scheme instead of cmpxcgh to make sure cache creation won't fail due to duplicates, which simplifies things by quite a bit.
The slub is a bit more complex than what I came up with in my slub-only series. The reason is we did not need to use the cache-selection logic in the allocator itself - it was done by the cache users. But since now we are lazy creating all caches, this is simply no longer doable.
I am leaving destruction of caches out of the series, although most of the infrastructure for that is here, since we did it in earlier series. This is basically because right now Kame is reworking it for user memcg, and I like the new proposed behavior a lot more. We all seemed to have agreed that reclaim is an interesting problem by itself, and is not included in this already too complicated series. Please note that this is still marked as experimental, so we have so room. A proper shrinker implementation is a hard requirement to take the kmem controller out of the experimental state.
I am also not including documentation, but it should only be a matter of merging what we already wrote in earlier series plus some additions.
The patches look great, thanks a lot for your work!
I finally tried them, and after a few fixes the kmem accounting seems to work fine with slab. The fixes will follow this email, and if they're fine, feel free to fold them into your patches.
However, with slub I'm getting kernel hangs and various traces[1]. It seems that kernel memcg recurses when trying to call memcg_create_cache_enqueue() -- it calls kmalloc_no_account() which was introduced to not recurse into memcg, but looking into 'slub: provide kmalloc_no_account' patch, I don't see any difference between _no_account and ordinary kmalloc. Hm.
OK, slub apart... the accounting works with slab, which is great.
There's another, more generic question: is there any particular reason why you don't want to account slab memory for root cgroup?
Personally I'm interested in kmem accounting because I use memcg for lowmemory notifications. I'm installing events on the root's memory.usage_in_bytes, and the thresholds values are calculated like this:
total_ram - wanted_threshold
So, if we want to get a notification when there's 64 MB memory left on a 256 MB machine, we'd install an event on the 194 MB mark (the good thing about usage_in_bytes, is that it does account file caches, so the formula is simple).
Obviously, without kmem accounting the formula can be very imprecise when kernel (e.g. hw drivers) itself start using a lot of memory. With root's slab accounting the problem would be solved, but for some reason you deliberately do not want to account it for root cgroup. I suspect that there are some performance concerns?..
Thanks,
[1]
BUG: unable to handle kernel paging request at ffffffffb2e80900 IP: [<ffffffff8105940c>] check_preempt_wakeup+0x3c/0x210 PGD 160d067 PUD 1611063 PMD 0 Thread overran stack, or stack corrupted Oops: 0000 [#1] SMP CPU 0 Pid: 943, comm: bash Not tainted 3.4.0-rc4+ #34 Bochs Bochs RIP: 0010:[<ffffffff8105940c>] [<ffffffff8105940c>] check_preempt_wakeup+0x3c/0x210 RSP: 0018:ffff880006305ee8 EFLAGS: 00010006 RAX: 00000000000109c0 RBX: ffff8800071b4e20 RCX: ffff880006306000 RDX: 0000000000000000 RSI: 0000000006306028 RDI: ffff880007c109c0 RBP: ffff880006305f28 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880007c109c0 R13: ffff88000644ddc0 R14: ffff8800071b4e68 R15: 0000000000000000 FS: 00007fad1244c700(0000) GS:ffff880007c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffb2e80900 CR3: 00000000063b8000 CR4: 00000000000006b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process bash (pid: 943, threadinfo ffff880006306000, task ffff88000644ddc0) Stack: 0000000000000000 ffff88000644de08 ffff880007c109c0 ffff880007c109c0 ffff8800071b4e20 0000000000000000 0000000000000000 0000000000000000 ffff880006305f48 ffffffff81053304 ffff880007c109c0 ffff880007c109c0 Call Trace: Code: 76 48 41 55 41 54 49 89 fc 53 48 89 f3 48 83 ec 18 4c 8b af e0 07 00 00 49 8d 4d 48 48 89 4d c8 49 8b 4d 08 4c 3b 75 c8 8b 71 18 <48> 8b 34 f5 c0 07 65 81 48 8b bc 30 a8 00 00 00 8b 35 3a 3f 5c RIP [<ffffffff8105940c>] check_preempt_wakeup+0x3c/0x210 RSP <ffff880006305ee8> CR2: ffffffffb2e80900 ---[ end trace 78fa9c86bebb1214 ]---
OFF_SLAB is not CREATE_MASK bit, so we should clear it before calling __kmem_cache_create(), otherwise kernel gets very upset, see below.
As a side effect, now we let slab to reevaluate off-slab decision, but the decision will be the same, because whether we do off-slabs only depend on the size and create_mask bits.
------------[ cut here ]------------ kernel BUG at mm/slab.c:2376! invalid opcode: 0000 [#1] SMP CPU 0 Pid: 14, comm: kworker/0:1 Not tainted 3.4.0-rc4+ #32 Bochs Bochs RIP: 0010:[<ffffffff810c1839>] [<ffffffff810c1839>] __kmem_cache_create+0x609/0x650 RSP: 0018:ffff8800072c9c90 EFLAGS: 00010286 RAX: 0000000000000800 RBX: ffffffff81f26bf8 RCX: 000000000000000b RDX: 000000000000000c RSI: 000000000000000b RDI: ffff8800065c66f8 RBP: ffff8800072c9d40 R08: ffffffff80002800 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff8800072c8000 R13: ffff8800072c9fd8 R14: ffffffffffffffff R15: ffff8800072c9d0c FS: 00007f45eb0f2700(0000) GS:ffff880007c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffffffffff600400 CR3: 000000000650e000 CR4: 00000000000006b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kworker/0:1 (pid: 14, threadinfo ffff8800072c8000, task ffff88000725d100) Stack: ffff8800072c9cb0 0000000000000000 ffffc9000000c000 ffffffff81621e80 ffff8800072c9cc0 ffffffff81621e80 ffff8800072c9d40 ffffffff81355cbf ffffffff810c1944 0000000000000000 ffffffff81621ec0 ffffffff80002800 Call Trace: [<ffffffff81355cbf>] ? mutex_lock_nested+0x26f/0x340 [<ffffffff810c1944>] ? kmem_cache_dup+0x44/0x110 [<ffffffff810c2aa0>] ? memcg_create_kmem_cache+0xd0/0xd0 [<ffffffff810c196b>] kmem_cache_dup+0x6b/0x110 [<ffffffff810c2a73>] memcg_create_kmem_cache+0xa3/0xd0 [<ffffffff810c2b1a>] memcg_create_cache_work_func+0x7a/0xe0 [<ffffffff810405d4>] process_one_work+0x174/0x450 [<ffffffff81040576>] ? process_one_work+0x116/0x450 [<ffffffff81040e53>] worker_thread+0x123/0x2d0 [<ffffffff81040d30>] ? manage_workers.isra.27+0x120/0x120 [<ffffffff8104639e>] kthread+0x8e/0xa0
Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org --- mm/slab.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/mm/slab.c b/mm/slab.c index eed72ac..dff87ef 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2619,6 +2619,13 @@ kmem_cache_dup(struct mem_cgroup *memcg, struct kmem_cache *cachep) return NULL;
flags = cachep->flags & ~SLAB_PANIC; + /* + * OFF_SLAB is not CREATE_MASK bit, so we should clear it before + * calling __kmem_cache_create(). As a side effect, we let slab + * to reevaluate off-slab decision; but that is OK, as the bit + * is automatically set depending on the size and other flags. + */ + flags &= ~CFLGS_OFF_SLAB; mutex_lock(&cache_chain_mutex); new = __kmem_cache_create(memcg, name, obj_size(cachep), cachep->memcg_params.orig_align, flags, cachep->ctor);
Not sure why the code tries to unlock the rcu. The only case where slab grabs the lock is around mem_cgroup_get_kmem_cache() call, which won't result into calling cache_grow() (that tries to unlock the rcu).
===================================== [ BUG: bad unlock balance detected! ] 3.4.0-rc4+ #33 Not tainted ------------------------------------- swapper/0/0 is trying to release lock (rcu_read_lock) at: [<ffffffff8134f0b4>] cache_grow.constprop.63+0xe8/0x371 but there are no more locks to release!
other info that might help us debug this: no locks held by swapper/0/0.
stack backtrace: Pid: 0, comm: swapper/0 Not tainted 3.4.0-rc4+ #33 Call Trace: [<ffffffff8134f0b4>] ? cache_grow.constprop.63+0xe8/0x371 [<ffffffff8134cf09>] print_unlock_inbalance_bug.part.26+0xd1/0xd9 [<ffffffff8134f0b4>] ? cache_grow.constprop.63+0xe8/0x371 [<ffffffff8106865e>] print_unlock_inbalance_bug+0x4e/0x50 [<ffffffff8134f0b4>] ? cache_grow.constprop.63+0xe8/0x371 [<ffffffff8106cb26>] __lock_release+0xd6/0xe0 [<ffffffff8106cb8c>] lock_release+0x5c/0x80 [<ffffffff8134f0cc>] cache_grow.constprop.63+0x100/0x371 [<ffffffff8134f5c6>] cache_alloc_refill+0x289/0x2dc [<ffffffff810bf682>] ? kmem_cache_alloc+0x92/0x260 [<ffffffff81676a0f>] ? pidmap_init+0x79/0xb2 [<ffffffff810bf842>] kmem_cache_alloc+0x252/0x260 [<ffffffff810bf5f0>] ? kmem_freepages+0x180/0x180 [<ffffffff81676a0f>] pidmap_init+0x79/0xb2 [<ffffffff81667aa3>] start_kernel+0x297/0x2f8 [<ffffffff8166769e>] ? repair_env_string+0x5a/0x5a [<ffffffff816672fd>] x86_64_start_reservations+0x101/0x105 [<ffffffff816673f1>] x86_64_start_kernel+0xf0/0xf7
Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org --- include/linux/slab_def.h | 2 -- 1 file changed, 2 deletions(-)
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h index c4f7e45..2d371ae 100644 --- a/include/linux/slab_def.h +++ b/include/linux/slab_def.h @@ -245,13 +245,11 @@ mem_cgroup_kmem_cache_prepare_sleep(struct kmem_cache *cachep) * enabled. */ kmem_cache_get_ref(cachep); - rcu_read_unlock(); }
static inline void mem_cgroup_kmem_cache_finish_sleep(struct kmem_cache *cachep) { - rcu_read_lock(); kmem_cache_drop_ref(cachep); }
The function is no longer used, so can be safely removed.
Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org --- include/linux/slab_def.h | 11 ----------- 1 file changed, 11 deletions(-)
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h index 2d371ae..72ea626 100644 --- a/include/linux/slab_def.h +++ b/include/linux/slab_def.h @@ -232,12 +232,6 @@ kmem_cache_get_ref(struct kmem_cache *cachep) }
static inline void -mem_cgroup_put_kmem_cache(struct kmem_cache *cachep) -{ - rcu_read_unlock(); -} - -static inline void mem_cgroup_kmem_cache_prepare_sleep(struct kmem_cache *cachep) { /* @@ -266,11 +260,6 @@ kmem_cache_drop_ref(struct kmem_cache *cachep) }
static inline void -mem_cgroup_put_kmem_cache(struct kmem_cache *cachep) -{ -} - -static inline void mem_cgroup_kmem_cache_prepare_sleep(struct kmem_cache *cachep) { }
linaro-kernel@lists.linaro.org