There appears to be another problem that is related to the cgroup_mutex -> mem_hotplug_lock deadlock described above.
In the original deadlock that I described, the workaround is to replace crash dump from piping to Linux traditional save to files method. However, after trying this workaround, I still observed hardware watchdog resets during machine shutdown.
The new problem occurs for the following reason: upon shutdown systemd calls a service that hot-removes memory, and if hot-removing fails for some reason systemd kills that service after timeout. However, systemd is never able to kill the service, and we get hardware reset caused by watchdog or a hang during shutdown:
Thread #1: memory hot-remove systemd service Loops indefinitely, because if there is something still to be migrated this loop never terminates. However, this loop can be terminated via signal from systemd after timeout. __offline_pages() do { pfn = scan_movable_pages(pfn, end_pfn); # Returns 0, meaning there is nothing available to # migrate, no page is PageLRU(page) ... ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL, check_pages_isolated_cb); # Returns -EBUSY, meaning there is at least one PFN that # still has to be migrated. } while (ret);
Thread #2: ccs killer kthread css_killed_work_fn cgroup_mutex <- Grab this Mutex mem_cgroup_css_offline memcg_offline_kmem.part memcg_deactivate_kmem_caches get_online_mems mem_hotplug_lock <- waits for Thread#1 to get read access
Thread #3: systemd ksys_read vfs_read __vfs_read seq_read proc_single_show proc_cgroup_show mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt to thread #1.
The proper fix for both of the problems is to avoid cgroup_mutex -> mem_hotplug_lock ordering that was recently fixed in the mainline but still present in all stable branches. Unfortunately, I do not see a simple fix in how to remove mem_hotplug_lock from memcg_deactivate_kmem_caches without using Roman's series that is too big for stable.
Thanks, Pasha
On Wed, Aug 12, 2020 at 8:31 PM Pavel Tatashin pasha.tatashin@soleen.com wrote:
On Wed, Aug 12, 2020 at 8:04 PM Roman Gushchin guro@fb.com wrote:
On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
Guys,
There is a convoluted deadlock that I just root caused, and that is fixed by this work (at least based on my code inspection it appears to be fixed); but the deadlock exists in older and stable kernels, and I am not sure whether to create a separate patch for it, or backport this whole thing.
Hi Roman,
Hi Pavel,
wow, it's a quite complicated deadlock. Thank you for providing a perfect analysis!
Thank you, it indeed took me a while to fully grasp the deadlock.
Unfortunately, backporting the whole new slab controller isn't an option: it's way too big and invasive.
This is what I thought as well, this is why I want to figure out what is the best way forward.
Do you already have a standalone fix?
Not yet, I do not have a standalone fix. I suspect the best fix would be to address fix css_killed_work_fn() stack so we never have: cgroup_mutex -> mem_hotplug_lock. Either decoupling them or reverse the order would work. If you have suggestions since you worked on this code recently, please let me know.
Thank you, Pasha
Thanks!
Thread #1: Hot-removes memory device_offline memory_subsys_offline offline_pages __offline_pages mem_hotplug_lock <- write access waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can migrate it.
Thread #2: ccs killer kthread css_killed_work_fn cgroup_mutex <- Grab this Mutex mem_cgroup_css_offline memcg_offline_kmem.part memcg_deactivate_kmem_caches get_online_mems mem_hotplug_lock <- waits for Thread#1 to get read access
Thread #3: crashing userland program do_coredump elf_core_dump get_dump_page() -> get page with pfn#9e5113, and increment refcnt dump_emit __kernel_write __vfs_write new_sync_write pipe_write pipe_wait -> waits for Thread #4 systemd-coredump to read the pipe
Thread #4: systemd-coredump ksys_read vfs_read __vfs_read seq_read proc_single_show proc_cgroup_show cgroup_mutex -> waits from Thread #2 for this lock.
In Summary: Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2 waits for Thread#1 for mem_hotplug_lock rwlock.
This work appears to fix this deadlock because cgroup_mutex is not called anymore before mem_hotplug_lock (unless I am missing it), as it removes memcg_deactivate_kmem_caches.
Thank you, Pasha
On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin guro@fb.com wrote:
On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
The existing cgroup slab memory controller is based on the idea of replicating slab allocator internals for each memory cgroup. This approach promises a low memory overhead (one pointer per page), and isn't adding too much code on hot allocation and release paths. But is has a very serious flaw: it leads to a low slab utilization.
Using a drgn* script I've got an estimation of slab utilization on a number of machines running different production workloads. In most cases it was between 45% and 65%, and the best number I've seen was around 85%. Turning kmem accounting off brings it to high 90s. Also it brings back 30-50% of slab memory. It means that the real price of the existing slab memory controller is way bigger than a pointer per page.
The real reason why the existing design leads to a low slab utilization is simple: slab pages are used exclusively by one memory cgroup. If there are only few allocations of certain size made by a cgroup, or if some active objects (e.g. dentries) are left after the cgroup is deleted, or the cgroup contains a single-threaded application which is barely allocating any kernel objects, but does it every time on a new CPU: in all these cases the resulting slab utilization is very low. If kmem accounting is off, the kernel is able to use free space on slab pages for other allocations.
Arguably it wasn't an issue back to days when the kmem controller was introduced and was an opt-in feature, which had to be turned on individually for each memory cgroup. But now it's turned on by default on both cgroup v1 and v2. And modern systemd-based systems tend to create a large number of cgroups.
This patchset provides a new implementation of the slab memory controller, which aims to reach a much better slab utilization by sharing slab pages between multiple memory cgroups. Below is the short description of the new design (more details in commit messages).
Accounting is performed per-object instead of per-page. Slab-related vmstat counters are converted to bytes. Charging is performed on page-basis, with rounding up and remembering leftovers.
Memcg ownership data is stored in a per-slab-page vector: for each slab page a vector of corresponding size is allocated. To keep slab memory reparenting working, instead of saving a pointer to the memory cgroup directly an intermediate object is used. It's simply a pointer to a memcg (which can be easily changed to the parent) with a built-in reference counter. This scheme allows to reparent all allocated objects without walking them over and changing memcg pointer to the parent.
Instead of creating an individual set of kmem_caches for each memory cgroup, two global sets are used: the root set for non-accounted and root-cgroup allocations and the second set for all other allocations. This allows to simplify the lifetime management of individual kmem_caches: they are destroyed with root counterparts. It allows to remove a good amount of code and make things generally simpler.
The patchset* has been tested on a number of different workloads in our production. In all cases it saved significant amount of memory, measured from high hundreds of MBs to single GBs per host. On average, the size of slab memory has been reduced by 35-45%.
Here are some numbers from multiple runs of sysbench and kernel compilation with this patchset on a 10 core POWER8 host:
========================================================================== Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and meminfo:Slab for Sysbench oltp_read_write with mysqld running as part of a mem cgroup (Sampling every 5s)
5.5.0-rc7-mm1 +slab patch %reduction
memory.kmem.usage_in_bytes 15859712 4456448 72 memory.usage_in_bytes 337510400 335806464 .5 Slab: (kB) 814336 607296 25
memory.kmem.usage_in_bytes 16187392 4653056 71 memory.usage_in_bytes 318832640 300154880 5 Slab: (kB) 789888 559744 29
Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and meminfo:Slab for kernel compilation (make -s -j64) Compilation was done from bash that is in a memory cgroup. (Sampling every 5s)
5.5.0-rc7-mm1 +slab patch %reduction
memory.kmem.usage_in_bytes 338493440 231931904 31 memory.usage_in_bytes 7368015872 6275923968 15 Slab: (kB) 1139072 785408 31
memory.kmem.usage_in_bytes 341835776 236453888 30 memory.usage_in_bytes 6540427264 6072893440 7 Slab: (kB) 1074304 761280 29
memory.kmem.usage_in_bytes 340525056 233570304 31 memory.usage_in_bytes 6406209536 6177357824 3 Slab: (kB) 1244288 739712 40
Slab consumption right after boot
5.5.0-rc7-mm1 +slab patch %reduction
Slab: (kB) 821888 583424 29
Summary:
With sysbench and kernel compilation, memory.kmem.usage_in_bytes shows around 70% and 30% reduction consistently.
Didn't see consistent reduction of memory.usage_in_bytes with sysbench and kernel compilation.
Slab usage (from /proc/meminfo) shows consistent 30% reduction and the same is seen right after boot too.
That's just perfect!
memory.usage_in_bytes was most likely the same because the freed space was taken by pagecache.
Thank you very much for testing!
Roman