On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
There appears to be another problem that is related to the cgroup_mutex -> mem_hotplug_lock deadlock described above.
In the original deadlock that I described, the workaround is to replace crash dump from piping to Linux traditional save to files method. However, after trying this workaround, I still observed hardware watchdog resets during machine shutdown.
The new problem occurs for the following reason: upon shutdown systemd calls a service that hot-removes memory, and if hot-removing fails for some reason systemd kills that service after timeout. However, systemd is never able to kill the service, and we get hardware reset caused by watchdog or a hang during shutdown:
Thread #1: memory hot-remove systemd service Loops indefinitely, because if there is something still to be migrated this loop never terminates. However, this loop can be terminated via signal from systemd after timeout. __offline_pages() do { pfn = scan_movable_pages(pfn, end_pfn); # Returns 0, meaning there is nothing available to # migrate, no page is PageLRU(page) ... ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL, check_pages_isolated_cb); # Returns -EBUSY, meaning there is at least one PFN that # still has to be migrated. } while (ret);
Thread #2: ccs killer kthread css_killed_work_fn cgroup_mutex <- Grab this Mutex mem_cgroup_css_offline memcg_offline_kmem.part memcg_deactivate_kmem_caches get_online_mems mem_hotplug_lock <- waits for Thread#1 to get read access
Thread #3: systemd ksys_read vfs_read __vfs_read seq_read proc_single_show proc_cgroup_show mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt to thread #1.
The proper fix for both of the problems is to avoid cgroup_mutex -> mem_hotplug_lock ordering that was recently fixed in the mainline but still present in all stable branches. Unfortunately, I do not see a simple fix in how to remove mem_hotplug_lock from memcg_deactivate_kmem_caches without using Roman's series that is too big for stable.
We too are seeing this on Power systems when stress-testing memory hotplug, but with the following call trace (from hung task timer) instead of Thread #2 above:
__switch_to __schedule schedule percpu_rwsem_wait __percpu_down_read get_online_mems memcg_create_kmem_cache memcg_kmem_cache_create_func process_one_work worker_thread kthread ret_from_kernel_thread
While I understand that Roman's new slab controller patchset will fix this, I also wonder if infinitely looping in the memory unplug path with mem_hotplug_lock held is the right thing to do? Earlier we had a few other exit possibilities in this path (like max retries etc) but those were removed by commits:
72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory
Or, is the user-space test is expected to induce a signal back-off when unplug doesn't complete within a reasonable amount of time?
Regards, Bharata.