Re: [PATCH v2 00/28] The new cgroup slab memory controller

1 Sep 2020


      On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
...
There appears to be another problem that is related to the
cgroup_mutex -> mem_hotplug_lock deadlock described above.
In the original deadlock that I described, the workaround is to
replace crash dump from piping to Linux traditional save to files
method. However, after trying this workaround, I still observed
hardware watchdog resets during machine  shutdown.
The new problem occurs for the following reason: upon shutdown systemd
calls a service that hot-removes memory, and if hot-removing fails for
some reason systemd kills that service after timeout. However, systemd
is never able to kill the service, and we get hardware reset caused by
watchdog or a hang during shutdown:
Thread #1: memory hot-remove systemd service
Loops indefinitely, because if there is something still to be migrated
this loop never terminates. However, this loop can be terminated via
signal from systemd after timeout.
__offline_pages()
      do {
          pfn = scan_movable_pages(pfn, end_pfn);
                  # Returns 0, meaning there is nothing available to
                  # migrate, no page is PageLRU(page)
          ...
          ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
                                            NULL, check_pages_isolated_cb);
                  # Returns -EBUSY, meaning there is at least one PFN that
                  # still has to be migrated.
      } while (ret);
Thread #2: ccs killer kthread
   css_killed_work_fn
     cgroup_mutex  <- Grab this Mutex
     mem_cgroup_css_offline
       memcg_offline_kmem.part
          memcg_deactivate_kmem_caches
            get_online_mems
              mem_hotplug_lock <- waits for Thread#1 to get read access
Thread #3: systemd
ksys_read
 vfs_read
   __vfs_read
     seq_read
       proc_single_show
         proc_cgroup_show
           mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
to thread #1.
The proper fix for both of the problems is to avoid cgroup_mutex ->
mem_hotplug_lock ordering that was recently fixed in the mainline but
still present in all stable branches. Unfortunately, I do not see a
simple fix in how to remove mem_hotplug_lock from
memcg_deactivate_kmem_caches without using Roman's series that is too
big for stable.
We too are seeing this on Power systems when stress-testing memory
hotplug, but with the following call trace (from hung task timer)
instead of Thread #2 above:
__switch_to
__schedule
schedule
percpu_rwsem_wait
__percpu_down_read
get_online_mems
memcg_create_kmem_cache
memcg_kmem_cache_create_func
process_one_work
worker_thread
kthread
ret_from_kernel_thread
While I understand that Roman's new slab controller patchset will fix
this, I also wonder if infinitely looping in the memory unplug path
with mem_hotplug_lock held is the right thing to do? Earlier we had
a few other exit possibilities in this path (like max retries etc)
but those were removed by commits:
72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early
ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory
Or, is the user-space test is expected to induce a signal back-off when
unplug doesn't complete within a reasonable amount of time?
Regards,
Bharata.

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v2 00/28] The new cgroup slab memory controller