On Mon, Dec 21, 2020 at 07:35:31PM +0000, Shaoying Xu wrote:
From: Johannes Weiner hannes@cmpxchg.org
[ Upstream commit a983b5ebee57209c99f68c8327072f25e0e6e3da ]
We've seen memory.stat reads in top-level cgroups take up to fourteen seconds during a userspace bug that created tens of thousands of ghost cgroups pinned by lingering page cache.
Even with a more reasonable number of cgroups, aggregating memory.stat is unnecessarily heavy. The complexity is this:
nr_cgroups * nr_stat_items * nr_possible_cpus
where the stat items are ~70 at this point. With 128 cgroups and 128 CPUs - decent, not enormous setups - reading the top-level memory.stat has to aggregate over a million per-cpu counters. This doesn't scale.
Instead of spreading the source of truth across all CPUs, use the per-cpu counters merely to batch updates to shared atomic counters.
This is the same as the per-cpu stocks we use for charging memory to the shared atomic page_counters, and also the way the global vmstat counters are implemented.
Vmstat has elaborate spilling thresholds that depend on the number of CPUs, amount of memory, and memory pressure - carefully balancing the cost of counter updates with the amount of per-cpu error. That's because the vmstat counters are system-wide, but also used for decisions inside the kernel (e.g. NR_FREE_PAGES in the allocator). Neither is true for the memory controller.
Use the same static batch size we already use for page_counter updates during charging. The per-cpu error in the stats will be 128k, which is an acceptable ratio of cores to memory accounting granularity.
[hannes@cmpxchg.org: fix warning in __this_cpu_xchg() calls] Link: http://lkml.kernel.org/r/20171201135750.GB8097@cmpxchg.org Link: http://lkml.kernel.org/r/20171103153336.24044-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Acked-by: Vladimir Davydov vdavydov.dev@gmail.com Cc: Michal Hocko mhocko@suse.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Cc: stable@vger.kernel.org c9019e9: mm: memcontrol: eliminate raw access to stat and event counters Cc: stable@vger.kernel.org 2845426: mm: memcontrol: implement lruvec stat functions on top of each other Cc: stable@vger.kernel.org [shaoyi@amazon.com: resolved the conflict brought by commit 17ffa29c355658c8e9b19f56cbf0388500ca7905 in mm/memcontrol.c by contextual fix] Signed-off-by: Shaoying Xu shaoyi@amazon.com
The excessive complexity in memory.stat reporting was fixed in v4.16 but didn't appear to make it to 4.14 stable. When backporting this patch, there is a small conflict brought by commit 17ffa29c355658c8e9b19f56cbf0388500ca7905 within free_mem_cgroup_per_node_info() of mm/memcontrol.c and can be resolved by contextual fix.
include/linux/memcontrol.h | 96 +++++++++++++++++++++++++++--------------- mm/memcontrol.c | 101 +++++++++++++++++++++++---------------------- 2 files changed, 113 insertions(+), 84 deletions(-)
This patch does not apply to the 4.14.y tree, please fix it up and resend it if you wish to see it applied there.
thanks,
greg k-h