Re: [PATCH 1/2] Add mempressure cgroup

7 Jan 2013

(2013/01/04 17:29), Anton Vorontsov wrote:
...
This commit implements David Rientjes' idea of mempressure cgroup.
The main characteristics are the same to what I've tried to add to vmevent
API; internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
pressure index calculation. But we don't expose the index to the userland.
Instead, there are three levels of the pressure:
o low (just reclaiming, e.g. caches are draining);
  o medium (allocation cost becomes high, e.g. swapping);
  o oom (about to oom very soon).
The rationale behind exposing levels and not the raw pressure index
described here: http://lkml.org/lkml/2012/11/16/675
For a task it is possible to be in both cpusets, memcg and mempressure
cgroups, so by rearranging the tasks it is possible to watch a specific
pressure (i.e. caused by cpuset and/or memcg).
Note that while this adds the cgroups support, the code is well separated
and eventually we might add a lightweight, non-cgroups API, i.e. vmevent.
But this is another story.
Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org
I'm just curious..
...

Documentation/cgroups/mempressure.txt |  50 ++++++
  include/linux/cgroup_subsys.h         |   6 +
  include/linux/vmstat.h                |  11 ++
  init/Kconfig                          |  12 ++
  mm/Makefile                           |   1 +
  mm/mempressure.c                      | 330 ++++++++++++++++++++++++++++++++++
  mm/vmscan.c                           |   4 +
  7 files changed, 414 insertions(+)
  create mode 100644 Documentation/cgroups/mempressure.txt
  create mode 100644 mm/mempressure.c

diff --git a/Documentation/cgroups/mempressure.txt b/Documentation/cgroups/mempressure.txt
new file mode 100644
index 0000000..dbc0aca
--- /dev/null
+++ b/Documentation/cgroups/mempressure.txt
@@ -0,0 +1,50 @@

Memory pressure cgroup

+~~~~~~~~~~~~~~~~~~~~~~~~~~

Before using the mempressure cgroup, make sure you have it mounted:

# cd /sys/fs/cgroup/
# mkdir mempressure
# mount -t cgroup cgroup ./mempressure -o mempressure

It is possible to combine cgroups, for example you can mount memory
(memcg) and mempressure cgroups together:

# mount -t cgroup cgroup ./mempressure -o memory,mempressure

That way the reported pressure will honour memory cgroup limits. The
same goes for cpusets.

After the hierarchy is mounted, you can use the following API:

/sys/fs/cgroup/.../mempressure.level

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To maintain the interactivity/memory allocation cost, one can use the
pressure level notifications, and the levels are defined like this:

The "low" level means that the system is reclaiming memory for new
allocations. Monitoring reclaiming activity might be useful for
maintaining overall system's cache level. Upon notification, the program
(typically "Activity Manager") might analyze vmstat and act in advance
(i.e. prematurely shutdown unimportant services).

The "medium" level means that the system is experiencing medium memory
pressure, there is some mild swapping activity. Upon this event
applications may decide to free any resources that can be easily
reconstructed or re-read from a disk.

The "oom" level means that the system is actively thrashing, it is about
to out of memory (OOM) or even the in-kernel OOM killer is on its way to
trigger. Applications should do whatever they can to help the system.

Event control:
Is used to setup an eventfd with a level threshold. The argument to
the event control specifies the level threshold.
Read:
Reads mempory presure levels: low, medium or oom.
Write:
Not implemented.
Test:
To set up a notification:

# cgroup_event_listener ./mempressure.level low
("low", "medium", "oom" are permitted.)

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index f204a7a..b9802e2 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -37,6 +37,12 @@ SUBSYS(mem_cgroup)
  
  /* */
  
+#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_MEMPRESSURE)
+SUBSYS(mpc_cgroup)
+#endif



+/* */

#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_DEVICE)
SUBSYS(devices)
#endif

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index a13291f..c1a66c7 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -10,6 +10,17 @@
  
  extern int sysctl_stat_interval;
  
+struct mem_cgroup;
+#ifdef CONFIG_CGROUP_MEMPRESSURE
+extern void vmpressure(struct mem_cgroup *memcg,

       ulong scanned, ulong reclaimed);



+extern void vmpressure_prio(struct mem_cgroup *memcg, int prio);
+#else
+static inline void vmpressure(struct mem_cgroup *memcg,

	      ulong scanned, ulong reclaimed) {}



+static inline void vmpressure_prio(struct mem_cgroup *memcg, int prio) {}
+#endif

#ifdef CONFIG_VM_EVENT_COUNTERS
/*
Light weight per cpu counter implementation.



diff --git a/init/Kconfig b/init/Kconfig
index 7d30240..d526249 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -891,6 +891,18 @@ config MEMCG_KMEM
     the kmem extension can use it to guarantee that no group of processes
     will ever exhaust kernel resources alone.
  
+config CGROUP_MEMPRESSURE

bool "Memory pressure monitor for Control Groups"
help
 The memory pressure monitor cgroup provides a facility for


 userland programs so that they could easily assist the kernel


 with the memory management. So far the API provides simple,


 levels-based memory pressure notifications.



 For more information see Documentation/cgroups/mempressure.txt



 If unsure, say N.


config CGROUP_HUGETLB
 bool "HugeTLB Resource Controller for Control Groups"
 depends on RESOURCE_COUNTERS && HUGETLB_PAGE && EXPERIMENTAL

diff --git a/mm/Makefile b/mm/Makefile
index 3a46287..e69bbda 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -51,6 +51,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
  obj-$(CONFIG_QUICKLIST) += quicklist.o
  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
  obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEMPRESSURE) += mempressure.o
  obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
  obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
  obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
diff --git a/mm/mempressure.c b/mm/mempressure.c
new file mode 100644
index 0000000..ea312bb
--- /dev/null
+++ b/mm/mempressure.c
@@ -0,0 +1,330 @@
+/*


Linux VM pressure







Copyright 2012 Linaro Ltd.



  Anton Vorontsov <anton.vorontsov@linaro.org>









Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,



Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.







This program is free software; you can redistribute it and/or modify it



under the terms of the GNU General Public License version 2 as published



by the Free Software Foundation.


*/


+#include <linux/cgroup.h>
+#include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
+#include <linux/eventfd.h>
+#include <linux/swap.h>
+#include <linux/printk.h>



+static void mpc_vmpressure(struct mem_cgroup *memcg, ulong s, ulong r);



+/*


Generic VM Pressure routines (no cgroups or any other API details)


*/


+/*


The window size is the number of scanned pages before we try to analyze



the scanned/reclaimed ratio (or difference).







It is used as a rate-limit tunable for the "low" level notification,



and for averaging medium/oom levels. Using small window sizes can cause



lot of false positives, but too big window size will delay the



notifications.


*/

+static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16;
+static const uint vmpressure_level_med = 60;
+static const uint vmpressure_level_oom = 99;
+static const uint vmpressure_level_oom_prio = 4;



Hmm... isn't this window size too small ?
If vmscan cannot find a reclaimable page while scanning 2M of pages in a zone,
oom notify will be returned. Right ?
Thanks,
-Kame

    

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [PATCH 1/2] Add mempressure cgroup