On Mon, Dec 10, 2012 at 01:58:38AM -0800, Anton Vorontsov wrote:
CC: linux-api@
The main changes for the mempressure cgroup:
Added documentation, describes APIs and the purpose;
Implemented shrinker interface, this is based on Andrew's idea and supersedes my "balance" level idea;
The shrinker interface comes with a stress-test utility, that is what Andrew was also asking for. A simple app that we can run and see if the thing works as expected;
Added reclaimer's target_mem_cgroup handling;
As promised, added support for multiple listeners, and fixed some other comments on the previous RFC.
Just for the reference, the first mempressure RFC:
http://lkml.org/lkml/2012/11/28/109
Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org
Documentation/cgroups/mempressure.txt | 89 ++++++ Documentation/cgroups/mempressure_test.c | 209 +++++++++++++ include/linux/cgroup_subsys.h | 6 + include/linux/vmstat.h | 11 + init/Kconfig | 12 + mm/Makefile | 1 + mm/mempressure.c | 488 +++++++++++++++++++++++++++++++ mm/vmscan.c | 4 + 8 files changed, 820 insertions(+) create mode 100644 Documentation/cgroups/mempressure.txt create mode 100644 Documentation/cgroups/mempressure_test.c create mode 100644 mm/mempressure.c
diff --git a/Documentation/cgroups/mempressure.txt b/Documentation/cgroups/mempressure.txt new file mode 100644 index 0000000..913accc --- /dev/null +++ b/Documentation/cgroups/mempressure.txt @@ -0,0 +1,89 @@
- Memory pressure cgroup
+~~~~~~~~~~~~~~~~~~~~~~~~~~
- Before using the mempressure cgroup, make sure you have it mounted:
- # cd /sys/fs/cgroup/
- # mkdir mempressure
- # mount -t cgroup cgroup ./mempressure -o mempressure
- After that, you can use the following files:
- /sys/fs/cgroup/.../mempressure.shrinker
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- The file implements userland shrinker (memory reclaimer) interface, so
- that the kernel can ask userland to help with the memory reclaiming
- process.
- There are two basic concepts: chunks and chunks' size. The program must
- tell the kernel the granularity of its allocations (chunk size) and the
- number of reclaimable chunks. The granularity may be not 100% accurate,
- but the more it is accurate, the better. I.e. suppose the application
- has 200 page renders cached (but not displayed), 1MB each. So the chunk
- size is 1MB, and the number of chunks is 200.
- The granularity is specified during shrinker registration (i.e. via
- argument to the event_control cgroup file; and it is OK to register
- multiple shrinkers for different granularities). The number of
- reclaimable chunks is specified by writing to the mempressure.shrinker
- file.
- The notification comes through the eventfd() interface. Upon the
- notification, a read() from the eventfd returns the number of chunks to
- reclaim (free).
- It is assumed that the application will free the specified amount of
- chunks before reading from the eventfd again. If that is not the case,
- suppose the program was not able to reclaim the chunks, then application
- should re-add the amount of chunks by writing to the
- mempressure.shrinker file (otherwise the chunks won't be accounted by
- the kernel, since it assumes that they were reclaimed).
- Event control:
- Used to setup shrinker events. There is only one argument for the
- event control: chunk size in bytes.
- Read:
- Not implemented.
- Write:
- Writes must be in "<eventfd> <number of chunks>" format. Positive
- numbers increment the internal counter, negative numbers decrement it
- (but the kernel prevents the counter from falling down below zero).
- Test:
- See mempressure_test.c
I think the interface is broken. One eventfd can be registered to get many different notifications.
The only information you have on POLLIN/read() is "something happened". Then, it's up to userspace to find out what had happened: if it's memory pressure or cgroup is removed or whatever else.
One more point: unlike kernel side shrinkers, userspace shrinkers cannot be synchronous. I doubt they can be useful in real world situations.
I personally feel that mempressure.level interface is enough.
- /sys/fs/cgroup/.../mempressure.level
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Instead of working on the bytes level (like shrinkers), one may decide
- to maintain the interactivity/memory allocation cost.
- For this, the cgroup has memory pressure level notifications, and the
- levels are defined like this:
- The "low" level means that the system is reclaiming memory for new
- allocations. Monitoring reclaiming activity might be useful for
- maintaining overall system's cache level. Upon notification, the program
- (typically "Activity Manager") might analyze vmstat and act in advance
- (i.e. prematurely shutdown unimportant services).
- The "medium" level means that the system is experiencing medium memory
- pressure, there is some mild swapping activity. Upon this event
- applications may decide to free any resources that can be easily
- reconstructed or re-read from a disk. Note that for a fine-grained
- control, you should probably use the shrinker interface, as described
- above.
- The "oom" level means that the system is actively thrashing, it is about
- to out of memory (OOM) or even the in-kernel OOM killer is on its way to
- trigger. Applications should do whatever they can to help the system.
- Event control:
- Is used to setup an eventfd with a level threshold. The argument to
- the event control specifies the level threshold.
- Read:
- Reads mempory presure levels: low, medium or oom.
- Write:
- Not implemented.
- Test:
- To set up a notification:
- # cgroup_event_listener ./mempressure.level low
- ("low", "medium", "oom" are permitted.)
Interface look okay for me.
BTW, do you track pressure level changes due changes in memory[.memsw].limit_in_bytes or memory hotplug?
diff --git a/Documentation/cgroups/mempressure_test.c b/Documentation/cgroups/mempressure_test.c new file mode 100644 index 0000000..9747fd6 --- /dev/null +++ b/Documentation/cgroups/mempressure_test.c @@ -0,0 +1,209 @@ +/*
- mempressure shrinker test
- Copyright 2012 Linaro Ltd.
Anton Vorontsov <anton.vorontsov@linaro.org>
- It is pretty simple: we create two threads, the first one constantly
- tries to allocate memory (more than we physically have), the second
- thread listens to the kernel shrinker notifications and frees asked
- amount of chunks. When we allocate more than available RAM, the two
- threads start to fight. Idially, we should not OOM (but if we reclaim
- slower than we allocate, things might OOM). Also, ideally we should not
- grow swap too much.
- The test accepts no arguments, so you can just run it and observe the
- output and memory usage (e.g. 'watch -n 0.2 free -m'). Upon ctrl+c, the
- test prints total amount of bytes we helped to reclaim.
- Compile with -pthread.
- This program is free software; you can redistribute it and/or modify it
- under the terms of the GNU General Public License version 2 as published
- by the Free Software Foundation.
- */
+#define _GNU_SOURCE +#include <stdio.h> +#include <stdlib.h> +#include <stdint.h> +#include <stdbool.h> +#include <unistd.h> +#include <string.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <fcntl.h> +#include <pthread.h> +#include <signal.h> +#include <errno.h> +#include <sys/eventfd.h> +#include <sys/sysinfo.h>
+#define CG "/sys/fs/cgroup/mempressure" +#define CG_EVENT_CONTROL (CG "/cgroup.event_control") +#define CG_SHRINKER (CG "/mempressure.shrinker")
+#define CHUNK_SIZE (1 * 1024 * 1024)
+static size_t num_chunks;
+static void **chunks; +static pthread_mutex_t *locks; +static int efd; +static int sfd;
+static inline void pabort(bool f, int code, const char *str) +{
- if (!f)
return;
- perror(str);
- printf("(%d)\n", code);
- abort();
+}
+static void init_shrinker(void) +{
- int cfd;
- int ret;
- char *str;
- cfd = open(CG_EVENT_CONTROL, O_WRONLY);
- pabort(cfd < 0, cfd, CG_EVENT_CONTROL);
- sfd = open(CG_SHRINKER, O_RDWR);
- pabort(sfd < 0, sfd, CG_SHRINKER);
- efd = eventfd(0, 0);
- pabort(efd < 0, efd, "eventfd()");
- ret = asprintf(&str, "%d %d %d\n", efd, sfd, CHUNK_SIZE);
- printf("%s\n", str);
str value is undefined here if asprintf() failed.
- pabort(ret == -1, ret, "control string");
- ret = write(cfd, str, ret + 1);
- pabort(ret == -1, ret, "write() to event_control");
str is leaked.
+}
+static void add_reclaimable(int chunks) +{
- int ret;
- char *str;
- ret = asprintf(&str, "%d %d\n", efd, CHUNK_SIZE);
s/CHUNK_SIZE/chunks/ ?
same problems with str here.
- pabort(ret == -1, ret, "add_reclaimable, asprintf");
- ret = write(sfd, str, ret + 1);
- pabort(ret <= 0, ret, "add_reclaimable, write");
+}
+static int chunks_to_reclaim(void) +{
- uint64_t n = 0;
- int ret;
- ret = read(efd, &n, sizeof(n));
- pabort(ret <= 0, ret, "read() from eventfd");
- printf("%d chunks to reclaim\n", (int)n);
- return n;
+}
+static unsigned int reclaimed;
+static void print_stats(int signum) +{
- printf("\nTOTAL: helped to reclaim %d chunks (%d MB)\n",
reclaimed, reclaimed * CHUNK_SIZE / 1024 / 1024);
- exit(0);
+}
+static void *shrinker_thr_fn(void *arg) +{
- puts("shrinker thread started");
- sigaction(SIGINT, &(struct sigaction){.sa_handler = print_stats}, NULL);
- while (1) {
unsigned int i = 0;
int n;
n = chunks_to_reclaim();
reclaimed += n;
while (n) {
pthread_mutex_lock(&locks[i]);
if (chunks[i]) {
free(chunks[i]);
chunks[i] = NULL;
n--;
}
pthread_mutex_unlock(&locks[i]);
i = (i + 1) % num_chunks;
}
- }
- return NULL;
+}
+static void consume_memory(void) +{
- unsigned int i = 0;
- unsigned int j = 0;
- puts("consuming memory...");
- while (1) {
pthread_mutex_lock(&locks[i]);
if (!chunks[i]) {
chunks[i] = malloc(CHUNK_SIZE);
pabort(!chunks[i], 0, "chunks alloc failed");
memset(chunks[i], 0, CHUNK_SIZE);
j++;
}
pthread_mutex_unlock(&locks[i]);
if (j >= num_chunks / 10) {
add_reclaimable(num_chunks / 10);
printf("added %d reclaimable chunks\n", j);
j = 0;
}
i = (i + 1) % num_chunks;
- }
+}
+int main(int argc, char *argv[]) +{
- int ret;
- int i;
- pthread_t shrinker_thr;
- struct sysinfo si;
- ret = sysinfo(&si);
- pabort(ret != 0, ret, "sysinfo()");
- num_chunks = (si.totalram + si.totalswap) * si.mem_unit / 1024 / 1024;
- chunks = malloc(sizeof(*chunks) * num_chunks);
- locks = malloc(sizeof(*locks) * num_chunks);
- pabort(!chunks || !locks, ENOMEM, NULL);
- init_shrinker();
- for (i = 0; i < num_chunks; i++) {
ret = pthread_mutex_init(&locks[i], NULL);
pabort(ret != 0, ret, "pthread_mutex_init");
- }
- ret = pthread_create(&shrinker_thr, NULL, shrinker_thr_fn, NULL);
- pabort(ret != 0, ret, "pthread_create(shrinker)");
- consume_memory();
- ret = pthread_join(shrinker_thr, NULL);
- pabort(ret != 0, ret, "pthread_join(shrinker)");
- return 0;
+} diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h index f204a7a..b9802e2 100644 --- a/include/linux/cgroup_subsys.h +++ b/include/linux/cgroup_subsys.h @@ -37,6 +37,12 @@ SUBSYS(mem_cgroup) /* */ +#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_MEMPRESSURE) +SUBSYS(mpc_cgroup) +#endif
+/* */
#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_DEVICE) SUBSYS(devices) #endif diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 92a86b2..3f7f7d2 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -10,6 +10,17 @@ extern int sysctl_stat_interval; +struct mem_cgroup; +#ifdef CONFIG_CGROUP_MEMPRESSURE +extern void vmpressure(struct mem_cgroup *memcg,
ulong scanned, ulong reclaimed);
+extern void vmpressure_prio(struct mem_cgroup *memcg, int prio); +#else +static inline void vmpressure(struct mem_cgroup *memcg,
ulong scanned, ulong reclaimed) {}
+static inline void vmpressure_prio(struct mem_cgroup *memcg, int prio) {} +#endif
#ifdef CONFIG_VM_EVENT_COUNTERS /*
- Light weight per cpu counter implementation.
diff --git a/init/Kconfig b/init/Kconfig index 6fdd6e3..5c308be 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -826,6 +826,18 @@ config MEMCG_KMEM the kmem extension can use it to guarantee that no group of processes will ever exhaust kernel resources alone. +config CGROUP_MEMPRESSURE
- bool "Memory pressure monitor for Control Groups"
- help
The memory pressure monitor cgroup provides a facility for
userland programs so that they could easily assist the kernel
with the memory management. This includes simple memory pressure
notifications and a full-fledged userland reclaimer.
For more information see Documentation/cgroups/mempressure.txt
If unsure, say N.
config CGROUP_HUGETLB bool "HugeTLB Resource Controller for Control Groups" depends on RESOURCE_COUNTERS && HUGETLB_PAGE && EXPERIMENTAL diff --git a/mm/Makefile b/mm/Makefile index 6b025f8..40cee19 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -50,6 +50,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o +obj-$(CONFIG_CGROUP_MEMPRESSURE) += mempressure.o obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o diff --git a/mm/mempressure.c b/mm/mempressure.c new file mode 100644 index 0000000..e39a33d --- /dev/null +++ b/mm/mempressure.c @@ -0,0 +1,488 @@ +/*
- Linux VM pressure
- Copyright 2012 Linaro Ltd.
Anton Vorontsov <anton.vorontsov@linaro.org>
- Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
- Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
- This program is free software; you can redistribute it and/or modify it
- under the terms of the GNU General Public License version 2 as published
- by the Free Software Foundation.
- */
+#include <linux/cgroup.h> +#include <linux/fs.h> +#include <linux/sched.h> +#include <linux/mm.h> +#include <linux/vmstat.h> +#include <linux/eventfd.h> +#include <linux/swap.h> +#include <linux/printk.h>
+static void mpc_vmpressure(struct mem_cgroup *memcg, ulong s, ulong r);
+/*
- Generic VM Pressure routines (no cgroups or any other API details)
- */
+/*
- The window size is the number of scanned pages before we try to analyze
- the scanned/reclaimed ratio (or difference).
- It is used as a rate-limit tunable for the "low" level notification,
- and for averaging medium/oom levels. Using small window sizes can cause
- lot of false positives, but too big window size will delay the
- notifications.
- The same window size also used for the shrinker, so be aware. It might
- be a good idea to derive the window size from the machine size, similar
- to what we do for the vmstat.
- */
+static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16; +static const uint vmpressure_level_med = 60; +static const uint vmpressure_level_oom = 99; +static const uint vmpressure_level_oom_prio = 4;
+enum vmpressure_levels {
- VMPRESSURE_LOW = 0,
- VMPRESSURE_MEDIUM,
- VMPRESSURE_OOM,
- VMPRESSURE_NUM_LEVELS,
+};
+static const char *vmpressure_str_levels[] = {
- [VMPRESSURE_LOW] = "low",
- [VMPRESSURE_MEDIUM] = "medium",
- [VMPRESSURE_OOM] = "oom",
+};
+static enum vmpressure_levels vmpressure_level(uint pressure) +{
- if (pressure >= vmpressure_level_oom)
return VMPRESSURE_OOM;
- else if (pressure >= vmpressure_level_med)
return VMPRESSURE_MEDIUM;
- return VMPRESSURE_LOW;
+}
+static ulong vmpressure_calc_level(uint win, uint s, uint r) +{
- ulong p;
- if (!s)
return 0;
- /*
* We calculate the ratio (in percents) of how many pages were
* scanned vs. reclaimed in a given time frame (window). Note that
* time is in VM reclaimer's "ticks", i.e. number of pages
* scanned. This makes it possible to set desired reaction time
* and serves as a ratelimit.
*/
- p = win - (r * win / s);
- p = p * 100 / win;
- pr_debug("%s: %3lu (s: %6u r: %6u)\n", __func__, p, s, r);
- return vmpressure_level(p);
+}
+void vmpressure(struct mem_cgroup *memcg, ulong scanned, ulong reclaimed) +{
- if (!scanned)
return;
- mpc_vmpressure(memcg, scanned, reclaimed);
+}
+void vmpressure_prio(struct mem_cgroup *memcg, int prio) +{
- if (prio > vmpressure_level_oom_prio)
return;
- /* OK, the prio is below the threshold, send the pre-OOM event. */
- vmpressure(memcg, vmpressure_win, 0);
+}
+/*
- Memory pressure cgroup code
- */
+struct mpc_event {
- struct eventfd_ctx *efd;
- enum vmpressure_levels level;
- struct list_head node;
+};
+struct mpc_shrinker {
- struct eventfd_ctx *efd;
- size_t chunks;
- size_t chunk_sz;
- struct list_head node;
+};
+struct mpc_state {
- struct cgroup_subsys_state css;
- uint scanned;
- uint reclaimed;
- struct mutex sr_lock;
- struct list_head events;
- struct mutex events_lock;
- struct list_head shrinkers;
- struct mutex shrinkers_lock;
- struct work_struct work;
+};
+static struct mpc_state *wk2mpc(struct work_struct *wk) +{
- return container_of(wk, struct mpc_state, work);
+}
+static struct mpc_state *css2mpc(struct cgroup_subsys_state *css) +{
- return container_of(css, struct mpc_state, css);
+}
+static struct mpc_state *tsk2mpc(struct task_struct *tsk) +{
- return css2mpc(task_subsys_state(tsk, mpc_cgroup_subsys_id));
+}
+static struct mpc_state *cg2mpc(struct cgroup *cg) +{
- return css2mpc(cgroup_subsys_state(cg, mpc_cgroup_subsys_id));
+}
+static void mpc_shrinker(struct mpc_state *mpc, ulong s, ulong r) +{
- struct mpc_shrinker *sh;
- ssize_t to_reclaim_pages = s - r;
- if (!to_reclaim_pages)
return;
- mutex_lock(&mpc->shrinkers_lock);
- /*
* To make accounting more precise and to avoid excessive
* communication with the kernel, we operate on chunks instead of
* bytes. Say, asking to free 8 KBs makes little sense if
* granularity of allocations is 10 MBs. Also, knowing the
* granularity (chunk size) and the number of reclaimable chunks,
* we just ask that N chunks should be freed, and we assume that
* it will be freed, thus we decrement our internal counter
* straight away (i.e. userland does not need to respond how much
* was reclaimed). But, if userland could not free it, it is
* responsible to increment the counter back.
*/
- list_for_each_entry(sh, &mpc->shrinkers, node) {
size_t to_reclaim_chunks;
if (!sh->chunks)
continue;
to_reclaim_chunks = to_reclaim_pages *
PAGE_SIZE / sh->chunk_sz;
to_reclaim_chunks = min(sh->chunks, to_reclaim_chunks);
if (!to_reclaim_chunks)
continue;
sh->chunks -= to_reclaim_chunks;
eventfd_signal(sh->efd, to_reclaim_chunks);
to_reclaim_pages -= to_reclaim_chunks *
sh->chunk_sz / PAGE_SIZE;
if (to_reclaim_pages <= 0)
break;
- }
- mutex_unlock(&mpc->shrinkers_lock);
+}
+static void mpc_event(struct mpc_state *mpc, ulong s, ulong r) +{
- struct mpc_event *ev;
- int level = vmpressure_calc_level(vmpressure_win, s, r);
- mutex_lock(&mpc->events_lock);
- list_for_each_entry(ev, &mpc->events, node) {
if (level >= ev->level)
What about per-level lists?
eventfd_signal(ev->efd, 1);
- }
- mutex_unlock(&mpc->events_lock);
+}
+static void mpc_vmpressure_wk_fn(struct work_struct *wk) +{
- struct mpc_state *mpc = wk2mpc(wk);
- ulong s;
- ulong r;
- mutex_lock(&mpc->sr_lock);
- s = mpc->scanned;
- r = mpc->reclaimed;
- mpc->scanned = 0;
- mpc->reclaimed = 0;
- mutex_unlock(&mpc->sr_lock);
- mpc_shrinker(mpc, s, r);
- mpc_event(mpc, s, r);
+}
+static void __mpc_vmpressure(struct mpc_state *mpc, ulong s, ulong r) +{
- mutex_lock(&mpc->sr_lock);
- mpc->scanned += s;
- mpc->reclaimed += r;
- mutex_unlock(&mpc->sr_lock);
- if (s < vmpressure_win || work_pending(&mpc->work))
return;
- schedule_work(&mpc->work);
+}
+static void mpc_vmpressure(struct mem_cgroup *memcg, ulong s, ulong r) +{
- /*
* There are two options for implementing cgroup pressure
* notifications:
*
* - Store pressure counter atomically in the task struct. Upon
* hitting 'window' wake up a workqueue that will walk every
* task and sum per-thread pressure into cgroup pressure (to
* which the task belongs). The cons are obvious: bloats task
* struct, have to walk all processes and makes pressue less
* accurate (the window becomes per-thread);
*
* - Store pressure counters in per-cgroup state. This is easy and
* straightforward, and that's how we do things here. But this
* requires us to not put the vmpressure hooks into hotpath,
* since we have to grab some locks.
*/
+#ifdef CONFIG_MEMCG
- if (memcg) {
struct cgroup_subsys_state *css = mem_cgroup_css(memcg);
struct cgroup *cg = css->cgroup;
struct mpc_state *mpc = cg2mpc(cg);
if (mpc)
__mpc_vmpressure(mpc, s, r);
return;
- }
+#endif
- task_lock(current);
- __mpc_vmpressure(tsk2mpc(current), s, r);
- task_unlock(current);
+}
+static struct cgroup_subsys_state *mpc_create(struct cgroup *cg) +{
- struct mpc_state *mpc;
- mpc = kzalloc(sizeof(*mpc), GFP_KERNEL);
- if (!mpc)
return ERR_PTR(-ENOMEM);
- mutex_init(&mpc->sr_lock);
- mutex_init(&mpc->events_lock);
- mutex_init(&mpc->shrinkers_lock);
- INIT_LIST_HEAD(&mpc->events);
- INIT_LIST_HEAD(&mpc->shrinkers);
- INIT_WORK(&mpc->work, mpc_vmpressure_wk_fn);
- return &mpc->css;
+}
+static void mpc_destroy(struct cgroup *cg) +{
- struct mpc_state *mpc = cg2mpc(cg);
- kfree(mpc);
+}
+static ssize_t mpc_read_level(struct cgroup *cg, struct cftype *cft,
struct file *file, char __user *buf,
size_t sz, loff_t *ppos)
+{
- struct mpc_state *mpc = cg2mpc(cg);
- uint level;
- const char *str;
- mutex_lock(&mpc->sr_lock);
- level = vmpressure_calc_level(vmpressure_win,
mpc->scanned, mpc->reclaimed);
- mutex_unlock(&mpc->sr_lock);
- str = vmpressure_str_levels[level];
- return simple_read_from_buffer(buf, sz, ppos, str, strlen(str));
+}
+static int mpc_register_level_event(struct cgroup *cg, struct cftype *cft,
struct eventfd_ctx *eventfd,
const char *args)
+{
- struct mpc_state *mpc = cg2mpc(cg);
- struct mpc_event *ev;
- int lvl;
- for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) {
if (!strcmp(vmpressure_str_levels[lvl], args))
break;
- }
- if (lvl >= VMPRESSURE_NUM_LEVELS)
return -EINVAL;
- ev = kzalloc(sizeof(*ev), GFP_KERNEL);
- if (!ev)
return -ENOMEM;
- ev->efd = eventfd;
- ev->level = lvl;
- mutex_lock(&mpc->events_lock);
- list_add(&ev->node, &mpc->events);
- mutex_unlock(&mpc->events_lock);
- return 0;
+}
+static void mpc_unregister_event(struct cgroup *cg, struct cftype *cft,
struct eventfd_ctx *eventfd)
+{
- struct mpc_state *mpc = cg2mpc(cg);
- struct mpc_event *ev;
- mutex_lock(&mpc->events_lock);
- list_for_each_entry(ev, &mpc->events, node) {
if (ev->efd != eventfd)
continue;
list_del(&ev->node);
kfree(ev);
break;
- }
- mutex_unlock(&mpc->events_lock);
+}
+static int mpc_register_shrinker(struct cgroup *cg, struct cftype *cft,
struct eventfd_ctx *eventfd,
const char *args)
+{
- struct mpc_state *mpc = cg2mpc(cg);
- struct mpc_shrinker *sh;
- ulong chunk_sz;
- int ret;
- ret = kstrtoul(args, 10, &chunk_sz);
- if (ret)
return ret;
- sh = kzalloc(sizeof(*sh), GFP_KERNEL);
- if (!sh)
return -ENOMEM;
- sh->efd = eventfd;
- sh->chunk_sz = chunk_sz;
- mutex_lock(&mpc->shrinkers_lock);
- list_add(&sh->node, &mpc->shrinkers);
- mutex_unlock(&mpc->shrinkers_lock);
- return 0;
+}
+static void mpc_unregister_shrinker(struct cgroup *cg, struct cftype *cft,
struct eventfd_ctx *eventfd)
+{
- struct mpc_state *mpc = cg2mpc(cg);
- struct mpc_shrinker *sh;
- mutex_lock(&mpc->shrinkers_lock);
- list_for_each_entry(sh, &mpc->shrinkers, node) {
if (sh->efd != eventfd)
continue;
list_del(&sh->node);
kfree(sh);
break;
- }
- mutex_unlock(&mpc->shrinkers_lock);
+}
+static int mpc_write_shrinker(struct cgroup *cg, struct cftype *cft,
const char *str)
+{
- struct mpc_state *mpc = cg2mpc(cg);
- struct mpc_shrinker *sh;
- struct eventfd_ctx *eventfd;
- struct file *file;
- ssize_t chunks;
- int fd;
- int ret;
- ret = sscanf(str, "%d %zd\n", &fd, &chunks);
- if (ret != 2)
return -EINVAL;
- file = fget(fd);
- if (!file)
return -EBADF;
- eventfd = eventfd_ctx_fileget(file);
- mutex_lock(&mpc->shrinkers_lock);
- /* Can avoid the loop once we introduce ->priv for eventfd_ctx. */
- list_for_each_entry(sh, &mpc->shrinkers, node) {
if (sh->efd != eventfd)
continue;
if (chunks < 0 && abs(chunks) > sh->chunks)
sh->chunks = 0;
else
sh->chunks += chunks;
break;
- }
- mutex_unlock(&mpc->shrinkers_lock);
- eventfd_ctx_put(eventfd);
- fput(file);
- return 0;
+}
+static struct cftype mpc_files[] = {
- {
.name = "level",
.read = mpc_read_level,
.register_event = mpc_register_level_event,
.unregister_event = mpc_unregister_event,
mpc_unregister_level_event for consistency.
- },
- {
.name = "shrinker",
.register_event = mpc_register_shrinker,
.unregister_event = mpc_unregister_shrinker,
.write_string = mpc_write_shrinker,
- },
- {},
+};
+struct cgroup_subsys mpc_cgroup_subsys = {
- .name = "mempressure",
- .subsys_id = mpc_cgroup_subsys_id,
- .create = mpc_create,
- .destroy = mpc_destroy,
- .base_cftypes = mpc_files,
+}; diff --git a/mm/vmscan.c b/mm/vmscan.c index 48550c6..d8ff846 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1877,6 +1877,9 @@ restart: shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON);
- vmpressure(sc->target_mem_cgroup,
sc->nr_scanned - nr_scanned, nr_reclaimed);
- /* reclaim/compaction might need reclaim to continue */ if (should_continue_reclaim(lruvec, nr_reclaimed, sc->nr_scanned - nr_scanned, sc))
@@ -2099,6 +2102,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, count_vm_event(ALLOCSTALL); do {
sc->nr_scanned = 0; aborted_reclaim = shrink_zones(zonelist, sc);vmpressure_prio(sc->target_mem_cgroup, sc->priority);
1.8.0