[RFC v3 0/3] vmpressure_fd: Linux VM pressure notifications

List overview All Threads
Download

newer

older

[PATCH resend 0/10] Get rid of...

[ACTIVITY] (Rajanikanth H V)...

Anton Vorontsov

7 Nov 2012 7 Nov '12

10:53 a.m.

Hi all,

This is the third RFC. As suggested by Minchan Kim, the API is much simplified now (comparing to vmevent_fd):

- As well as Minchan, KOSAKI Motohiro didn't like the timers, so the timers are gone now; - Pekka Enberg didn't like the complex attributes matching code, and so it is no longer there; - Nobody liked the raw vmstat attributes, and so they were eliminated too.

But, conceptually, it is the exactly the same approach as in v2: three discrete levels of the pressure -- low, medium and oom. The levels are based on the reclaimer inefficiency index as proposed by Mel Gorman, but userland does not see the raw index values. The description why I moved away from reporting the raw 'reclaimer inefficiency index' can be found in v2: http://lkml.org/lkml/2012/10/22/177

While the new API is very simple, it is still extensible (i.e. versioned).

As there are a lot of drastic changes in the API itself, I decided to just add a new files along with vmevent, it is much easier to review it this way (I can prepare a separate patch that removes vmevent files, if we care to preserve the history through the vmevent tree).

Thanks, Anton.

Show replies by date

Anton Vorontsov

7 Nov 7 Nov

11:01 a.m.

New subject: [RFC 1/3] mm: Add VM pressure notifications

This patch introduces vmpressure_fd() system call. The system call creates a new file descriptor that can be used to monitor Linux' virtual memory management pressure. There are three discrete levels of the pressure:

VMPRESSURE_LOW: Notifies that the system is reclaiming memory for new allocations. Monitoring reclaiming activity might be useful for maintaining overall system's cache level.

VMPRESSURE_MEDIUM: The system is experiencing medium memory pressure, there might be some mild swapping activity. Upon this event applications may decide to free any resources that can be easily reconstructed or re-read from a disk.

VMPRESSURE_OOM: The system is actively thrashing, it is about to go out of memory (OOM) or even the in-kernel OOM killer is on its way to trigger. Applications should do whatever they can to help the system.

There are four sysctls to tune the behaviour of the levels:

vmevent_window vmevent_level_medium vmevent_level_oom vmevent_level_oom_priority

Currently vmevent pressure levels are based on the reclaimer inefficiency index (range from 0 to 100). The index shows the relative time spent by the kernel uselessly scanning pages, or, in other words, the percentage of scans of pages (vmevent_window) that were not reclaimed. The higher the index, the more it should be evident that new allocations' cost becomes higher.

The files vmevent_level_medium and vmevent_level_oom accept the index values (by default set to 60 and 99 respectively). A non-existent vmevent_level_low tunable is always set to 0

When index equals to 0, this means that the kernel is reclaiming, but every scanned page has been successfully reclaimed (so the pressure is low). 100 means that the kernel is trying to reclaim, but nothing can be reclaimed (OOM).

Window size is used as a rate-limit tunable for VMPRESSURE_LOW notifications and for averaging for VMPRESSURE_{MEDIUM,OOM} levels. So, using small window sizes can cause lot of false positives for _MEDIUM and _OOM levels, but too big window size may delay notifications. By default the window size equals to 256 pages (1MB).

The _OOM level is also attached to the reclaimer's priority. When the system is almost OOM, it might be getting the last reclaimable pages slowly, scanning all the queues, and so we never catch the OOM case via window-size averaging. For this case the priority can be used to determine the pre-OOM condition, the pre-OOM priority level can be set via vmpressure_level_oom_prio sysctl.

Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org --- Documentation/sysctl/vm.txt | 48 ++++++++ arch/x86/syscalls/syscall_64.tbl | 1 + include/linux/syscalls.h | 2 + include/linux/vmpressure.h | 128 ++++++++++++++++++++++ kernel/sys_ni.c | 1 + kernel/sysctl.c | 31 ++++++ mm/Kconfig | 13 +++ mm/Makefile | 1 + mm/vmpressure.c | 231 +++++++++++++++++++++++++++++++++++++++ mm/vmscan.c | 5 + 10 files changed, 461 insertions(+) create mode 100644 include/linux/vmpressure.h create mode 100644 mm/vmpressure.c

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 078701f..9837fe2 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -44,6 +44,10 @@ Currently, these files are in /proc/sys/vm: - nr_overcommit_hugepages - nr_trim_pages (only if CONFIG_MMU=n) - numa_zonelist_order +- vmpressure_window +- vmpressure_level_medium +- vmpressure_level_oom +- vmpressure_level_oom_priority - oom_dump_tasks - oom_kill_allocating_task - overcommit_memory @@ -487,6 +491,50 @@ this is causing problems for your system/application.

==============================================================

+vmpressure_window +vmpressure_level_med +vmpressure_level_oom +vmpressure_level_oom_priority + +These sysctls are used to tune vmpressure_fd(2) behaviour. + +Currently vmpressure pressure levels are based on the reclaimer +inefficiency index (range from 0 to 100). The files vmpressure_level_med +and vmpressure_level_oom accept the index values (by default set to 60 and +99 respectively). A non-existent vmpressure_level_low tunable is always +set to 0 + +When the system is short on idle pages, the new memory is allocated by +reclaiming least recently used resources: kernel scans pages to be +reclaimed (e.g. from file caches, mmap(2) volatile ranges, etc.; and +potentially swapping some pages out). The index shows the relative time +spent by the kernel uselessly scanning pages, or, in other words, the +percentage of scans of pages (vmpressure_window) that were not reclaimed. +The higher the index, the more it should be evident that new allocations' +cost becomes higher. + +When index equals to 0, this means that the kernel is reclaiming, but +every scanned page has been successfully reclaimed (so the pressure is +low). 100 means that the kernel is trying to reclaim, but nothing can be +reclaimed (close to OOM). + +Window size is used as a rate-limit tunable for VMPRESSURE_LOW +notifications and for averaging for VMPRESSURE_{MEDIUM,OOM} levels. So, +using small window sizes can cause lot of false positives for _MEDIUM and +_OOM levels, but too big window size may delay notifications. By default +the window size equals to 256 pages (1MB). + +When the system is almost OOM it might be getting the last reclaimable +pages slowly, scanning all the queues, and so we never catch the OOM case +via window-size averaging. For this case there is another mechanism of +detecting the pre-OOM conditions: kernel's reclaimer has a scanning +priority, the higest priority is 0 (reclaimer will scan all the available +pages). Kernel starts scanning with priority set to 12 (queue_length >> +12). So, vmpressure_level_oom_prio should be between 0 and 12 (by default +it is set to 4). + +============================================================== + oom_dump_tasks

Enables a system-wide task dump (excluding kernel threads) to be diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl index 316449a..6e4fa6a 100644 --- a/arch/x86/syscalls/syscall_64.tbl +++ b/arch/x86/syscalls/syscall_64.tbl @@ -320,6 +320,7 @@ 311 64 process_vm_writev sys_process_vm_writev 312 common kcmp sys_kcmp 313 64 vmevent_fd sys_vmevent_fd +314 64 vmpressure_fd sys_vmpressure_fd

# # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 19439c7..3d2587d 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -63,6 +63,7 @@ struct getcpu_cache; struct old_linux_dirent; struct perf_event_attr; struct file_handle; +struct vmpressure_config;

#include <linux/types.h> #include <linux/aio_abi.h> @@ -860,4 +861,5 @@ asmlinkage long sys_process_vm_writev(pid_t pid,

asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1, unsigned long idx2); +asmlinkage long sys_vmpressure_fd(struct vmpressure_config __user *config); #endif diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h new file mode 100644 index 0000000..b808b04 --- /dev/null +++ b/include/linux/vmpressure.h @@ -0,0 +1,128 @@ +/* + * Linux VM pressure notifications + * + * Copyright 2011-2012 Pekka Enberg penberg@kernel.org + * Copyright 2011-2012 Linaro Ltd. + * Anton Vorontsov anton.vorontsov@linaro.org + * + * Based on ideas from KOSAKI Motohiro, Leonid Moiseichuk, Mel Gorman, + * Minchan Kim and Pekka Enberg. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published + * by the Free Software Foundation. + */ + +#ifndef _LINUX_VMPRESSURE_H +#define _LINUX_VMPRESSURE_H + +#include <linux/types.h> + +/** + * enum vmpressure_level - Memory pressure levels + * @VMPRESSURE_LOW: The system is short on idle pages, losing caches + * @VMPRESSURE_MEDIUM: New allocations' cost becomes high + * @VMPRESSURE_OOM: The system is about to go out-of-memory + */ +enum vmpressure_level { + /* We spread the values, reserving room for new levels. */ + VMPRESSURE_LOW = 1 << 10, + VMPRESSURE_MEDIUM = 1 << 20, + VMPRESSURE_OOM = 1 << 30, +}; + +/** + * struct vmpressure_config - Configuration structure for vmpressure_fd() + * @size: Size of the struct for ABI extensibility + * @threshold: Minimum pressure level of notifications + * + * This structure is used to configure the file descriptor that + * vmpressure_fd() returns. + * + * @size is used to "version" the ABI, it must be initialized to + * 'sizeof(struct vmpressure_config)'. + * + * @threshold should be one of @vmpressure_level values, and specifies + * minimal level of notification that will be delivered. + */ +struct vmpressure_config { + __u32 size; + __u32 threshold; +}; + +/** + * struct vmpressure_event - An event that is returned via vmpressure fd + * @pressure: Most recent system's pressure level + * + * Upon notification, this structure must be read from the vmpressure file + * descriptor. + */ +struct vmpressure_event { + __u32 pressure; +}; + +#ifdef __KERNEL__ + +struct mem_cgroup; + +#ifdef CONFIG_VMPRESSURE + +extern uint vmpressure_win; +extern uint vmpressure_level_med; +extern uint vmpressure_level_oom; +extern uint vmpressure_level_oom_prio; + +extern void __vmpressure(struct mem_cgroup *memcg, + ulong scanned, ulong reclaimed); +static void vmpressure(struct mem_cgroup *memcg, + ulong scanned, ulong reclaimed); + +/* + * OK, we're cheating. The thing is, we have to average s/r ratio by + * gathering a lot of scans (otherwise we might get some local + * false-positives index of '100'). + * + * But... when we're almost OOM we might be getting the last reclaimable + * pages slowly, scanning all the queues, and so we never catch the OOM + * case via averaging. Although the priority will show it for sure. The + * pre-OOM priority value is mostly an empirically taken priority: we + * never observe it under any load, except for last few allocations before + * the OOM (but the exact value is still configurable via sysctl). + */ +static inline void vmpressure_prio(struct mem_cgroup *memcg, int prio) +{ + if (prio > vmpressure_level_oom_prio) + return; + + /* OK, the prio is below the threshold, send the pre-OOM event. */ + vmpressure(memcg, vmpressure_win, 0); +} + +#else +static inline void __vmpressure(struct mem_cgroup *memcg, + ulong scanned, ulong reclaimed) {} +static inline void vmpressure_prio(struct mem_cgroup *memcg, int prio) {} +#endif /* CONFIG_VMPRESSURE */ + +static inline void vmpressure(struct mem_cgroup *memcg, + ulong scanned, ulong reclaimed) +{ + if (!scanned) + return; + + if (IS_BUILTIN(CONFIG_MEMCG) && memcg) { + /* + * The vmpressure API reports system pressure, for per-cgroup + * pressure, we'll chain cgroups notifications, this is to + * be implemented. + * + * memcg_vm_pressure(target_mem_cgroup, scanned, reclaimed); + */ + return; + } + __vmpressure(memcg, scanned, reclaimed); +} + +#endif /* __KERNEL__ */ + +#endif /* _LINUX_VMPRESSURE_H */ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 3ccdbf4..9573a5a 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -192,6 +192,7 @@ cond_syscall(compat_sys_timerfd_gettime); cond_syscall(sys_eventfd); cond_syscall(sys_eventfd2); cond_syscall(sys_vmevent_fd); +cond_syscall(sys_vmpressure_fd);

/* performance counters: */ cond_syscall(sys_perf_event_open); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 87174ef..7c9a3be 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -50,6 +50,7 @@ #include <linux/dnotify.h> #include <linux/syscalls.h> #include <linux/vmstat.h> +#include <linux/vmpressure.h> #include <linux/nfs_fs.h> #include <linux/acpi.h> #include <linux/reboot.h> @@ -1317,6 +1318,36 @@ static struct ctl_table vm_table[] = { .proc_handler = numa_zonelist_order_handler, }, #endif +#ifdef CONFIG_VMPRESSURE + { + .procname = "vmpressure_window", + .data = &vmpressure_win, + .maxlen = sizeof(vmpressure_win), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { + .procname = "vmpressure_level_medium", + .data = &vmpressure_level_med, + .maxlen = sizeof(vmpressure_level_med), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { + .procname = "vmpressure_level_oom", + .data = &vmpressure_level_oom, + .maxlen = sizeof(vmpressure_level_oom), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { + .procname = "vmpressure_level_oom_priority", + .data = &vmpressure_level_oom_prio, + .maxlen = sizeof(vmpressure_level_oom_prio), + .mode = 0644, + .proc_handler = proc_dointvec, + }, +#endif #if (defined(CONFIG_X86_32) && !defined(CONFIG_UML))|| \ (defined(CONFIG_SUPERH) && defined(CONFIG_VSYSCALL)) { diff --git a/mm/Kconfig b/mm/Kconfig index cd0ea24e..8a47a5f 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -401,6 +401,19 @@ config VMEVENT help If unsure, say N to disable vmevent

+config VMPRESSURE + bool "Enable vmpressure_fd() notifications" + help + This option enables vmpressure_fd() system call, it is used to + notify userland applications about system's virtual memory + pressure state. + + Upon these notifications, userland programs can cooperate with + the kernel (e.g. free easily reclaimable resources), and so + achieving better system's memory management. + + If unsure, say N. + config FRONTSWAP bool "Enable frontswap to cache swap pages if tmem is present" depends on SWAP diff --git a/mm/Makefile b/mm/Makefile index 80debc7..2f08d14 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -57,4 +57,5 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o obj-$(CONFIG_CLEANCACHE) += cleancache.o obj-$(CONFIG_VMEVENT) += vmevent.o +obj-$(CONFIG_VMPRESSURE) += vmpressure.o obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o diff --git a/mm/vmpressure.c b/mm/vmpressure.c new file mode 100644 index 0000000..54f35a3 --- /dev/null +++ b/mm/vmpressure.c @@ -0,0 +1,231 @@ +/* + * Linux VM pressure notifications + * + * Copyright 2011-2012 Pekka Enberg penberg@kernel.org + * Copyright 2011-2012 Linaro Ltd. + * Anton Vorontsov anton.vorontsov@linaro.org + * + * Based on ideas from KOSAKI Motohiro, Leonid Moiseichuk, Mel Gorman, + * Minchan Kim and Pekka Enberg. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published + * by the Free Software Foundation. + */ + +#include <linux/anon_inodes.h> +#include <linux/atomic.h> +#include <linux/compiler.h> +#include <linux/vmpressure.h> +#include <linux/syscalls.h> +#include <linux/workqueue.h> +#include <linux/mutex.h> +#include <linux/file.h> +#include <linux/list.h> +#include <linux/poll.h> +#include <linux/slab.h> +#include <linux/swap.h> + +struct vmpressure_watch { + struct vmpressure_config config; + atomic_t pending; + wait_queue_head_t waitq; + struct list_head node; +}; + +static atomic64_t vmpressure_sr; +static uint vmpressure_val; + +static LIST_HEAD(vmpressure_watchers); +static DEFINE_MUTEX(vmpressure_watchers_lock); + +/* Our sysctl tunables, see Documentation/sysctl/vm.txt */ +uint __read_mostly vmpressure_win = SWAP_CLUSTER_MAX * 16; +uint vmpressure_level_med = 60; +uint vmpressure_level_oom = 99; +uint vmpressure_level_oom_prio = 4; + +/* + * This function is called from a workqueue, which can have only one + * execution thread, so we don't need to worry about racing w/ ourselves. + * And so it possible to implement the lock-free logic, using just the + * atomic watch->pending variable. + */ +static void vmpressure_sample(struct vmpressure_watch *watch) +{ + if (atomic_read(&watch->pending)) + return; + if (vmpressure_val < watch->config.threshold) + return; + + atomic_set(&watch->pending, 1); + wake_up(&watch->waitq); +} + +static u64 vmpressure_level(uint pressure) +{ + if (pressure >= vmpressure_level_oom) + return VMPRESSURE_OOM; + else if (pressure >= vmpressure_level_med) + return VMPRESSURE_MEDIUM; + return VMPRESSURE_LOW; +} + +static uint vmpressure_calc_pressure(uint win, uint s, uint r) +{ + ulong p; + + /* + * We calculate the ratio (in percents) of how many pages were + * scanned vs. reclaimed in a given time frame (window). Note that + * time is in VM reclaimer's "ticks", i.e. number of pages + * scanned. This makes it possible set desired reaction time and + * serves as a ratelimit. + */ + p = win - (r * win / s); + p = p * 100 / win; + + pr_debug("%s: %3lu (s: %6u r: %6u)\n", __func__, p, s, r); + + return vmpressure_level(p); +} + +#define VMPRESSURE_SCANNED_SHIFT (sizeof(u64) * 8 / 2) + +static void vmpressure_wk_fn(struct work_struct *wk) +{ + struct vmpressure_watch *watch; + u64 sr = atomic64_xchg(&vmpressure_sr, 0); + u32 s = sr >> VMPRESSURE_SCANNED_SHIFT; + u32 r = sr & (((u64)1 << VMPRESSURE_SCANNED_SHIFT) - 1); + + vmpressure_val = vmpressure_calc_pressure(vmpressure_win, s, r); + + mutex_lock(&vmpressure_watchers_lock); + list_for_each_entry(watch, &vmpressure_watchers, node) + vmpressure_sample(watch); + mutex_unlock(&vmpressure_watchers_lock); +} +static DECLARE_WORK(vmpressure_wk, vmpressure_wk_fn); + +void __vmpressure(struct mem_cgroup *memcg, ulong scanned, ulong reclaimed) +{ + /* + * Store s/r combined, so we don't have to worry to synchronize + * them. On modern machines it will be truly atomic; on arches w/o + * 64 bit atomics it will turn into a spinlock (for a small amount + * of CPUs it's not a problem). + * + * Using int-sized atomics is a bad idea as it would only allow to + * count (1 << 16) - 1 pages (256MB), which we can scan pretty + * fast. + * + * We can't have per-CPU counters as this will not catch a case + * when many CPUs scan small amounts (so none of them hit the + * window size limit, and thus we won't send a notification in + * time). + * + * So we shouldn't place vmpressure() into a very hot path. + */ + atomic64_add(scanned << VMPRESSURE_SCANNED_SHIFT | reclaimed, + &vmpressure_sr); + + scanned = atomic64_read(&vmpressure_sr) >> VMPRESSURE_SCANNED_SHIFT; + if (scanned >= vmpressure_win && !work_pending(&vmpressure_wk)) + schedule_work(&vmpressure_wk); +} + +static uint vmpressure_poll(struct file *file, poll_table *wait) +{ + struct vmpressure_watch *watch = file->private_data; + + poll_wait(file, &watch->waitq, wait); + + return atomic_read(&watch->pending) ? POLLIN : 0; +} + +static ssize_t vmpressure_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + struct vmpressure_watch *watch = file->private_data; + struct vmpressure_event event; + int ret; + + if (count < sizeof(event)) + return -EINVAL; + + ret = wait_event_interruptible(watch->waitq, + atomic_read(&watch->pending)); + if (ret) + return ret; + + event.pressure = vmpressure_val; + if (copy_to_user(buf, &event, sizeof(event))) + return -EFAULT; + + atomic_set(&watch->pending, 0); + + return count; +} + +static int vmpressure_release(struct inode *inode, struct file *file) +{ + struct vmpressure_watch *watch = file->private_data; + + mutex_lock(&vmpressure_watchers_lock); + list_del(&watch->node); + mutex_unlock(&vmpressure_watchers_lock); + + kfree(watch); + return 0; +} + +static const struct file_operations vmpressure_fops = { + .poll = vmpressure_poll, + .read = vmpressure_read, + .release = vmpressure_release, +}; + +SYSCALL_DEFINE1(vmpressure_fd, struct vmpressure_config __user *, config) +{ + struct vmpressure_watch *watch; + struct file *file; + int ret; + int fd; + + watch = kzalloc(sizeof(*watch), GFP_KERNEL); + if (!watch) + return -ENOMEM; + + ret = copy_from_user(&watch->config, config, sizeof(*config)); + if (ret) + goto err_free; + + fd = get_unused_fd_flags(O_RDONLY); + if (fd < 0) { + ret = fd; + goto err_free; + } + + file = anon_inode_getfile("[vmpressure]", &vmpressure_fops, watch, + O_RDONLY); + if (IS_ERR(file)) { + ret = PTR_ERR(file); + goto err_fd; + } + + fd_install(fd, file); + + init_waitqueue_head(&watch->waitq); + + mutex_lock(&vmpressure_watchers_lock); + list_add(&watch->node, &vmpressure_watchers); + mutex_unlock(&vmpressure_watchers_lock); + + return fd; +err_fd: + put_unused_fd(fd); +err_free: + kfree(watch); + return ret; +} diff --git a/mm/vmscan.c b/mm/vmscan.c index 99b434b..5439117 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -20,6 +20,7 @@ #include <linux/init.h> #include <linux/highmem.h> #include <linux/vmstat.h> +#include <linux/vmpressure.h> #include <linux/file.h> #include <linux/writeback.h> #include <linux/blkdev.h> @@ -1846,6 +1847,9 @@ restart: shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON);

+ vmpressure(sc->target_mem_cgroup, + sc->nr_scanned - nr_scanned, nr_reclaimed); + /* reclaim/compaction might need reclaim to continue */ if (should_continue_reclaim(lruvec, nr_reclaimed, sc->nr_scanned - nr_scanned, sc)) @@ -2068,6 +2072,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, count_vm_event(ALLOCSTALL);

do { + vmpressure_prio(sc->target_mem_cgroup, sc->priority); sc->nr_scanned = 0; aborted_reclaim = shrink_zones(zonelist, sc);

-- 1.8.0

Mel Gorman

8 Nov 8 Nov

5:01 p.m.

New subject: [RFC 1/3] mm: Add VM pressure notifications

(Sorry about being very late reviewing this)

On Wed, Nov 07, 2012 at 03:01:28AM -0800, Anton Vorontsov wrote:

...

This patch introduces vmpressure_fd() system call. The system call creates a new file descriptor that can be used to monitor Linux' virtual memory management pressure. There are three discrete levels of the pressure:

Why was eventfd unsuitable? It's a bit trickier to use but there are examples in the kernel where an application is required to do something like

1. open eventfd 2. open a control file, say /proc/sys/vm/vmpressure or if cgroups /sys/fs/cgroup/something/vmpressure 3. write fd_event fd_control [low|medium|oom]. Can be a binary structure you write

and then poll the eventfd. The trickiness is awkward but a library implementation of vmpressure_fd() that mapped onto eventfd properly should be trivial.

I confess I'm not super familiar with eventfd and if this can actually work in practice but I found the introduction of a dedicated syscall surprising. Apologies if this has been discussed already. If it was, it should be in the changelog to prevent stupid questions from drive-by reviewers.

...

VMPRESSURE_LOW: Notifies that the system is reclaiming memory for new allocations. Monitoring reclaiming activity might be useful for maintaining overall system's cache level.

If you do another revision, add a caveat that a streaming reader might be enough to trigger this level. It's not necessarily a problem of course.

...

VMPRESSURE_MEDIUM: The system is experiencing medium memory pressure, there might be some mild swapping activity. Upon this event applications may decide to free any resources that can be easily reconstructed or re-read from a disk.

Good.

...

VMPRESSURE_OOM: The system is actively thrashing, it is about to go out of memory (OOM) or even the in-kernel OOM killer is on its way to trigger. Applications should do whatever they can to help the system.

Good.

...

There are four sysctls to tune the behaviour of the levels:

vmevent_window vmevent_level_medium vmevent_level_oom vmevent_level_oom_priority

Superficially these feel like the might expose implementation details of the pressure implementation and therby indirectly expose the internals of the VM. Should these be debugfs instead of sysctls that spit out a warning if used so it generates a bug report? That won't stop someone depending on them anyway but if these values are changed we should immediately hear why it was necessary.

...

Currently vmevent pressure levels are based on the reclaimer inefficiency index (range from 0 to 100). The index shows the relative time spent by the kernel uselessly scanning pages, or, in other words, the percentage of scans of pages (vmevent_window) that were not reclaimed. The higher the index, the more it should be evident that new allocations' cost becomes higher.

Good.

...

The files vmevent_level_medium and vmevent_level_oom accept the index values (by default set to 60 and 99 respectively). A non-existent vmevent_level_low tunable is always set to 0

When index equals to 0, this means that the kernel is reclaiming, but every scanned page has been successfully reclaimed (so the pressure is low). 100 means that the kernel is trying to reclaim, but nothing can be reclaimed (OOM).

Window size is used as a rate-limit tunable for VMPRESSURE_LOW notifications and for averaging for VMPRESSURE_{MEDIUM,OOM} levels. So, using small window sizes can cause lot of false positives for _MEDIUM and _OOM levels, but too big window size may delay notifications. By default the window size equals to 256 pages (1MB).

I think it would be reasonable to leave the window as a sysctl but rename it vmpressure_sensitivity. Tuning it to be very "sensitive" would initially be implemented as the window shrinking.

...

The _OOM level is also attached to the reclaimer's priority. When the system is almost OOM, it might be getting the last reclaimable pages slowly, scanning all the queues, and so we never catch the OOM case via window-size averaging. For this case the priority can be used to determine the pre-OOM condition, the pre-OOM priority level can be set via vmpressure_level_oom_prio sysctl.

Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org

Documentation/sysctl/vm.txt | 48 ++++++++ arch/x86/syscalls/syscall_64.tbl | 1 + include/linux/syscalls.h | 2 + include/linux/vmpressure.h | 128 ++++++++++++++++++++++ kernel/sys_ni.c | 1 + kernel/sysctl.c | 31 ++++++ mm/Kconfig | 13 +++ mm/Makefile | 1 + mm/vmpressure.c | 231 +++++++++++++++++++++++++++++++++++++++ mm/vmscan.c | 5 + 10 files changed, 461 insertions(+) create mode 100644 include/linux/vmpressure.h create mode 100644 mm/vmpressure.c

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 078701f..9837fe2 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -44,6 +44,10 @@ Currently, these files are in /proc/sys/vm:

nr_overcommit_hugepages

nr_trim_pages (only if CONFIG_MMU=n)

numa_zonelist_order

+- vmpressure_window +- vmpressure_level_medium +- vmpressure_level_oom +- vmpressure_level_oom_priority

oom_dump_tasks

oom_kill_allocating_task

overcommit_memory

@@ -487,6 +491,50 @@ this is causing problems for your system/application. ============================================================== +vmpressure_window +vmpressure_level_med +vmpressure_level_oom +vmpressure_level_oom_priority

+These sysctls are used to tune vmpressure_fd(2) behaviour.

Ok, I'm ok with FD being the interface. I think it makes sense and means it can be used with select or poll.

...

+Currently vmpressure pressure levels are based on the reclaimer +inefficiency index (range from 0 to 100). The files vmpressure_level_med +and vmpressure_level_oom accept the index values (by default set to 60 and +99 respectively). A non-existent vmpressure_level_low tunable is always +set to 0

+When the system is short on idle pages, the new memory is allocated by +reclaiming least recently used resources: kernel scans pages to be +reclaimed (e.g. from file caches, mmap(2) volatile ranges, etc.; and +potentially swapping some pages out). The index shows the relative time +spent by the kernel uselessly scanning pages, or, in other words, the +percentage of scans of pages (vmpressure_window) that were not reclaimed. +The higher the index, the more it should be evident that new allocations' +cost becomes higher.

+When index equals to 0, this means that the kernel is reclaiming, but +every scanned page has been successfully reclaimed (so the pressure is +low). 100 means that the kernel is trying to reclaim, but nothing can be +reclaimed (close to OOM).

+Window size is used as a rate-limit tunable for VMPRESSURE_LOW +notifications and for averaging for VMPRESSURE_{MEDIUM,OOM} levels. So, +using small window sizes can cause lot of false positives for _MEDIUM and +_OOM levels, but too big window size may delay notifications. By default +the window size equals to 256 pages (1MB).

+When the system is almost OOM it might be getting the last reclaimable +pages slowly, scanning all the queues, and so we never catch the OOM case +via window-size averaging. For this case there is another mechanism of +detecting the pre-OOM conditions: kernel's reclaimer has a scanning +priority, the higest priority is 0 (reclaimer will scan all the available +pages). Kernel starts scanning with priority set to 12 (queue_length >> +12). So, vmpressure_level_oom_prio should be between 0 and 12 (by default +it is set to 4).

Sounds good. Again, be careful on how much implementation detail you expose to the interface. I think the actual user-visible interface should be low, medium, high with a sensitivity tunable but the ranges and window sizes hidden away (or at least in debugfs).

...

+==============================================================

oom_dump_tasks Enables a system-wide task dump (excluding kernel threads) to be diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl index 316449a..6e4fa6a 100644 --- a/arch/x86/syscalls/syscall_64.tbl +++ b/arch/x86/syscalls/syscall_64.tbl @@ -320,6 +320,7 @@ 311 64 process_vm_writev sys_process_vm_writev 312 common kcmp sys_kcmp 313 64 vmevent_fd sys_vmevent_fd +314 64 vmpressure_fd sys_vmpressure_fd # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 19439c7..3d2587d 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -63,6 +63,7 @@ struct getcpu_cache; struct old_linux_dirent; struct perf_event_attr; struct file_handle; +struct vmpressure_config; #include <linux/types.h> #include <linux/aio_abi.h> @@ -860,4 +861,5 @@ asmlinkage long sys_process_vm_writev(pid_t pid, asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1, unsigned long idx2); +asmlinkage long sys_vmpressure_fd(struct vmpressure_config __user *config); #endif diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h new file mode 100644 index 0000000..b808b04 --- /dev/null +++ b/include/linux/vmpressure.h @@ -0,0 +1,128 @@ +/*
Linux VM pressure notifications

Copyright 2011-2012 Pekka Enberg penberg@kernel.org

Copyright 2011-2012 Linaro Ltd.
       Anton Vorontsov <anton.vorontsov@linaro.org>
Based on ideas from KOSAKI Motohiro, Leonid Moiseichuk, Mel Gorman,

Minchan Kim and Pekka Enberg.

This program is free software; you can redistribute it and/or modify it

under the terms of the GNU General Public License version 2 as published

by the Free Software Foundation.

*/
+#ifndef _LINUX_VMPRESSURE_H +#define _LINUX_VMPRESSURE_H

+#include <linux/types.h>

+/**

enum vmpressure_level - Memory pressure levels

@VMPRESSURE_LOW: The system is short on idle pages, losing caches

@VMPRESSURE_MEDIUM: New allocations' cost becomes high

@VMPRESSURE_OOM: The system is about to go out-of-memory

*/

+enum vmpressure_level {

/* We spread the values, reserving room for new levels. */

VMPRESSURE_LOW = 1 << 10,

VMPRESSURE_MEDIUM = 1 << 20,

VMPRESSURE_OOM = 1 << 30,

+};

Once again, be careful on what you expose to userspace. Bear in mind these are compiled to to maintain binary compatability the user-visible structure should be plain enums.

enum vmpressure_level { VM_PRESSURE_LOW, VM_PRESSURE_MEDIUM, VM_PRESSURE_OOM };

These should then be mapped to a kernel internal ranges

enum __vmpressure_level_range_internal { __VM_PRESSURE_LOW = 1<< 10, __VM_PRESSURE_MEDIUM = 1 << 20, }

That allows the kernel internal ranges to change without worrying about userspace compatability.

This comment would apply even if you used eventfd.

I don't mean to bitch about exposing implementation details but a stated goal of this interface was to avoid having applications aware of VM implementation details.

...

+/**

struct vmpressure_config - Configuration structure for vmpressure_fd()

@size: Size of the struct for ABI extensibility

@threshold: Minimum pressure level of notifications

This structure is used to configure the file descriptor that

vmpressure_fd() returns.

@size is used to "version" the ABI, it must be initialized to

'sizeof(struct vmpressure_config)'.

@threshold should be one of @vmpressure_level values, and specifies

minimal level of notification that will be delivered.

*/

+struct vmpressure_config {

__u32 size;

__u32 threshold;

+};

Again I suspect this might be compatible with eventfd. The writing of the eventfd just needs to handle a binary structure instead of strings without having to introduce a dedicated system call.

The versioning of the structure is not a bad idea though but don't use "size". Use a magic value for the high bits and a number of the low bits and #define it VMPRESSURE_NOTIFY_MAGIC1

...

+/**

struct vmpressure_event - An event that is returned via vmpressure fd

@pressure: Most recent system's pressure level

Upon notification, this structure must be read from the vmpressure file

descriptor.

*/

+struct vmpressure_event {

__u32 pressure;

+};

What is the meaning of "pressure" as returned to userspace?

Would it be better if userspace just received an event when the requested threshold was reached but when it reads it just gets a single 0 byte that should not be interpreted?

I say this because the application can only request low, medium or OOM but gets a number back. How should it intepret that number? The value of the number depends on sysctl files and I fear that applications will end up making decisions on the implementation again.

I think it would be a lot safer for Pressure ABI v1 to return only 0 here and see how far that gets. If Android has already gone through this process and *know* they need this number then it should be documented.

If this has already been discussed, it should also be documented :P

I see from a debugging perspective why it might be handy to monitor pressure over time. If so, then maybe a debugfs file would help with a CLEAR warning that no application should depend on its existance (make it depend on CONFIG_DEBUG_VMPRESSURE && CONFIG_DEBUG_VM or something).

...

+#ifdef __KERNEL__

+struct mem_cgroup;

+#ifdef CONFIG_VMPRESSURE

+extern uint vmpressure_win; +extern uint vmpressure_level_med; +extern uint vmpressure_level_oom; +extern uint vmpressure_level_oom_prio;

+extern void __vmpressure(struct mem_cgroup *memcg,
	 ulong scanned, ulong reclaimed);
+static void vmpressure(struct mem_cgroup *memcg,
       ulong scanned, ulong reclaimed);
+/*

OK, we're cheating. The thing is, we have to average s/r ratio by

gathering a lot of scans (otherwise we might get some local

false-positives index of '100').

But... when we're almost OOM we might be getting the last reclaimable

pages slowly, scanning all the queues, and so we never catch the OOM

case via averaging. Although the priority will show it for sure. The

pre-OOM priority value is mostly an empirically taken priority: we

never observe it under any load, except for last few allocations before

the OOM (but the exact value is still configurable via sysctl).

*/

+static inline void vmpressure_prio(struct mem_cgroup *memcg, int prio) +{
if (prio > vmpressure_level_oom_prio)
return;
/* OK, the prio is below the threshold, send the pre-OOM event. */

vmpressure(memcg, vmpressure_win, 0);
+}

+#else +static inline void __vmpressure(struct mem_cgroup *memcg,
		ulong scanned, ulong reclaimed) {}
+static inline void vmpressure_prio(struct mem_cgroup *memcg, int prio) {} +#endif /* CONFIG_VMPRESSURE */

+static inline void vmpressure(struct mem_cgroup *memcg,
	      ulong scanned, ulong reclaimed)
+{
if (!scanned)
return;
if (IS_BUILTIN(CONFIG_MEMCG) && memcg) {
/*
 * The vmpressure API reports system pressure, for per-cgroup
 * pressure, we'll chain cgroups notifications, this is to
 * be implemented.
 *
 * memcg_vm_pressure(target_mem_cgroup, scanned, reclaimed);
 */
return;
}

__vmpressure(memcg, scanned, reclaimed);
+}

Ok. Personally I'm ok with memcg support not existing initially. If we can't get the global case right, then the memcg case is impossible.

...

+#endif /* __KERNEL__ */

+#endif /* _LINUX_VMPRESSURE_H */ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 3ccdbf4..9573a5a 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -192,6 +192,7 @@ cond_syscall(compat_sys_timerfd_gettime); cond_syscall(sys_eventfd); cond_syscall(sys_eventfd2); cond_syscall(sys_vmevent_fd); +cond_syscall(sys_vmpressure_fd); /* performance counters: */ cond_syscall(sys_perf_event_open); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 87174ef..7c9a3be 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -50,6 +50,7 @@ #include <linux/dnotify.h> #include <linux/syscalls.h> #include <linux/vmstat.h> +#include <linux/vmpressure.h> #include <linux/nfs_fs.h> #include <linux/acpi.h> #include <linux/reboot.h> @@ -1317,6 +1318,36 @@ static struct ctl_table vm_table[] = { .proc_handler = numa_zonelist_order_handler, }, #endif +#ifdef CONFIG_VMPRESSURE
{
.procname	= "vmpressure_window",
.data		= &vmpressure_win,
.maxlen		= sizeof(vmpressure_win),
.mode		= 0644,
.proc_handler	= proc_dointvec,
},

{
.procname	= "vmpressure_level_medium",
.data		= &vmpressure_level_med,
.maxlen		= sizeof(vmpressure_level_med),
.mode		= 0644,
.proc_handler	= proc_dointvec,
},

{
.procname	= "vmpressure_level_oom",
.data		= &vmpressure_level_oom,
.maxlen		= sizeof(vmpressure_level_oom),
.mode		= 0644,
.proc_handler	= proc_dointvec,
},

{
.procname	= "vmpressure_level_oom_priority",
.data		= &vmpressure_level_oom_prio,
.maxlen		= sizeof(vmpressure_level_oom_prio),
.mode		= 0644,
.proc_handler	= proc_dointvec,
},
+#endif

Talked about this and why I think they should be debugfs already.

...

#if (defined(CONFIG_X86_32) && !defined(CONFIG_UML))|| \ (defined(CONFIG_SUPERH) && defined(CONFIG_VSYSCALL)) { diff --git a/mm/Kconfig b/mm/Kconfig index cd0ea24e..8a47a5f 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -401,6 +401,19 @@ config VMEVENT help If unsure, say N to disable vmevent +config VMPRESSURE
bool "Enable vmpressure_fd() notifications"

help
 This option enables vmpressure_fd() system call, it is used to
 notify userland applications about system's virtual memory
 pressure state.
 Upon these notifications, userland programs can cooperate with
 the kernel (e.g. free easily reclaimable resources), and so
 achieving better system's memory management.
 If unsure, say N.

If anything I think this should be default Y. If Android benefits from it, it's plausible that normal desktops might and failing that, monitoring applications on server workloads will. With default N, it's going to be missed by distributions.

I think making it configurable at all is overkill -- maybe the debugfs parts but otherwise build it.

...

config FRONTSWAP bool "Enable frontswap to cache swap pages if tmem is present" depends on SWAP diff --git a/mm/Makefile b/mm/Makefile index 80debc7..2f08d14 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -57,4 +57,5 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o obj-$(CONFIG_CLEANCACHE) += cleancache.o obj-$(CONFIG_VMEVENT) += vmevent.o +obj-$(CONFIG_VMPRESSURE) += vmpressure.o obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o diff --git a/mm/vmpressure.c b/mm/vmpressure.c new file mode 100644 index 0000000..54f35a3 --- /dev/null +++ b/mm/vmpressure.c @@ -0,0 +1,231 @@ +/*
Linux VM pressure notifications

Copyright 2011-2012 Pekka Enberg penberg@kernel.org

Copyright 2011-2012 Linaro Ltd.
       Anton Vorontsov <anton.vorontsov@linaro.org>
Based on ideas from KOSAKI Motohiro, Leonid Moiseichuk, Mel Gorman,

Minchan Kim and Pekka Enberg.

This program is free software; you can redistribute it and/or modify it

under the terms of the GNU General Public License version 2 as published

by the Free Software Foundation.

*/
+#include <linux/anon_inodes.h> +#include <linux/atomic.h> +#include <linux/compiler.h> +#include <linux/vmpressure.h> +#include <linux/syscalls.h> +#include <linux/workqueue.h> +#include <linux/mutex.h> +#include <linux/file.h> +#include <linux/list.h> +#include <linux/poll.h> +#include <linux/slab.h> +#include <linux/swap.h>

+struct vmpressure_watch {

struct vmpressure_config config;

atomic_t pending;

wait_queue_head_t waitq;

struct list_head node;

+};

+static atomic64_t vmpressure_sr; +static uint vmpressure_val;

+static LIST_HEAD(vmpressure_watchers); +static DEFINE_MUTEX(vmpressure_watchers_lock);

Superficially, this looks like a custom implementation of a chain notifier (include/linux/notifier.h). It's not something the VM makes much use of other than the OOM killer but it's there.

...

+/* Our sysctl tunables, see Documentation/sysctl/vm.txt */ +uint __read_mostly vmpressure_win = SWAP_CLUSTER_MAX * 16; +uint vmpressure_level_med = 60; +uint vmpressure_level_oom = 99; +uint vmpressure_level_oom_prio = 4;

+/*

This function is called from a workqueue, which can have only one

execution thread, so we don't need to worry about racing w/ ourselves.

And so it possible to implement the lock-free logic, using just the

atomic watch->pending variable.

*/

+static void vmpressure_sample(struct vmpressure_watch *watch) +{
if (atomic_read(&watch->pending))
return;
if (vmpressure_val < watch->config.threshold)
return;
atomic_set(&watch->pending, 1);

wake_up(&watch->waitq);
+}

+static u64 vmpressure_level(uint pressure) +{
if (pressure >= vmpressure_level_oom)
return VMPRESSURE_OOM;
else if (pressure >= vmpressure_level_med)
return VMPRESSURE_MEDIUM;
return VMPRESSURE_LOW;
+}

+static uint vmpressure_calc_pressure(uint win, uint s, uint r) +{
ulong p;

/*
* We calculate the ratio (in percents) of how many pages were
* scanned vs. reclaimed in a given time frame (window). Note that
* time is in VM reclaimer's "ticks", i.e. number of pages
* scanned. This makes it possible set desired reaction time and
* serves as a ratelimit.
*/
p = win - (r * win / s);

p = p * 100 / win;

pr_debug("%s: %3lu (s: %6u r: %6u)\n", __func__, p, s, r);

return vmpressure_level(p);
+}

Ok!

...

+#define VMPRESSURE_SCANNED_SHIFT (sizeof(u64) * 8 / 2)

+static void vmpressure_wk_fn(struct work_struct *wk) +{
struct vmpressure_watch *watch;

u64 sr = atomic64_xchg(&vmpressure_sr, 0);

u32 s = sr >> VMPRESSURE_SCANNED_SHIFT;

u32 r = sr & (((u64)1 << VMPRESSURE_SCANNED_SHIFT) - 1);

vmpressure_val = vmpressure_calc_pressure(vmpressure_win, s, r);

mutex_lock(&vmpressure_watchers_lock);

list_for_each_entry(watch, &vmpressure_watchers, node)
vmpressure_sample(watch);
mutex_unlock(&vmpressure_watchers_lock);
+}

So, if you used notifiers I think this would turn into a blocking_notifier_call_chain() probably. Maybe atomic_notifier_call_chain() depending.

...

+static DECLARE_WORK(vmpressure_wk, vmpressure_wk_fn);

+void __vmpressure(struct mem_cgroup *memcg, ulong scanned, ulong reclaimed) +{
/*
* Store s/r combined, so we don't have to worry to synchronize
* them. On modern machines it will be truly atomic; on arches w/o
* 64 bit atomics it will turn into a spinlock (for a small amount
* of CPUs it's not a problem).
*
* Using int-sized atomics is a bad idea as it would only allow to
* count (1 << 16) - 1 pages (256MB), which we can scan pretty
* fast.
*
* We can't have per-CPU counters as this will not catch a case
* when many CPUs scan small amounts (so none of them hit the
* window size limit, and thus we won't send a notification in
* time).
*
* So we shouldn't place vmpressure() into a very hot path.
*/
atomic64_add(scanned << VMPRESSURE_SCANNED_SHIFT | reclaimed,
     &vmpressure_sr);
scanned = atomic64_read(&vmpressure_sr) >> VMPRESSURE_SCANNED_SHIFT;

if (scanned >= vmpressure_win && !work_pending(&vmpressure_wk))
schedule_work(&vmpressure_wk);
+}

So after all this, I'm ok with the actual calculation of pressure part and when userspace gets woken up. I'm *WAY* happier with this than I was with notifiers based on free memory so for *just* that part

Acked-by: Mel Gorman mgorman@suse.de

I'm less keen on the actual interface and have explained why but it's up to other people to say whether they feel the same way. If Pekka and the Android people are ok with the interface then I won't object. However, if eventfd cannot be used and a system call really is required then it should be explained *very* carefully in the changelog or it'll just get snagged by another reviewer.

...

+static uint vmpressure_poll(struct file *file, poll_table *wait) +{

struct vmpressure_watch *watch = file->private_data;

poll_wait(file, &watch->waitq, wait);

return atomic_read(&watch->pending) ? POLLIN : 0;

+}

+static ssize_t vmpressure_read(struct file *file, char __user *buf,
	       size_t count, loff_t *ppos)
+{
struct vmpressure_watch *watch = file->private_data;

struct vmpressure_event event;

int ret;

if (count < sizeof(event))
return -EINVAL;
ret = wait_event_interruptible(watch->waitq,
		       atomic_read(&watch->pending));
if (ret)
return ret;
event.pressure = vmpressure_val;

if (copy_to_user(buf, &event, sizeof(event)))
return -EFAULT;
atomic_set(&watch->pending, 0);

return count;
+}

+static int vmpressure_release(struct inode *inode, struct file *file) +{

struct vmpressure_watch *watch = file->private_data;

mutex_lock(&vmpressure_watchers_lock);

list_del(&watch->node);

mutex_unlock(&vmpressure_watchers_lock);

kfree(watch);

return 0;

+}

+static const struct file_operations vmpressure_fops = {

.poll = vmpressure_poll,

.read = vmpressure_read,

.release = vmpressure_release,

+};

+SYSCALL_DEFINE1(vmpressure_fd, struct vmpressure_config __user *, config) +{
struct vmpressure_watch *watch;

struct file *file;

int ret;

int fd;

watch = kzalloc(sizeof(*watch), GFP_KERNEL);

if (!watch)
return -ENOMEM;
ret = copy_from_user(&watch->config, config, sizeof(*config));

if (ret)
goto err_free;
fd = get_unused_fd_flags(O_RDONLY);

if (fd < 0) {
ret = fd;
goto err_free;
}

file = anon_inode_getfile("[vmpressure]", &vmpressure_fops, watch,
		  O_RDONLY);
if (IS_ERR(file)) {
ret = PTR_ERR(file);
goto err_fd;
}

fd_install(fd, file);

init_waitqueue_head(&watch->waitq);

mutex_lock(&vmpressure_watchers_lock);

list_add(&watch->node, &vmpressure_watchers);

mutex_unlock(&vmpressure_watchers_lock);

return fd;
+err_fd:

put_unused_fd(fd);

+err_free:

kfree(watch);

return ret;

+} diff --git a/mm/vmscan.c b/mm/vmscan.c index 99b434b..5439117 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -20,6 +20,7 @@ #include <linux/init.h> #include <linux/highmem.h> #include <linux/vmstat.h> +#include <linux/vmpressure.h> #include <linux/file.h> #include <linux/writeback.h> #include <linux/blkdev.h> @@ -1846,6 +1847,9 @@ restart: shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON);
vmpressure(sc->target_mem_cgroup,
   sc->nr_scanned - nr_scanned, nr_reclaimed);
/* reclaim/compaction might need reclaim to continue */ if (should_continue_reclaim(lruvec, nr_reclaimed, sc->nr_scanned - nr_scanned, sc))
@@ -2068,6 +2072,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, count_vm_event(ALLOCSTALL); do {
vmpressure_prio(sc->target_mem_cgroup, sc->priority);
sc->nr_scanned = 0; aborted_reclaim = shrink_zones(zonelist, sc);
1.8.0

-- Mel Gorman SUSE Labs

Kirill A. Shutemov

5:14 p.m.

New subject: [RFC 1/3] mm: Add VM pressure notifications

On Thu, Nov 08, 2012 at 05:01:24PM +0000, Mel Gorman wrote:

...

(Sorry about being very late reviewing this)

On Wed, Nov 07, 2012 at 03:01:28AM -0800, Anton Vorontsov wrote:

...
This patch introduces vmpressure_fd() system call. The system call creates a new file descriptor that can be used to monitor Linux' virtual memory management pressure. There are three discrete levels of the pressure:

Why was eventfd unsuitable? It's a bit trickier to use but there are examples in the kernel where an application is required to do something like

open eventfd

open a control file, say /proc/sys/vm/vmpressure or if cgroups /sys/fs/cgroup/something/vmpressure

write fd_event fd_control [low|medium|oom]. Can be a binary structure you write

and then poll the eventfd. The trickiness is awkward but a library implementation of vmpressure_fd() that mapped onto eventfd properly should be trivial.

I confess I'm not super familiar with eventfd and if this can actually work in practice

You've described how it works for memory thresholds and oom notifications in memcg. So it works. I also prefer this kind of interface.

See Documentation/cgroups/cgroups.txt section 2.4 and Documentation/cgroups/memory.txt sections 9 and 10.

-- Kirill A. Shutemov

Jonathan Corbet

13 Nov 13 Nov

6:38 p.m.

New subject: [RFC 1/3] mm: Add VM pressure notifications

On Wed, 7 Nov 2012 03:01:28 -0800 Anton Vorontsov anton.vorontsov@linaro.org wrote:

...

This patch introduces vmpressure_fd() system call. The system call creates a new file descriptor that can be used to monitor Linux' virtual memory management pressure.

I noticed a couple of quick things as I was looking this over...

...

+static ssize_t vmpressure_read(struct file *file, char __user *buf,
	       size_t count, loff_t *ppos)
+{
struct vmpressure_watch *watch = file->private_data;

struct vmpressure_event event;

int ret;

if (count < sizeof(event))
return -EINVAL;
ret = wait_event_interruptible(watch->waitq,
		       atomic_read(&watch->pending));

Would it make sense to support non-blocking reads? Perhaps a process would like to simply know that current pressure level?

...

+SYSCALL_DEFINE1(vmpressure_fd, struct vmpressure_config __user *, config) +{
struct vmpressure_watch *watch;

struct file *file;

int ret;

int fd;

watch = kzalloc(sizeof(*watch), GFP_KERNEL);

if (!watch)
return -ENOMEM;
ret = copy_from_user(&watch->config, config, sizeof(*config));

if (ret)
goto err_free;

This is wrong - you'll return the number of uncopied bytes to user space. You'll need a "ret = -EFAULT;" in there somewhere.

jon

Anton Vorontsov

7 Nov 7 Nov

11:01 a.m.

New subject: [RFC 2/3] tools/testing: Add vmpressure-test utility

Just a simple test/example utility for the vmpressure_fd(2) system call.

Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org --- tools/testing/vmpressure/.gitignore | 1 + tools/testing/vmpressure/Makefile | 30 ++++++++++ tools/testing/vmpressure/vmpressure-test.c | 93 ++++++++++++++++++++++++++++++ 3 files changed, 124 insertions(+) create mode 100644 tools/testing/vmpressure/.gitignore create mode 100644 tools/testing/vmpressure/Makefile create mode 100644 tools/testing/vmpressure/vmpressure-test.c

diff --git a/tools/testing/vmpressure/.gitignore b/tools/testing/vmpressure/.gitignore new file mode 100644 index 0000000..fe5e38c --- /dev/null +++ b/tools/testing/vmpressure/.gitignore @@ -0,0 +1 @@ +vmpressure-test diff --git a/tools/testing/vmpressure/Makefile b/tools/testing/vmpressure/Makefile new file mode 100644 index 0000000..7545f3e --- /dev/null +++ b/tools/testing/vmpressure/Makefile @@ -0,0 +1,30 @@ +WARNINGS := -Wcast-align +WARNINGS += -Wformat +WARNINGS += -Wformat-security +WARNINGS += -Wformat-y2k +WARNINGS += -Wshadow +WARNINGS += -Winit-self +WARNINGS += -Wpacked +WARNINGS += -Wredundant-decls +WARNINGS += -Wstrict-aliasing=3 +WARNINGS += -Wswitch-default +WARNINGS += -Wno-system-headers +WARNINGS += -Wundef +WARNINGS += -Wwrite-strings +WARNINGS += -Wbad-function-cast +WARNINGS += -Wmissing-declarations +WARNINGS += -Wmissing-prototypes +WARNINGS += -Wnested-externs +WARNINGS += -Wold-style-definition +WARNINGS += -Wstrict-prototypes +WARNINGS += -Wdeclaration-after-statement + +CFLAGS = -O3 -g -std=gnu99 $(WARNINGS) + +PROGRAMS = vmpressure-test + +all: $(PROGRAMS) + +clean: + rm -f $(PROGRAMS) *.o +.PHONY: clean diff --git a/tools/testing/vmpressure/vmpressure-test.c b/tools/testing/vmpressure/vmpressure-test.c new file mode 100644 index 0000000..1e448be --- /dev/null +++ b/tools/testing/vmpressure/vmpressure-test.c @@ -0,0 +1,93 @@ +/* + * vmpressure_fd(2) test utility + * + * Copyright 2011-2012 Pekka Enberg penberg@kernel.org + * Copyright 2011-2012 Linaro Ltd. + * Anton Vorontsov anton.vorontsov@linaro.org + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published + * by the Free Software Foundation. + */ + +/* TODO: glibc wrappers */ +#include "../../../include/linux/vmpressure.h" + +#if defined(__x86_64__) +#include "../../../arch/x86/include/generated/asm/unistd_64.h" +#endif +#if defined(__arm__) +#include "../../../arch/arm/include/asm/unistd.h" +#endif + +#include <stdint.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> +#include <errno.h> +#include <stdio.h> +#include <poll.h> + +#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])) + +static void pexit(const char *str) +{ + perror(str); + exit(1); +} + +static int vmpressure_fd(struct vmpressure_config *config) +{ + config->size = sizeof(*config); + + return syscall(__NR_vmpressure_fd, config); +} + +int main(int argc, char *argv[]) +{ + struct vmpressure_config config[] = { + /* + * We could just set the lowest priority, but we want to + * actually test if the thresholds work. + */ + { .threshold = VMPRESSURE_LOW }, + { .threshold = VMPRESSURE_MEDIUM }, + { .threshold = VMPRESSURE_OOM }, + }; + const size_t num = ARRAY_SIZE(config); + struct pollfd pfds[num]; + int i; + + for (i = 0; i < num; i++) { + pfds[i].fd = vmpressure_fd(&config[i]); + if (pfds[i].fd < 0) + pexit("vmpressure_fd failed"); + + pfds[i].events = POLLIN; + } + + while (poll(pfds, num, -1) > 0) { + for (i = 0; i < num; i++) { + struct vmpressure_event event; + + if (!pfds[i].revents) + continue; + + if (read(pfds[i].fd, &event, sizeof(event)) < 0) + pexit("read failed"); + + printf("VM pressure: 0x%.8x (threshold 0x%.8x)\n", + event.pressure, config[i].threshold); + } + } + + perror("poll failed\n"); + + for (i = 0; i < num; i++) { + if (close(pfds[i].fd) < 0) + pexit("close failed"); + } + + exit(1); + return 0; +}

-- 1.8.0

Anton Vorontsov

11:01 a.m.

New subject: [RFC 3/3] man-pages: Add man page for vmpressure_fd(2)

VMPRESSURE_FD(2) Linux Programmer's Manual VMPRESSURE_FD(2)

NAME vmpressure_fd - Linux virtual memory pressure notifications

SYNOPSIS #define _GNU_SOURCE #include <unistd.h> #include <sys/syscall.h> #include <asm/unistd.h> #include <linux/types.h> #include <linux/vmpressure.h>

int vmpressure_fd(struct vmpressure_config *config) { config->size = sizeof(*config); return syscall(__NR_vmpressure_fd, config); }

DESCRIPTION This system call creates a new file descriptor that can be used with blocking (e.g. read(2)) and/or polling (e.g. poll(2)) rou- tines to get notified about system's memory pressure.

Upon these notifications, userland programs can cooperate with the kernel, achieving better system's memory management.

Memory pressure levels There are currently three memory pressure levels, each level is defined via vmpressure_level enumeration, and correspond to these constants:

VMPRESSURE_LOW The system is reclaiming memory for new allocations. Moni- toring reclaiming activity might be useful for maintaining overall system's cache level.

VMPRESSURE_MEDIUM The system is experiencing medium memory pressure, there might be some mild swapping activity. Upon this event, applications may decide to free any resources that can be easily reconstructed or re-read from a disk.

VMPRESSURE_OOM The system is actively thrashing, it is about to out of memory (OOM) or even the in-kernel OOM killer is on its way to trigger. Applications should do whatever they can to help the system. See proc(5) for more information about OOM killer and its configuration options.

Note that the behaviour of some levels can be tuned through the sysctl(5) mechanism. See /usr/src/linux/Documenta- tion/sysctl/vm.txt for various vmpressure_* tunables and their meanings.

Configuration vmpressure_fd(2) accepts vmpressure_config structure to configure the notifications:

struct vmpressure_config { __u32 size; __u32 threshold; };

size is a part of ABI versioning and must be initialized to sizeof(struct vmpressure_config).

threshold is used to setup a minimal value of the pressure upon which the events will be delivered by the kernel (for algebraic comparisons, it is defined that VMPRESSURE_LOW < VMPRES- SURE_MEDIUM < VMPRESSURE_OOM, but applications should not put any meaning into the absolute values.)

Events Upon a notification, application must read out events using read(2) system call. The events are delivered using the follow- ing structure:

struct vmpressure_event { __u32 pressure; };

The pressure shows the most recent system's pressure level.

RETURN VALUE On success, vmpressure_fd() returns a new file descriptor. On error, a negative value is returned and errno is set to indicate the error.

ERRORS vmpressure_fd() can fail with errors similar to open(2).

In addition, the following errors are possible:

EINVAL The failure means that an improperly initalized config structure has been passed to the call.

EFAULT The failure means that the kernel was unable to read the configuration structure, that is, config parameter points to an inaccessible memory.

VERSIONS The system call is available on Linux since kernel 3.8. Library support is yet not provided by any glibc version.

CONFORMING TO The system call is Linux-specific.

EXAMPLE Examples can be found in /usr/src/linux/tools/testing/vmpressure/ directory.

SEE ALSO poll(2), read(2), proc(5), sysctl(5), vmstat(8)

Linux 2012-10-16 VMPRESSURE_FD(2)

Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org --- man2/vmpressure_fd.2 | 163 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 163 insertions(+) create mode 100644 man2/vmpressure_fd.2

diff --git a/man2/vmpressure_fd.2 b/man2/vmpressure_fd.2 new file mode 100644 index 0000000..eaf07d4 --- /dev/null +++ b/man2/vmpressure_fd.2 @@ -0,0 +1,163 @@ +." Copyright (C) 2008 Michael Kerrisk mtk.manpages@gmail.com +." Copyright (C) 2012 Linaro Ltd. +." Anton Vorontsov anton.vorontsov@linaro.org +." +." Based on ideas from: +." KOSAKI Motohiro, Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka +." Enberg. +." +." This program is free software; you can redistribute it and/or modify +." it under the terms of the GNU General Public License as published by +." the Free Software Foundation; either version 2 of the License, or +." (at your option) any later version. +." +." This program is distributed in the hope that it will be useful, +." but WITHOUT ANY WARRANTY; without even the implied warranty of +." MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +." GNU General Public License for more details. +." +." You should have received a copy of the GNU General Public License +." along with this program; if not, write to the Free Software +." Foundation, Inc., 59 Temple Place, Suite 330, Boston, +." MA 02111-1307 USA +." +.TH VMPRESSURE_FD 2 2012-10-16 Linux "Linux Programmer's Manual" +.SH NAME +vmpressure_fd - Linux virtual memory pressure notifications +.SH SYNOPSIS +.nf +.B #define _GNU_SOURCE +.B #include <unistd.h> +.B #include <sys/syscall.h> +.B #include <asm/unistd.h> +.B #include <linux/types.h> +.B #include <linux/vmpressure.h> +." TODO: libc wrapper + +.BI "int vmpressure_fd(struct vmpressure_config *"config ) +.B +{ +.B + config->size = sizeof(*config); +.B + return syscall(__NR_vmpressure_fd, config); +.B +} +.fi +.SH DESCRIPTION +This system call creates a new file descriptor that can be used with +blocking (e.g. +.BR read (2)) +and/or polling (e.g. +.BR poll (2)) +routines to get notified about system's memory pressure. + +Upon these notifications, userland programs can cooperate with the kernel, +achieving better system's memory management. +.SS Memory pressure levels +There are currently three memory pressure levels, each level is defined +via +.IR vmpressure_level " enumeration," +and correspond to these constants: +.TP +.B VMPRESSURE_LOW +The system is reclaiming memory for new allocations. Monitoring reclaiming +activity might be useful for maintaining overall system's cache level. +.TP +.B VMPRESSURE_MEDIUM +The system is experiencing medium memory pressure, there might be some +mild swapping activity. Upon this event, applications may decide to free +any resources that can be easily reconstructed or re-read from a disk. +.TP +.B VMPRESSURE_OOM +The system is actively thrashing, it is about to out of memory (OOM) or +even the in-kernel OOM killer is on its way to trigger. Applications +should do whatever they can to help the system. See +.BR proc (5) +for more information about OOM killer and its configuration options. +.TP 0 +Note that the behaviour of some levels can be tuned through the +.BR sysctl (5) +mechanism. See +.I /usr/src/linux/Documentation/sysctl/vm.txt +for various +.I vmpressure_* +tunables and their meanings. +.SS Configuration +.BR vmpressure_fd (2) +accepts +.I vmpressure_config +structure to configure the notifications: + +.nf +struct vmpressure_config { + __u32 size; + __u32 threshold; +}; +.fi + +.I size +is a part of ABI versioning and must be initialized to +.IR "sizeof(struct vmpressure_config)" . + +.I threshold +is used to setup a minimal value of the pressure upon which the events +will be delivered by the kernel (for algebraic comparisons, it is defined +that +.BR VMPRESSURE_LOW " <" +.BR VMPRESSURE_MEDIUM " <" +.BR VMPRESSURE_OOM , +but applications should not put any meaning into the absolute values.) +.SS Events +Upon a notification, application must read out events using +.BR read (2) +system call. +The events are delivered using the following structure: + +.nf +struct vmpressure_event { + __u32 pressure; +}; +.fi + +The +.I pressure +shows the most recent system's pressure level. +.SH "RETURN VALUE" +On success, +.BR vmpressure_fd () +returns a new file descriptor. On error, a negative value is returned and +.I errno +is set to indicate the error. +.SH ERRORS +.BR vmpressure_fd () +can fail with errors similar to +.BR open (2). + +In addition, the following errors are possible: +.TP +.B EINVAL +The failure means that an improperly initalized +.I config +structure has been passed to the call. +.TP +.B EFAULT +The failure means that the kernel was unable to read the configuration +structure, that is, +.I config +parameter points to an inaccessible memory. +.SH VERSIONS +The system call is available on Linux since kernel 3.8. Library support is +yet not provided by any glibc version. +.SH CONFORMING TO +The system call is Linux-specific. +.SH EXAMPLE +Examples can be found in +.I /usr/src/linux/tools/testing/vmpressure/ +directory. +.SH "SEE ALSO" +.BR poll (2), +.BR read (2), +.BR proc (5), +.BR sysctl (5), +.BR vmstat (8)

-- 1.8.0

Rik van Riel

2:19 p.m.

New subject: [RFC 3/3] man-pages: Add man page for vmpressure_fd(2)

On 11/07/2012 06:01 AM, Anton Vorontsov wrote:

...

Configuration
    vmpressure_fd(2) accepts vmpressure_config structure to configure
    the notifications:

    struct vmpressure_config {
         __u32 size;
         __u32 threshold;
    };

    size is a part of ABI  versioning  and  must  be  initialized  to
    sizeof(struct vmpressure_config).

If you want to use a versioned ABI, why not pass in an actual version number?

Andrew Morton

20 Nov 20 Nov

5:52 a.m.

New subject: [RFC 3/3] man-pages: Add man page for vmpressure_fd(2)

On Wed, 7 Nov 2012 03:01:52 -0800 Anton Vorontsov anton.vorontsov@linaro.org wrote:

...

   Upon  these  notifications,  userland programs can cooperate with
   the kernel, achieving better system's memory management.

Well I read through the whole thread and afaict the above is the only attempt to describe why this patchset exists!

How about we step away from implementation details for a while and discuss observed problems, use-cases, requirements and such? What are we actually trying to achieve here?

Anton Vorontsov

6:24 a.m.

New subject: [RFC 3/3] man-pages: Add man page for vmpressure_fd(2)

On Mon, Nov 19, 2012 at 09:52:11PM -0800, Andrew Morton wrote:

...

On Wed, 7 Nov 2012 03:01:52 -0800 Anton Vorontsov anton.vorontsov@linaro.org wrote:

...
   Upon  these  notifications,  userland programs can cooperate with
   the kernel, achieving better system's memory management.
Well I read through the whole thread and afaict the above is the only attempt to describe why this patchset exists!

Thanks for taking a look. :)

...

How about we step away from implementation details for a while and discuss observed problems, use-cases, requirements and such? What are we actually trying to achieve here?

We try to make userland freeing resources when the system becomes low on memory. Once we're short on memory, sometimes it's better to discard (free) data, rather than let the kernel to drain file caches or even start swapping.

In Android case, the data includes all idling applications' state, some of which might be saved on the disk anyway -- so we don't need to swap apps, we just kill them. Another Android use-case is to kill low-priority tasks (e.g. currently unimportant services -- background/sync daemons, etc.).

There are other use cases: VPS/containers balancing, freeing browser's old pages renders on desktops, etc. But I'll let folks speak for their use cases, as I truly know about Android/embedded only.

But in general, it's the same stuff as the in-kernel shrinker, except that we try to make it available for the userland: the userland knows better about its memory, so we want to let it help with the memory management.

Thanks, Anton.

David Rientjes

6:12 p.m.

New subject: [RFC 3/3] man-pages: Add man page for vmpressure_fd(2)

On Mon, 19 Nov 2012, Anton Vorontsov wrote:

...

We try to make userland freeing resources when the system becomes low on memory. Once we're short on memory, sometimes it's better to discard (free) data, rather than let the kernel to drain file caches or even start swapping.

To add another usecase: its possible to modify our version of malloc (or any malloc) so that memory that is free()'d can be released back to the kernel only when necessary, i.e. when keeping the extra memory around starts to have a detremental effect on the system, memcg, or cpuset. When there is an abundance of memory available such that allocations need not defragment or reclaim memory to be allocated, it can improve performance to keep a memory arena from which to allocate from immediately without calling the kernel.

Our version of malloc frees memory back to the kernel with madvise(MADV_DONTNEED) which ends up zaping the mapped ptes. With pressure events, we only need to do this when faced with memory pressure; to keep our rss low, we require that thp's max_ptes_none tunable be set to 0; we don't want our applications to use any additional memory. This requires splitting a hugepage anytime memory is free()'d back to the kernel.

I'd like to use this as a hook into malloc() for applications that do not have strict memory footprint requirements to be able to increase performance by keeping around a memory arena from which to allocate.

Mel Gorman

21 Nov 21 Nov

3:01 p.m.

New subject: [RFC 3/3] man-pages: Add man page for vmpressure_fd(2)

On Tue, Nov 20, 2012 at 10:12:28AM -0800, David Rientjes wrote:

...

On Mon, 19 Nov 2012, Anton Vorontsov wrote:

...
We try to make userland freeing resources when the system becomes low on memory. Once we're short on memory, sometimes it's better to discard (free) data, rather than let the kernel to drain file caches or even start swapping.

To add another usecase: its possible to modify our version of malloc (or any malloc) so that memory that is free()'d can be released back to the kernel only when necessary, i.e. when keeping the extra memory around starts to have a detremental effect on the system, memcg, or cpuset. When there is an abundance of memory available such that allocations need not defragment or reclaim memory to be allocated, it can improve performance to keep a memory arena from which to allocate from immediately without calling the kernel.

A potential third use case is a variation of the first for batch systems. If it's running low priority tasks and a high priority task starts that results in memory pressure then the job scheduler may decide to move the low priority jobs elsewhere (or cancel them entirely).

A similar use case is monitoring systems running high priority workloads that should never swap. It can be easily detected if the system starts swapping but a pressure notification might act as an early warning system that something is happening on the system that might cause the primary workload to start swapping.

-- Mel Gorman SUSE Labs

Andrew Morton

7:39 p.m.

New subject: [RFC 3/3] man-pages: Add man page for vmpressure_fd(2)

On Wed, 21 Nov 2012 15:01:50 +0000 Mel Gorman mgorman@suse.de wrote:

...

On Tue, Nov 20, 2012 at 10:12:28AM -0800, David Rientjes wrote:

...
On Mon, 19 Nov 2012, Anton Vorontsov wrote:

...
We try to make userland freeing resources when the system becomes low on memory. Once we're short on memory, sometimes it's better to discard (free) data, rather than let the kernel to drain file caches or even start swapping.

To add another usecase: its possible to modify our version of malloc (or any malloc) so that memory that is free()'d can be released back to the kernel only when necessary, i.e. when keeping the extra memory around starts to have a detremental effect on the system, memcg, or cpuset. When there is an abundance of memory available such that allocations need not defragment or reclaim memory to be allocated, it can improve performance to keep a memory arena from which to allocate from immediately without calling the kernel.

A potential third use case is a variation of the first for batch systems. If it's running low priority tasks and a high priority task starts that results in memory pressure then the job scheduler may decide to move the low priority jobs elsewhere (or cancel them entirely).

A similar use case is monitoring systems running high priority workloads that should never swap. It can be easily detected if the system starts swapping but a pressure notification might act as an early warning system that something is happening on the system that might cause the primary workload to start swapping.

I hope Anton's writing all of this down ;)

The proposed API bugs me a bit. It seems simplistic. I need to have a quality think about this. Maybe the result of that think will be to suggest an interface which can be extended in a back-compatible fashion later on, if/when the simplistic nature becomes a problem.

Pekka Enberg

22 Nov 22 Nov

8:52 a.m.

New subject: [RFC 3/3] man-pages: Add man page for vmpressure_fd(2)

On Wed, 21 Nov 2012, Andrew Morton wrote:

...

The proposed API bugs me a bit. It seems simplistic. I need to have a quality think about this. Maybe the result of that think will be to suggest an interface which can be extended in a back-compatible fashion later on, if/when the simplistic nature becomes a problem.

That's exactly why I made a generic vmevent_fd() syscall, not a 'vm pressure' specific ABI.

Pekka

Kirill A. Shutemov

7 Nov 7 Nov

11:21 a.m.

On Wed, Nov 07, 2012 at 02:53:49AM -0800, Anton Vorontsov wrote:

...

Hi all,

This is the third RFC. As suggested by Minchan Kim, the API is much simplified now (comparing to vmevent_fd):

As well as Minchan, KOSAKI Motohiro didn't like the timers, so the timers are gone now;

Pekka Enberg didn't like the complex attributes matching code, and so it is no longer there;

Nobody liked the raw vmstat attributes, and so they were eliminated too.

But, conceptually, it is the exactly the same approach as in v2: three discrete levels of the pressure -- low, medium and oom. The levels are based on the reclaimer inefficiency index as proposed by Mel Gorman, but userland does not see the raw index values. The description why I moved away from reporting the raw 'reclaimer inefficiency index' can be found in v2: http://lkml.org/lkml/2012/10/22/177

While the new API is very simple, it is still extensible (i.e. versioned).

Sorry, I didn't follow previous discussion on this, but could you explain what's wrong with memory notifications from memcg? As I can see you can get pretty similar functionality using memory thresholds on the root cgroup. What's the point?

-- Kirill A. Shutemov

Pekka Enberg

11:28 a.m.

On Wed, Nov 7, 2012 at 1:21 PM, Kirill A. Shutemov kirill@shutemov.name wrote:

...

...
While the new API is very simple, it is still extensible (i.e. versioned).

Sorry, I didn't follow previous discussion on this, but could you explain what's wrong with memory notifications from memcg? As I can see you can get pretty similar functionality using memory thresholds on the root cgroup. What's the point?

Why should you be required to use cgroups to get VM pressure events to userspace?

Kirill A. Shutemov

11:43 a.m.

On Wed, Nov 07, 2012 at 01:28:12PM +0200, Pekka Enberg wrote:

...

On Wed, Nov 7, 2012 at 1:21 PM, Kirill A. Shutemov kirill@shutemov.name wrote:

...
...
While the new API is very simple, it is still extensible (i.e. versioned).

Sorry, I didn't follow previous discussion on this, but could you explain what's wrong with memory notifications from memcg? As I can see you can get pretty similar functionality using memory thresholds on the root cgroup. What's the point?

Why should you be required to use cgroups to get VM pressure events to userspace?

Valid point. But in fact you have it on most systems anyway.

I personally don't like to have a syscall per small feature. Isn't it better to have a file-based interface which can be used with normal file syscalls: open()/read()/poll()?

-- Kirill A. Shutemov

David Rientjes

15 Nov 15 Nov

3:21 a.m.

On Wed, 7 Nov 2012, Kirill A. Shutemov wrote:

...

...
...
Sorry, I didn't follow previous discussion on this, but could you explain what's wrong with memory notifications from memcg? As I can see you can get pretty similar functionality using memory thresholds on the root cgroup. What's the point?

Why should you be required to use cgroups to get VM pressure events to userspace?

Valid point. But in fact you have it on most systems anyway.

I personally don't like to have a syscall per small feature. Isn't it better to have a file-based interface which can be used with normal file syscalls: open()/read()/poll()?

I agree that eventfd is the way to go, but I'll also add that this feature seems to be implemented at a far too coarse of level. Memory, and hence memory pressure, is constrained by several factors other than just the amount of physical RAM which vmpressure_fd is addressing. What about memory pressure caused by cpusets or mempolicies? (Memcg has its own reclaim logic and its own memory thresholds implemented on top of eventfd that people already use.) These both cause high levels of reclaim within the page allocator whereas there may be an abundance of free memory available on the system.

I don't think we want several implementations of memory pressure notifications, so a more generic and flexible interface is going to be needed and I think it can't be done in an extendable way through this vmpressure_fd syscall. Unfortunately, I think that means polling on a per-thread notifier.

Anton Vorontsov

3:39 a.m.

Hi David,

Thanks for your comments!

On Wed, Nov 14, 2012 at 07:21:14PM -0800, David Rientjes wrote:

...

...
...
Why should you be required to use cgroups to get VM pressure events to userspace?

Valid point. But in fact you have it on most systems anyway.

I personally don't like to have a syscall per small feature. Isn't it better to have a file-based interface which can be used with normal file syscalls: open()/read()/poll()?

I agree that eventfd is the way to go, but I'll also add that this feature seems to be implemented at a far too coarse of level. Memory, and hence memory pressure, is constrained by several factors other than just the amount of physical RAM which vmpressure_fd is addressing. What about memory pressure caused by cpusets or mempolicies? (Memcg has its own reclaim logic

Yes, sure, and my plan for per-cgroups vmpressure was to just add the same hooks into cgroups reclaim logic (as far as I understand, we can use the same scanned/reclaimed ratio + reclaimer priority to determine the pressure).

...

and its own memory thresholds implemented on top of eventfd that people already use.) These both cause high levels of reclaim within the page allocator whereas there may be an abundance of free memory available on the system.

Yes, surely global-level vmpressure should be separate for the per-cgroup memory pressure.

But we still want the "global vmpressure" thing, so that we could use it without cgroups too. How to do it -- syscall or sysfs+eventfd doesn't matter much (in the sense that I can do eventfd thing if you folks like it :).

...

I don't think we want several implementations of memory pressure notifications,

Even with a dedicated syscall, why would we need a several implementation of memory pressure? Suppose an app in the root cgroup gets an FD via vmpressure_fd() syscall and then polls it... Do you see any reason why we can't make the underlaying FD switch from global to per-cgroup vmpressure notifications completely transparently for the app? Actually, it must be done transparently.

Oh, or do you mean that we want to monitor cgroups vmpressure outside of the cgroup? I.e. parent cgroup might want to watch child's pressure? Well, for this, the API will have to have a hard dependency for cgroup's sysfs hierarchy -- so how would we use it without cgroups then? :) I see no other option but to have two "APIs" then. (Well, in eventfd case it will be indeed simpler -- we would only have different sysfs paths for cgroups and non-cgroups case... do you see this acceptable?)

Thanks, Anton.

David Rientjes

3:59 a.m.

On Wed, 14 Nov 2012, Anton Vorontsov wrote:

...

...
I agree that eventfd is the way to go, but I'll also add that this feature seems to be implemented at a far too coarse of level. Memory, and hence memory pressure, is constrained by several factors other than just the amount of physical RAM which vmpressure_fd is addressing. What about memory pressure caused by cpusets or mempolicies? (Memcg has its own reclaim logic

Yes, sure, and my plan for per-cgroups vmpressure was to just add the same hooks into cgroups reclaim logic (as far as I understand, we can use the same scanned/reclaimed ratio + reclaimer priority to determine the pressure).

I don't understand, how would this work with cpusets, for example, with vmpressure_fd as defined? The cpuset policy is embedded in the page allocator and skips over zones that are not allowed when trying to find a page of the specified order. Imagine a cpuset bound to a single node that is under severe memory pressure. The reclaim logic will get triggered and cause a notification on your fd when the rest of the system's nodes may have tons of memory available. So now an application that actually is using this interface and is trying to be a good kernel citizen decides to free caches back to the kernel, start ratelimiting, etc, when it actually doesn't have any memory allocated on the nearly-oom cpuset so its memory freeing doesn't actually achieve anything.

Rather, I think it's much better to be notified when an individual process invokes various levels of reclaim up to and including the oom killer so that we know the context that memory freeing needs to happen (or, optionally, the set of processes that could be sacrificed so that this higher priority process may allocate memory).

...

...
and its own memory thresholds implemented on top of eventfd that people already use.) These both cause high levels of reclaim within the page allocator whereas there may be an abundance of free memory available on the system.

Yes, surely global-level vmpressure should be separate for the per-cgroup memory pressure.

I disagree, I think if you have a per-thread memory pressure notification if and when it starts down the page allocator slowpath, through the various states of reclaim (perhaps on a scale of 0-100 as described), and including the oom killer that you can target eventual memory freeing that actually is useful.

...

But we still want the "global vmpressure" thing, so that we could use it without cgroups too. How to do it -- syscall or sysfs+eventfd doesn't matter much (in the sense that I can do eventfd thing if you folks like it :).

Most processes aren't going to care if they are running into memory pressure and have no implementation to free memory back to the kernel or start ratelimiting themselves. They will just continue happily along until they get the memory they want or they get oom killed. The ones that do, however, or a job scheduler or monitor that is watching over the memory usage of a set of tasks, will be able to do something when notified.

In the hopes of a single API that can do all this and not a reimplementation for various types of memory limitations (it seems like what you're suggesting is at least three different APIs: system-wide via vmpressure_fd, memcg via memcg thresholds, and cpusets through an eventual cpuset threshold), I'm hoping that we can have a single interface that can be polled on to determine when individual processes are encountering memory pressure. And if I'm not running in your oom cpuset, I don't care about your memory pressure.

Anton Vorontsov

7:34 a.m.

Hi David,

Thanks again for your inspirational comments!

On Wed, Nov 14, 2012 at 07:59:52PM -0800, David Rientjes wrote:

...

...
...
I agree that eventfd is the way to go, but I'll also add that this feature seems to be implemented at a far too coarse of level. Memory, and hence memory pressure, is constrained by several factors other than just the amount of physical RAM which vmpressure_fd is addressing. What about memory pressure caused by cpusets or mempolicies? (Memcg has its own reclaim logic

Yes, sure, and my plan for per-cgroups vmpressure was to just add the same hooks into cgroups reclaim logic (as far as I understand, we can use the same scanned/reclaimed ratio + reclaimer priority to determine the pressure).

[Answers reordered]

...

Rather, I think it's much better to be notified when an individual process invokes various levels of reclaim up to and including the oom killer so that we know the context that memory freeing needs to happen (or, optionally, the set of processes that could be sacrificed so that this higher priority process may allocate memory).

I think I understand what you're saying, and surely it makes sense, but I don't know how you see this implemented on the API level.

Getting struct {pid, pressure} pairs that cause the pressure at the moment? And the monitor only gets <pids> that are in the same cpuset? How about memcg limits?..

[...]

...

...
But we still want the "global vmpressure" thing, so that we could use it without cgroups too. How to do it -- syscall or sysfs+eventfd doesn't matter much (in the sense that I can do eventfd thing if you folks like it :).

Most processes aren't going to care if they are running into memory pressure and have no implementation to free memory back to the kernel or start ratelimiting themselves. They will just continue happily along until they get the memory they want or they get oom killed. The ones that do, however, or a job scheduler or monitor that is watching over the memory usage of a set of tasks, will be able to do something when notified.

Yup, this is exactly how we want to use this. In Android we have "Activity Manager" thing, which acts exactly how you describe: it's a tasks monitor.

...

In the hopes of a single API that can do all this and not a reimplementation for various types of memory limitations (it seems like what you're suggesting is at least three different APIs: system-wide via vmpressure_fd, memcg via memcg thresholds, and cpusets through an eventual cpuset threshold), I'm hoping that we can have a single interface that can be polled on to determine when individual processes are encountering memory pressure. And if I'm not running in your oom cpuset, I don't care about your memory pressure.

I'm not sure to what exactly you are opposing. :) You don't want to have three "kinds" pressures, or you don't what to have three different interfaces to each of them, or both?

...

I don't understand, how would this work with cpusets, for example, with vmpressure_fd as defined? The cpuset policy is embedded in the page allocator and skips over zones that are not allowed when trying to find a page of the specified order. Imagine a cpuset bound to a single node that is under severe memory pressure. The reclaim logic will get triggered and cause a notification on your fd when the rest of the system's nodes may have tons of memory available.

Yes, I see your point: we have many ways to limit resources, so it makes it hard to identify the cause of the "pressure" and thus how to deal with it, since the pressure might be caused by different kinds of limits, and freeing memory from one bucket doesn't mean that the memory will be available to the process that is requesting the memory.

So we do want to know whether a specific cpuset is under pressure, whether a specific memcg is under pressure, or whether the system (and kernel itself) lacks memory.

And we want to have a single API for this? Heh. :)

The other idea might be this (I'm describing it in detail so that you could actually comment on what exactly you don't like in this):

1. Obtain the fd via eventfd();

2. The fd can be passed to these files:

I) Say /sys/kernel/mm/memory_pressure

If we don't use cpusets/memcg or even have CGROUPS=n, this will be system's/global memory pressure. Pass the fd to this file and start polling.

If we do use cpusets or memcg, the API will still work, but we have two options for its behaviour:

a) This will only report the pressure when we're reclaiming with say (global_reclaim() && node_isset(zone_to_nid(zone), current->mems_allowed)) == 1. (Basically, we want to see pressure of kernel slabs allocations or any non-soft limits).

b) If 'filtering' cpusets/memcg seems too hard, we can say that these notifications are the "sum" of global+memcg+cpuset. It doesn't make sense to actually monitor these, though, so if the monitor is aware of cgroups, just 'goto II) and/or III)'.

II) /sys/fs/cgroup/cpuset/.../cpuset.memory_pressure (yeah, we have it already)

Pass the fd to this file to monitor per-cpuset pressure. So, if you get the pressure from here, it makes sense to free resources from this cpuset.

III) /sys/fs/cgroup/memory/.../memory.pressure

Pass the fd to this file to monitor per-memcg pressure. If you get the pressure from here, it only makes sense to free resources from this memcg.

3. The pressure level values (and their meaning) and the format of the files are the same, and this what defines the "API".

So, if "memory monitor/supervisor app" is aware of cpusets, it manages memory at this level. If both cpuset and memcg is used, then it has to monitor both files, and act accordingly. And if we don't use cpusets/memcg (or even have cgroups=n), we can just watch the global reclaimer's pressure.

Do I understand correctly that you don't like this? Just to make sure. :)

Thanks, Anton.

David Rientjes

8:11 a.m.

On Wed, 14 Nov 2012, Anton Vorontsov wrote:

...

Thanks again for your inspirational comments!

Heh, not sure I've been too inspirational (probably more annoying than anything else). I really do want generic memory pressure notifications in the kernel and already have some ideas on how I can tie it into our malloc arenas, so please do keep working on it.

...

I think I understand what you're saying, and surely it makes sense, but I don't know how you see this implemented on the API level.

Getting struct {pid, pressure} pairs that cause the pressure at the moment? And the monitor only gets <pids> that are in the same cpuset? How about memcg limits?..

Depends on whether you want to support mempolicies or not and the argument could go either way:

- FOR supporting mempolicies: memory that you're mbind() too can become depleted and since there is no fallback then you have no way to prevent lots of reclaim and/or invoking the oom killer, it would be disappointing to not be able to get notifications of such a condition.

- AGAINST supporting mempolicies: you only need to support memory isolation for cgroups (memcg and cpusets) and thus can implement your own memory pressure cgroup that you can use to aggregate tasks and then replace memcg memory thresholds with co-mounting this new cgroup that would notify on an eventfd anytime one of the attached processes experiences memory pressure.

...

...
Most processes aren't going to care if they are running into memory pressure and have no implementation to free memory back to the kernel or start ratelimiting themselves. They will just continue happily along until they get the memory they want or they get oom killed. The ones that do, however, or a job scheduler or monitor that is watching over the memory usage of a set of tasks, will be able to do something when notified.

Yup, this is exactly how we want to use this. In Android we have "Activity Manager" thing, which acts exactly how you describe: it's a tasks monitor.

In addition to that, I think I can hook into our implementation of malloc which frees memory back to the kernel with MADV_DONTNEED and zaps individual ptes to poke holes in the memory it allocates to actually cache the memory that we free() and then re-use it under normal circumstances to return cache-hot memory on the next allocation but under memory pressure, as triggered by your interface (but for threads attached to a memcg facing memcg limits), drain the memory back to the kernel immediately.

...

...
In the hopes of a single API that can do all this and not a reimplementation for various types of memory limitations (it seems like what you're suggesting is at least three different APIs: system-wide via vmpressure_fd, memcg via memcg thresholds, and cpusets through an eventual cpuset threshold), I'm hoping that we can have a single interface that can be polled on to determine when individual processes are encountering memory pressure. And if I'm not running in your oom cpuset, I don't care about your memory pressure.

I'm not sure to what exactly you are opposing. :) You don't want to have three "kinds" pressures, or you don't what to have three different interfaces to each of them, or both?

The three pressures are a separate topic (I think it would be better to have some measure of memory pressure similar to your reclaim scale and allow users to get notifications at levels they define). I really dislike having multiple interfaces that are all different from one another depending on the context.

Given what we have right now with memory thresholds in memcg, if we were to merge vmpressure_fd, then we're significantly limiting the usecase since applications need not know if they are attached to a memcg or not: it's a type of virtualization that the admin may setup but another admin may be running unconstrained on a system with much more memory. So for your usecase of a job monitor, that would work fine for global oom conditions but the application no longer has an API to use if it wants to know when it itself is feeling memory pressure.

I think others have voiced their opinion on trying to create a single API for memory pressure notifications as well, it's just a hard problem and takes a lot of work to determine how we can make it easy to use and understand and extendable at the same time.

...

...
I don't understand, how would this work with cpusets, for example, with vmpressure_fd as defined? The cpuset policy is embedded in the page allocator and skips over zones that are not allowed when trying to find a page of the specified order. Imagine a cpuset bound to a single node that is under severe memory pressure. The reclaim logic will get triggered and cause a notification on your fd when the rest of the system's nodes may have tons of memory available.

Yes, I see your point: we have many ways to limit resources, so it makes it hard to identify the cause of the "pressure" and thus how to deal with it, since the pressure might be caused by different kinds of limits, and freeing memory from one bucket doesn't mean that the memory will be available to the process that is requesting the memory.

So we do want to know whether a specific cpuset is under pressure, whether a specific memcg is under pressure, or whether the system (and kernel itself) lacks memory.

And we want to have a single API for this? Heh. :)

Might not be too difficult if you implement your own cgroup to aggregate these tasks for which you want to know memory pressure events; it would have to be triggered for the task trying to allocate memory at any given time and how hard it was to allocate that memory in the slowpath, tie it back to that tasks' memory pressure cgroup, and then report the trigger if it's over a user-defined threshold normalized to the 0-100 scale. Then you could co-mount this cgroup with memcg, cpusets, or just do it for the root cgroup for users who want to monitor the entire system (CONFIG_CGROUPS is enabled by default).

Anton Vorontsov

8:52 a.m.

On Thu, Nov 15, 2012 at 12:11:47AM -0800, David Rientjes wrote: [...]

...

Might not be too difficult if you implement your own cgroup to aggregate these tasks for which you want to know memory pressure events; it would have to be triggered for the task trying to allocate memory at any given time and how hard it was to allocate that memory in the slowpath, tie it back to that tasks' memory pressure cgroup, and then report the trigger if it's over a user-defined threshold normalized to the 0-100 scale. Then you could co-mount this cgroup with memcg, cpusets, or just do it for the root cgroup for users who want to monitor the entire system

This seems doable. But

...

(CONFIG_CGROUPS is enabled by default).

Hehe, you're saying that we have to have cgroups=y. :) But some folks were deliberately asking us to make the cgroups optional.

OK, here is what I can try to do:

- Implement memory pressure cgroup as you described, by doing so we'd make the thing play well with cpusets and memcg;

- This will be eventfd()-based;

- Once done, we will have a solution for pretty much every major use-case (i.e. servers, desktops and Android, they all have cgroups enabled);

(- Optionally, if there will be a demand, for CGROUPS=n we can implement a separate sysfs file with the exactly same eventfd interface, it will only report global pressure. This will be for folks that don't want the cgroups for some reason. The interface can be discussed separately.)

Thanks, Anton.

David Rientjes

9:25 p.m.

On Thu, 15 Nov 2012, Anton Vorontsov wrote:

...

Hehe, you're saying that we have to have cgroups=y. :) But some folks were deliberately asking us to make the cgroups optional.

Enabling just CONFIG_CGROUPS (which is enabled by default) and no other current cgroups increases the size of the kernel text by less than 0.3% with x86_64 defconfig:

text data bss dec hex filename 10330039 1038912 1118208 12487159 be89f7 vmlinux.disabled 10360993 1041624 1122304 12524921 bf1d79 vmlinux.enabled

I understand that users with minimally-enabled configs for an optimized memory footprint will have a higher percentage because their kernel is already smaller (~1.8% increase for allnoconfig), but I think the cost of enabling the cgroups code to be able to mount a vmpressure cgroup (which I'd rename to be "mempressure" to be consistent with "memcg" but it's only an opinion) is relatively small and allows for a much more maintainable and extendable feature to be included: it already provides the cgroup.event_control interface that supports eventfd that makes implementation much easier. It also makes writing a library on top of the cgroup to be much easier because of the standardization.

I'm more concerned about what to do with the memcg memory thresholds and whether they can be replaced with this new cgroup. If so, then we'll have to figure out how to map those triggers to use the new cgroup's interface in a way that doesn't break current users that open and pass the fd of memory.usage_in_bytes to cgroup.event_control for memcg.

...

OK, here is what I can try to do:

Implement memory pressure cgroup as you described, by doing so we'd make the thing play well with cpusets and memcg;

This will be eventfd()-based;

Should be based on cgroup.event_control, see how memcg interfaces its memory thresholds with this in Documentation/cgroups/memory.txt.

...

Once done, we will have a solution for pretty much every major use-case (i.e. servers, desktops and Android, they all have cgroups enabled);

Excellent! I'd be interested in hearing anybody else's opinions, especially those from the memcg world, so we make sure that everybody is happy with the API that you've described.

Glauber Costa

16 Nov 16 Nov

9:33 a.m.

On 11/16/2012 01:25 AM, David Rientjes wrote:

...

On Thu, 15 Nov 2012, Anton Vorontsov wrote:

...
Hehe, you're saying that we have to have cgroups=y. :) But some folks were deliberately asking us to make the cgroups optional.

Enabling just CONFIG_CGROUPS (which is enabled by default) and no other current cgroups increases the size of the kernel text by less than 0.3% with x86_64 defconfig:

text data bss dec hex filename 10330039 1038912 1118208 12487159 be89f7 vmlinux.disabled 10360993 1041624 1122304 12524921 bf1d79 vmlinux.enabled

I understand that users with minimally-enabled configs for an optimized memory footprint will have a higher percentage because their kernel is already smaller (~1.8% increase for allnoconfig), but I think the cost of enabling the cgroups code to be able to mount a vmpressure cgroup (which I'd rename to be "mempressure" to be consistent with "memcg" but it's only an opinion) is relatively small and allows for a much more maintainable and extendable feature to be included: it already provides the cgroup.event_control interface that supports eventfd that makes implementation much easier. It also makes writing a library on top of the cgroup to be much easier because of the standardization.

I'm more concerned about what to do with the memcg memory thresholds and whether they can be replaced with this new cgroup. If so, then we'll have to figure out how to map those triggers to use the new cgroup's interface in a way that doesn't break current users that open and pass the fd of memory.usage_in_bytes to cgroup.event_control for memcg.

...
OK, here is what I can try to do:

Implement memory pressure cgroup as you described, by doing so we'd make the thing play well with cpusets and memcg;

This will be eventfd()-based;

Should be based on cgroup.event_control, see how memcg interfaces its memory thresholds with this in Documentation/cgroups/memory.txt.

...

Once done, we will have a solution for pretty much every major use-case (i.e. servers, desktops and Android, they all have cgroups enabled);

Excellent! I'd be interested in hearing anybody else's opinions, especially those from the memcg world, so we make sure that everybody is happy with the API that you've described.

Just CC'd them all.

My personal take:

Most people hate memcg due to the cost it imposes. I've already demonstrated that with some effort, it doesn't necessarily have to be so. (http://lwn.net/Articles/517634/)

The one thing I missed on that work, was precisely notifications. If you can come up with a good notifications scheme that *lives* in memcg, but does not *depend* in the memcg infrastructure, I personally think it could be a big win.

Doing this in memcg has the advantage that the "per-group" vs "global" is automatically solved, since the root memcg is just another name for "global".

I honestly like your low/high/oom scheme better than memcg's "threshold-in-bytes". I would also point out that those thresholds are *far* from exact, due to the stock charging mechanism, and can be wrong by as much as O(#cpus). So far, nobody complained. So in theory it should be possible to convert memcg to low/high/oom, while still accepting writes in bytes, that would be thrown in the closest bucket.

Another thing from one of your e-mails, that may shift you in the memcg direction:

"2. The last time I checked, cgroups memory controller did not (and I guess still does not) not account kernel-owned slabs. I asked several times why so, but nobody answered."

It should, now, in the latest -mm, although it won't do per-group reclaim (yet).

I am also failing to see how cpusets would be involved in here. I understand that you may have free memory in terms of size, but still be further restricted by cpuset. But I also think that having multiple entry points for this buy us nothing at all. So the choices I see are:

1) If cpuset + memcg are comounted, take this into account when deciding low / high / oom. This is yet another advantage over the "threshold in bytes" interface, in which you can transparently take other issues into account while keeping the interface.

2) If they are not, just ignore this effect.

The fallback in 2) sounds harsh, but I honestly think this is the price to pay for the insanity of mounting those things in different hierarchies, and we do have a plan to have all those things eventually together anyway. If you have two cgroups dealing with memory, and set them up in orthogonal ways, I really can't see how we can bring sanity to that. So just admitting and unleashing the insanity may be better, if it brings up our urge to fix it. It worked for Batman, why wouldn't it work for us?

David Rientjes

8:04 p.m.

On Fri, 16 Nov 2012, Glauber Costa wrote:

...

My personal take:

Most people hate memcg due to the cost it imposes. I've already demonstrated that with some effort, it doesn't necessarily have to be so. (http://lwn.net/Articles/517634/)

The one thing I missed on that work, was precisely notifications. If you can come up with a good notifications scheme that *lives* in memcg, but does not *depend* in the memcg infrastructure, I personally think it could be a big win.

This doesn't allow users of cpusets without memcg to have an API for memory pressure, that's why I thought it should be a new cgroup that can be mounted alongside any existing cgroup, any cgroup in the future, or just by itself.

...

Doing this in memcg has the advantage that the "per-group" vs "global" is automatically solved, since the root memcg is just another name for "global".

That's true of any cgroup.

...

I honestly like your low/high/oom scheme better than memcg's "threshold-in-bytes". I would also point out that those thresholds are *far* from exact, due to the stock charging mechanism, and can be wrong by as much as O(#cpus). So far, nobody complained. So in theory it should be possible to convert memcg to low/high/oom, while still accepting writes in bytes, that would be thrown in the closest bucket.

I'm wondering if we should have more than three different levels.

...

Another thing from one of your e-mails, that may shift you in the memcg direction:

"2. The last time I checked, cgroups memory controller did not (and I guess still does not) not account kernel-owned slabs. I asked several times why so, but nobody answered."

It should, now, in the latest -mm, although it won't do per-group reclaim (yet).

Not sure where that was written, but I certainly didn't write it and it's not really relevant in this discussion: memory pressure notifications would be triggered by reclaim when trying to allocate memory; why we need to reclaim or how we got into that state is tangential. It certainly may be because a lot of slab was allocated, but that's not the only case.

...

I am also failing to see how cpusets would be involved in here. I understand that you may have free memory in terms of size, but still be further restricted by cpuset. But I also think that having multiple entry points for this buy us nothing at all. So the choices I see are:

Umm, why do users of cpusets not want to be able to trigger memory pressure notifications?

Glauber Costa

9:12 p.m.

Hey,

On 11/17/2012 12:04 AM, David Rientjes wrote:

...

On Fri, 16 Nov 2012, Glauber Costa wrote:

...
My personal take:

Most people hate memcg due to the cost it imposes. I've already demonstrated that with some effort, it doesn't necessarily have to be so. (http://lwn.net/Articles/517634/)

The one thing I missed on that work, was precisely notifications. If you can come up with a good notifications scheme that *lives* in memcg, but does not *depend* in the memcg infrastructure, I personally think it could be a big win.

This doesn't allow users of cpusets without memcg to have an API for memory pressure, that's why I thought it should be a new cgroup that can be mounted alongside any existing cgroup, any cgroup in the future, or just by itself.

...
Doing this in memcg has the advantage that the "per-group" vs "global" is automatically solved, since the root memcg is just another name for "global".

That's true of any cgroup.

Yes. But memcg happens to also deal with memory usage, and already have a notification mechanism =)

...

...
I honestly like your low/high/oom scheme better than memcg's "threshold-in-bytes". I would also point out that those thresholds are *far* from exact, due to the stock charging mechanism, and can be wrong by as much as O(#cpus). So far, nobody complained. So in theory it should be possible to convert memcg to low/high/oom, while still accepting writes in bytes, that would be thrown in the closest bucket.

I'm wondering if we should have more than three different levels.

In the case I outlined below, for backwards compatibility. What I actually mean is that memcg *currently* allows arbitrary notifications. One way to merge those, while moving to a saner 3-point notification, is to still allow the old writes and fit them in the closest bucket.

...

...
Another thing from one of your e-mails, that may shift you in the memcg direction:

"2. The last time I checked, cgroups memory controller did not (and I guess still does not) not account kernel-owned slabs. I asked several times why so, but nobody answered."

It should, now, in the latest -mm, although it won't do per-group reclaim (yet).

Not sure where that was written, but I certainly didn't write it

Indeed you didn't, Anton did. It's his proposal, so I actually meant him everytime I said "you". The fact that you were the last responder made it confusing - sorry.

...

and it's not really relevant in this discussion: memory pressure notifications would be triggered by reclaim when trying to allocate memory; why we need to reclaim or how we got into that state is tangential.

My understanding is that one of the advantages he was pointing of his mechanism over memcg, is that it would allow one to count slab memory as well, which memcg won't do (it will, now).

...

...
I am also failing to see how cpusets would be involved in here. I understand that you may have free memory in terms of size, but still be further restricted by cpuset. But I also think that having multiple entry points for this buy us nothing at all. So the choices I see are:

Umm, why do users of cpusets not want to be able to trigger memory pressure notifications?

Because cpusets only deal with memory placement, not memory usage. And it is not that moving a task to cpuset disallows you to do any of this: you could, as long as the same set of tasks are mounted in a corresponding memcg.

Of course there are a couple use cases that could benefit from the orthogonality, but I doubt it would justify the complexity in this case.

David Rientjes

9:57 p.m.

On Sat, 17 Nov 2012, Glauber Costa wrote:

...

...
I'm wondering if we should have more than three different levels.

In the case I outlined below, for backwards compatibility. What I actually mean is that memcg *currently* allows arbitrary notifications. One way to merge those, while moving to a saner 3-point notification, is to still allow the old writes and fit them in the closest bucket.

Yeah, but I'm wondering why three is the right answer.

...

...
Umm, why do users of cpusets not want to be able to trigger memory pressure notifications?

Because cpusets only deal with memory placement, not memory usage.

The set of nodes that a thread is allowed to allocate from may face memory pressure up to and including oom while the rest of the system may have a ton of free memory. Your solution is to compile and mount memcg if you want notifications of memory pressure on those nodes. Others in this thread have already said they don't want to rely on memcg for any of this and, as Anton showed, this can be tied directly into the VM without any help from memcg as it sits today. So why implement a simple and clean mempressure cgroup that can be used alone or co-existing with either memcg or cpusets?

...

And it is not that moving a task to cpuset disallows you to do any of this: you could, as long as the same set of tasks are mounted in a corresponding memcg.

Same thing with a separate mempressure cgroup. The point is that there will be users of this cgroup that do not want the overhead imposed by memcg (which is why it's disabled in defconfig) and there's no direct dependency that causes it to be a part of memcg.

Anton Vorontsov

17 Nov 17 Nov

1:21 a.m.

On Fri, Nov 16, 2012 at 01:57:09PM -0800, David Rientjes wrote:

...

...
...
I'm wondering if we should have more than three different levels.

In the case I outlined below, for backwards compatibility. What I actually mean is that memcg *currently* allows arbitrary notifications. One way to merge those, while moving to a saner 3-point notification, is to still allow the old writes and fit them in the closest bucket.

Yeah, but I'm wondering why three is the right answer.

You were not Cc'ed, so let me repeat why I ended up w/ the levels (not necessary three levels), instead of relying on the 0..100 scale:

The main change is that I decided to go with discrete levels of the pressure.

When I started writing the man page, I had to describe the 'reclaimer inefficiency index', and while doing this I realized that I'm describing how the kernel is doing the memory management, which we try to avoid in the vmevent. And applications don't really care about these details: reclaimers, its inefficiency indexes, scanning window sizes, priority levels, etc. -- it's all "not interesting", and purely kernel's stuff. So I guess Mel Gorman was right, we need some sort of levels.

What applications (well, activity managers) are really interested in is this:

1. Do we we sacrifice resources for new memory allocations (e.g. files cache)? 2. Does the new memory allocations' cost becomes too high, and the system hurts because of this? 3. Are we about to OOM soon?

And here are the answers:

1. VMEVENT_PRESSURE_LOW 2. VMEVENT_PRESSURE_MED 3. VMEVENT_PRESSURE_OOM

There is no "high" pressure, since I really don't see any definition of it, but it's possible to introduce new levels without breaking ABI.

Later I came up with the fourth level:

Maybe it makes sense to implement something like PRESSURE_MILD/BALANCE with an additional nr_pages threshold, which basically hits the kernel about how many easily reclaimable pages userland has (that would be a part of our definition for the mild/balance pressure level).

I.e. the fourth level can serve as a two-way communication w/ the kernel. But again, this would be just an extension, I don't want to introduce this now.

...

...
...
Umm, why do users of cpusets not want to be able to trigger memory pressure notifications?

Because cpusets only deal with memory placement, not memory usage.

The set of nodes that a thread is allowed to allocate from may face memory pressure up to and including oom while the rest of the system may have a ton of free memory. Your solution is to compile and mount memcg if you want notifications of memory pressure on those nodes. Others in this thread have already said they don't want to rely on memcg for any of this and, as Anton showed, this can be tied directly into the VM without any help from memcg as it sits today. So why implement a simple and clean

You meant 'why not'?

...

mempressure cgroup that can be used alone or co-existing with either memcg or cpusets?

...
And it is not that moving a task to cpuset disallows you to do any of this: you could, as long as the same set of tasks are mounted in a corresponding memcg.

Same thing with a separate mempressure cgroup. The point is that there will be users of this cgroup that do not want the overhead imposed by memcg (which is why it's disabled in defconfig) and there's no direct dependency that causes it to be a part of memcg.

There's also an API "inconvenince issue" with memcg's usage_in_bytes stuff: applications have a hard time resetting the threshold to 'emulate' the pressure notifications, and they also have to count bytes (like 'total - used = free') to set the threshold. While a separate 'pressure' notifications shows exactly what apps actually want to know: the pressure.

Thanks, Anton.

David Rientjes

18 Nov 18 Nov

10:53 p.m.

On Fri, 16 Nov 2012, Anton Vorontsov wrote:

...

The main change is that I decided to go with discrete levels of the pressure.

When I started writing the man page, I had to describe the 'reclaimer inefficiency index', and while doing this I realized that I'm describing how the kernel is doing the memory management, which we try to avoid in the vmevent. And applications don't really care about these details: reclaimers, its inefficiency indexes, scanning window sizes, priority levels, etc. -- it's all "not interesting", and purely kernel's stuff. So I guess Mel Gorman was right, we need some sort of levels.

What applications (well, activity managers) are really interested in is this:

Do we we sacrifice resources for new memory allocations (e.g. files cache)?

Does the new memory allocations' cost becomes too high, and the system hurts because of this?

Are we about to OOM soon?

And here are the answers:

VMEVENT_PRESSURE_LOW

VMEVENT_PRESSURE_MED

VMEVENT_PRESSURE_OOM

There is no "high" pressure, since I really don't see any definition of it, but it's possible to introduce new levels without breaking ABI.

Later I came up with the fourth level:

Maybe it makes sense to implement something like PRESSURE_MILD/BALANCE with an additional nr_pages threshold, which basically hits the kernel about how many easily reclaimable pages userland has (that would be a part of our definition for the mild/balance pressure level).

I.e. the fourth level can serve as a two-way communication w/ the kernel. But again, this would be just an extension, I don't want to introduce this now.

That certainly makes sense, it would be too much of a usage and maintenance burden to assume that the implementation of the VM is to remain the same.

...

...
The set of nodes that a thread is allowed to allocate from may face memory pressure up to and including oom while the rest of the system may have a ton of free memory. Your solution is to compile and mount memcg if you want notifications of memory pressure on those nodes. Others in this thread have already said they don't want to rely on memcg for any of this and, as Anton showed, this can be tied directly into the VM without any help from memcg as it sits today. So why implement a simple and clean

You meant 'why not'?

Yes, sorry.

...

...
mempressure cgroup that can be used alone or co-existing with either memcg or cpusets?

Same thing with a separate mempressure cgroup. The point is that there will be users of this cgroup that do not want the overhead imposed by memcg (which is why it's disabled in defconfig) and there's no direct dependency that causes it to be a part of memcg.

There's also an API "inconvenince issue" with memcg's usage_in_bytes stuff: applications have a hard time resetting the threshold to 'emulate' the pressure notifications, and they also have to count bytes (like 'total

used = free') to set the threshold. While a separate 'pressure'

notifications shows exactly what apps actually want to know: the pressure.

Agreed.

Glauber Costa

19 Nov 19 Nov

2 p.m.

On 11/17/2012 05:21 AM, Anton Vorontsov wrote:

...

On Fri, Nov 16, 2012 at 01:57:09PM -0800, David Rientjes wrote:

...
...
...
I'm wondering if we should have more than three different levels.

In the case I outlined below, for backwards compatibility. What I actually mean is that memcg *currently* allows arbitrary notifications. One way to merge those, while moving to a saner 3-point notification, is to still allow the old writes and fit them in the closest bucket.

Yeah, but I'm wondering why three is the right answer.

You were not Cc'ed, so let me repeat why I ended up w/ the levels (not necessary three levels), instead of relying on the 0..100 scale:

The main change is that I decided to go with discrete levels of the pressure.

When I started writing the man page, I had to describe the 'reclaimer inefficiency index', and while doing this I realized that I'm describing how the kernel is doing the memory management, which we try to avoid in the vmevent. And applications don't really care about these details: reclaimers, its inefficiency indexes, scanning window sizes, priority levels, etc. -- it's all "not interesting", and purely kernel's stuff. So I guess Mel Gorman was right, we need some sort of levels.

What applications (well, activity managers) are really interested in is this:

Do we we sacrifice resources for new memory allocations (e.g. files cache)?

Does the new memory allocations' cost becomes too high, and the system hurts because of this?

Are we about to OOM soon?

And here are the answers:

VMEVENT_PRESSURE_LOW

VMEVENT_PRESSURE_MED

VMEVENT_PRESSURE_OOM

There is no "high" pressure, since I really don't see any definition of it, but it's possible to introduce new levels without breaking ABI.

Later I came up with the fourth level:

Maybe it makes sense to implement something like PRESSURE_MILD/BALANCE with an additional nr_pages threshold, which basically hits the kernel about how many easily reclaimable pages userland has (that would be a part of our definition for the mild/balance pressure level).

I.e. the fourth level can serve as a two-way communication w/ the kernel. But again, this would be just an extension, I don't want to introduce this now.

...
...
...
Umm, why do users of cpusets not want to be able to trigger memory pressure notifications?

Because cpusets only deal with memory placement, not memory usage.

The set of nodes that a thread is allowed to allocate from may face memory pressure up to and including oom while the rest of the system may have a ton of free memory. Your solution is to compile and mount memcg if you want notifications of memory pressure on those nodes. Others in this thread have already said they don't want to rely on memcg for any of this and, as Anton showed, this can be tied directly into the VM without any help from memcg as it sits today. So why implement a simple and clean

You meant 'why not'?

...
mempressure cgroup that can be used alone or co-existing with either memcg or cpusets?

...
And it is not that moving a task to cpuset disallows you to do any of this: you could, as long as the same set of tasks are mounted in a corresponding memcg.

Same thing with a separate mempressure cgroup. The point is that there will be users of this cgroup that do not want the overhead imposed by memcg (which is why it's disabled in defconfig) and there's no direct dependency that causes it to be a part of memcg.

There's also an API "inconvenince issue" with memcg's usage_in_bytes stuff: applications have a hard time resetting the threshold to 'emulate' the pressure notifications, and they also have to count bytes (like 'total

used = free') to set the threshold. While a separate 'pressure'

notifications shows exactly what apps actually want to know: the pressure.

Anton,

The API you propose is way superior than memcg's current interface IMHO. That is why my proposal is to move memcg to yours, and deprecate the old interface.

We can do this easily by allowing writes to happen, and then moving them to the closest pressure bucket. More or less what was done for timers to reduce wakeups.

What I noted in a previous e-mail, is that memcg triggers notifications based on "usage" *before* the stock is drained. This means it can be wrong by as much as 32 * NR_CPUS * PAGE_SIZE, and so far, nobody seemed to care.

Glauber Costa

1:57 p.m.

On 11/17/2012 01:57 AM, David Rientjes wrote:

...

On Sat, 17 Nov 2012, Glauber Costa wrote:

...
...
I'm wondering if we should have more than three different levels.

In the case I outlined below, for backwards compatibility. What I actually mean is that memcg *currently* allows arbitrary notifications. One way to merge those, while moving to a saner 3-point notification, is to still allow the old writes and fit them in the closest bucket.

Yeah, but I'm wondering why three is the right answer.

This is unrelated to what I am talking about. I am talking about pre-defined values with a specific event meaning (in his patchset, 3) vs arbitrary numbers valued in bytes.

...

...
...
Umm, why do users of cpusets not want to be able to trigger memory pressure notifications?

Because cpusets only deal with memory placement, not memory usage.

The set of nodes that a thread is allowed to allocate from may face memory pressure up to and including oom while the rest of the system may have a ton of free memory. Your solution is to compile and mount memcg if you want notifications of memory pressure on those nodes. Others in this thread have already said they don't want to rely on memcg for any of this and, as Anton showed, this can be tied directly into the VM without any help from memcg as it sits today. So why implement a simple and clean mempressure cgroup that can be used alone or co-existing with either memcg or cpusets?

...
And it is not that moving a task to cpuset disallows you to do any of this: you could, as long as the same set of tasks are mounted in a corresponding memcg.

Same thing with a separate mempressure cgroup. The point is that there will be users of this cgroup that do not want the overhead imposed by memcg (which is why it's disabled in defconfig) and there's no direct dependency that causes it to be a part of memcg.

I think we should shoot the duck where it is going, not where it is. A good interface is more important than overhead, since this overhead is by no means fundamental - memcg is fixable, and we would all benefit from it.

Now, whether or not memcg is the right interface is a different discussion - let's have it!

David Rientjes

20 Nov 20 Nov

6:02 p.m.

On Mon, 19 Nov 2012, Glauber Costa wrote:

...

...
...
In the case I outlined below, for backwards compatibility. What I actually mean is that memcg *currently* allows arbitrary notifications. One way to merge those, while moving to a saner 3-point notification, is to still allow the old writes and fit them in the closest bucket.

Yeah, but I'm wondering why three is the right answer.

This is unrelated to what I am talking about. I am talking about pre-defined values with a specific event meaning (in his patchset, 3) vs arbitrary numbers valued in bytes.

Right, and I don't see how you can map the memcg thresholds onto Anton's scheme that heavily relies upon reclaim activity; what bucket does a threshold of 48MB in a memcg with a limit of 64MB fit into? Perhaps you have some formula in mind that would do this, but I don't see how it works correctly without factoring in configuration options (memory compaction), type of allocation (GFP_ATOMIC won't trigger Anton's reclaim scheme like GFP_KERNEL), altered min_free_kbytes, etc.

This begs the question of whether the new cgroup should be considered as a replacement for memory thresholds within memcg in the first place; certainly both can coexist just fine.

...

...
Same thing with a separate mempressure cgroup. The point is that there will be users of this cgroup that do not want the overhead imposed by memcg (which is why it's disabled in defconfig) and there's no direct dependency that causes it to be a part of memcg.

I think we should shoot the duck where it is going, not where it is. A good interface is more important than overhead, since this overhead is by no means fundamental - memcg is fixable, and we would all benefit from it.

Now, whether or not memcg is the right interface is a different discussion - let's have it!

I don't see memcg as being a prerequisite for any of this, I think Anton's cgroup can coexist with memcg thresholds, it allows for notifications in cpusets as well when they face memory pressure, and users need not enable memcg for this functionality (and memcg is pretty darn large in its memory footprint, I'd rather not see it fragmented either for something that can standalone with increased functionality).

But let's try the question in reverse: is there any specific reasons why this can't be implemented separately? I sure know the cpusets + no-memcg configuration would benefit from it.

Kirill A. Shutemov

21 Nov 21 Nov

9:30 a.m.

On Tue, Nov 20, 2012 at 10:02:45AM -0800, David Rientjes wrote:

...

On Mon, 19 Nov 2012, Glauber Costa wrote:

...
...
...
In the case I outlined below, for backwards compatibility. What I actually mean is that memcg *currently* allows arbitrary notifications. One way to merge those, while moving to a saner 3-point notification, is to still allow the old writes and fit them in the closest bucket.

Yeah, but I'm wondering why three is the right answer.

This is unrelated to what I am talking about. I am talking about pre-defined values with a specific event meaning (in his patchset, 3) vs arbitrary numbers valued in bytes.

Right, and I don't see how you can map the memcg thresholds onto Anton's scheme

BTW, there's interface for OOM notification in memcg. See oom_control. I guess other pressure levels can also fit to the interface.

-- Kirill A. Shutemov

leonid.moiseichuk＠nokia.com

11:32 a.m.

-----Original Message----- From: ext Kirill A. Shutemov [mailto:kirill@shutemov.name] Sent: 21 November, 2012 11:31 ...

BTW, there's interface for OOM notification in memcg. See oom_control. I guess other pressure levels can also fit to the interface.

--- Hi,

I have tracking this conversation very little, but as person somehow related to this round of development and requestor of memcg notification mechanism in past (Kirill implemented that) I have to point there are reasons not to use memcg. The situation in latest kernels could be different but practically in past the following troubles were observed with memcg: 1. by default memcg is turned off on Android (at least on 4.1 I see it) 2. you need to produce memory partitioning, and that maybe non-trivial task in general case when apps/use cases are not so limited 3. memcg takes into account cached memory. Yes, you can play with MADV_DONTNEED as it was mentioned but in generic case that is insane 4. memcg need be extended in a way you need to track some other kinds of memory 5. in case of situation in some partition changed fast (e.g. process moved to another partition) it may cause pages trashing and device lock. The in-kernel lock was fixed in May 2012, but even pages trashing knock out device number of seconds (even minutes).

Thus, I would prefer to avoid memcg even it is powerful feature.

Memory notifications are quite irrelevant to partitioning and cgroups. The use-case is related to user-space handling low memory. Meaning the functionality should be accurate with specific granularity (e.g. 1 MB) and time (0.25s is OK) but better to have it as simple and battery-friendly. I prefer to have pseudo-device-based text API because it is easy to debug and investigate. It would be nice if it will be possible to use simple scripting to point what kind of memory on which levels need to be tracked but private/shared dirty is #1 and memcg cannot handle it.

There are two use-cases related to this notification feature:

1. direct usage -> reaction to coming low memory situation and do something ahead of time. E.g. system calibrated to 80% dirty memory border, and if we crossed it we can compensate device slowness by flushing application caches, closing background images even notify user but without killing apps by any OOM killer and corruption unsaved data.

2. permission to do some heavy actions. If memory level is low enough for some application use case (e.g. 50 MB available) application can start heavy use-case, otherwise - do something to prevent potential problems.

So, seems to me, the levels depends from application memory usage e.g. calculator does not need memory information but browser and image gallery needs. Thus, tracking daemons in user-space looks as overhead, and such construction we used in n900 (ke-recv -> dbus -> apps) is quite fragile and slow.

These bits [1] was developed initially for n9 to replace memcg notifications with great support of kernel community about a year ago. Unfortunately for n9 I was a bit late and code was integrated to another product's kernel (say M), but at last Summer project M was forced to die due to moving product line to W. Practically arm device produced signals ON/OFF which fit well into space/time requirements, so I have what I need. Even it is quite primitive code but I prefer do not over-engineering complexity without necessity.

Best Wishes, Leonid

PS: but seems code related to vmpressure_fd solves some other problem so you can ignore my speech.

[1] http://maemo.gitorious.org/maemo-tools/libmemnotify/blobs/master/src/kernel/...

Glauber Costa

11:54 a.m.

Hi,

...

Memory notifications are quite irrelevant to partitioning and cgroups. The use-case is related to user-space handling low memory. Meaning the functionality should be accurate with specific granularity (e.g. 1 MB) and time (0.25s is OK) but better to have it as simple and battery-friendly. I prefer to have pseudo-device-based text API because it is easy to debug and investigate. It would be nice if it will be possible to use simple scripting to point what kind of memory on which levels need to be tracked but private/shared dirty is #1 and memcg cannot handle it.

If that is the case, then fine. The reason I jumped in talking about memcg, is that it was mentioned that at some point we'd like to have those notifications on a per-group basis.

So I'll say it again: if this is always global, there is no reason any cgroup needs to be involved. If this turns out to be per-process, as Anton suggested in a recent e-mail, I don't see any reason to have cgroups involved as well.

But if this needs to be extended to be per-cgroup, then past experience shows that we need to be really careful not to start duplicating infrastructure, and creating inter-dependencies like it happened to other groups in the past.

leonid.moiseichuk＠nokia.com

1:48 p.m.

-----Original Message----- From: ext Glauber Costa [mailto:glommer@parallels.com] Sent: 21 November, 2012 13:55 .... So I'll say it again: if this is always global, there is no reason any cgroup needs to be involved. If this turns out to be per-process, as Anton suggested in a recent e-mail, I don't see any reason to have cgroups involved as well. -----

Per-process memory tracking has no much sense: process should consume all available memory but work fast. Also this approach required knowledge about process deps to take into account dependencies e.g. in dbus or Xorg. If you need to know how much memory process consumed in particular moment you can use /proc/self/smaps, that is easier.

Best Wishes, Leonid

Michal Hocko

26 Nov 26 Nov

9:35 p.m.

[Sorry to jump in that late]

On Tue 20-11-12 10:02:45, David Rientjes wrote:

...

On Mon, 19 Nov 2012, Glauber Costa wrote:

...
...
...
In the case I outlined below, for backwards compatibility. What I actually mean is that memcg *currently* allows arbitrary notifications. One way to merge those, while moving to a saner 3-point notification, is to still allow the old writes and fit them in the closest bucket.

Yeah, but I'm wondering why three is the right answer.

This is unrelated to what I am talking about. I am talking about pre-defined values with a specific event meaning (in his patchset, 3) vs arbitrary numbers valued in bytes.

Right, and I don't see how you can map the memcg thresholds onto Anton's scheme that heavily relies upon reclaim activity; what bucket does a threshold of 48MB in a memcg with a limit of 64MB fit into? Perhaps you have some formula in mind that would do this, but I don't see how it works correctly without factoring in configuration options (memory compaction), type of allocation (GFP_ATOMIC won't trigger Anton's reclaim scheme like GFP_KERNEL), altered min_free_kbytes, etc.

This begs the question of whether the new cgroup should be considered as a replacement for memory thresholds within memcg in the first place; certainly both can coexist just fine.

Absolutely agreed. Yes those two things are inherently different. Information that "you have passed half of your limit" is something totally different than "you should slow down". Although I am not entirely sure what the first is one good for (to be honest), but I believe there are users out there.

I do not think that mixing those two makes much sense. They have different usecases and until we have users for the thresholds one we should keep it.

[...]

Thanks

-- Michal Hocko SUSE Labs

Glauber Costa

19 Nov 19 Nov

2:19 p.m.

...

...
...
Umm, why do users of cpusets not want to be able to trigger memory pressure notifications?

Because cpusets only deal with memory placement, not memory usage.

The set of nodes that a thread is allowed to allocate from may face memory pressure up to and including oom while the rest of the system may have a ton of free memory. Your solution is to compile and mount memcg if you want notifications of memory pressure on those nodes. Others in this thread have already said they don't want to rely on memcg for any of this and, as Anton showed, this can be tied directly into the VM without any help from memcg as it sits today. So why implement a simple and clean mempressure cgroup that can be used alone or co-existing with either memcg or cpusets?

Forgot this one:

Because there is a huge ongoing work going on by Tejun aiming at reducing the effects of orthogonal hierarchy. There are many controllers today that are "close enough" to each other (cpu, cpuacct; net_prio, net_cls), and in practice, it brought more problems than it solved.

So yes, *maybe* mempressure is the answer, but it need to be justified with care. Long term, I think a saner notification API for memcg will lead us to a better and brighter future.

There is also yet another aspect: This scheme works well for global notifications. If we would always want this to be global, this would work neatly. But as already mentioned in this thread, at some point we'll want this to work for a group of processes as well. At that point, you'll have to count how much memory is being used, so you can determine whether or not pressure is going on. You will, then, have to redo all the work memcg already does.

David Rientjes

20 Nov 20 Nov

6:23 p.m.

On Mon, 19 Nov 2012, Glauber Costa wrote:

...

...
...
Because cpusets only deal with memory placement, not memory usage.

The set of nodes that a thread is allowed to allocate from may face memory pressure up to and including oom while the rest of the system may have a ton of free memory. Your solution is to compile and mount memcg if you want notifications of memory pressure on those nodes. Others in this thread have already said they don't want to rely on memcg for any of this and, as Anton showed, this can be tied directly into the VM without any help from memcg as it sits today. So why implement a simple and clean mempressure cgroup that can be used alone or co-existing with either memcg or cpusets?

Forgot this one:

Because there is a huge ongoing work going on by Tejun aiming at reducing the effects of orthogonal hierarchy. There are many controllers today that are "close enough" to each other (cpu, cpuacct; net_prio, net_cls), and in practice, it brought more problems than it solved.

I'm very happy that Tejun is working on that, but I don't see how it's relevant here: I'm referring to users who are not using memcg specifically. This is what others brought up earlier in the thread: they do not want to be required to use memcg for this functionality.

There are users of cpusets today that do not enable nor comount memcg. I argue that a mempressure cgroup allows them this functionality without the memory footprint of memcg (not only in text, but requiring page_cgroup). Additionally, there are probably users who do not want either cpusets or memcg and want notifications from mempressure at a global level. Users who care so much about the memory pressure of their systems probably have strict footprint requirements, it would be a complete shame to require a semi-tractor trailer when all I want is a compact car.

...

So yes, *maybe* mempressure is the answer, but it need to be justified with care. Long term, I think a saner notification API for memcg will lead us to a better and brighter future.

You can easily comount mempressure with your memcg, this is not anything new.

...

There is also yet another aspect: This scheme works well for global notifications. If we would always want this to be global, this would work neatly. But as already mentioned in this thread, at some point we'll want this to work for a group of processes as well. At that point, you'll have to count how much memory is being used, so you can determine whether or not pressure is going on. You will, then, have to redo all the work memcg already does.

Anton can correct me if I'm wrong, but I certainly don't think this is where mempressure is headed: I don't think any accounting needs to be done and, if it is, it's a design issue that should be addressed now rather than later. I believe notifications should occur on current's mempressure cgroup depending on its level of reclaim: nobody cares if your memcg has a limit of 64GB when you only have 32GB of RAM, we'll want the notification.

Glauber Costa

21 Nov 21 Nov

8:27 a.m.

On 11/20/2012 10:23 PM, David Rientjes wrote:

...

Anton can correct me if I'm wrong, but I certainly don't think this is where mempressure is headed: I don't think any accounting needs to be done and, if it is, it's a design issue that should be addressed now rather than later. I believe notifications should occur on current's mempressure cgroup depending on its level of reclaim: nobody cares if your memcg has a limit of 64GB when you only have 32GB of RAM, we'll want the notification.

My main concern is that to trigger those notifications, one would have to first determine whether or not the particular group of tasks is under pressure. And to do that, we need to somehow know how much memory we are using, and how much we are reclaiming, etc. On a system-wide level, we have this information. On a grouplevel, this is already accounted by memcg.

In fact, the current code already seems to rely on memcg:

+ vmpressure(sc->target_mem_cgroup, + sc->nr_scanned - nr_scanned, nr_reclaimed);

Now, let's start simple: Assume we will have a different cgroup. We want per-group pressure notifications for that group. How would you determine that the specific group is under pressure?

Anton Vorontsov

8:46 a.m.

On Wed, Nov 21, 2012 at 12:27:28PM +0400, Glauber Costa wrote:

...

On 11/20/2012 10:23 PM, David Rientjes wrote:

...
Anton can correct me if I'm wrong, but I certainly don't think this is where mempressure is headed: I don't think any accounting needs to be done

Yup, I'd rather not do any accounting, at least not in bytes.

...

...
and, if it is, it's a design issue that should be addressed now rather than later. I believe notifications should occur on current's mempressure cgroup depending on its level of reclaim: nobody cares if your memcg has a limit of 64GB when you only have 32GB of RAM, we'll want the notification.

My main concern is that to trigger those notifications, one would have to first determine whether or not the particular group of tasks is under pressure.

As far as I understand, the notifications will be triggered by a process that tries to allocate memory. So, effectively that would be a per-process pressure.

So, if one process in a group is suffering, we notify that "a process in a group is under pressure", and the notification goes to a cgroup listener

...

And to do that, we need to somehow know how much memory we are using, and how much we are reclaiming, etc. On a system-wide level, we have this information. On a grouplevel, this is already accounted by memcg.

In fact, the current code already seems to rely on memcg:
vmpressure(sc->target_mem_cgroup,
   sc->nr_scanned - nr_scanned, nr_reclaimed);

Well, I'm yet unsure about the details, but I guess in "mempressure" cgroup approach, this will be derived from the current->, i.e. a task.

But note that we won't report pressure to a memcg cgroup, we will notify only mempressure cgroup. But a process can be in both of them simultaneously. In the code, the mempressure and memcg will not depend on each other.

...

Now, let's start simple: Assume we will have a different cgroup. We want per-group pressure notifications for that group. How would you determine that the specific group is under pressure?

If a process that tries to allocate memory & causes reclaim is a part of the cgroup, then cgroup has a pressure.

At least that's very brief understanding of the idea, details to be investigated... But I welcome David to comment whether I got everything correctly. :)

Thanks, Anton.

Glauber Costa

9:25 a.m.

On 11/21/2012 12:46 PM, Anton Vorontsov wrote:

...

On Wed, Nov 21, 2012 at 12:27:28PM +0400, Glauber Costa wrote:

...
On 11/20/2012 10:23 PM, David Rientjes wrote:

...
Anton can correct me if I'm wrong, but I certainly don't think this is where mempressure is headed: I don't think any accounting needs to be done

Yup, I'd rather not do any accounting, at least not in bytes.

It doesn't matter here, but memcg doesn't do any accounting in bytes as well. It only display it in bytes, but internally, it's all pages. The bytes representation is convenient, because then you can be agnostic of page sizes.

...

...
...
and, if it is, it's a design issue that should be addressed now rather than later. I believe notifications should occur on current's mempressure cgroup depending on its level of reclaim: nobody cares if your memcg has a limit of 64GB when you only have 32GB of RAM, we'll want the notification.

My main concern is that to trigger those notifications, one would have to first determine whether or not the particular group of tasks is under pressure.

As far as I understand, the notifications will be triggered by a process that tries to allocate memory. So, effectively that would be a per-process pressure.

So, if one process in a group is suffering, we notify that "a process in a group is under pressure", and the notification goes to a cgroup listener

If you effectively have a per-process mechanism, why do you need an extra cgroup at all?

It seems to me that this is simply something that should be inherited over fork, and then you register the notifier in your first process, and it will be valid for everybody in the process tree.

If you need tasks in different processes to respond to the same notifier, then you just register the same notifier in two different processes.

Anton Vorontsov

7 Nov 7 Nov

11:43 a.m.

On Wed, Nov 07, 2012 at 01:21:36PM +0200, Kirill A. Shutemov wrote: [...]

...

Sorry, I didn't follow previous discussion on this, but could you explain what's wrong with memory notifications from memcg? As I can see you can get pretty similar functionality using memory thresholds on the root cgroup. What's the point?

There are a few reasons we don't use cgroup notifications:

1. We're not interested in the absolute number of pages/KB of available memory, as provided by cgroup memory controller. What we're interested in is the amount of easily reclaimable memory and new memory allocations' cost.

We can have plenty of "free" memory, of which say 90% will be caches, and say 10% idle. But we do want to differentiate these types of memory (although not going into details about it), i.e. we want to get notified when kernel is reclaiming. And we also want to know when the memory comes from swapping others' pages out (well, actually we don't call it swap, it's "new allocations cost becomes high" -- it might be a result of many factors (swapping, fragmentation, etc.) -- and userland might analyze the situation when this happens).

Exposing all the VM details to userland is not an option -- it is not possible to build a stable ABI on this. Plus, it makes it really hard for userland to deal with all the low level details of Linux VM internals.

So, no, raw numbers of "free/used KBs" are not interesting at all.

1.5. But it is important to understand that vmpressure_fd() is not orthogonal to cgroups (like it was with vmevent_fd()). We want it to be "cgroup'able" too. :) But optionally.

2. The last time I checked, cgroups memory controller did not (and I guess still does not) not account kernel-owned slabs. I asked several times why so, but nobody answered.

But no, this is not the main issue -- per "1.", we're not interested in kilobytes.

3. Some folks don't like cgroups: it has a penalty for kernel size, for performance and memory wastage. But again, it's not the main issue with memcg.

Thanks, Anton.

Kirill A. Shutemov

12:11 p.m.

On Wed, Nov 07, 2012 at 03:43:46AM -0800, Anton Vorontsov wrote:

...

On Wed, Nov 07, 2012 at 01:21:36PM +0200, Kirill A. Shutemov wrote: [...]

...
Sorry, I didn't follow previous discussion on this, but could you explain what's wrong with memory notifications from memcg? As I can see you can get pretty similar functionality using memory thresholds on the root cgroup. What's the point?

There are a few reasons we don't use cgroup notifications:

We're not interested in the absolute number of pages/KB of available memory, as provided by cgroup memory controller. What we're interested in is the amount of easily reclaimable memory and new memory allocations' cost.

We can have plenty of "free" memory, of which say 90% will be caches, and say 10% idle. But we do want to differentiate these types of memory (although not going into details about it), i.e. we want to get notified when kernel is reclaiming. And we also want to know when the memory comes from swapping others' pages out (well, actually we don't call it swap, it's "new allocations cost becomes high" -- it might be a result of many factors (swapping, fragmentation, etc.) -- and userland might analyze the situation when this happens).

Exposing all the VM details to userland is not an option

IIUC, you want MemFree + Buffers + Cached + SwapCached, right? It's already exposed to userspace.

...

-- it is not possible to build a stable ABI on this. Plus, it makes it really hard for userland to deal with all the low level details of Linux VM internals.

So, no, raw numbers of "free/used KBs" are not interesting at all.

1.5. But it is important to understand that vmpressure_fd() is not orthogonal to cgroups (like it was with vmevent_fd()). We want it to be "cgroup'able" too. :) But optionally.

The last time I checked, cgroups memory controller did not (and I guess still does not) not account kernel-owned slabs. I asked several times why so, but nobody answered.

Almost there. Glauber works on it.

-- Kirill A. Shutemov

Anton Vorontsov

12:28 p.m.

On Wed, Nov 07, 2012 at 02:11:10PM +0200, Kirill A. Shutemov wrote: [...]

...

...
We can have plenty of "free" memory, of which say 90% will be caches, and say 10% idle. But we do want to differentiate these types of memory (although not going into details about it), i.e. we want to get notified when kernel is reclaiming. And we also want to know when the memory comes from swapping others' pages out (well, actually we don't call it swap, it's "new allocations cost becomes high" -- it might be a result of many factors (swapping, fragmentation, etc.) -- and userland might analyze the situation when this happens).

Exposing all the VM details to userland is not an option

IIUC, you want MemFree + Buffers + Cached + SwapCached, right? It's already exposed to userspace.

How? If you mean vmstat, then no, that interface is not efficient at all: we have to poll it from userland, which is no go for embedded (although, as a workaround it can be done via deferrable timers in userland, which I posted a few months ago).

But even with polling vmstat via deferrable timers, it leaves us with the ugly timers-based approach (and no way to catch the pre-OOM conditions). With vmpressure_fd() we have the synchronous notifications right from the core (upon which, you can, if you want to, analyze the vmstat).

...

...

The last time I checked, cgroups memory controller did not (and I guess still does not) not account kernel-owned slabs. I asked several times why so, but nobody answered.

Almost there. Glauber works on it.

It's good to hear, but still, the number of "used KBs" is a bad (or irrelevant) metric for the pressure. We'd still need to analyze the memory in more details, and "'limit - used' KBs" doesn't tell us anything about the cost of the available memory.

Thanks, Anton.

Greg Thelen

5:20 p.m.

On Wed, Nov 07 2012, Kirill A. Shutemov wrote:

...

On Wed, Nov 07, 2012 at 02:53:49AM -0800, Anton Vorontsov wrote:

...
Hi all,

This is the third RFC. As suggested by Minchan Kim, the API is much simplified now (comparing to vmevent_fd):

As well as Minchan, KOSAKI Motohiro didn't like the timers, so the timers are gone now;

Pekka Enberg didn't like the complex attributes matching code, and so it is no longer there;

Nobody liked the raw vmstat attributes, and so they were eliminated too.

But, conceptually, it is the exactly the same approach as in v2: three discrete levels of the pressure -- low, medium and oom. The levels are based on the reclaimer inefficiency index as proposed by Mel Gorman, but userland does not see the raw index values. The description why I moved away from reporting the raw 'reclaimer inefficiency index' can be found in v2: http://lkml.org/lkml/2012/10/22/177

While the new API is very simple, it is still extensible (i.e. versioned).

Sorry, I didn't follow previous discussion on this, but could you explain what's wrong with memory notifications from memcg? As I can see you can get pretty similar functionality using memory thresholds on the root cgroup. What's the point?

Related question: are there plans to extend this system call to provide per-cgroup vm pressure notification?

Pekka Enberg

8:52 p.m.

Hi Greg,

On 11/7/12 7:20 PM, Greg Thelen wrote:

...

Related question: are there plans to extend this system call to provide per-cgroup vm pressure notification?

Yes, that's something that needs to be addressed before we can ever consider merging something like this to mainline. We probably need help with that, though. Preferably from someone who knows cgroups. :-)

Pekka

Pekka Enberg

11:30 a.m.

Hi Anton,

On Wed, Nov 7, 2012 at 12:53 PM, Anton Vorontsov anton.vorontsov@linaro.org wrote:

...

This is the third RFC. As suggested by Minchan Kim, the API is much simplified now (comparing to vmevent_fd):

As well as Minchan, KOSAKI Motohiro didn't like the timers, so the timers are gone now;

Pekka Enberg didn't like the complex attributes matching code, and so it is no longer there;

Nobody liked the raw vmstat attributes, and so they were eliminated too.

I love the API and implementation simplifications but I hate the new ABI. It's a specialized, single-purpose syscall and bunch of procfs tunables and I don't see how it's 'extensible' to anything but VM

If people object to vmevent_fd() system call, we should consider using something more generic like perf_event_open() instead of inventing our own special purpose ABI.

Pekka

Pekka Enberg

11:31 a.m.

On Wed, Nov 7, 2012 at 1:30 PM, Pekka Enberg penberg@kernel.org wrote:

...

I love the API and implementation simplifications but I hate the new ABI. It's a specialized, single-purpose syscall and bunch of procfs tunables and I don't see how it's 'extensible' to anything but VM

s/anything but VM/anything but VM pressure notification/

Anton Vorontsov

12:06 p.m.

On Wed, Nov 07, 2012 at 01:30:16PM +0200, Pekka Enberg wrote: [...]

...

I love the API and implementation simplifications but I hate the new ABI. It's a specialized, single-purpose syscall and bunch of procfs tunables and I don't see how it's 'extensible' to anything but VM

It is extensible to VM pressure notifications, yeah. We're probably not going to add the raw vmstat values to it (and that's why we changed the name). But having three levels is not the best thing we can do -- we can do better. As I described here:

http://lkml.org/lkml/2012/10/25/115

That is, later we might want to tell the kernel how much reclaimable memory userland has. So this can be two-way communication, which to me sounds pretty cool. :) And who knows what we'll do after that.

But these are just plans. We might end up not having this, but we always have an option to have it one day.

...

If people object to vmevent_fd() system call, we should consider using something more generic like perf_event_open() instead of inventing our own special purpose ABI.

Ugh. While I *love* perf, but, IIUC, it was designed for other things: handling tons of events, so it has many stuff that are completely unnecessary here: we don't need ring buffers, formats, 7+k LOC, etc. Folks will complain that we need the whole perf stuff for such a simple thing (just like cgroups).

Also note that for pre-OOM we have to be really fast, i.e. use shortest possible path (and, btw, that's why in this version the read() now can be blocking -- and so we no longer have to do two poll()+read() syscalls, just single read is now possible).

So I really don't see the need for perf here: it doesn't result in any code reuse, but instead it just complicates our task. As for ABI maintenance point of view, it is just the same thing as the dedicated syscall.

Thanks, Anton.

Luiz Capitulino

9 Nov 9 Nov

8:32 a.m.

Hi Anton,

On Wed, 7 Nov 2012 02:53:49 -0800 Anton Vorontsov anton.vorontsov@linaro.org wrote:

...

Hi all,

This is the third RFC. As suggested by Minchan Kim, the API is much simplified now (comparing to vmevent_fd):

Which tree is this against? I'd like to try this series, but it doesn't apply to Linus tree.

Anton Vorontsov

9:04 a.m.

On Fri, Nov 09, 2012 at 09:32:03AM +0100, Luiz Capitulino wrote:

...

Anton Vorontsov anton.vorontsov@linaro.org wrote:

...
This is the third RFC. As suggested by Minchan Kim, the API is much simplified now (comparing to vmevent_fd):

Which tree is this against? I'd like to try this series, but it doesn't apply to Linus tree.

Thanks for trying!

The tree is a mix of Pekka's linux-vmevent tree and Linus' tree. You can just clone my tree to get the whole thing:

git://git.infradead.org/users/cbou/linux-vmevent.git

Note that the tree is rebasable. Also be sure to select CONFIG_VMPRESSURE, not CONFIG_VMEVENT.

Thanks! Anton.

4725

days inactive

4744

days old

linaro-kernel@lists.linaro.org

52 comments

participants

tags (0)

participants (14)

Andrew Morton
Anton Vorontsov
Anton Vorontsov
David Rientjes
Glauber Costa
Greg Thelen
Jonathan Corbet
Kirill A. Shutemov
leonid.moiseichuk＠nokia.com
Luiz Capitulino
Mel Gorman
Michal Hocko
Pekka Enberg
Rik van Riel