Hi all,
So this is the second RFC. The main change is that I decided to go with discrete levels of the pressure.
When I started writing the man page, I had to describe the 'reclaimer inefficiency index', and while doing this I realized that I'm describing how the kernel is doing the memory management, which we try to avoid in the vmevent. And applications don't really care about these details: reclaimers, its inefficiency indexes, scanning window sizes, priority levels, etc. -- it's all "not interesting", and purely kernel's stuff. So I guess Mel Gorman was right, we need some sort of levels.
What applications (well, activity managers) are really interested in is this:
1. Do we we sacrifice resources for new memory allocations (e.g. files cache)? 2. Does the new memory allocations' cost becomes too high, and the system hurts because of this? 3. Are we about to OOM soon?
And here are the answers:
1. VMEVENT_PRESSURE_LOW 2. VMEVENT_PRESSURE_MED 3. VMEVENT_PRESSURE_OOM
There is no "high" pressure, since I really don't see any definition of it, but it's possible to introduce new levels without breaking ABI. The levels described in more details in the patches, and the stuff is still tunable, but now via sysctls, not the vmevent_fd() call itself (i.e. we don't need to rebuild applications to adjust window size or other mm "details").
What I couldn't fix in this RFC is making vmevent_{scanned,reclaimed} stuff per-CPU (there's a comment describing the problem with this). But I made it lockless and tried to make it very lightweight (plus I moved the vmevent_pressure() call to a more "cold" path).
Thanks, Anton.
This patch introduces VMEVENT_ATTR_PRESSURE, the attribute reports Linux virtual memory management pressure. There are three discrete levels:
VMEVENT_PRESSURE_LOW: Notifies that the system is reclaiming memory for new allocations. Monitoring reclaiming activity might be useful for maintaining overall system's cache level.
VMEVENT_PRESSURE_MED: The system is experiencing medium memory pressure, there is some mild swapping activity. Upon this event applications may decide to free any resources that can be easily reconstructed or re-read from a disk.
VMEVENT_PRESSURE_OOM: The system is actively thrashing, it is about to out of memory (OOM) or even the in-kernel OOM killer is on its way to trigger. Applications should do whatever they can to help the system.
There are three sysctls to tune the behaviour of the levels:
vmevent_window vmevent_level_med vmevent_level_oom
Currently vmevent pressure levels are based on the reclaimer inefficiency index (range from 0 to 100). The index shows the relative time spent by the kernel uselessly scanning pages, or, in other words, the percentage of scans of pages (vmevent_window) that were not reclaimed. The higher the index, the more it should be evident that new allocations' cost becomes higher.
The files vmevent_level_med and vmevent_level_oom accept the index values (by default set to 60 and 99 respectively). A non-existent vmevent_level_low tunable is always set to 0
When index equals to 0, this means that the kernel is reclaiming, but every scanned page has been successfully reclaimed (so the pressure is low). 100 means that the kernel is trying to reclaim, but nothing can be reclaimed (close to OOM).
Window size is used as a rate-limit tunable for VMEVENT_PRESSURE_LOW notifications and for averaging for VMEVENT_PRESSURE_{MED,OOM} levels. So, using small window sizes can cause lot of false positives for _MED and _OOM levels, but too big window size may delay notifications.
By default the window size equals to 256 pages (1MB).
Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org --- Documentation/sysctl/vm.txt | 37 +++++++++++++++ include/linux/vmevent.h | 42 +++++++++++++++++ kernel/sysctl.c | 24 ++++++++++ mm/vmevent.c | 107 ++++++++++++++++++++++++++++++++++++++++++++ mm/vmscan.c | 19 ++++++++ 5 files changed, 229 insertions(+)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 078701f..ff0023b 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -44,6 +44,9 @@ Currently, these files are in /proc/sys/vm: - nr_overcommit_hugepages - nr_trim_pages (only if CONFIG_MMU=n) - numa_zonelist_order +- vmevent_window +- vmevent_level_med +- vmevent_level_oom - oom_dump_tasks - oom_kill_allocating_task - overcommit_memory @@ -487,6 +490,40 @@ this is causing problems for your system/application.
==============================================================
+vmevent_window +vmevent_level_med +vmevent_level_oom + +These sysctls are used to tune vmevent_fd(2) behaviour. + +Currently vmevent pressure levels are based on the reclaimer inefficiency +index (range from 0 to 100). The files vmevent_level_med and +vmevent_level_oom accept the index values (by default set to 60 and 99 +respectively). A non-existent vmevent_level_low tunable is always set to 0 + +When the system is short on idle pages, the new memory is allocated by +reclaiming least recently used resources: kernel scans pages to be +reclaimed (e.g. from file caches, mmap(2) volatile ranges, etc.; and +potentially swapping some pages out), and the index shows the relative +time spent by the kernel uselessly scanning pages, or, in other words, the +percentage of scans of pages (vmevent_window) that were not reclaimed. The +higher the index, the more it should be evident that new allocations' cost +becomes higher. + +When index equals to 0, this means that the kernel is reclaiming, but +every scanned page has been successfully reclaimed (so the pressure is +low). 100 means that the kernel is trying to reclaim, but nothing can be +reclaimed (close to OOM). + +Window size is used as a rate-limit tunable for VMEVENT_PRESSURE_LOW +notifications and for averaging for VMEVENT_PRESSURE_{MED,OOM} levels. So, +using small window sizes can cause lot of false positives for _MED and +_OOM levels, but too big window size may delay notifications. + +By default the window size equals to 256 pages (1MB). + +============================================================== + oom_dump_tasks
Enables a system-wide task dump (excluding kernel threads) to be diff --git a/include/linux/vmevent.h b/include/linux/vmevent.h index b1c4016..a0e6641 100644 --- a/include/linux/vmevent.h +++ b/include/linux/vmevent.h @@ -10,10 +10,18 @@ enum { VMEVENT_ATTR_NR_AVAIL_PAGES = 1UL, VMEVENT_ATTR_NR_FREE_PAGES = 2UL, VMEVENT_ATTR_NR_SWAP_PAGES = 3UL, + VMEVENT_ATTR_PRESSURE = 4UL,
VMEVENT_ATTR_MAX /* non-ABI */ };
+/* We spread the values, reserving room for new levels, if ever needed. */ +enum { + VMEVENT_PRESSURE_LOW = 1 << 10, + VMEVENT_PRESSURE_MED = 1 << 11, + VMEVENT_PRESSURE_OOM = 1 << 12, +}; + /* * Attribute state bits for threshold */ @@ -97,4 +105,38 @@ struct vmevent_event { struct vmevent_attr attrs[]; };
+#ifdef __KERNEL__ + +struct mem_cgroup; + +extern void __vmevent_pressure(struct mem_cgroup *memcg, + ulong scanned, + ulong reclaimed); + +static inline void vmevent_pressure(struct mem_cgroup *memcg, + ulong scanned, + ulong reclaimed) +{ + if (!scanned) + return; + + if (IS_BUILTIN(CONFIG_MEMCG) && memcg) { + /* + * The vmevent API reports system pressure, for per-cgroup + * pressure, we'll chain cgroups notifications, this is to + * be implemented. + * + * memcg_vm_pressure(target_mem_cgroup, scanned, reclaimed); + */ + return; + } + __vmevent_pressure(memcg, scanned, reclaimed); +} + +extern uint vmevent_window; +extern uint vmevent_level_med; +extern uint vmevent_level_oom; + +#endif + #endif /* _LINUX_VMEVENT_H */ diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 87174ef..e00d3fb 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -50,6 +50,7 @@ #include <linux/dnotify.h> #include <linux/syscalls.h> #include <linux/vmstat.h> +#include <linux/vmevent.h> #include <linux/nfs_fs.h> #include <linux/acpi.h> #include <linux/reboot.h> @@ -1317,6 +1318,29 @@ static struct ctl_table vm_table[] = { .proc_handler = numa_zonelist_order_handler, }, #endif +#ifdef CONFIG_VMEVENT + { + .procname = "vmevent_window", + .data = &vmevent_window, + .maxlen = sizeof(vmevent_window), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { + .procname = "vmevent_level_med", + .data = &vmevent_level_med, + .maxlen = sizeof(vmevent_level_med), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { + .procname = "vmevent_level_oom", + .data = &vmevent_level_oom, + .maxlen = sizeof(vmevent_level_oom), + .mode = 0644, + .proc_handler = proc_dointvec, + }, +#endif #if (defined(CONFIG_X86_32) && !defined(CONFIG_UML))|| \ (defined(CONFIG_SUPERH) && defined(CONFIG_VSYSCALL)) { diff --git a/mm/vmevent.c b/mm/vmevent.c index 8195897..11ce5ef 100644 --- a/mm/vmevent.c +++ b/mm/vmevent.c @@ -4,6 +4,7 @@ #include <linux/vmevent.h> #include <linux/syscalls.h> #include <linux/workqueue.h> +#include <linux/mutex.h> #include <linux/file.h> #include <linux/list.h> #include <linux/poll.h> @@ -28,8 +29,22 @@ struct vmevent_watch {
/* poll */ wait_queue_head_t waitq; + + /* Our node in the pressure watchers list. */ + struct list_head pwatcher; };
+static atomic64_t vmevent_pressure_sr; +static uint vmevent_pressure_val; + +static LIST_HEAD(vmevent_pwatchers); +static DEFINE_MUTEX(vmevent_pwatchers_lock); + +/* Our sysctl tunables, see Documentation/sysctl/vm.txt */ +uint __read_mostly vmevent_window = SWAP_CLUSTER_MAX * 16; +uint vmevent_level_med = 60; +uint vmevent_level_oom = 99; + typedef u64 (*vmevent_attr_sample_fn)(struct vmevent_watch *watch, struct vmevent_attr *attr);
@@ -97,10 +112,21 @@ static u64 vmevent_attr_avail_pages(struct vmevent_watch *watch, return totalram_pages; }
+static u64 vmevent_attr_pressure(struct vmevent_watch *watch, + struct vmevent_attr *attr) +{ + if (vmevent_pressure_val >= vmevent_level_oom) + return VMEVENT_PRESSURE_OOM; + else if (vmevent_pressure_val >= vmevent_level_med) + return VMEVENT_PRESSURE_MED; + return VMEVENT_PRESSURE_LOW; +} + static vmevent_attr_sample_fn attr_samplers[] = { [VMEVENT_ATTR_NR_AVAIL_PAGES] = vmevent_attr_avail_pages, [VMEVENT_ATTR_NR_FREE_PAGES] = vmevent_attr_free_pages, [VMEVENT_ATTR_NR_SWAP_PAGES] = vmevent_attr_swap_pages, + [VMEVENT_ATTR_PRESSURE] = vmevent_attr_pressure, };
static u64 vmevent_sample_attr(struct vmevent_watch *watch, struct vmevent_attr *attr) @@ -239,6 +265,73 @@ static void vmevent_start_timer(struct vmevent_watch *watch) vmevent_schedule_watch(watch); }
+static uint vmevent_calc_pressure(uint win, uint s, uint r) +{ + ulong p; + + /* + * We calculate the ratio (in percents) of how many pages were + * scanned vs. reclaimed in a given time frame (window). Note that + * time is in VM reclaimer's "ticks", i.e. number of pages + * scanned. This makes it possible set desired reaction time and + * serves as a ratelimit. + */ + p = win - (r * win / s); + p = p * 100 / win; + + pr_debug("%s: %3lu (s: %6u r: %6u)\n", __func__, p, s, r); + + return p; +} + +#define VMEVENT_SCANNED_SHIFT (sizeof(u64) * 8 / 2) + +static void vmevent_pressure_wk_fn(struct work_struct *wk) +{ + struct vmevent_watch *watch; + u64 sr = atomic64_xchg(&vmevent_pressure_sr, 0); + u32 s = sr >> VMEVENT_SCANNED_SHIFT; + u32 r = sr & (((u64)1 << VMEVENT_SCANNED_SHIFT) - 1); + + vmevent_pressure_val = vmevent_calc_pressure(vmevent_window, s, r); + + mutex_lock(&vmevent_pwatchers_lock); + list_for_each_entry(watch, &vmevent_pwatchers, pwatcher) + vmevent_sample(watch); + mutex_unlock(&vmevent_pwatchers_lock); +} +static DECLARE_WORK(vmevent_pressure_wk, vmevent_pressure_wk_fn); + +void __vmevent_pressure(struct mem_cgroup *memcg, + ulong scanned, + ulong reclaimed) +{ + /* + * Store s/r combined, so we don't have to worry to synchronize + * them. On modern machines it will be truly atomic; on arches w/o + * 64 bit atomics it will turn into a spinlock (for a small amount + * of CPUs it's not a problem). + * + * Using int-sized atomics is a bad idea as it would only allow to + * count (1 << 16) - 1 pages (256MB), which we can scan pretty + * fast. + * + * We can't have per-CPU counters as this will not catch a case + * when many CPUs scan small amounts (so none of them hit the + * window size limit, and thus we won't send a notification in + * time). + * + * So we shouldn't place vmevent_pressure() into a very hot path. + */ + atomic64_add(scanned << VMEVENT_SCANNED_SHIFT | reclaimed, + &vmevent_pressure_sr); + + scanned = atomic64_read(&vmevent_pressure_sr) >> VMEVENT_SCANNED_SHIFT; + if (scanned >= vmevent_window && + !work_pending(&vmevent_pressure_wk)) + schedule_work(&vmevent_pressure_wk); +} + static unsigned int vmevent_poll(struct file *file, poll_table *wait) { struct vmevent_watch *watch = file->private_data; @@ -300,6 +393,11 @@ static int vmevent_release(struct inode *inode, struct file *file)
cancel_delayed_work_sync(&watch->work);
+ if (watch->pwatcher.next) { + mutex_lock(&vmevent_pwatchers_lock); + list_del(&watch->pwatcher); + mutex_unlock(&vmevent_pwatchers_lock); + } kfree(watch);
return 0; @@ -328,6 +426,7 @@ static int vmevent_setup_watch(struct vmevent_watch *watch) { struct vmevent_config *config = &watch->config; struct vmevent_attr *attrs = NULL; + bool pwatcher = 0; unsigned long nr; int i;
@@ -340,6 +439,8 @@ static int vmevent_setup_watch(struct vmevent_watch *watch)
if (attr->type >= VMEVENT_ATTR_MAX) continue; + else if (attr->type == VMEVENT_ATTR_PRESSURE) + pwatcher = 1;
size = sizeof(struct vmevent_attr) * (nr + 1);
@@ -363,6 +464,12 @@ static int vmevent_setup_watch(struct vmevent_watch *watch) watch->sample_attrs = attrs; watch->nr_attrs = nr;
+ if (pwatcher) { + mutex_lock(&vmevent_pwatchers_lock); + list_add(&watch->pwatcher, &vmevent_pwatchers); + mutex_unlock(&vmevent_pwatchers_lock); + } + return 0; }
diff --git a/mm/vmscan.c b/mm/vmscan.c index 99b434b..cd3bd19 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -20,6 +20,7 @@ #include <linux/init.h> #include <linux/highmem.h> #include <linux/vmstat.h> +#include <linux/vmevent.h> #include <linux/file.h> #include <linux/writeback.h> #include <linux/blkdev.h> @@ -1846,6 +1847,9 @@ restart: shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON);
+ vmevent_pressure(sc->target_mem_cgroup, + sc->nr_scanned - nr_scanned, nr_reclaimed); + /* reclaim/compaction might need reclaim to continue */ if (should_continue_reclaim(lruvec, nr_reclaimed, sc->nr_scanned - nr_scanned, sc)) @@ -2068,6 +2072,21 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, count_vm_event(ALLOCSTALL);
do { + /* + * OK, we're cheating. The thing is, we have to average + * s/r ratio by gathering a lot of scans (otherwise we + * might get some local false-positives index of '100'). + * + * But... when we're almost OOM we might be getting the + * last reclaimable pages slowly, scanning all the queues, + * and so we never catch the OOM case via averaging. + * Although the priority will show it for sure. 3 is an + * empirically taken priority: we never observe it under + * any load, except for last few allocations before OOM. + */ + if (sc->priority <= 3) + vmevent_pressure(sc->target_mem_cgroup, + vmevent_window, 0); sc->nr_scanned = 0; aborted_reclaim = shrink_zones(zonelist, sc);
On Mon, 22 Oct 2012, Anton Vorontsov wrote:
This patch introduces VMEVENT_ATTR_PRESSURE, the attribute reports Linux virtual memory management pressure. There are three discrete levels:
VMEVENT_PRESSURE_LOW: Notifies that the system is reclaiming memory for new allocations. Monitoring reclaiming activity might be useful for maintaining overall system's cache level.
VMEVENT_PRESSURE_MED: The system is experiencing medium memory pressure, there is some mild swapping activity. Upon this event applications may decide to free any resources that can be easily reconstructed or re-read from a disk.
Nit:
s/VMEVENT_PRESSURE_MED/VMEVENT_PRESSUDE_MEDIUM/
Other than that, I'm OK with this. Mel and others, what are your thoughts on this?
Anton, have you tested this with real world scenarios? How does it stack up against Android's low memory killer, for example?
Pekka
Hello Pekka,
Thanks for taking a look into this!
On Wed, Oct 24, 2012 at 12:03:10PM +0300, Pekka Enberg wrote:
On Mon, 22 Oct 2012, Anton Vorontsov wrote:
This patch introduces VMEVENT_ATTR_PRESSURE, the attribute reports Linux virtual memory management pressure. There are three discrete levels:
VMEVENT_PRESSURE_LOW: Notifies that the system is reclaiming memory for new allocations. Monitoring reclaiming activity might be useful for maintaining overall system's cache level.
VMEVENT_PRESSURE_MED: The system is experiencing medium memory pressure, there is some mild swapping activity. Upon this event applications may decide to free any resources that can be easily reconstructed or re-read from a disk.
Nit:
s/VMEVENT_PRESSURE_MED/VMEVENT_PRESSUDE_MEDIUM/
Sure thing, will change.
Other than that, I'm OK with this. Mel and others, what are your thoughts on this?
Anton, have you tested this with real world scenarios?
Yup, I was mostly testing it on a desktop. I.e. in a KVM instance I was running a full fedora17 desktop w/ a lot of apps opened. The pressure index was pretty good in the sense that it was indeed reflecting the sluggishness in the system during swap activity. It's not ideal, i.e. the index might drop slightly for some time, but we usually interested in "above some value" threshold, so it should be fine.
The _LOW level is defined very strictly, and cannot be tuned anyhow. So it's very solid, and that's what we mostly use for Android.
The _OOM level is also defined quite strict, so from the API point of view, it's also solid, and should not be a problem.
Although the problem with _OOM is delivering the event in time (i.e. we must be quick in predicting it, before OOMK triggers). Today the patch has a shortcut for _OOM level: we send _OOM notification when reclaimer's priority is below empirically found value '3' (we might make it tunable via sysctl too, but that would expose another mm detail -- although sysctl sounds not that bad as exposing something in the C API; we have plenty of mm knobs in /proc/sys/vm/ already).
The real tunable is _MED level, and this should be tuned based on the desired system's behaviour that I described in more detail in this long post: http://lkml.org/lkml/2012/10/7/29.
Based on my observations, I wouldn't say that we have plenty of room to tune the value, though. Usual swapping activity causes index to rise to say to 30%, and when the system can't keep up, it raises to 50..90 (but we still have plenty of swap space, so the system is far away from OOM, although it is thrashing. Ideally I'd prefer to not have any sysctl, but I believe _MED level is really based on user's definition of "medium".
How does it stack up against Android's low memory killer, for example?
The LMK driver is effectively using what we call _LOW pressure notifications here, so by definition it is enough to build a full replacement for the in-kernel LMK using just the _LOW level. But in the future, we might want to use _MED as well, e.g. kill unneeded services based not on the cache level, but based on the pressure.
Thanks, Anton.
On Wed, Oct 24, 2012 at 07:23:21PM -0700, Anton Vorontsov wrote:
Hello Pekka,
Thanks for taking a look into this!
On Wed, Oct 24, 2012 at 12:03:10PM +0300, Pekka Enberg wrote:
On Mon, 22 Oct 2012, Anton Vorontsov wrote:
This patch introduces VMEVENT_ATTR_PRESSURE, the attribute reports Linux virtual memory management pressure. There are three discrete levels:
VMEVENT_PRESSURE_LOW: Notifies that the system is reclaiming memory for new allocations. Monitoring reclaiming activity might be useful for maintaining overall system's cache level.
VMEVENT_PRESSURE_MED: The system is experiencing medium memory pressure, there is some mild swapping activity. Upon this event applications may decide to free any resources that can be easily reconstructed or re-read from a disk.
Nit:
s/VMEVENT_PRESSURE_MED/VMEVENT_PRESSUDE_MEDIUM/
Sure thing, will change.
Other than that, I'm OK with this. Mel and others, what are your thoughts on this?
Anton, have you tested this with real world scenarios?
Yup, I was mostly testing it on a desktop. I.e. in a KVM instance I was running a full fedora17 desktop w/ a lot of apps opened. The pressure index was pretty good in the sense that it was indeed reflecting the sluggishness in the system during swap activity. It's not ideal, i.e. the index might drop slightly for some time, but we usually interested in "above some value" threshold, so it should be fine.
The _LOW level is defined very strictly, and cannot be tuned anyhow. So it's very solid, and that's what we mostly use for Android.
The _OOM level is also defined quite strict, so from the API point of view, it's also solid, and should not be a problem.
The one of the concern when I see the code is that whether we should consider high order page allocation. Now OOM killer doesn't kill anyone when VM suffer from higher order allocation because it doesn't help getting physical contiguos memory in normal case. Same rule could be applied.
Although the problem with _OOM is delivering the event in time (i.e. we must be quick in predicting it, before OOMK triggers). Today the patch has
Absolutely. It was a biggest challenge.
a shortcut for _OOM level: we send _OOM notification when reclaimer's priority is below empirically found value '3' (we might make it tunable via sysctl too, but that would expose another mm detail -- although sysctl sounds not that bad as exposing something in the C API; we have plenty of mm knobs in /proc/sys/vm/ already).
Hmm, I'm not sure depending on such magic value is good idea but I have no idea so I will shut up :(
The real tunable is _MED level, and this should be tuned based on the desired system's behaviour that I described in more detail in this long post: http://lkml.org/lkml/2012/10/7/29.
Based on my observations, I wouldn't say that we have plenty of room to tune the value, though. Usual swapping activity causes index to rise to say to 30%, and when the system can't keep up, it raises to 50..90 (but we still have plenty of swap space, so the system is far away from OOM, although it is thrashing. Ideally I'd prefer to not have any sysctl, but I believe _MED level is really based on user's definition of "medium".
How does it stack up against Android's low memory killer, for example?
The LMK driver is effectively using what we call _LOW pressure notifications here, so by definition it is enough to build a full replacement for the in-kernel LMK using just the _LOW level. But in the future, we might want to use _MED as well, e.g. kill unneeded services based not on the cache level, but based on the pressure.
Good idea. Thanks for keeping trying this, Anton!
Thanks, Anton.
-- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
VMEVENT_FD(2) Linux Programmer's Manual VMEVENT_FD(2)
NAME vmevent_fd - Linux virtual memory management events
SYNOPSIS #define _GNU_SOURCE #include <unistd.h> #include <sys/syscall.h> #include <asm/unistd.h> #include <linux/types.h> #include <linux/vmevent.h>
syscall(__NR_vmevent_fd, config);
DESCRIPTION This system call creates a new file descriptor that can be used with polling routines (e.g. poll(2)) to get notified about vari- ous in-kernel virtual memory management events that might be of interest for userspace. The interface can also be used to effe- ciently monitor memory usage (e.g. number of idle and swap pages).
Applications can make overall system's memory management more nimble by adjusting theirs resources usage upon the notifica- tions.
Attributes Attributes are the basic concept, they are described by the fol- lowing structure:
struct vmevent_attr { __u64 value; __u32 type; __u32 state; };
type may correspond to these values:
VMEVENT_ATTR_NR_AVAIL_PAGES The attribute reports total number of available pages in the system, not including swap space (i.e. just total RAM). value is used to setup a threshold (in number or pages) upon which the event will be delivered by the ker- nel.
Upon notifications kernel updates all configured attributes, so the attribute is mostly used without any thresholds, just for getting the value together with other attributes and avoid reading and parsing /proc/vmstat.
VMEVENT_ATTR_NR_FREE_PAGES The attribute reports total number of unused (idle) RAM in the system.
value is used to setup a threshold (in number or pages) upon which the event will be delivered by the kernel.
VMEVENT_ATTR_NR_SWAP_PAGES The attribute reports total number of swapped pages.
value is used to setup a threshold (in number or pages) upon which the event will be delivered by the kernel.
VMEVENT_ATTR_PRESSURE The attribute reports Linux virtual memory management pressure. There are three discrete levels:
VMEVENT_PRESSURE_LOW: By setting the threshold to this value it's possible to watch whether system is reclaiming memory for new allocations. Monitoring reclaiming activity might be useful for maintaining overall system's cache level.
VMEVENT_PRESSURE_MED: The system is experiencing medium memory pressure, there is some mild swapping activity. Upon this event applications may decide to free any resources that can be easily reconstructed or re-read from a disk.
VMEVENT_PRESSURE_OOM: The system is actively thrashing, it is about to out of memory (OOM) or even the in-kernel OOM killer is on its way to trigger. Applications should do whatever they can to help the system. See proc(5) for more information about OOM killer and its configuration options.
value is used to setup a threshold upon which the event will be delivered by the kernel (for algebraic compar- isons, it is defined that VMEVENT_PRESSURE_LOW < VMEVENT_PRESSURE_MED < VMEVENT_PRESSURE_OOM, but applica- tions should not put any meaning into the absolute val- ues.)
state is used to setup thresholds' behaviour, the following flags can be bitwise OR'ed:
VMEVENT_ATTR_STATE_VALUE_LT Notification will be delivered when an attribute is less than a user-specified value.
VMEVENT_ATTR_STATE_VALUE_GT Notifications will be delivered when an attribute is greater than a user-specified value.
VMEVENT_ATTR_STATE_VALUE_EQ Notifications will be delivered when an attribute is equal to a user-specified value.
VMEVENT_ATTR_STATE_EDGE_TRIGGER Events will be only delivered when an attribute crosses value threshold.
Events Upon a notification, application must read out events using read(2) system call. The events are delivered using the follow- ing structure:
struct vmevent_event { __u32 counter; __u32 padding; struct vmevent_attr attrs[]; };
The counter specifies a number of reported attributes, and the attrs array contains a copy of configured attributes, with vmevent_attr's value overwritten to attribute's value.
Config vmevent_fd(2) accepts vmevent_config structure to configure the notifications:
struct vmevent_config { __u32 size; __u32 counter; __u64 sample_period_ns; struct vmevent_attr attrs[VMEVENT_CONFIG_MAX_ATTRS]; };
size must be initialized to sizeof(struct vmevent_config).
counter specifies a number of initialized attrs elements.
sample_period_ns specifies sampling period in nanoseconds. For applications it is recommended to set this value to a highest suitable period. (Note that for some attributes the delivery tim- ing is not based on the sampling period, e.g. VMEVENT_ATTR_PRES- SURE.)
RETURN VALUE On success, vmevent_fd() returns a new file descriptor. On error, a negative value is returned and errno is set to indicate the error.
ERRORS vmevent_fd() can fail with errors similar to open(2).
In addition, the following errors are possible:
EINVAL The failure means that an improperly initalized config structure has been passed to the call (this also includes improperly initialized attrs arrays).
EFAULT The failure means that the kernel was unable to read the configuration structure, that is, config parameter points to an inaccessible memory.
VERSIONS The system call is available on Linux since kernel 3.8. Library support is yet not provided by any glibc version.
CONFORMING TO The system call is Linux-specific.
EXAMPLE Examples can be found in /usr/src/linux/tools/testing/vmevent/ directory.
SEE ALSO poll(2), read(2), proc(5), vmstat(8)
Linux 2012-10-16 VMEVENT_FD(2)
Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org --- man2/vmevent_fd.2 | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 235 insertions(+) create mode 100644 man2/vmevent_fd.2
diff --git a/man2/vmevent_fd.2 b/man2/vmevent_fd.2 new file mode 100644 index 0000000..b631455 --- /dev/null +++ b/man2/vmevent_fd.2 @@ -0,0 +1,235 @@ +." Copyright (C) 2008 Michael Kerrisk mtk.manpages@gmail.com +." Copyright (C) 2012 Linaro Ltd. +." Anton Vorontsov anton.vorontsov@linaro.org +." Based on ideas from: +." KOSAKI Motohiro, Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka +." Enberg. +." +." This program is free software; you can redistribute it and/or modify +." it under the terms of the GNU General Public License as published by +." the Free Software Foundation; either version 2 of the License, or +." (at your option) any later version. +." +." This program is distributed in the hope that it will be useful, +." but WITHOUT ANY WARRANTY; without even the implied warranty of +." MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +." GNU General Public License for more details. +." +." You should have received a copy of the GNU General Public License +." along with this program; if not, write to the Free Software +." Foundation, Inc., 59 Temple Place, Suite 330, Boston, +." MA 02111-1307 USA +." +.TH VMEVENT_FD 2 2012-10-16 Linux "Linux Programmer's Manual" +.SH NAME +vmevent_fd - Linux virtual memory management events +.SH SYNOPSIS +.nf +.B #define _GNU_SOURCE +.B #include <unistd.h> +.B #include <sys/syscall.h> +.B #include <asm/unistd.h> +.B #include <linux/types.h> +.B #include <linux/vmevent.h> + +." TODO: libc wrapper +.BI "syscall(__NR_vmevent_fd, "config ); +.fi +.SH DESCRIPTION +This system call creates a new file descriptor that can be used with polling +routines (e.g. +.BR poll (2)) +to get notified about various in-kernel virtual memory management events +that might be of interest for userspace. The interface can +also be used to effeciently monitor memory usage (e.g. number of idle and +swap pages). + +Applications can make overall system's memory management more nimble by +adjusting theirs resources usage upon the notifications. +.SS Attributes +Attributes are the basic concept, they are described by the following +structure: + +.nf +struct vmevent_attr { + __u64 value; + __u32 type; + __u32 state; +}; +.fi + +.I type +may correspond to these values: +.TP +.B VMEVENT_ATTR_NR_AVAIL_PAGES +The attribute reports total number of available pages in the system, not +including swap space (i.e. just total RAM). +.I value +is used to setup a threshold (in number or pages) upon which the event +will be delivered by the kernel. + +Upon notifications kernel updates all configured attributes, so the +attribute is mostly used without any thresholds, just for getting the +value together with other attributes and avoid reading and parsing +.IR /proc/vmstat . +.TP +.B VMEVENT_ATTR_NR_FREE_PAGES +The attribute reports total number of unused (idle) RAM in the system. + +.I value +is used to setup a threshold (in number or pages) upon which the event +will be delivered by the kernel. +.TP +.B VMEVENT_ATTR_NR_SWAP_PAGES +The attribute reports total number of swapped pages. + +.I value +is used to setup a threshold (in number or pages) upon which the event +will be delivered by the kernel. +.TP +.B VMEVENT_ATTR_PRESSURE +The attribute reports Linux virtual memory management pressure. There are +three discrete levels: + +.BR VMEVENT_PRESSURE_LOW : +By setting the threshold to this value it's possible to watch whether +system is reclaiming memory for new allocations. Monitoring reclaiming +activity might be useful for maintaining overall system's cache level. + +.BR VMEVENT_PRESSURE_MED : +The system is experiencing medium memory pressure, there is some mild +swapping activity. Upon this event applications may decide to free any +resources that can be easily reconstructed or re-read from a disk. + +.BR VMEVENT_PRESSURE_OOM : +The system is actively thrashing, it is about to out of memory (OOM) or +even the in-kernel OOM killer is on its way to trigger. Applications +should do whatever they can to help the system. See +.BR proc (5) +for more information about OOM killer and its configuration options. + +.I value +is used to setup a threshold upon which the event will be delivered by +the kernel (for algebraic comparisons, it is defined that +.BR VMEVENT_PRESSURE_LOW " <" +.BR VMEVENT_PRESSURE_MED " <" +.BR VMEVENT_PRESSURE_OOM , +but applications should not put any meaning into the absolute values.) + +.TP +.I state +is used to setup thresholds' behaviour, the following flags can be bitwise +OR'ed: +.... +.TP +.B VMEVENT_ATTR_STATE_VALUE_LT +Notification will be delivered when an attribute is less than a +user-specified +.IR "value" . +.TP +.B VMEVENT_ATTR_STATE_VALUE_GT +Notifications will be delivered when an attribute is greater than a +user-specified +.IR "value" . +.TP +.B VMEVENT_ATTR_STATE_VALUE_EQ +Notifications will be delivered when an attribute is equal to a +user-specified +.IR "value" . +.TP +.B VMEVENT_ATTR_STATE_EDGE_TRIGGER +Events will be only delivered when an attribute crosses +.I value +threshold. +.SS Events +Upon a notification, application must read out events using +.BR read (2) +system call. +The events are delivered using the following structure: + +.nf +struct vmevent_event { + __u32 counter; + __u32 padding; + struct vmevent_attr attrs[]; +}; +.fi + +The +.I counter +specifies a number of reported attributes, and the +.I attrs +array contains a copy of configured attributes, with +.IR "vmevent_attr" 's +.I value +overwritten to attribute's value. +.SS Config +.BR vmevent_fd (2) +accepts +.I vmevent_config +structure to configure the notifications: + +.nf +struct vmevent_config { + __u32 size; + __u32 counter; + __u64 sample_period_ns; + struct vmevent_attr attrs[VMEVENT_CONFIG_MAX_ATTRS]; +}; +.fi + +.I size +must be initialized to +.IR "sizeof(struct vmevent_config)" . + +.I counter +specifies a number of initialized +.I attrs +elements. + +.I sample_period_ns +specifies sampling period in nanoseconds. For applications it is +recommended to set this value to a highest suitable period. (Note that for +some attributes the delivery timing is not based on the sampling period, +e.g. +.IR VMEVENT_ATTR_PRESSURE .) +.SH "RETURN VALUE" +On success, +.BR vmevent_fd () +returns a new file descriptor. On error, a negative value is returned and +.I errno +is set to indicate the error. +.SH ERRORS +.BR vmevent_fd () +can fail with errors similar to +.BR open (2). + +In addition, the following errors are possible: +.TP +.B EINVAL +The failure means that an improperly initalized +.I config +structure has been passed to the call (this also includes improperly +initialized +.I attrs +arrays). +.TP +.B EFAULT +The failure means that the kernel was unable to read the configuration +structure, that is, +.I config +parameter points to an inaccessible memory. +.SH VERSIONS +The system call is available on Linux since kernel 3.8. Library support is +yet not provided by any glibc version. +.SH CONFORMING TO +The system call is Linux-specific. +.SH EXAMPLE +Examples can be found in +.I /usr/src/linux/tools/testing/vmevent/ +directory. +.SH "SEE ALSO" +.BR poll (2), +.BR read (2), +.BR proc (5), +.BR vmstat (8)
Hi Anton,
On Mon, Oct 22, 2012 at 04:19:28AM -0700, Anton Vorontsov wrote:
Hi all,
So this is the second RFC. The main change is that I decided to go with discrete levels of the pressure.
I am very happy with that because I already have yelled it several time.
When I started writing the man page, I had to describe the 'reclaimer inefficiency index', and while doing this I realized that I'm describing how the kernel is doing the memory management, which we try to avoid in the vmevent. And applications don't really care about these details: reclaimers, its inefficiency indexes, scanning window sizes, priority levels, etc. -- it's all "not interesting", and purely kernel's stuff. So I guess Mel Gorman was right, we need some sort of levels.
What applications (well, activity managers) are really interested in is this:
- Do we we sacrifice resources for new memory allocations (e.g. files cache)?
- Does the new memory allocations' cost becomes too high, and the system hurts because of this?
- Are we about to OOM soon?
Good but I think 3 is never easy. But early notification would be better than late notification which can kill someone.
And here are the answers:
- VMEVENT_PRESSURE_LOW
- VMEVENT_PRESSURE_MED
- VMEVENT_PRESSURE_OOM
There is no "high" pressure, since I really don't see any definition of it, but it's possible to introduce new levels without breaking ABI. The levels described in more details in the patches, and the stuff is still tunable, but now via sysctls, not the vmevent_fd() call itself (i.e. we don't need to rebuild applications to adjust window size or other mm "details").
What I couldn't fix in this RFC is making vmevent_{scanned,reclaimed} stuff per-CPU (there's a comment describing the problem with this). But I made it lockless and tried to make it very lightweight (plus I moved the vmevent_pressure() call to a more "cold" path).
Your description doesn't include why we need new vmevent_fd(2). Of course, it's very flexible and potential to add new VM knob easily but the thing we is about to use now is only VMEVENT_ATTR_PRESSURE. Is there any other use cases for swap or free? or potential user? Adding vmevent_fd without them is rather overkill.
And I want to avoid timer-base polling of vmevent if possbile. mem_notify of KOSAKI doesn't use such timer.
I don't object but we need rationale for adding new system call which should be maintained forever once we add it.
Thanks, Anton.
-- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Thu, Oct 25, 2012 at 9:40 AM, Minchan Kim minchan@kernel.org wrote:
Your description doesn't include why we need new vmevent_fd(2). Of course, it's very flexible and potential to add new VM knob easily but the thing we is about to use now is only VMEVENT_ATTR_PRESSURE. Is there any other use cases for swap or free? or potential user? Adding vmevent_fd without them is rather overkill.
What ABI would you use instead?
On Thu, Oct 25, 2012 at 9:40 AM, Minchan Kim minchan@kernel.org wrote:
I don't object but we need rationale for adding new system call which should be maintained forever once we add it.
Agreed.
Hi Pekka,
On Thu, Oct 25, 2012 at 09:44:52AM +0300, Pekka Enberg wrote:
On Thu, Oct 25, 2012 at 9:40 AM, Minchan Kim minchan@kernel.org wrote:
Your description doesn't include why we need new vmevent_fd(2). Of course, it's very flexible and potential to add new VM knob easily but the thing we is about to use now is only VMEVENT_ATTR_PRESSURE. Is there any other use cases for swap or free? or potential user? Adding vmevent_fd without them is rather overkill.
What ABI would you use instead?
I thought /dev/some_knob like mem_notify and epoll is enough but please keep in mind that I'm not against vmevent_fd strongly. My point is that description should include explain about why other candidate is not good or why vmevent_fd is better. (But at least, I don't like vmevent timer polling still and I hope we use it as last resort once we can find another)
On Thu, Oct 25, 2012 at 9:40 AM, Minchan Kim minchan@kernel.org wrote:
I don't object but we need rationale for adding new system call which should be maintained forever once we add it.
Agreed.
-- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Hello Minchan,
Thanks a lot for the email!
On Thu, Oct 25, 2012 at 03:40:09PM +0900, Minchan Kim wrote: [...]
What applications (well, activity managers) are really interested in is this:
- Do we we sacrifice resources for new memory allocations (e.g. files cache)?
- Does the new memory allocations' cost becomes too high, and the system hurts because of this?
- Are we about to OOM soon?
Good but I think 3 is never easy. But early notification would be better than late notification which can kill someone.
Well, basically these are two fixed (strictly defined) levels (low and oom) + one flexible level (med), which meaning can be slightly tuned (but we still have a meaningful definition for it).
So, I guess it's a good start. :)
And here are the answers:
- VMEVENT_PRESSURE_LOW
- VMEVENT_PRESSURE_MED
- VMEVENT_PRESSURE_OOM
There is no "high" pressure, since I really don't see any definition of it, but it's possible to introduce new levels without breaking ABI. The levels described in more details in the patches, and the stuff is still tunable, but now via sysctls, not the vmevent_fd() call itself (i.e. we don't need to rebuild applications to adjust window size or other mm "details").
What I couldn't fix in this RFC is making vmevent_{scanned,reclaimed} stuff per-CPU (there's a comment describing the problem with this). But I made it lockless and tried to make it very lightweight (plus I moved the vmevent_pressure() call to a more "cold" path).
Your description doesn't include why we need new vmevent_fd(2). Of course, it's very flexible and potential to add new VM knob easily but the thing we is about to use now is only VMEVENT_ATTR_PRESSURE. Is there any other use cases for swap or free? or potential user?
Number of idle pages by itself might be not that interesting, but cache+idle level is quite interesting.
By definition, _MED happens when performance already degraded, slightly, but still -- we can be swapping.
But _LOW notifications are coming when kernel is just reclaiming, so by using _LOW notifications + watching for cache level we can very easily predict the swapping activity long before we have even _MED pressure.
E.g. if idle+cache drops below amount of memory that userland can free, we'd indeed like to start freeing stuff (this somewhat resembles current logic that we have in the in-kernel LMK).
Sure, we can read and parse /proc/vmstat upon _LOW events (and that was my backup plan), but reporting stuff together would make things much nicer.
Although, I somewhat doubt that it is OK to report raw numbers, so this needs some thinking to develop more elegant solution.
Maybe it makes sense to implement something like PRESSURE_MILD with an additional nr_pages threshold, which basically hits the kernel about how many easily reclaimable pages userland has (that would be a part of our definition for the mild pressure level). So, essentially it will be
if (pressure_index >= oom_level) return PRESSURE_OOM; else if (pressure_index >= med_level) return PRESSURE_MEDIUM; else if (userland_reclaimable_pages >= nr_reclaimable_pages) return PRESSURE_MILD; return PRESSURE_LOW;
I must admit I like the idea more than exposing NR_FREE and stuff, but the scheme reminds me the blended attributes, which we abandoned. Although, the definition sounds better now, and we seem to be doing it in the right place.
And if we go this way, then sure, we won't need any other attributes, and so we could make the API much simpler.
Adding vmevent_fd without them is rather overkill.
And I want to avoid timer-base polling of vmevent if possbile. mem_notify of KOSAKI doesn't use such timer.
For pressure notifications we don't use the timers. We also read the vmstat counters together with the pressure, so "pressure + counters" effectively turns it into non-timer based polling. :)
But yes, hopefully we can get rid of the raw counters and timers, I don't them it too.
I don't object but we need rationale for adding new system call which should be maintained forever once we add it.
We can do it via eventfd, or /dev/chardev (which has been discussed and people didn't like it, IIRC), or signals (which also has been discussed and there are problems with this approach as well).
I'm not sure why having a syscall is a big issue. If we're making eventfd interface, then we'd need to maintain /sys/.../ ABI the same way as we maintain the syscall. What's the difference? A dedicated syscall is just a simpler interface, we don't need to mess with opening and passing things through /sys/.../.
Personally I don't have any preference (except that I distaste chardev and ioctls :), I just want to see pros and cons of all the solutions, and so far the syscall seems like an easiest way? Anyway, I'm totally open to changing it into whatever fits best.
Thanks, Anton.
On Thu, Oct 25, 2012 at 02:08:14AM -0700, Anton Vorontsov wrote: [...]
Maybe it makes sense to implement something like PRESSURE_MILD with an additional nr_pages threshold, which basically hits the kernel about how many easily reclaimable pages userland has (that would be a part of our definition for the mild pressure level). So, essentially it will be
if (pressure_index >= oom_level) return PRESSURE_OOM; else if (pressure_index >= med_level) return PRESSURE_MEDIUM; else if (userland_reclaimable_pages >= nr_reclaimable_pages) return PRESSURE_MILD;
...or we can call it PRESSURE_BALANCE, just to be precise and clear.
On Thu, Oct 25, 2012 at 02:08:14AM -0700, Anton Vorontsov wrote:
Hello Minchan,
Thanks a lot for the email!
On Thu, Oct 25, 2012 at 03:40:09PM +0900, Minchan Kim wrote: [...]
What applications (well, activity managers) are really interested in is this:
- Do we we sacrifice resources for new memory allocations (e.g. files cache)?
- Does the new memory allocations' cost becomes too high, and the system hurts because of this?
- Are we about to OOM soon?
Good but I think 3 is never easy. But early notification would be better than late notification which can kill someone.
Well, basically these are two fixed (strictly defined) levels (low and oom) + one flexible level (med), which meaning can be slightly tuned (but we still have a meaningful definition for it).
I mean detection of "3) Are we about to OOM soon" isn't easy.
So, I guess it's a good start. :)
Absolutely!
And here are the answers:
- VMEVENT_PRESSURE_LOW
- VMEVENT_PRESSURE_MED
- VMEVENT_PRESSURE_OOM
There is no "high" pressure, since I really don't see any definition of it, but it's possible to introduce new levels without breaking ABI. The levels described in more details in the patches, and the stuff is still tunable, but now via sysctls, not the vmevent_fd() call itself (i.e. we don't need to rebuild applications to adjust window size or other mm "details").
What I couldn't fix in this RFC is making vmevent_{scanned,reclaimed} stuff per-CPU (there's a comment describing the problem with this). But I made it lockless and tried to make it very lightweight (plus I moved the vmevent_pressure() call to a more "cold" path).
Your description doesn't include why we need new vmevent_fd(2). Of course, it's very flexible and potential to add new VM knob easily but the thing we is about to use now is only VMEVENT_ATTR_PRESSURE. Is there any other use cases for swap or free? or potential user?
Number of idle pages by itself might be not that interesting, but cache+idle level is quite interesting.
By definition, _MED happens when performance already degraded, slightly, but still -- we can be swapping.
But _LOW notifications are coming when kernel is just reclaiming, so by using _LOW notifications + watching for cache level we can very easily predict the swapping activity long before we have even _MED pressure.
So, for seeing cache level, we need new vmevent_attr?
E.g. if idle+cache drops below amount of memory that userland can free, we'd indeed like to start freeing stuff (this somewhat resembles current logic that we have in the in-kernel LMK).
Sure, we can read and parse /proc/vmstat upon _LOW events (and that was my backup plan), but reporting stuff together would make things much nicer.
My concern is that user can imagine various scenario with vmstat and they might start to require new vmevent_attr in future and vmevent_fd will be bloated and mm guys should care of vmevent_vd whenever they add new vmstat. I don't like it. User can do it by just reading /proc/vmstat. So I support your backup plan.
Although, I somewhat doubt that it is OK to report raw numbers, so this needs some thinking to develop more elegant solution.
Indeed.
Maybe it makes sense to implement something like PRESSURE_MILD with an additional nr_pages threshold, which basically hits the kernel about how many easily reclaimable pages userland has (that would be a part of our definition for the mild pressure level). So, essentially it will be
if (pressure_index >= oom_level) return PRESSURE_OOM; else if (pressure_index >= med_level) return PRESSURE_MEDIUM; else if (userland_reclaimable_pages >= nr_reclaimable_pages) return PRESSURE_MILD; return PRESSURE_LOW;
I must admit I like the idea more than exposing NR_FREE and stuff, but the scheme reminds me the blended attributes, which we abandoned. Although, the definition sounds better now, and we seem to be doing it in the right place.
And if we go this way, then sure, we won't need any other attributes, and so we could make the API much simpler.
That's what I want! If there isn't any user who really are willing to use it, let's drop it. Do not persuade with imaginary scenario because we should be careful to introduce new ABI.
Adding vmevent_fd without them is rather overkill.
And I want to avoid timer-base polling of vmevent if possbile. mem_notify of KOSAKI doesn't use such timer.
For pressure notifications we don't use the timers. We also read the
Hmm, when I see the code, timer still works and can notify to user. No?
vmstat counters together with the pressure, so "pressure + counters" effectively turns it into non-timer based polling. :)
But yes, hopefully we can get rid of the raw counters and timers, I don't them it too.
You and i are reaching on a conclusion, at least.
I don't object but we need rationale for adding new system call which should be maintained forever once we add it.
We can do it via eventfd, or /dev/chardev (which has been discussed and people didn't like it, IIRC), or signals (which also has been discussed and there are problems with this approach as well).
I'm not sure why having a syscall is a big issue. If we're making eventfd interface, then we'd need to maintain /sys/.../ ABI the same way as we maintain the syscall. What's the difference? A dedicated syscall is just a
No difference. What I want is just to remove unnecessary stuff in vmevent_fd and keep it as simple. If we do via /dev/chardev, I expect we can do necessary things for VM pressure. But if we can diet with vmevent_fd, It would be better. If so, maybe we have to change vmevent_fd to lowmem_fd or vmpressure_fd.
simpler interface, we don't need to mess with opening and passing things through /sys/.../.
Personally I don't have any preference (except that I distaste chardev and ioctls :), I just want to see pros and cons of all the solutions, and so far the syscall seems like an easiest way? Anyway, I'm totally open to changing it into whatever fits best.
Yeb. Interface stuff isn't a big concern for low memory notification so I'm not against it stronlgy, too.
Thanks, Anton.
On Fri, Oct 26, 2012 at 11:37:20AM +0900, Minchan Kim wrote: [...]
Of course, it's very flexible and potential to add new VM knob easily but the thing we is about to use now is only VMEVENT_ATTR_PRESSURE. Is there any other use cases for swap or free? or potential user?
Number of idle pages by itself might be not that interesting, but cache+idle level is quite interesting.
By definition, _MED happens when performance already degraded, slightly, but still -- we can be swapping.
But _LOW notifications are coming when kernel is just reclaiming, so by using _LOW notifications + watching for cache level we can very easily predict the swapping activity long before we have even _MED pressure.
So, for seeing cache level, we need new vmevent_attr?
Hopefully, not. We're not interested in the raw values of the cache level, but what we want is to to tell the kernel how much "easily reclaimable pages" userland has, and get notified when kernel believes that it's good time for the userland is to help. I.e. this new _MILD level:
Maybe it makes sense to implement something like PRESSURE_MILD with an additional nr_pages threshold, which basically hits the kernel about how many easily reclaimable pages userland has (that would be a part of our definition for the mild pressure level). So, essentially it will be
if (pressure_index >= oom_level) return PRESSURE_OOM; else if (pressure_index >= med_level) return PRESSURE_MEDIUM; else if (userland_reclaimable_pages >= nr_reclaimable_pages) return PRESSURE_MILD; return PRESSURE_LOW;
I must admit I like the idea more than exposing NR_FREE and stuff, but the scheme reminds me the blended attributes, which we abandoned. Although, the definition sounds better now, and we seem to be doing it in the right place.
And if we go this way, then sure, we won't need any other attributes, and so we could make the API much simpler.
That's what I want! If there isn't any user who really are willing to use it, let's drop it. Do not persuade with imaginary scenario because we should be careful to introduce new ABI.
Yeah, I think you're right. Let's make the vmevent_fd slim first. I won't even focus on the _MILD/_BALANCE level for now, we can do it later, and we always have the /proc/vmstat even if the _MILD turns out to be a bad idea.
Reading /proc/vmstat is a bit more overhead, but it's not that much at all (especially when we don't have to timer-poll the vmstat).
Adding vmevent_fd without them is rather overkill.
And I want to avoid timer-base polling of vmevent if possbile. mem_notify of KOSAKI doesn't use such timer.
For pressure notifications we don't use the timers. We also read the
Hmm, when I see the code, timer still works and can notify to user. No?
Yes, I was mostly saying that it is technically not required anymore, but you're right, the code still fires the timer (it just runs needlessly for the pressure attr).
Bad wording on my side.
[..]
We can do it via eventfd, or /dev/chardev (which has been discussed and people didn't like it, IIRC), or signals (which also has been discussed and there are problems with this approach as well).
I'm not sure why having a syscall is a big issue. If we're making eventfd interface, then we'd need to maintain /sys/.../ ABI the same way as we maintain the syscall. What's the difference? A dedicated syscall is just a
No difference. What I want is just to remove unnecessary stuff in vmevent_fd and keep it as simple. If we do via /dev/chardev, I expect we can do necessary things for VM pressure. But if we can diet with vmevent_fd, It would be better. If so, maybe we have to change vmevent_fd to lowmem_fd or vmpressure_fd.
Sure, then I'm starting the work to slim the API down, and we'll see how things are going to look after that.
Thanks a lot!
Anton.
linaro-kernel@lists.linaro.org