Re: [RFC v2 0/2] vmevent: A bit reworked pressure attribute + docs + man page

26 Oct 2012

      On Thu, Oct 25, 2012 at 02:08:14AM -0700, Anton Vorontsov wrote:
...
Hello Minchan,
Thanks a lot for the email!
On Thu, Oct 25, 2012 at 03:40:09PM +0900, Minchan Kim wrote:
[...]
...
...
What applications (well, activity managers) are really interested in is
this:

Do we we sacrifice resources for new memory allocations (e.g. files
cache)?
Does the new memory allocations' cost becomes too high, and the system
hurts because of this?
Are we about to OOM soon?

Good but I think 3 is never easy.
But early notification would be better than late notification which can kill
someone.
Well, basically these are two fixed (strictly defined) levels (low and
oom) + one flexible level (med), which meaning can be slightly tuned (but
we still have a meaningful definition for it).
I mean detection of "3) Are we about to OOM soon" isn't easy.
...
So, I guess it's a good start. :)
Absolutely!
...
...
...
And here are the answers:

VMEVENT_PRESSURE_LOW
VMEVENT_PRESSURE_MED
VMEVENT_PRESSURE_OOM

There is no "high" pressure, since I really don't see any definition of
it, but it's possible to introduce new levels without breaking ABI. The
levels described in more details in the patches, and the stuff is still
tunable, but now via sysctls, not the vmevent_fd() call itself (i.e. we
don't need to rebuild applications to adjust window size or other mm
"details").
What I couldn't fix in this RFC is making vmevent_{scanned,reclaimed}
stuff per-CPU (there's a comment describing the problem with this). But I
made it lockless and tried to make it very lightweight (plus I moved the
vmevent_pressure() call to a more "cold" path).
Your description doesn't include why we need new vmevent_fd(2).
Of course, it's very flexible and potential to add new VM knob easily but
the thing we is about to use now is only VMEVENT_ATTR_PRESSURE.
Is there any other use cases for swap or free? or potential user?
Number of idle pages by itself might be not that interesting, but
cache+idle level is quite interesting.
By definition, _MED happens when performance already degraded, slightly,
but still -- we can be swapping.
But _LOW notifications are coming when kernel is just reclaiming, so by
using _LOW notifications + watching for cache level we can very easily
predict the swapping activity long before we have even _MED pressure.
So, for seeing cache level, we need new vmevent_attr?
...
E.g. if idle+cache drops below amount of memory that userland can free,
we'd indeed like to start freeing stuff (this somewhat resembles current
logic that we have in the in-kernel LMK).
Sure, we can read and parse /proc/vmstat upon _LOW events (and that was my
backup plan), but reporting stuff together would make things much nicer.
My concern is that user can imagine various scenario with vmstat and they might start to
require new vmevent_attr in future and vmevent_fd will be bloated and mm guys should
care of vmevent_vd whenever they add new vmstat. I don't like it. User can do it by
just reading /proc/vmstat. So I support your backup plan.
...
Although, I somewhat doubt that it is OK to report raw numbers, so this
needs some thinking to develop more elegant solution.
Indeed.
...
Maybe it makes sense to implement something like PRESSURE_MILD with an
additional nr_pages threshold, which basically hits the kernel about how
many easily reclaimable pages userland has (that would be a part of our
definition for the mild pressure level). So, essentially it will be
if (pressure_index >= oom_level)
   	return PRESSURE_OOM;
   else if (pressure_index >= med_level)
   	return PRESSURE_MEDIUM;
   else if (userland_reclaimable_pages >= nr_reclaimable_pages)
   	return PRESSURE_MILD;
   return PRESSURE_LOW;
I must admit I like the idea more than exposing NR_FREE and stuff, but the
scheme reminds me the blended attributes, which we abandoned. Although,
the definition sounds better now, and we seem to be doing it in the right
place.
And if we go this way, then sure, we won't need any other attributes, and
so we could make the API much simpler.
That's what I want! If there isn't any user who really are willing to use it,
let's drop it. Do not persuade with imaginary scenario because we should be 
careful to introduce new ABI.
...
...
Adding vmevent_fd without them is rather overkill.
And I want to avoid timer-base polling of vmevent if possbile.
mem_notify of KOSAKI doesn't use such timer.
For pressure notifications we don't use the timers. We also read the
Hmm, when I see the code, timer still works and can notify to user. No?
...
vmstat counters together with the pressure, so "pressure + counters"
effectively turns it into non-timer based polling. :)
But yes, hopefully we can get rid of the raw counters and timers, I don't
them it too.
You and i are reaching on a conclusion, at least.
...
...
I don't object but we need rationale for adding new system call which should
be maintained forever once we add it.
We can do it via eventfd, or /dev/chardev (which has been discussed and
people didn't like it, IIRC), or signals (which also has been discussed
and there are problems with this approach as well).
I'm not sure why having a syscall is a big issue. If we're making eventfd
interface, then we'd need to maintain /sys/.../ ABI the same way as we
maintain the syscall. What's the difference? A dedicated syscall is just a
No difference. What I want is just to remove unnecessary stuff in vmevent_fd
and keep it as simple. If we do via /dev/chardev, I expect we can do necessary
things for VM pressure. But if we can diet with vmevent_fd, It would be better.
If so, maybe we have to change vmevent_fd to lowmem_fd or vmpressure_fd.
...
simpler interface, we don't need to mess with opening and passing things
through /sys/.../.
Personally I don't have any preference (except that I distaste chardev and
ioctls :), I just want to see pros and cons of all the solutions, and so
far the syscall seems like an easiest way? Anyway, I'm totally open to
changing it into whatever fits best.
Yeb. Interface stuff isn't a big concern for low memory notification so I'm not
against it stronlgy, too.
Thanks, Anton.
-- 
Kind regards,
Minchan Kim

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [RFC v2 0/2] vmevent: A bit reworked pressure attribute + docs + man page