Hello Rik,
Thanks for looking into this!
On Tue, May 01, 2012 at 05:04:21PM -0400, Rik van Riel wrote:
On 05/01/2012 09:18 AM, Anton Vorontsov wrote:
This patch implements a new event type, it will trigger whenever a value becomes greater than user-specified threshold, it complements the 'less-then' trigger type.
Also, let's implement the one-shot mode for the events, when set, userspace will only receive one notification per crossing the boundaries.
Now when both LT and GT are set on the same level, the event type works as a cross event type: it triggers whenever a value crosses the threshold from a lesser values side to a greater values side, and vice versa.
We use the event types in an userspace low-memory killer: we get a notification when memory becomes low, so we start freeing memory by killing unneeded processes, and we get notification when memory hits the threshold from another side, so we know that we freed enough of memory.
How are these vmevents supposed to work with cgroups?
Currently these are independent subsystems, if you have memcg enabled, you can do almost anything* with the memory, as memg has all the needed hooks in the mm/ subsystem (it is more like "memory management tracer" nowadays :-).
But cgroups have its cost, both performance penalty and memory wastage. For example, in the best case, memcg constantly consumes 0.5% of RAM to track memory usage, this is 5 MB on a 1 GB "embedded" machine. To some people it feels just wrong to waste that memory for mere notifications.
Of course, this alone can be considered as a lame argument for making another subsystem (instead of "fixing" the current one). But see below, vmevent is just a convenient ABI.
What do we do when a cgroup nears its limit, and there is no more swap space available?
What do we do when a cgroup nears its limit, and there is swap space available?
As of now, this is all orthogonal to vmevent. Vmevent doesn't know about cgroups. If kernel has the memcg enabled, one should probably* go with it (or better, with its ABI). At least for now.
It would be nice to be able to share the same code for embedded, desktop and server workloads...
It would be great indeed, but so far I don't see much that vmevent could share. Plus, sharing the code at this point is not that interesting; it's mere 500 lines of code (comparing to more than 10K lines for cgroups, and it's not including memcg_ hooks and logic that is spread all over mm/).
Today vmevent code is mostly an ABI implementation, there is very little memory management logic (in contrast to the memcg).
Personally, I would rather consider sharing ABI at some point: i.e. making a memcg backend for the vmevent. That would be pretty cool. And once done, vmevent would be cgroups-aware (if memcg enabled, of course; and if not, vmevent would still work, with no memcg-related expenses).
* For low memory notifications, there are still some unresolved issues with memcg. Mainly, slab accounting for the root cgroup: currently developed slab accounting doesn't account kernel's internal memory consumption, plus it doesn't account slab memory for the root cgroup at all.
A few days ago I asked[1] why memcg doesn't do all this, and whether it is a design decision or just an implementation detail (so that we have a chance to fix it).
But so far there were no feedback. We'll see how things turn out.
[1] http://lkml.org/lkml/2012/4/30/115
Thanks!