On Mon, 19 Nov 2012, Glauber Costa wrote:
Because cpusets only deal with memory placement, not memory usage.
The set of nodes that a thread is allowed to allocate from may face memory pressure up to and including oom while the rest of the system may have a ton of free memory. Your solution is to compile and mount memcg if you want notifications of memory pressure on those nodes. Others in this thread have already said they don't want to rely on memcg for any of this and, as Anton showed, this can be tied directly into the VM without any help from memcg as it sits today. So why implement a simple and clean mempressure cgroup that can be used alone or co-existing with either memcg or cpusets?
Forgot this one:
Because there is a huge ongoing work going on by Tejun aiming at reducing the effects of orthogonal hierarchy. There are many controllers today that are "close enough" to each other (cpu, cpuacct; net_prio, net_cls), and in practice, it brought more problems than it solved.
I'm very happy that Tejun is working on that, but I don't see how it's relevant here: I'm referring to users who are not using memcg specifically. This is what others brought up earlier in the thread: they do not want to be required to use memcg for this functionality.
There are users of cpusets today that do not enable nor comount memcg. I argue that a mempressure cgroup allows them this functionality without the memory footprint of memcg (not only in text, but requiring page_cgroup). Additionally, there are probably users who do not want either cpusets or memcg and want notifications from mempressure at a global level. Users who care so much about the memory pressure of their systems probably have strict footprint requirements, it would be a complete shame to require a semi-tractor trailer when all I want is a compact car.
So yes, *maybe* mempressure is the answer, but it need to be justified with care. Long term, I think a saner notification API for memcg will lead us to a better and brighter future.
You can easily comount mempressure with your memcg, this is not anything new.
There is also yet another aspect: This scheme works well for global notifications. If we would always want this to be global, this would work neatly. But as already mentioned in this thread, at some point we'll want this to work for a group of processes as well. At that point, you'll have to count how much memory is being used, so you can determine whether or not pressure is going on. You will, then, have to redo all the work memcg already does.
Anton can correct me if I'm wrong, but I certainly don't think this is where mempressure is headed: I don't think any accounting needs to be done and, if it is, it's a design issue that should be addressed now rather than later. I believe notifications should occur on current's mempressure cgroup depending on its level of reclaim: nobody cares if your memcg has a limit of 64GB when you only have 32GB of RAM, we'll want the notification.