Re: [PATCH v3] memcg: Add memory.pressure_level events

3 Apr 2013

On Tue, Mar 26, 2013 at 01:46:56PM -0700, Andrew Morton wrote:
[...]
...
...
+The file memory.pressure_level is only used to setup an eventfd,
+read/write operations are no implemented.
[...]
...
Did we tell people how to use the eventfd interface anywhere?
Good point. In v4 I added a detailed instructions on how to setup the file
descriptors.
...
...

Add support for accounting huge pages (as a separate controller)
Make per-cgroup scanner reclaim not-shared pages first

diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
--- /dev/null
+++ b/include/linux/vmpressure.h
@@ -0,0 +1,47 @@
+#ifndef __LINUX_VMPRESSURE_H
+#define __LINUX_VMPRESSURE_H



+#include <linux/mutex.h>
+#include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/gfp.h>
+#include <linux/types.h>
+#include <linux/cgroup.h>



+struct vmpressure {

unsigned int scanned;
unsigned int reclaimed;
/* The lock is used to keep the scanned/reclaimed above in sync. */
struct mutex sr_lock;

struct list_head events;

A comment describing what goes at `events' would be nice.  Reference
"struct vmpressure_event".
Done.
...
...

/* Have to grab the lock on events traversal or modifications. */
struct mutex events_lock;

struct work_struct work;

+};
...
+/*


The window size is the number of scanned pages before we try to analyze



the scanned/reclaimed ratio (or difference).







It is used as a rate-limit tunable for the "low" level notification,



and for averaging medium/critical levels. Using small window sizes can



cause lot of false positives, but too big window size will delay the



notifications.







TODO: Make the window size depend on machine size, as we do for vmstat



thresholds.



Here "the window size" refers to vmpressure_win, yes?
Yup.
(To make it clear, in the new version I added a direct reference to the
vmpressure_win.)
...
...

*/

+static const unsigned int vmpressure_win = SWAP_CLUSTER_MAX * 16;
+static const unsigned int vmpressure_level_med = 60;
+static const unsigned int vmpressure_level_critical = 95;
+static const unsigned int vmpressure_level_critical_prio = 3;
vmpressure_level_critical_prio is a bit mysterious and undocumented. 
Please document it here and/or at vmpressure_prio().
I added documentation in v4.
...
...
+void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,

unsigned long scanned, unsigned long reclaimed)



Exported function and a primary inteface.  Needs nice documentation, please ;)
Sure thing, all exported function now come with kernel-doc comments.
[...]
...
...

*/


if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))

I'm surprised at __GFP_HIGHMEM's inclusion.  On some machines the great
majority of user memory is in highmem.  What's up?
In the new revision I included this comment:
/*
         * Here we only want to account pressure that userland is able to
         * help us with. For example, suppose that DMA zone is under
         * pressure; if we notify userland about that kind of pressure,
         * then it will be mostly a waste as it will trigger unnecessary
         * freeing of memory by userland (since userland is more likely to
         * have HIGHMEM/MOVABLE pages instead of the DMA fallback). That
         * is why we include only movable, highmem and FS/IO pages.
         * Indirect reclaim (kswapd) sets sc->gfp_mask to GFP_KERNEL, so
         * we account it too.
         */
        if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
                return;
...
...

if (!scanned)
return;



mutex_lock(&vmpr->sr_lock);
vmpr->scanned += scanned;
vmpr->reclaimed += reclaimed;

See, here we're accumulating into a 32-bit variable quantities which used
to be held in 64-bit variables.    The overflow risk gets higher...
I see. I fixed this.
...
...

mutex_unlock(&vmpr->sr_lock);

if (scanned < vmpressure_win || work_pending(&vmpr->work))
return;


schedule_work(&vmpr->work);

+}



+void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
Documentation please.
Yup, done.
...
...
+{

if (prio > vmpressure_level_critical_prio)
return;



/*
* OK, the prio is below the threshold, updating vmpressure



But you never told me what that threshold is for!  And I have no means
of working out why you chose "3", nor the effects of altering it, etc.
True. This is explained it in a comment now.
[...]
...
...
+int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,

	      struct eventfd_ctx *eventfd, const char *args)



Document the interface, please.
Done.
...
...
+{

struct vmpressure *vmpr = cg_to_vmpr(cg);
struct vmpressure_event *ev;
int lvl;

These abbreviations are rather unlinuxy.  wk->work, vmpr->vmpressure,
lvl->level, etc.
Yeah, I agree. Although, 'vmpressure' as a function-scope variable is
kinda too long, the code becomes really hard to read. But in memcg struct
and global namespace I now use the full 'vmpressure' name.
...
...

for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) {
if (!strcmp(vmpressure_str_levels[lvl], args))


	break;


}

if (lvl >= VMPRESSURE_NUM_LEVELS)
return -EINVAL;



ev = kzalloc(sizeof(*ev), GFP_KERNEL);
if (!ev)
return -ENOMEM;



ev->efd = eventfd;
ev->level = lvl;

mutex_lock(&vmpr->events_lock);
list_add(&ev->node, &vmpr->events);

What's the upper bound on the length of this list?
As of now, it is controlled by the cgroup core, so I would say the number
of opened FDs, and if that is a problem, it should be fixed for everyone.
The good thing is that the list is per-cgroup, it is not global.
Thanks for the review, Andrew!
Anton

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [PATCH v3] memcg: Add memory.pressure_level events