On Tue, Mar 26, 2013 at 01:46:56PM -0700, Andrew Morton wrote: [...]
+The file memory.pressure_level is only used to setup an eventfd, +read/write operations are no implemented.
[...]
Did we tell people how to use the eventfd interface anywhere?
Good point. In v4 I added a detailed instructions on how to setup the file descriptors.
- Add support for accounting huge pages (as a separate controller)
- Make per-cgroup scanner reclaim not-shared pages first
diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h --- /dev/null +++ b/include/linux/vmpressure.h @@ -0,0 +1,47 @@ +#ifndef __LINUX_VMPRESSURE_H +#define __LINUX_VMPRESSURE_H
+#include <linux/mutex.h> +#include <linux/list.h> +#include <linux/workqueue.h> +#include <linux/gfp.h> +#include <linux/types.h> +#include <linux/cgroup.h>
+struct vmpressure {
- unsigned int scanned;
- unsigned int reclaimed;
- /* The lock is used to keep the scanned/reclaimed above in sync. */
- struct mutex sr_lock;
- struct list_head events;
A comment describing what goes at `events' would be nice. Reference "struct vmpressure_event".
Done.
- /* Have to grab the lock on events traversal or modifications. */
- struct mutex events_lock;
- struct work_struct work;
+};
...
+/*
- The window size is the number of scanned pages before we try to analyze
- the scanned/reclaimed ratio (or difference).
- It is used as a rate-limit tunable for the "low" level notification,
- and for averaging medium/critical levels. Using small window sizes can
- cause lot of false positives, but too big window size will delay the
- notifications.
- TODO: Make the window size depend on machine size, as we do for vmstat
- thresholds.
Here "the window size" refers to vmpressure_win, yes?
Yup.
(To make it clear, in the new version I added a direct reference to the vmpressure_win.)
- */
+static const unsigned int vmpressure_win = SWAP_CLUSTER_MAX * 16; +static const unsigned int vmpressure_level_med = 60; +static const unsigned int vmpressure_level_critical = 95; +static const unsigned int vmpressure_level_critical_prio = 3;
vmpressure_level_critical_prio is a bit mysterious and undocumented. Please document it here and/or at vmpressure_prio().
I added documentation in v4.
+void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
unsigned long scanned, unsigned long reclaimed)
Exported function and a primary inteface. Needs nice documentation, please ;)
Sure thing, all exported function now come with kernel-doc comments.
[...]
*/
- if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
I'm surprised at __GFP_HIGHMEM's inclusion. On some machines the great majority of user memory is in highmem. What's up?
In the new revision I included this comment:
/* * Here we only want to account pressure that userland is able to * help us with. For example, suppose that DMA zone is under * pressure; if we notify userland about that kind of pressure, * then it will be mostly a waste as it will trigger unnecessary * freeing of memory by userland (since userland is more likely to * have HIGHMEM/MOVABLE pages instead of the DMA fallback). That * is why we include only movable, highmem and FS/IO pages. * Indirect reclaim (kswapd) sets sc->gfp_mask to GFP_KERNEL, so * we account it too. */ if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS))) return;
- if (!scanned)
return;
- mutex_lock(&vmpr->sr_lock);
- vmpr->scanned += scanned;
- vmpr->reclaimed += reclaimed;
See, here we're accumulating into a 32-bit variable quantities which used to be held in 64-bit variables. The overflow risk gets higher...
I see. I fixed this.
- mutex_unlock(&vmpr->sr_lock);
- if (scanned < vmpressure_win || work_pending(&vmpr->work))
return;
- schedule_work(&vmpr->work);
+}
+void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
Documentation please.
Yup, done.
+{
- if (prio > vmpressure_level_critical_prio)
return;
- /*
* OK, the prio is below the threshold, updating vmpressure
But you never told me what that threshold is for! And I have no means of working out why you chose "3", nor the effects of altering it, etc.
True. This is explained it in a comment now.
[...]
+int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
struct eventfd_ctx *eventfd, const char *args)
Document the interface, please.
Done.
+{
- struct vmpressure *vmpr = cg_to_vmpr(cg);
- struct vmpressure_event *ev;
- int lvl;
These abbreviations are rather unlinuxy. wk->work, vmpr->vmpressure, lvl->level, etc.
Yeah, I agree. Although, 'vmpressure' as a function-scope variable is kinda too long, the code becomes really hard to read. But in memcg struct and global namespace I now use the full 'vmpressure' name.
- for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) {
if (!strcmp(vmpressure_str_levels[lvl], args))
break;
- }
- if (lvl >= VMPRESSURE_NUM_LEVELS)
return -EINVAL;
- ev = kzalloc(sizeof(*ev), GFP_KERNEL);
- if (!ev)
return -ENOMEM;
- ev->efd = eventfd;
- ev->level = lvl;
- mutex_lock(&vmpr->events_lock);
- list_add(&ev->node, &vmpr->events);
What's the upper bound on the length of this list?
As of now, it is controlled by the cgroup core, so I would say the number of opened FDs, and if that is a problem, it should be fixed for everyone. The good thing is that the list is per-cgroup, it is not global.
Thanks for the review, Andrew!
Anton