linaro-kernel April 2013

linaro-kernel@lists.linaro.org

91 participants
94 discussions

Re: [PATCH 8/9] spark: cpufreq: move cpufreq driver to drivers/cpufreq

by Viresh Kumar

On 3 April 2013 22:08, David Miller <davem(a)davemloft.net> wrote: > From: Viresh Kumar <viresh.kumar(a)linaro.org> > Date: Wed, 3 Apr 2013 14:59:44 +0530 > >> On 1 April 2013 10:11, Viresh Kumar <viresh.kumar(a)linaro.org> wrote: >>> On 31 March 2013 22:10, David Miller <davem(a)davemloft.net> wrote: >>>>> On 26 March 2013 09:55, Viresh Kumar <viresh.kumar(a)linaro.org> wrote: >>>>>> From: Viresh Kumar <viresh.kumar(a)linaro.org> >>>>>> Date: Mon, 25 Mar 2013 11:20:23 +0530 >>>>>> Subject: [PATCH] sparc: cpufreq: move cpufreq driver to drivers/cpufreq >>> >>>> Subject line still has the "spark" typo. >>> >>> Your mail was scary, really... HOW can i do it?? >>> >>> And then i saw how you got it wrong. I haven't sent a new mail, so mails subject >>> remains the same... I copied V2 in the same mail.. Check above, subject looks >>> fine :) >> >> Hi David, >> >> I think all pending issues are fixed now... Can i have your Ack please? >> Or maybe more comments :) > > Acked-by: David S. Miller <davem(a)davemloft.net> Adding everybody else in cc.

12 years, 6 months

[PATCH 1/9] ARM: cpuidle: remove useless declaration

by Daniel Lezcano

The noop functions code is not necessary because the header file is included in files which are compiled when CONFIG_CPU_IDLE is on. Signed-off-by: Daniel Lezcano <daniel.lezcano(a)linaro.org> --- arch/arm/include/asm/cpuidle.h | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/arch/arm/include/asm/cpuidle.h b/arch/arm/include/asm/cpuidle.h index 2fca60a..7367787 100644 --- a/arch/arm/include/asm/cpuidle.h +++ b/arch/arm/include/asm/cpuidle.h @@ -1,13 +1,8 @@ #ifndef __ASM_ARM_CPUIDLE_H #define __ASM_ARM_CPUIDLE_H -#ifdef CONFIG_CPU_IDLE extern int arm_cpuidle_simple_enter(struct cpuidle_device *dev, - struct cpuidle_driver *drv, int index); -#else -static inline int arm_cpuidle_simple_enter(struct cpuidle_device *dev, - struct cpuidle_driver *drv, int index) { return -ENODEV; } -#endif + struct cpuidle_driver *drv, int index); /* Common ARM WFI state */ #define ARM_CPUIDLE_WFI_STATE_PWR(p) {\ -- 1.7.9.5

12 years, 6 months

[PATCH 1/9] ARM: cpuidle: remove useless declaration

by Daniel Lezcano

Hi Rafael, I noticed this patchset went to the linux-arm-project in patchwork but it was addressed to you / linux-pm. AFAIU, you rely on patchwork to manage the patches, I was wondering if you could have missed them ? Thanks -- Daniel

12 years, 6 months

Re: [PATCH 1/9] AVR32: cpufreq: move cpufreq driver to drivers/cpufreq

by Viresh Kumar

On 3 April 2013 15:12, Hans-Christian Egtvedt <egtvedt(a)samfundet.no> wrote: > Around Wed 03 Apr 2013 14:58:29 +0530 or thereabout, Viresh Kumar wrote: >> Ping!! > > If it still builds, then > > Acked-by: Hans-Christian Egtvedt <egtvedt(a)samfundet.no> > > It only uses global header files, so should be fine. Thanks. Ahh!! I have removed others from my last Ping!! mail (didn't wanted to spam their inbox), but yes they are aware of your Ack now. Thanks.

12 years, 6 months

[PATCH v3] memcg: Add memory.pressure_level events

by Anton Vorontsov

With this patch userland applications that want to maintain the interactivity/memory allocation cost can use the pressure level notifications. The levels are defined like this: The "low" level means that the system is reclaiming memory for new allocations. Monitoring this reclaiming activity might be useful for maintaining cache level. Upon notification, the program (typically "Activity Manager") might analyze vmstat and act in advance (i.e. prematurely shutdown unimportant services). The "medium" level means that the system is experiencing medium memory pressure, the system might be making swap, paging out active file caches, etc. Upon this event applications may decide to further analyze vmstat/zoneinfo/memcg or internal memory usage statistics and free any resources that can be easily reconstructed or re-read from a disk. The "critical" level means that the system is actively thrashing, it is about to out of memory (OOM) or even the in-kernel OOM killer is on its way to trigger. Applications should do whatever they can to help the system. It might be too late to consult with vmstat or any other statistics, so it's advisable to take an immediate action. The events are propagated upward until the event is handled, i.e. the events are not pass-through. Here is what this means: for example you have three cgroups: A->B->C. Now you set up an event listener on cgroups A, B and C, and suppose group C experiences some pressure. In this situation, only group C will receive the notification, i.e. groups A and B will not receive it. This is done to avoid excessive "broadcasting" of messages, which disturbs the system and which is especially bad if we are low on memory or thrashing. So, organize the cgroups wisely, or propagate the events manually (or, ask us to implement the pass-through events, explaining why would you need them.) Performance wise, the memory pressure notifications feature itself is lightweight and does not require much of bookkeeping, in contrast to the rest of memcg features. Unfortunately, as of current memcg implementation, pages accounting is an inseparable part and cannot be turned off. The good news is that there are some efforts[1] to improve the situation; plus, implementing the same, fully API-compatible[2] interface for CONFIG_MEMCG=n case (e.g. embedded) is also a viable option, so it will not require any changes on the userland side. [1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291 [2] http://lkml.org/lkml/2013/2/21/454 Signed-off-by: Anton Vorontsov <anton.vorontsov(a)linaro.org> Acked-by: Kirill A. Shutemov <kirill(a)shutemov.name> --- Hi all, Here is a shiny new v3! In v3: - No changes in the code, just updated commit message to incorporate the answer to Minchan Kim's comment regarding applicability to embedded use cases in the light of memcg performance overhead, plus gave some references to Glauber Costa's memcg work. - Rebased onto 3.9.0-rc3-next-20130321. In v2: - Addressed Glauber Costa's comments: o Use parent_mem_cgroup() instead of own parent function (also suggested by Kamezawa). This change also affected events distribution logic, so it became more like memory thresholds notifications, i.e. we deliver the event to the cgroup where the event originated, not to the parent cgroup; (This also addreses Kamezawa's remark regarding which cgroup receives which event.) o Register vmpressure cgroup file directly in memcontrol.c. - Addressed Greg Thelen's comments: o Fixed bool/int inconsistency in the code; o Fixed nr_scanned accounting; o Don't use cryptic 's', 'r' abbreviations; get rid of confusing 'window' argument. - Addressed Kamezawa Hiroyuki's comments: o Moved declarations from mm/internal.h into linux/vmpressue.h; o Removed Kconfig symbol. Vmpressure is pretty lightweight (especially comparing to the memcg accounting). If it ever causes any measurable performance effect, we want to fix it, not paper it over with a Kconfig option. :-) o Removed read operation on pressure_level cgroup file. In apps, we only use notifications, we don't need the content of the file, so let's keep things simple for now. Plus this resolves questions like what should we return there when the system is not reclaiming; o Reworded documentation; o Improved comments for vmpressure_prio(). Old changelogs/submissions: v2: http://lkml.org/lkml/2013/2/18/577 v1: http://lkml.org/lkml/2013/2/10/140 mempressure cgroup: http://lkml.org/lkml/2013/1/4/55 Documentation/cgroups/memory.txt | 61 +++++++++- include/linux/vmpressure.h | 47 ++++++++ mm/Makefile | 2 +- mm/memcontrol.c | 28 +++++ mm/vmpressure.c | 252 +++++++++++++++++++++++++++++++++++++++ mm/vmscan.c | 8 ++ 6 files changed, 396 insertions(+), 2 deletions(-) create mode 100644 include/linux/vmpressure.h create mode 100644 mm/vmpressure.c diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index addb1f1..0c004de 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -40,6 +40,7 @@ Features: - soft limit - moving (recharging) account at moving a task is selectable. - usage threshold notifier + - memory pressure notifier - oom-killer disable knob and oom-notifier - Root cgroup has no limit controls. @@ -65,6 +66,7 @@ Brief summary of control files. memory.stat # show various statistics memory.use_hierarchy # set/show hierarchical account enabled memory.force_empty # trigger forced move charge to parent + memory.pressure_level # set memory pressure notifications memory.swappiness # set/show swappiness parameter of vmscan (See sysctl's vm.swappiness) memory.move_charge_at_immigrate # set/show controls of moving charges @@ -778,7 +780,64 @@ At reading, current status of OOM is shown. under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may be stopped.) -11. TODO +11. Memory Pressure + +The pressure level notifications can be used to monitor the memory +allocation cost; based on the pressure, applications can implement +different strategies of managing their memory resources. The pressure +levels are defined as following: + +The "low" level means that the system is reclaiming memory for new +allocations. Monitoring this reclaiming activity might be useful for +maintaining cache level. Upon notification, the program (typically +"Activity Manager") might analyze vmstat and act in advance (i.e. +prematurely shutdown unimportant services). + +The "medium" level means that the system is experiencing medium memory +pressure, the system might be making swap, paging out active file caches, +etc. Upon this event applications may decide to further analyze +vmstat/zoneinfo/memcg or internal memory usage statistics and free any +resources that can be easily reconstructed or re-read from a disk. + +The "critical" level means that the system is actively thrashing, it is +about to out of memory (OOM) or even the in-kernel OOM killer is on its +way to trigger. Applications should do whatever they can to help the +system. It might be too late to consult with vmstat or any other +statistics, so it's advisable to take an immediate action. + +The events are propagated upward until the event is handled, i.e. the +events are not pass-through. Here is what this means: for example you have +three cgroups: A->B->C. Now you set up an event listener on cgroups A, B +and C, and suppose group C experiences some pressure. In this situation, +only group C will receive the notification, i.e. groups A and B will not +receive it. This is done to avoid excessive "broadcasting" of messages, +which disturbs the system and which is especially bad if we are low on +memory or thrashing. So, organize the cgroups wisely, or propagate the +events manually (or, ask us to implement the pass-through events, +explaining why would you need them.) + +The file memory.pressure_level is only used to setup an eventfd, +read/write operations are no implemented. + +Test: + + Here is a small script example that makes a new cgroup, sets up a + memory limit, sets up a notification in the cgroup and then makes child + cgroup experience a critical pressure: + + # cd /sys/fs/cgroup/memory/ + # mkdir foo + # cd foo + # cgroup_event_listener memory.pressure_level low & + # echo 8000000 > memory.limit_in_bytes + # echo 8000000 > memory.memsw.limit_in_bytes + # echo $$ > tasks + # dd if=/dev/zero | read x + + (Expect a bunch of notifications, and eventually, the oom-killer will + trigger.) + +12. TODO 1. Add support for accounting huge pages (as a separate controller) 2. Make per-cgroup scanner reclaim not-shared pages first diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h new file mode 100644 index 0000000..fa84783 --- /dev/null +++ b/include/linux/vmpressure.h @@ -0,0 +1,47 @@ +#ifndef __LINUX_VMPRESSURE_H +#define __LINUX_VMPRESSURE_H + +#include <linux/mutex.h> +#include <linux/list.h> +#include <linux/workqueue.h> +#include <linux/gfp.h> +#include <linux/types.h> +#include <linux/cgroup.h> + +struct vmpressure { + unsigned int scanned; + unsigned int reclaimed; + /* The lock is used to keep the scanned/reclaimed above in sync. */ + struct mutex sr_lock; + + struct list_head events; + /* Have to grab the lock on events traversal or modifications. */ + struct mutex events_lock; + + struct work_struct work; +}; + +struct mem_cgroup; + +#ifdef CONFIG_MEMCG +extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, + unsigned long scanned, unsigned long reclaimed); +extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio); +#else +static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, + unsigned long scanned, unsigned long reclaimed) {} +static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, + int prio) {} +#endif /* CONFIG_MEMCG */ + +extern void vmpressure_init(struct vmpressure *vmpr); +extern struct vmpressure *memcg_to_vmpr(struct mem_cgroup *memcg); +extern struct cgroup_subsys_state *vmpr_to_css(struct vmpressure *vmpr); +extern struct vmpressure *css_to_vmpr(struct cgroup_subsys_state *css); +extern int vmpressure_register_event(struct cgroup *cg, struct cftype *cft, + struct eventfd_ctx *eventfd, + const char *args); +extern void vmpressure_unregister_event(struct cgroup *cg, struct cftype *cft, + struct eventfd_ctx *eventfd); + +#endif /* __LINUX_VMPRESSURE_H */ diff --git a/mm/Makefile b/mm/Makefile index 3a46287..72c5acb 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -50,7 +50,7 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o -obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o +obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o vmpressure.o obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f608546..2482f2c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -49,6 +49,7 @@ #include <linux/fs.h> #include <linux/seq_file.h> #include <linux/vmalloc.h> +#include <linux/vmpressure.h> #include <linux/mm_inline.h> #include <linux/page_cgroup.h> #include <linux/cpu.h> @@ -376,6 +377,9 @@ struct mem_cgroup { atomic_t numainfo_events; atomic_t numainfo_updating; #endif + + struct vmpressure vmpr; + /* * Per cgroup active and inactive list, similar to the * per zone LRU lists. @@ -576,6 +580,24 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s) return container_of(s, struct mem_cgroup, css); } +/* Some nice accessors for the vmpressure. */ +struct vmpressure *memcg_to_vmpr(struct mem_cgroup *memcg) +{ + if (!memcg) + memcg = root_mem_cgroup; + return &memcg->vmpr; +} + +struct cgroup_subsys_state *vmpr_to_css(struct vmpressure *vmpr) +{ + return &container_of(vmpr, struct mem_cgroup, vmpr)->css; +} + +struct vmpressure *css_to_vmpr(struct cgroup_subsys_state *css) +{ + return &mem_cgroup_from_css(css)->vmpr; +} + static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) { return (memcg == root_mem_cgroup); @@ -6074,6 +6096,11 @@ static struct cftype mem_cgroup_files[] = { .unregister_event = mem_cgroup_oom_unregister_event, .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL), }, + { + .name = "pressure_level", + .register_event = vmpressure_register_event, + .unregister_event = vmpressure_unregister_event, + }, #ifdef CONFIG_NUMA { .name = "numa_stat", @@ -6365,6 +6392,7 @@ mem_cgroup_css_alloc(struct cgroup *cont) memcg->move_charge_at_immigrate = 0; mutex_init(&memcg->thresholds_lock); spin_lock_init(&memcg->move_lock); + vmpressure_init(&memcg->vmpr); return &memcg->css; diff --git a/mm/vmpressure.c b/mm/vmpressure.c new file mode 100644 index 0000000..ae0ff8e --- /dev/null +++ b/mm/vmpressure.c @@ -0,0 +1,252 @@ +/* + * Linux VM pressure + * + * Copyright 2012 Linaro Ltd. + * Anton Vorontsov <anton.vorontsov(a)linaro.org> + * + * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro, + * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published + * by the Free Software Foundation. + */ + +#include <linux/cgroup.h> +#include <linux/fs.h> +#include <linux/sched.h> +#include <linux/mm.h> +#include <linux/vmstat.h> +#include <linux/eventfd.h> +#include <linux/swap.h> +#include <linux/printk.h> +#include <linux/vmpressure.h> + +/* + * The window size is the number of scanned pages before we try to analyze + * the scanned/reclaimed ratio (or difference). + * + * It is used as a rate-limit tunable for the "low" level notification, + * and for averaging medium/critical levels. Using small window sizes can + * cause lot of false positives, but too big window size will delay the + * notifications. + * + * TODO: Make the window size depend on machine size, as we do for vmstat + * thresholds. + */ +static const unsigned int vmpressure_win = SWAP_CLUSTER_MAX * 16; +static const unsigned int vmpressure_level_med = 60; +static const unsigned int vmpressure_level_critical = 95; +static const unsigned int vmpressure_level_critical_prio = 3; + +enum vmpressure_levels { + VMPRESSURE_LOW = 0, + VMPRESSURE_MEDIUM, + VMPRESSURE_CRITICAL, + VMPRESSURE_NUM_LEVELS, +}; + +static const char *vmpressure_str_levels[] = { + [VMPRESSURE_LOW] = "low", + [VMPRESSURE_MEDIUM] = "medium", + [VMPRESSURE_CRITICAL] = "critical", +}; + +static enum vmpressure_levels vmpressure_level(unsigned int pressure) +{ + if (pressure >= vmpressure_level_critical) + return VMPRESSURE_CRITICAL; + else if (pressure >= vmpressure_level_med) + return VMPRESSURE_MEDIUM; + return VMPRESSURE_LOW; +} + +static enum vmpressure_levels vmpressure_calc_level(unsigned int scanned, + unsigned int reclaimed) +{ + unsigned long scale = scanned + reclaimed; + unsigned long pressure; + + if (!scanned) + return VMPRESSURE_LOW; + + /* + * We calculate the ratio (in percents) of how many pages were + * scanned vs. reclaimed in a given time frame (window). Note that + * time is in VM reclaimer's "ticks", i.e. number of pages + * scanned. This makes it possible to set desired reaction time + * and serves as a ratelimit. + */ + pressure = scale - (reclaimed * scale / scanned); + pressure = pressure * 100 / scale; + + pr_debug("%s: %3lu (s: %6u r: %6u)\n", __func__, pressure, + scanned, reclaimed); + + return vmpressure_level(pressure); +} + +void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, + unsigned long scanned, unsigned long reclaimed) +{ + struct vmpressure *vmpr = memcg_to_vmpr(memcg); + + /* + * So far we are only interested application memory, or, in case + * of low pressure, in FS/IO memory reclaim. We are also + * interested indirect reclaim (kswapd sets sc->gfp_mask to + * GFP_KERNEL). + */ + if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS))) + return; + + if (!scanned) + return; + + mutex_lock(&vmpr->sr_lock); + vmpr->scanned += scanned; + vmpr->reclaimed += reclaimed; + mutex_unlock(&vmpr->sr_lock); + + if (scanned < vmpressure_win || work_pending(&vmpr->work)) + return; + schedule_work(&vmpr->work); +} + +void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio) +{ + if (prio > vmpressure_level_critical_prio) + return; + + /* + * OK, the prio is below the threshold, updating vmpressure + * information before diving into long shrinking of long range + * vmscan. + */ + vmpressure(gfp, memcg, vmpressure_win, 0); +} + +static struct vmpressure *wk_to_vmpr(struct work_struct *wk) +{ + return container_of(wk, struct vmpressure, work); +} + +static struct vmpressure *cg_to_vmpr(struct cgroup *cg) +{ + return css_to_vmpr(cgroup_subsys_state(cg, mem_cgroup_subsys_id)); +} + +struct vmpressure_event { + struct eventfd_ctx *efd; + enum vmpressure_levels level; + struct list_head node; +}; + +static bool vmpressure_event(struct vmpressure *vmpr, + unsigned long scanned, unsigned long reclaimed) +{ + struct vmpressure_event *ev; + int level = vmpressure_calc_level(scanned, reclaimed); + bool signalled = false; + + mutex_lock(&vmpr->events_lock); + + list_for_each_entry(ev, &vmpr->events, node) { + if (level >= ev->level) { + eventfd_signal(ev->efd, 1); + signalled = true; + } + } + + mutex_unlock(&vmpr->events_lock); + + return signalled; +} + +static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr) +{ + struct cgroup *cg = vmpr_to_css(vmpr)->cgroup; + struct mem_cgroup *memcg = mem_cgroup_from_cont(cg); + + memcg = parent_mem_cgroup(memcg); + if (!memcg) + return NULL; + return memcg_to_vmpr(memcg); +} + +static void vmpressure_wk_fn(struct work_struct *wk) +{ + struct vmpressure *vmpr = wk_to_vmpr(wk); + unsigned long s; + unsigned long r; + + mutex_lock(&vmpr->sr_lock); + s = vmpr->scanned; + r = vmpr->reclaimed; + vmpr->scanned = 0; + vmpr->reclaimed = 0; + mutex_unlock(&vmpr->sr_lock); + + do { + if (vmpressure_event(vmpr, s, r)) + break; + /* + * If not handled, propagate the event upward into the + * hierarchy. + */ + } while ((vmpr = vmpressure_parent(vmpr))); +} + +int vmpressure_register_event(struct cgroup *cg, struct cftype *cft, + struct eventfd_ctx *eventfd, const char *args) +{ + struct vmpressure *vmpr = cg_to_vmpr(cg); + struct vmpressure_event *ev; + int lvl; + + for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) { + if (!strcmp(vmpressure_str_levels[lvl], args)) + break; + } + + if (lvl >= VMPRESSURE_NUM_LEVELS) + return -EINVAL; + + ev = kzalloc(sizeof(*ev), GFP_KERNEL); + if (!ev) + return -ENOMEM; + + ev->efd = eventfd; + ev->level = lvl; + + mutex_lock(&vmpr->events_lock); + list_add(&ev->node, &vmpr->events); + mutex_unlock(&vmpr->events_lock); + + return 0; +} + +void vmpressure_unregister_event(struct cgroup *cg, struct cftype *cft, + struct eventfd_ctx *eventfd) +{ + struct vmpressure *vmpr = cg_to_vmpr(cg); + struct vmpressure_event *ev; + + mutex_lock(&vmpr->events_lock); + list_for_each_entry(ev, &vmpr->events, node) { + if (ev->efd != eventfd) + continue; + list_del(&ev->node); + kfree(ev); + break; + } + mutex_unlock(&vmpr->events_lock); +} + +void vmpressure_init(struct vmpressure *vmpr) +{ + mutex_init(&vmpr->sr_lock); + mutex_init(&vmpr->events_lock); + INIT_LIST_HEAD(&vmpr->events); + INIT_WORK(&vmpr->work, vmpressure_wk_fn); +} diff --git a/mm/vmscan.c b/mm/vmscan.c index df78d17..616e2bb 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -19,6 +19,7 @@ #include <linux/pagemap.h> #include <linux/init.h> #include <linux/highmem.h> +#include <linux/vmpressure.h> #include <linux/vmstat.h> #include <linux/file.h> #include <linux/writeback.h> @@ -1982,6 +1983,11 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) } memcg = mem_cgroup_iter(root, memcg, &reclaim); } while (memcg); + + vmpressure(sc->gfp_mask, sc->target_mem_cgroup, + sc->nr_scanned - nr_scanned, + sc->nr_reclaimed - nr_reclaimed); + } while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed, sc->nr_scanned - nr_scanned, sc)); } @@ -2167,6 +2173,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, count_vm_event(ALLOCSTALL); do { + vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup, + sc->priority); sc->nr_scanned = 0; aborted_reclaim = shrink_zones(zonelist, sc); -- 1.8.1.4

12 years, 6 months

[PATCH v4] memcg: Add memory.pressure_level events

by Anton Vorontsov

With this patch userland applications that want to maintain the interactivity/memory allocation cost can use the pressure level notifications. The levels are defined like this: The "low" level means that the system is reclaiming memory for new allocations. Monitoring this reclaiming activity might be useful for maintaining cache level. Upon notification, the program (typically "Activity Manager") might analyze vmstat and act in advance (i.e. prematurely shutdown unimportant services). The "medium" level means that the system is experiencing medium memory pressure, the system might be making swap, paging out active file caches, etc. Upon this event applications may decide to further analyze vmstat/zoneinfo/memcg or internal memory usage statistics and free any resources that can be easily reconstructed or re-read from a disk. The "critical" level means that the system is actively thrashing, it is about to out of memory (OOM) or even the in-kernel OOM killer is on its way to trigger. Applications should do whatever they can to help the system. It might be too late to consult with vmstat or any other statistics, so it's advisable to take an immediate action. The events are propagated upward until the event is handled, i.e. the events are not pass-through. Here is what this means: for example you have three cgroups: A->B->C. Now you set up an event listener on cgroups A, B and C, and suppose group C experiences some pressure. In this situation, only group C will receive the notification, i.e. groups A and B will not receive it. This is done to avoid excessive "broadcasting" of messages, which disturbs the system and which is especially bad if we are low on memory or thrashing. So, organize the cgroups wisely, or propagate the events manually (or, ask us to implement the pass-through events, explaining why would you need them.) Performance wise, the memory pressure notifications feature itself is lightweight and does not require much of bookkeeping, in contrast to the rest of memcg features. Unfortunately, as of current memcg implementation, pages accounting is an inseparable part and cannot be turned off. The good news is that there are some efforts[1] to improve the situation; plus, implementing the same, fully API-compatible[2] interface for CONFIG_MEMCG=n case (e.g. embedded) is also a viable option, so it will not require any changes on the userland side. [1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291 [2] http://lkml.org/lkml/2013/2/21/454 Signed-off-by: Anton Vorontsov <anton.vorontsov(a)linaro.org> Acked-by: Kirill A. Shutemov <kirill(a)shutemov.name> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com> --- Hi all, Thanks for the previous reviews! In v4 I addressed Andrew's and Kamezawa's comments: - Documented public interfaces and tunables; - Added documentation for eventfd interface; - Some cosmetic changes: code rearrangements and variables renames (wk->work, lvl->level, etc.); - Changed types for page counters from 'unsigned int' to 'unsigned long', this avoids possible overflows; - Added Kamezawa's Ack, and rebased onto 3.9.0-rc5-next-20130402+. In v3: - No changes in the code, just updated commit message to incorporate the answer to Minchan Kim's comment regarding applicability to embedded use cases in the light of memcg performance overhead, plus gave some references to Glauber Costa's memcg work. - Rebased onto 3.9.0-rc3-next-20130321. Old changelogs/submissions: v3: http://lkml.org/lkml/2013/3/22/31 v2: http://lkml.org/lkml/2013/2/18/577 v1: http://lkml.org/lkml/2013/2/10/140 mempressure cgroup: http://lkml.org/lkml/2013/1/4/55 Documentation/cgroups/memory.txt | 70 +++++++- include/linux/vmpressure.h | 48 +++++ mm/Makefile | 2 +- mm/memcontrol.c | 29 +++ mm/vmpressure.c | 374 +++++++++++++++++++++++++++++++++++++++ mm/vmscan.c | 8 + 6 files changed, 529 insertions(+), 2 deletions(-) create mode 100644 include/linux/vmpressure.h create mode 100644 mm/vmpressure.c diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 3aaa984..1178e23 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -40,6 +40,7 @@ Features: - soft limit - moving (recharging) account at moving a task is selectable. - usage threshold notifier + - memory pressure notifier - oom-killer disable knob and oom-notifier - Root cgroup has no limit controls. @@ -65,6 +66,7 @@ Brief summary of control files. memory.stat # show various statistics memory.use_hierarchy # set/show hierarchical account enabled memory.force_empty # trigger forced move charge to parent + memory.pressure_level # set memory pressure notifications memory.swappiness # set/show swappiness parameter of vmscan (See sysctl's vm.swappiness) memory.move_charge_at_immigrate # set/show controls of moving charges @@ -778,7 +780,73 @@ At reading, current status of OOM is shown. under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may be stopped.) -11. TODO +11. Memory Pressure + +The pressure level notifications can be used to monitor the memory +allocation cost; based on the pressure, applications can implement +different strategies of managing their memory resources. The pressure +levels are defined as following: + +The "low" level means that the system is reclaiming memory for new +allocations. Monitoring this reclaiming activity might be useful for +maintaining cache level. Upon notification, the program (typically +"Activity Manager") might analyze vmstat and act in advance (i.e. +prematurely shutdown unimportant services). + +The "medium" level means that the system is experiencing medium memory +pressure, the system might be making swap, paging out active file caches, +etc. Upon this event applications may decide to further analyze +vmstat/zoneinfo/memcg or internal memory usage statistics and free any +resources that can be easily reconstructed or re-read from a disk. + +The "critical" level means that the system is actively thrashing, it is +about to out of memory (OOM) or even the in-kernel OOM killer is on its +way to trigger. Applications should do whatever they can to help the +system. It might be too late to consult with vmstat or any other +statistics, so it's advisable to take an immediate action. + +The events are propagated upward until the event is handled, i.e. the +events are not pass-through. Here is what this means: for example you have +three cgroups: A->B->C. Now you set up an event listener on cgroups A, B +and C, and suppose group C experiences some pressure. In this situation, +only group C will receive the notification, i.e. groups A and B will not +receive it. This is done to avoid excessive "broadcasting" of messages, +which disturbs the system and which is especially bad if we are low on +memory or thrashing. So, organize the cgroups wisely, or propagate the +events manually (or, ask us to implement the pass-through events, +explaining why would you need them.) + +The file memory.pressure_level is only used to setup an eventfd. To +register a notification, an application must: + +- create an eventfd using eventfd(2); +- open memory.pressure_level; +- write string like "<event_fd> <fd of memory.pressure_level> <level>" + to cgroup.event_control. + +Application will be notified through eventfd when memory pressure is at +the specific level (or higher). Read/write operations to +memory.pressure_level are no implemented. + +Test: + + Here is a small script example that makes a new cgroup, sets up a + memory limit, sets up a notification in the cgroup and then makes child + cgroup experience a critical pressure: + + # cd /sys/fs/cgroup/memory/ + # mkdir foo + # cd foo + # cgroup_event_listener memory.pressure_level low & + # echo 8000000 > memory.limit_in_bytes + # echo 8000000 > memory.memsw.limit_in_bytes + # echo $$ > tasks + # dd if=/dev/zero | read x + + (Expect a bunch of notifications, and eventually, the oom-killer will + trigger.) + +12. TODO 1. Add support for accounting huge pages (as a separate controller) 2. Make per-cgroup scanner reclaim not-shared pages first diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h new file mode 100644 index 0000000..2e86259 --- /dev/null +++ b/include/linux/vmpressure.h @@ -0,0 +1,48 @@ +#ifndef __LINUX_VMPRESSURE_H +#define __LINUX_VMPRESSURE_H + +#include <linux/mutex.h> +#include <linux/list.h> +#include <linux/workqueue.h> +#include <linux/gfp.h> +#include <linux/types.h> +#include <linux/cgroup.h> + +struct vmpressure { + unsigned long scanned; + unsigned long reclaimed; + /* The lock is used to keep the scanned/reclaimed above in sync. */ + struct mutex sr_lock; + + /* The list of vmpressure_event structs. */ + struct list_head events; + /* Have to grab the lock on events traversal or modifications. */ + struct mutex events_lock; + + struct work_struct work; +}; + +struct mem_cgroup; + +#ifdef CONFIG_MEMCG +extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, + unsigned long scanned, unsigned long reclaimed); +extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio); +#else +static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, + unsigned long scanned, unsigned long reclaimed) {} +static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, + int prio) {} +#endif /* CONFIG_MEMCG */ + +extern void vmpressure_init(struct vmpressure *vmpr); +extern struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg); +extern struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr); +extern struct vmpressure *css_to_vmpressure(struct cgroup_subsys_state *css); +extern int vmpressure_register_event(struct cgroup *cg, struct cftype *cft, + struct eventfd_ctx *eventfd, + const char *args); +extern void vmpressure_unregister_event(struct cgroup *cg, struct cftype *cft, + struct eventfd_ctx *eventfd); + +#endif /* __LINUX_VMPRESSURE_H */ diff --git a/mm/Makefile b/mm/Makefile index 3a46287..72c5acb 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -50,7 +50,7 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o -obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o +obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o vmpressure.o obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f608546..64d75a2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -49,6 +49,7 @@ #include <linux/fs.h> #include <linux/seq_file.h> #include <linux/vmalloc.h> +#include <linux/vmpressure.h> #include <linux/mm_inline.h> #include <linux/page_cgroup.h> #include <linux/cpu.h> @@ -315,6 +316,9 @@ struct mem_cgroup { /* thresholds for mem+swap usage. RCU-protected */ struct mem_cgroup_thresholds memsw_thresholds; + /* vmpressure notifications */ + struct vmpressure vmpressure; + union { /* For oom notifier event fd */ struct list_head oom_notify; @@ -376,6 +380,7 @@ struct mem_cgroup { atomic_t numainfo_events; atomic_t numainfo_updating; #endif + /* * Per cgroup active and inactive list, similar to the * per zone LRU lists. @@ -576,6 +581,24 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s) return container_of(s, struct mem_cgroup, css); } +/* Some nice accessors for the vmpressure. */ +struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg) +{ + if (!memcg) + memcg = root_mem_cgroup; + return &memcg->vmpressure; +} + +struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr) +{ + return &container_of(vmpr, struct mem_cgroup, vmpressure)->css; +} + +struct vmpressure *css_to_vmpressure(struct cgroup_subsys_state *css) +{ + return &mem_cgroup_from_css(css)->vmpressure; +} + static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) { return (memcg == root_mem_cgroup); @@ -6074,6 +6097,11 @@ static struct cftype mem_cgroup_files[] = { .unregister_event = mem_cgroup_oom_unregister_event, .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL), }, + { + .name = "pressure_level", + .register_event = vmpressure_register_event, + .unregister_event = vmpressure_unregister_event, + }, #ifdef CONFIG_NUMA { .name = "numa_stat", @@ -6365,6 +6393,7 @@ mem_cgroup_css_alloc(struct cgroup *cont) memcg->move_charge_at_immigrate = 0; mutex_init(&memcg->thresholds_lock); spin_lock_init(&memcg->move_lock); + vmpressure_init(&memcg->vmpressure); return &memcg->css; diff --git a/mm/vmpressure.c b/mm/vmpressure.c new file mode 100644 index 0000000..ccbdc9e --- /dev/null +++ b/mm/vmpressure.c @@ -0,0 +1,374 @@ +/* + * Linux VM pressure + * + * Copyright 2012 Linaro Ltd. + * Anton Vorontsov <anton.vorontsov(a)linaro.org> + * + * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro, + * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published + * by the Free Software Foundation. + */ + +#include <linux/cgroup.h> +#include <linux/fs.h> +#include <linux/log2.h> +#include <linux/sched.h> +#include <linux/mm.h> +#include <linux/vmstat.h> +#include <linux/eventfd.h> +#include <linux/swap.h> +#include <linux/printk.h> +#include <linux/vmpressure.h> + +/* + * The window size (vmpressure_win) is the number of scanned pages before + * we try to analyze scanned/reclaimed ratio. So the window is used as a + * rate-limit tunable for the "low" level notification, and also for + * averaging the ratio for medium/critical levels. Using small window + * sizes can cause lot of false positives, but too big window size will + * delay the notifications. + * + * As the vmscan reclaimer logic works with chunks which are multiple of + * SWAP_CLUSTER_MAX, it makes sense to use it for the window size as well. + * + * TODO: Make the window size depend on machine size, as we do for vmstat + * thresholds. Currently we set it to 512 pages (2MB for 4KB pages). + */ +static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16; + +/* + * These thresholds are used when we account memory pressure through + * scanned/reclaimed ratio. The current values were chosen empirically. In + * essence, they are percents: the higher the value, the more number + * unsuccessful reclaims there were. + */ +static const unsigned int vmpressure_level_med = 60; +static const unsigned int vmpressure_level_critical = 95; + +/* + * When there are too little pages left to scan, vmpressure() may miss the + * critical pressure as number of pages will be less than "window size". + * However, in that case the vmscan priority will raise fast as the + * reclaimer will try to scan LRUs more deeply. + * + * The vmscan logic considers these special priorities: + * + * prio == DEF_PRIORITY (12): reclaimer starts with that value + * prio <= DEF_PRIORITY - 2 : kswapd becomes somewhat overwhelmed + * prio == 0 : close to OOM, kernel scans every page in an lru + * + * Any value in this range is acceptable for this tunable (i.e. from 12 to + * 0). Current value for the vmpressure_level_critical_prio is chosen + * empirically, but the number, in essence, means that we consider + * critical level when scanning depth is ~10% of the lru size (vmscan + * scans 'lru_size >> prio' pages, so it is actually 12.5%, or one + * eights). + */ +static const unsigned int vmpressure_level_critical_prio = ilog2(100 / 10); + +static struct vmpressure *work_to_vmpressure(struct work_struct *work) +{ + return container_of(work, struct vmpressure, work); +} + +static struct vmpressure *cg_to_vmpressure(struct cgroup *cg) +{ + return css_to_vmpressure(cgroup_subsys_state(cg, mem_cgroup_subsys_id)); +} + +static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr) +{ + struct cgroup *cg = vmpressure_to_css(vmpr)->cgroup; + struct mem_cgroup *memcg = mem_cgroup_from_cont(cg); + + memcg = parent_mem_cgroup(memcg); + if (!memcg) + return NULL; + return memcg_to_vmpressure(memcg); +} + +enum vmpressure_levels { + VMPRESSURE_LOW = 0, + VMPRESSURE_MEDIUM, + VMPRESSURE_CRITICAL, + VMPRESSURE_NUM_LEVELS, +}; + +static const char *vmpressure_str_levels[] = { + [VMPRESSURE_LOW] = "low", + [VMPRESSURE_MEDIUM] = "medium", + [VMPRESSURE_CRITICAL] = "critical", +}; + +static enum vmpressure_levels vmpressure_level(unsigned long pressure) +{ + if (pressure >= vmpressure_level_critical) + return VMPRESSURE_CRITICAL; + else if (pressure >= vmpressure_level_med) + return VMPRESSURE_MEDIUM; + return VMPRESSURE_LOW; +} + +static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned, + unsigned long reclaimed) +{ + unsigned long scale = scanned + reclaimed; + unsigned long pressure; + + /* + * We calculate the ratio (in percents) of how many pages were + * scanned vs. reclaimed in a given time frame (window). Note that + * time is in VM reclaimer's "ticks", i.e. number of pages + * scanned. This makes it possible to set desired reaction time + * and serves as a ratelimit. + */ + pressure = scale - (reclaimed * scale / scanned); + pressure = pressure * 100 / scale; + + pr_debug("%s: %3lu (s: %lu r: %lu)\n", __func__, pressure, + scanned, reclaimed); + + return vmpressure_level(pressure); +} + +struct vmpressure_event { + struct eventfd_ctx *efd; + enum vmpressure_levels level; + struct list_head node; +}; + +static bool vmpressure_event(struct vmpressure *vmpr, + unsigned long scanned, unsigned long reclaimed) +{ + struct vmpressure_event *ev; + enum vmpressure_levels level; + bool signalled = false; + + level = vmpressure_calc_level(scanned, reclaimed); + + mutex_lock(&vmpr->events_lock); + + list_for_each_entry(ev, &vmpr->events, node) { + if (level >= ev->level) { + eventfd_signal(ev->efd, 1); + signalled = true; + } + } + + mutex_unlock(&vmpr->events_lock); + + return signalled; +} + +static void vmpressure_work_fn(struct work_struct *work) +{ + struct vmpressure *vmpr = work_to_vmpressure(work); + unsigned long scanned; + unsigned long reclaimed; + + /* + * Several contexts might be calling vmpressure(), so it is + * possible that the work was rescheduled again before the old + * work context cleared the counters. In that case we will run + * just after the old work returns, but then scanned might be zero + * here. No need for any locks here since we don't care if + * vmpr->reclaimed is in sync. + */ + if (!vmpr->scanned) + return; + + mutex_lock(&vmpr->sr_lock); + scanned = vmpr->scanned; + reclaimed = vmpr->reclaimed; + vmpr->scanned = 0; + vmpr->reclaimed = 0; + mutex_unlock(&vmpr->sr_lock); + + do { + if (vmpressure_event(vmpr, scanned, reclaimed)) + break; + /* + * If not handled, propagate the event upward into the + * hierarchy. + */ + } while ((vmpr = vmpressure_parent(vmpr))); +} + +/** + * vmpressure() - Account memory pressure through scanned/reclaimed ratio + * @gfp: reclaimer's gfp mask + * @memcg: cgroup memory controller handle + * @scanned: number of pages scanned + * @reclaimed: number of pages reclaimed + * + * This function should be called from the vmscan reclaim path to account + * "instantaneous" memory pressure (scanned/reclaimed ratio). The raw + * pressure index is then further refined and averaged over time. + * + * This function does not return any value. + */ +void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, + unsigned long scanned, unsigned long reclaimed) +{ + struct vmpressure *vmpr = memcg_to_vmpressure(memcg); + + /* + * Here we only want to account pressure that userland is able to + * help us with. For example, suppose that DMA zone is under + * pressure; if we notify userland about that kind of pressure, + * then it will be mostly a waste as it will trigger unnecessary + * freeing of memory by userland (since userland is more likely to + * have HIGHMEM/MOVABLE pages instead of the DMA fallback). That + * is why we include only movable, highmem and FS/IO pages. + * Indirect reclaim (kswapd) sets sc->gfp_mask to GFP_KERNEL, so + * we account it too. + */ + if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS))) + return; + + /* + * If we got here with no pages scanned, then that is an indicator + * that reclaimer was unable to find any shrinkable LRUs at the + * current scanning depth. But it does not mean that we should + * report the critical pressure, yet. If the scanning priority + * (scanning depth) goes too high (deep), we will be notified + * through vmpressure_prio(). But so far, keep calm. + */ + if (!scanned) + return; + + mutex_lock(&vmpr->sr_lock); + vmpr->scanned += scanned; + vmpr->reclaimed += reclaimed; + scanned = vmpr->scanned; + mutex_unlock(&vmpr->sr_lock); + + if (scanned < vmpressure_win || work_pending(&vmpr->work)) + return; + schedule_work(&vmpr->work); +} + +/** + * vmpressure_prio() - Account memory pressure through reclaimer priority level + * @gfp: reclaimer's gfp mask + * @memcg: cgroup memory controller handle + * @prio: reclaimer's priority + * + * This function should be called from the reclaim path every time when + * the vmscan's reclaiming priority (scanning depth) changes. + * + * This function does not return any value. + */ +void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio) +{ + /* + * We only use prio for accounting critical level. For more info + * see comment for vmpressure_level_critical_prio variable above. + */ + if (prio > vmpressure_level_critical_prio) + return; + + /* + * OK, the prio is below the threshold, updating vmpressure + * information before shrinker dives into long shrinking of long + * range vmscan. Passing scanned = vmpressure_win, reclaimed = 0 + * to the vmpressure() basically means that we signal 'critical' + * level. + */ + vmpressure(gfp, memcg, vmpressure_win, 0); +} + +/** + * vmpressure_register_event() - Bind vmpressure notifications to an eventfd + * @cg: cgroup that is interested in vmpressure notifications + * @cft: cgroup control files handle + * @eventfd: eventfd context to link notifications with + * @args: event arguments (used to set up a pressure level threshold) + * + * This function associates eventfd context with the vmpressure + * infrastructure, so that the notifications will be delivered to the + * @eventfd. The @args parameter is a string that denotes pressure level + * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or + * "critical"). + * + * This function should not be used directly, just pass it to (struct + * cftype).register_event, and then cgroup core will handle everything by + * itself. + */ +int vmpressure_register_event(struct cgroup *cg, struct cftype *cft, + struct eventfd_ctx *eventfd, const char *args) +{ + struct vmpressure *vmpr = cg_to_vmpressure(cg); + struct vmpressure_event *ev; + int level; + + for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) { + if (!strcmp(vmpressure_str_levels[level], args)) + break; + } + + if (level >= VMPRESSURE_NUM_LEVELS) + return -EINVAL; + + ev = kzalloc(sizeof(*ev), GFP_KERNEL); + if (!ev) + return -ENOMEM; + + ev->efd = eventfd; + ev->level = level; + + mutex_lock(&vmpr->events_lock); + list_add(&ev->node, &vmpr->events); + mutex_unlock(&vmpr->events_lock); + + return 0; +} + +/** + * vmpressure_unregister_event() - Unbind eventfd from vmpressure + * @cg: cgroup handle + * @cft: cgroup control files handle + * @eventfd: eventfd context that was used to link vmpressure with the @cg + * + * This function does internal manipulations to detach the @eventfd from + * the vmpressure notifications, and then frees internal resources + * associated with the @eventfd (but the @eventfd itself is not freed). + * + * This function should not be used directly, just pass it to (struct + * cftype).unregister_event, and then cgroup core will handle everything + * by itself. + */ +void vmpressure_unregister_event(struct cgroup *cg, struct cftype *cft, + struct eventfd_ctx *eventfd) +{ + struct vmpressure *vmpr = cg_to_vmpressure(cg); + struct vmpressure_event *ev; + + mutex_lock(&vmpr->events_lock); + list_for_each_entry(ev, &vmpr->events, node) { + if (ev->efd != eventfd) + continue; + list_del(&ev->node); + kfree(ev); + break; + } + mutex_unlock(&vmpr->events_lock); +} + +/** + * vmpressure_init() - Initialize vmpressure control structure + * @vmpr: Structure to be initialized + * + * This function should be called on every allocated vmpressure structure + * before any usage. + */ +void vmpressure_init(struct vmpressure *vmpr) +{ + mutex_init(&vmpr->sr_lock); + mutex_init(&vmpr->events_lock); + INIT_LIST_HEAD(&vmpr->events); + INIT_WORK(&vmpr->work, vmpressure_work_fn); +} diff --git a/mm/vmscan.c b/mm/vmscan.c index df78d17..616e2bb 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -19,6 +19,7 @@ #include <linux/pagemap.h> #include <linux/init.h> #include <linux/highmem.h> +#include <linux/vmpressure.h> #include <linux/vmstat.h> #include <linux/file.h> #include <linux/writeback.h> @@ -1982,6 +1983,11 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) } memcg = mem_cgroup_iter(root, memcg, &reclaim); } while (memcg); + + vmpressure(sc->gfp_mask, sc->target_mem_cgroup, + sc->nr_scanned - nr_scanned, + sc->nr_reclaimed - nr_reclaimed); + } while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed, sc->nr_scanned - nr_scanned, sc)); } @@ -2167,6 +2173,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, count_vm_event(ALLOCSTALL); do { + vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup, + sc->priority); sc->nr_scanned = 0; aborted_reclaim = shrink_zones(zonelist, sc); -- 1.8.1.4

12 years, 6 months

[PATCH 1/5] timer: move enum definition out of ifdef section

by Daniel Lezcano

The next patch will setup automatically the broadcast timer for the different cpuidle driver when one idle state stops its timer. This will be part of the generic code. But some ARM boards, like s3c64xx, uses cpuidle but without the CONFIG_GENERIC_CLOCKEVENTS_BUILD set. Hence the cpuidle framework will be compiled with the code supposed to be generic, that is with clockevents_notify and the different enum. Also the function clockevents_notify is a noop macro, this is fine except the usual code is: int cpu = smp_processor_id(); clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ON, &cpu); and that raises a warning for the variable cpu which is not used. Move the clock_event_nofitiers enum definition out of the CONFIG_GENERIC_CLOCKEVENTS_BUILD section to prevent a compilation error when these are used in the code. Change the clockevents_notify macro to a static inline noop function to prevent a compilation warning. Signed-off-by: Daniel Lezcano <daniel.lezcano(a)linaro.org> --- include/linux/clockchips.h | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h index 6634652..f9fd937 100644 --- a/include/linux/clockchips.h +++ b/include/linux/clockchips.h @@ -8,6 +8,20 @@ #ifndef _LINUX_CLOCKCHIPS_H #define _LINUX_CLOCKCHIPS_H +/* Clock event notification values */ +enum clock_event_nofitiers { + CLOCK_EVT_NOTIFY_ADD, + CLOCK_EVT_NOTIFY_BROADCAST_ON, + CLOCK_EVT_NOTIFY_BROADCAST_OFF, + CLOCK_EVT_NOTIFY_BROADCAST_FORCE, + CLOCK_EVT_NOTIFY_BROADCAST_ENTER, + CLOCK_EVT_NOTIFY_BROADCAST_EXIT, + CLOCK_EVT_NOTIFY_SUSPEND, + CLOCK_EVT_NOTIFY_RESUME, + CLOCK_EVT_NOTIFY_CPU_DYING, + CLOCK_EVT_NOTIFY_CPU_DEAD, +}; + #ifdef CONFIG_GENERIC_CLOCKEVENTS_BUILD #include <linux/clocksource.h> @@ -26,20 +40,6 @@ enum clock_event_mode { CLOCK_EVT_MODE_RESUME, }; -/* Clock event notification values */ -enum clock_event_nofitiers { - CLOCK_EVT_NOTIFY_ADD, - CLOCK_EVT_NOTIFY_BROADCAST_ON, - CLOCK_EVT_NOTIFY_BROADCAST_OFF, - CLOCK_EVT_NOTIFY_BROADCAST_FORCE, - CLOCK_EVT_NOTIFY_BROADCAST_ENTER, - CLOCK_EVT_NOTIFY_BROADCAST_EXIT, - CLOCK_EVT_NOTIFY_SUSPEND, - CLOCK_EVT_NOTIFY_RESUME, - CLOCK_EVT_NOTIFY_CPU_DYING, - CLOCK_EVT_NOTIFY_CPU_DEAD, -}; - /* * Clock event features */ @@ -173,7 +173,7 @@ extern int tick_receive_broadcast(void); #ifdef CONFIG_GENERIC_CLOCKEVENTS extern void clockevents_notify(unsigned long reason, void *arg); #else -# define clockevents_notify(reason, arg) do { } while (0) +static inline void clockevents_notify(unsigned long reason, void *arg) {} #endif #else /* CONFIG_GENERIC_CLOCKEVENTS_BUILD */ @@ -181,7 +181,7 @@ extern void clockevents_notify(unsigned long reason, void *arg); static inline void clockevents_suspend(void) {} static inline void clockevents_resume(void) {} -#define clockevents_notify(reason, arg) do { } while (0) +static inline void clockevents_notify(unsigned long reason, void *arg) {} #endif -- 1.7.9.5

12 years, 7 months

[PATCH] tools: cpufreq: Fix cpufreq-info print messages for affected[related]_cpus

by Viresh Kumar

Earlier definitions of affected and related cpus were: Related_cpus: CPUs which run at the same hardware frequency. Affected_cpus: CPUs which need to have their frequency coordinated by software. These definitions were very confusing as they don't communicate the real difference between them. Following are the new definitions of these variables: Related_cpus: All (Online & Offline) CPUs that run at the same hardware frequency. Affected_cpus: Online CPUs that run at the same hardware frequency. Above definitions are more consistent with latest cpufreq core code. Signed-off-by: Viresh Kumar <viresh.kumar(a)linaro.org> --- tools/power/cpupower/utils/cpufreq-info.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/power/cpupower/utils/cpufreq-info.c b/tools/power/cpupower/utils/cpufreq-info.c index 28953c9..a81d4ec 100644 --- a/tools/power/cpupower/utils/cpufreq-info.c +++ b/tools/power/cpupower/utils/cpufreq-info.c @@ -247,7 +247,7 @@ static void debug_output_one(unsigned int cpu) cpus = cpufreq_get_related_cpus(cpu); if (cpus) { - printf(_(" CPUs which run at the same hardware frequency: ")); + printf(_(" All (Online & Offline) CPUs that run at the same hardware frequency: ")); while (cpus->next) { printf("%d ", cpus->cpu); cpus = cpus->next; @@ -258,7 +258,7 @@ static void debug_output_one(unsigned int cpu) cpus = cpufreq_get_affected_cpus(cpu); if (cpus) { - printf(_(" CPUs which need to have their frequency coordinated by software: ")); + printf(_(" Online CPUs that run at the same hardware frequency: ")); while (cpus->next) { printf("%d ", cpus->cpu); cpus = cpus->next; -- 1.7.12.rc2.18.g61b472e

12 years, 7 months

[PATCH v4] clk: allow reentrant calls into the clk framework

by Mike Turquette

Reentrancy into the clock framework from the clk.h api is necessary for clocks that are prepared and unprepared via i2c_transfer (which includes many PMICs and discrete audio chips) as well as for several other use cases. This patch implements reentrancy by adding two global atomic_t's which track the context of the current caller. Context in this case is the return value from get_current(). One context variable is for slow operations protected by the prepare_mutex and the other is for fast operations protected by the enable_lock spinlock. The clk.h api implementations are modified to first see if the relevant global lock is already held and if so compare the global context (set by whoever is holding the lock) against their own context (via a call to get_current()). If the two match then this function is a nested call from the one already holding the lock and we procede. If the context does not match then procede to call mutex_lock and busy-wait for the existing task to complete. This patch does not increase concurrency for unrelated calls into the clock framework. Instead it simply allows reentrancy by the single task which is currently holding the global clock framework lock. Signed-off-by: Mike Turquette <mturquette(a)linaro.org> Cc: Rajagopal Venkat <rajagopal.venkat(a)linaro.org> Cc: David Brown <davidb(a)codeaurora.org> Cc: Ulf Hansson <ulf.hansson(a)linaro.org> Cc: Laurent Pinchart <laurent.pinchart(a)ideasonboard.com> --- drivers/clk/clk.c | 255 ++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 186 insertions(+), 69 deletions(-) diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c index 5e8ffff..17432a5 100644 --- a/drivers/clk/clk.c +++ b/drivers/clk/clk.c @@ -19,9 +19,12 @@ #include <linux/of.h> #include <linux/device.h> #include <linux/init.h> +#include <linux/sched.h> static DEFINE_SPINLOCK(enable_lock); static DEFINE_MUTEX(prepare_lock); +static atomic_t prepare_context; +static atomic_t enable_context; static HLIST_HEAD(clk_root_list); static HLIST_HEAD(clk_orphan_list); @@ -456,27 +459,6 @@ unsigned int __clk_get_prepare_count(struct clk *clk) return !clk ? 0 : clk->prepare_count; } -unsigned long __clk_get_rate(struct clk *clk) -{ - unsigned long ret; - - if (!clk) { - ret = 0; - goto out; - } - - ret = clk->rate; - - if (clk->flags & CLK_IS_ROOT) - goto out; - - if (!clk->parent) - ret = 0; - -out: - return ret; -} - unsigned long __clk_get_flags(struct clk *clk) { return !clk ? 0 : clk->flags; @@ -566,6 +548,35 @@ struct clk *__clk_lookup(const char *name) return NULL; } +/*** locking & reentrancy ***/ + +static void clk_fwk_lock(void) +{ + /* hold the framework-wide lock, context == NULL */ + mutex_lock(&prepare_lock); + + /* set context for any reentrant calls */ + atomic_set(&prepare_context, (int) get_current()); +} + +static void clk_fwk_unlock(void) +{ + /* clear the context */ + atomic_set(&prepare_context, 0); + + /* release the framework-wide lock, context == NULL */ + mutex_unlock(&prepare_lock); +} + +static bool clk_is_reentrant(void) +{ + if (mutex_is_locked(&prepare_lock)) + if ((void *) atomic_read(&prepare_context) == get_current()) + return true; + + return false; +} + /*** clk api ***/ void __clk_unprepare(struct clk *clk) @@ -600,9 +611,15 @@ void __clk_unprepare(struct clk *clk) */ void clk_unprepare(struct clk *clk) { - mutex_lock(&prepare_lock); + /* re-enter if call is from the same context */ + if (clk_is_reentrant()) { + __clk_unprepare(clk); + return; + } + + clk_fwk_lock(); __clk_unprepare(clk); - mutex_unlock(&prepare_lock); + clk_fwk_unlock(); } EXPORT_SYMBOL_GPL(clk_unprepare); @@ -648,10 +665,16 @@ int clk_prepare(struct clk *clk) { int ret; - mutex_lock(&prepare_lock); - ret = __clk_prepare(clk); - mutex_unlock(&prepare_lock); + /* re-enter if call is from the same context */ + if (clk_is_reentrant()) { + ret = __clk_prepare(clk); + goto out; + } + clk_fwk_lock(); + ret = __clk_prepare(clk); + clk_fwk_unlock(); +out: return ret; } EXPORT_SYMBOL_GPL(clk_prepare); @@ -692,8 +715,27 @@ void clk_disable(struct clk *clk) { unsigned long flags; + /* this call re-enters if it is from the same context */ + if (spin_is_locked(&enable_lock)) { + if ((void *) atomic_read(&enable_context) == get_current()) { + __clk_disable(clk); + return; + } + } + + /* hold the framework-wide lock, context == NULL */ spin_lock_irqsave(&enable_lock, flags); + + /* set context for any reentrant calls */ + atomic_set(&enable_context, (int) get_current()); + + /* disable the clock(s) */ __clk_disable(clk); + + /* clear the context */ + atomic_set(&enable_context, 0); + + /* release the framework-wide lock, context == NULL */ spin_unlock_irqrestore(&enable_lock, flags); } EXPORT_SYMBOL_GPL(clk_disable); @@ -745,10 +787,29 @@ int clk_enable(struct clk *clk) unsigned long flags; int ret; + /* this call re-enters if it is from the same context */ + if (spin_is_locked(&enable_lock)) { + if ((void *) atomic_read(&enable_context) == get_current()) { + ret = __clk_enable(clk); + goto out; + } + } + + /* hold the framework-wide lock, context == NULL */ spin_lock_irqsave(&enable_lock, flags); + + /* set context for any reentrant calls */ + atomic_set(&enable_context, (int) get_current()); + + /* enable the clock(s) */ ret = __clk_enable(clk); - spin_unlock_irqrestore(&enable_lock, flags); + /* clear the context */ + atomic_set(&enable_context, 0); + + /* release the framework-wide lock, context == NULL */ + spin_unlock_irqrestore(&enable_lock, flags); +out: return ret; } EXPORT_SYMBOL_GPL(clk_enable); @@ -792,10 +853,17 @@ long clk_round_rate(struct clk *clk, unsigned long rate) { unsigned long ret; - mutex_lock(&prepare_lock); + /* this call re-enters if it is from the same context */ + if (clk_is_reentrant()) { + ret = __clk_round_rate(clk, rate); + goto out; + } + + clk_fwk_lock(); ret = __clk_round_rate(clk, rate); - mutex_unlock(&prepare_lock); + clk_fwk_unlock(); +out: return ret; } EXPORT_SYMBOL_GPL(clk_round_rate); @@ -877,6 +945,30 @@ static void __clk_recalc_rates(struct clk *clk, unsigned long msg) __clk_recalc_rates(child, msg); } +unsigned long __clk_get_rate(struct clk *clk) +{ + unsigned long ret; + + if (!clk) { + ret = 0; + goto out; + } + + if (clk->flags & CLK_GET_RATE_NOCACHE) + __clk_recalc_rates(clk, 0); + + ret = clk->rate; + + if (clk->flags & CLK_IS_ROOT) + goto out; + + if (!clk->parent) + ret = 0; + +out: + return ret; +} + /** * clk_get_rate - return the rate of clk * @clk: the clk whose rate is being returned @@ -889,14 +981,22 @@ unsigned long clk_get_rate(struct clk *clk) { unsigned long rate; - mutex_lock(&prepare_lock); + /* + * FIXME - any locking here seems heavy weight + * can clk->rate be replaced with an atomic_t? + * same logic can likely be applied to prepare_count & enable_count + */ - if (clk && (clk->flags & CLK_GET_RATE_NOCACHE)) - __clk_recalc_rates(clk, 0); + if (clk_is_reentrant()) { + rate = __clk_get_rate(clk); + goto out; + } + clk_fwk_lock(); rate = __clk_get_rate(clk); - mutex_unlock(&prepare_lock); + clk_fwk_unlock(); +out: return rate; } EXPORT_SYMBOL_GPL(clk_get_rate); @@ -1073,6 +1173,39 @@ static void clk_change_rate(struct clk *clk) clk_change_rate(child); } +int __clk_set_rate(struct clk *clk, unsigned long rate) +{ + int ret = 0; + struct clk *top, *fail_clk; + + /* bail early if nothing to do */ + if (rate == clk->rate) + return 0; + + if ((clk->flags & CLK_SET_RATE_GATE) && clk->prepare_count) { + return -EBUSY; + } + + /* calculate new rates and get the topmost changed clock */ + top = clk_calc_new_rates(clk, rate); + if (!top) + return -EINVAL; + + /* notify that we are about to change rates */ + fail_clk = clk_propagate_rate_change(top, PRE_RATE_CHANGE); + if (fail_clk) { + pr_warn("%s: failed to set %s rate\n", __func__, + fail_clk->name); + clk_propagate_rate_change(top, ABORT_RATE_CHANGE); + return -EBUSY; + } + + /* change the rates */ + clk_change_rate(top); + + return ret; +} + /** * clk_set_rate - specify a new rate for clk * @clk: the clk whose rate is being changed @@ -1096,44 +1229,18 @@ static void clk_change_rate(struct clk *clk) */ int clk_set_rate(struct clk *clk, unsigned long rate) { - struct clk *top, *fail_clk; int ret = 0; - /* prevent racing with updates to the clock topology */ - mutex_lock(&prepare_lock); - - /* bail early if nothing to do */ - if (rate == clk->rate) - goto out; - - if ((clk->flags & CLK_SET_RATE_GATE) && clk->prepare_count) { - ret = -EBUSY; - goto out; - } - - /* calculate new rates and get the topmost changed clock */ - top = clk_calc_new_rates(clk, rate); - if (!top) { - ret = -EINVAL; - goto out; - } - - /* notify that we are about to change rates */ - fail_clk = clk_propagate_rate_change(top, PRE_RATE_CHANGE); - if (fail_clk) { - pr_warn("%s: failed to set %s rate\n", __func__, - fail_clk->name); - clk_propagate_rate_change(top, ABORT_RATE_CHANGE); - ret = -EBUSY; + if (clk_is_reentrant()) { + ret = __clk_set_rate(clk, rate); goto out; } - /* change the rates */ - clk_change_rate(top); + clk_fwk_lock(); + ret = __clk_set_rate(clk, rate); + clk_fwk_unlock(); out: - mutex_unlock(&prepare_lock); - return ret; } EXPORT_SYMBOL_GPL(clk_set_rate); @@ -1148,10 +1255,16 @@ struct clk *clk_get_parent(struct clk *clk) { struct clk *parent; - mutex_lock(&prepare_lock); + if (clk_is_reentrant()) { + parent = __clk_get_parent(clk); + goto out; + } + + clk_fwk_lock(); parent = __clk_get_parent(clk); - mutex_unlock(&prepare_lock); + clk_fwk_unlock(); +out: return parent; } EXPORT_SYMBOL_GPL(clk_get_parent); @@ -1330,6 +1443,7 @@ out: int clk_set_parent(struct clk *clk, struct clk *parent) { int ret = 0; + bool reenter; if (!clk || !clk->ops) return -EINVAL; @@ -1337,8 +1451,10 @@ int clk_set_parent(struct clk *clk, struct clk *parent) if (!clk->ops->set_parent) return -ENOSYS; - /* prevent racing with updates to the clock topology */ - mutex_lock(&prepare_lock); + reenter = clk_is_reentrant(); + + if (!reenter) + clk_fwk_lock(); if (clk->parent == parent) goto out; @@ -1367,7 +1483,8 @@ int clk_set_parent(struct clk *clk, struct clk *parent) __clk_reparent(clk, parent); out: - mutex_unlock(&prepare_lock); + if (!reenter) + clk_fwk_unlock(); return ret; } -- 1.7.10.4

12 years, 7 months

[PATCH v10 0/3] Add DRM FIMD DT support for Exynos4 DT Machines

by Vikas Sajjan

This patch series adds support for DRM FIMD DT for Exynos4 DT Machines, specifically for Exynos4412 SoC. changes since v9: - dropped the patch "ARM: dts: Add lcd pinctrl node entries for EXYNOS4412 SoC" as the gpios in the newly added nodes "lcd_en" and "lcd_sync" in this patch were already PULLed high by existing "lcd_clk" node. - addressed comments from Sylwester Nawrocki <sylvester.nawrocki(a)gmail.com> and Thomas Abraham <thomas.abraham(a)linaro.org> changes since v8: - addressed comments to add missing documentation for clock and clock-names properties as pointed out by Sachin Kamat <sachin.kamat(a)linaro.org> changes since v7: - rebased to kgene's "for-next" - Migrated to Common Clock Framework - removed the patch "ARM: dts: Add FIMD AUXDATA node entry for exynos4 DT", as migration to Common Clock Framework will NOT need this. - addressed the comments raised by Sachin Kamat <sachin.kamat(a)linaro.org> changes since v6: - addressed comments and added interrupt-names = "fifo", "vsync", "lcd_sys" in exynos4.dtsi and re-ordered the interrupt numbering to match the order in interrupt combiner IP as suggested by Sylwester Nawrocki <sylvester.nawrocki(a)gmail.com>. changes since v5: - renamed the fimd binding documentation file name as "samsung-fimd.txt", since it not only talks about exynos display controller but also about previous samsung display controllers. - rephrased an abmigious statement about the interrupt combiner in the fimd binding documentation as pointed out by Sachin Kamat <sachin.kamat(a)linaro.org> changes since v4: - moved the fimd binding documentation to Documentation/devicetree/bindings/video/ as suggested by Sylwester Nawrocki <sylvester.nawrocki(a)gmail.com> - added more fimd compatiblity strings in fimd documentation as discussed at https://patchwork.kernel.org/patch/2144861/ with Sylwester Nawrocki <sylvester.nawrocki(a)gmail.com> and Tomasz Figa <tomasz.figa(a)gmail.com> - modified compatible string for exynos4 fimd as "exynos4210-fimd" exynos5 fimd as "exynos5250-fimd" to stick to the rule that compatible value should be named after first specific SoC model in which this particular IP version was included as discussed at https://patchwork.kernel.org/patch/2144861/ - documented more about the interrupt combiner and their order as suggested by Sylwester Nawrocki <sylvester.nawrocki(a)gmail.com> changes since v3: - rebased on http://git.kernel.org/?p=linux/kernel/git/kgene/linux-samsung.git;a=shortlo… changes since v2: - added alias to 'fimd@11c00000' node (reported by: Rahul Sharma <r.sh.open(a)gmail.com>) - removed 'lcd0_data' node as there was already a similar node lcd_data24 (reported by: Jingoo Han <jg1.han(a)samsung.com> - replaced spaces with tabs in display-timing node changes since v1: - added new patch to add FIMD DT binding Documentation - removed patch enabling SAMSUNG_DEV_BACKLIGHT and SAMSUNG_DEV_PMW for mach-exynos4 DT - added 'status' property to fimd DT node Is based on branch kgene's "for-next" https://git.kernel.org/cgit/linux/kernel/git/kgene/linux-samsung.git/log/?h… Vikas Sajjan (3): ARM: dts: Add FIMD node to exynos4 ARM: dts: Add FIMD node and display timing node to exynos4412-origen.dts ARM: dts: Add FIMD DT binding Documentation .../devicetree/bindings/video/samsung-fimd.txt | 65 ++++++++++++++++++++ arch/arm/boot/dts/exynos4.dtsi | 12 ++++ arch/arm/boot/dts/exynos4412-origen.dts | 21 +++++++ 3 files changed, 98 insertions(+) create mode 100644 Documentation/devicetree/bindings/video/samsung-fimd.txt -- 1.7.9.5

12 years, 7 months

← Newer
1
2
3
4
5
6
7
8
9
10
Older →

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

linaro-kernel April 2013