With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the pressure level
notifications. The levels are defined like this:
The "low" level means that the system is reclaiming memory for new
allocations. Monitoring this reclaiming activity might be useful for
maintaining cache level. Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).
The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file caches,
etc. Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.
The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger. Applications should do whatever they can to help the
system. It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.
The events are propagated upward until the event is handled, i.e. the
events are not pass-through. Here is what this means: for example you have
three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
and C, and suppose group C experiences some pressure. In this situation,
only group C will receive the notification, i.e. groups A and B will not
receive it. This is done to avoid excessive "broadcasting" of messages,
which disturbs the system and which is especially bad if we are low on
memory or thrashing. So, organize the cgroups wisely, or propagate the
events manually (or, ask us to implement the pass-through events,
explaining why would you need them.)
Performance wise, the memory pressure notifications feature itself is
lightweight and does not require much of bookkeeping, in contrast to the
rest of memcg features. Unfortunately, as of current memcg implementation,
pages accounting is an inseparable part and cannot be turned off. The good
news is that there are some efforts[1] to improve the situation; plus,
implementing the same, fully API-compatible[2] interface for
CONFIG_MEMCG=n case (e.g. embedded) is also a viable option, so it will
not require any changes on the userland side.
[1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
[2] http://lkml.org/lkml/2013/2/21/454
Signed-off-by: Anton Vorontsov <anton.vorontsov(a)linaro.org>
Acked-by: Kirill A. Shutemov <kirill(a)shutemov.name>
---
Hi all,
Here is a shiny new v3!
In v3:
- No changes in the code, just updated commit message to incorporate the
answer to Minchan Kim's comment regarding applicability to embedded use
cases in the light of memcg performance overhead, plus gave some
references to Glauber Costa's memcg work.
- Rebased onto 3.9.0-rc3-next-20130321.
In v2:
- Addressed Glauber Costa's comments:
o Use parent_mem_cgroup() instead of own parent function (also suggested
by Kamezawa). This change also affected events distribution logic, so
it became more like memory thresholds notifications, i.e. we deliver
the event to the cgroup where the event originated, not to the parent
cgroup; (This also addreses Kamezawa's remark regarding which cgroup
receives which event.)
o Register vmpressure cgroup file directly in memcontrol.c.
- Addressed Greg Thelen's comments:
o Fixed bool/int inconsistency in the code;
o Fixed nr_scanned accounting;
o Don't use cryptic 's', 'r' abbreviations; get rid of confusing
'window' argument.
- Addressed Kamezawa Hiroyuki's comments:
o Moved declarations from mm/internal.h into linux/vmpressue.h;
o Removed Kconfig symbol. Vmpressure is pretty lightweight (especially
comparing to the memcg accounting). If it ever causes any measurable
performance effect, we want to fix it, not paper it over with a
Kconfig option. :-)
o Removed read operation on pressure_level cgroup file. In apps, we only
use notifications, we don't need the content of the file, so let's
keep things simple for now. Plus this resolves questions like what
should we return there when the system is not reclaiming;
o Reworded documentation;
o Improved comments for vmpressure_prio().
Old changelogs/submissions:
v2: http://lkml.org/lkml/2013/2/18/577
v1: http://lkml.org/lkml/2013/2/10/140
mempressure cgroup: http://lkml.org/lkml/2013/1/4/55
Documentation/cgroups/memory.txt | 61 +++++++++-
include/linux/vmpressure.h | 47 ++++++++
mm/Makefile | 2 +-
mm/memcontrol.c | 28 +++++
mm/vmpressure.c | 252 +++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 8 ++
6 files changed, 396 insertions(+), 2 deletions(-)
create mode 100644 include/linux/vmpressure.h
create mode 100644 mm/vmpressure.c
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index addb1f1..0c004de 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -40,6 +40,7 @@ Features:
- soft limit
- moving (recharging) account at moving a task is selectable.
- usage threshold notifier
+ - memory pressure notifier
- oom-killer disable knob and oom-notifier
- Root cgroup has no limit controls.
@@ -65,6 +66,7 @@ Brief summary of control files.
memory.stat # show various statistics
memory.use_hierarchy # set/show hierarchical account enabled
memory.force_empty # trigger forced move charge to parent
+ memory.pressure_level # set memory pressure notifications
memory.swappiness # set/show swappiness parameter of vmscan
(See sysctl's vm.swappiness)
memory.move_charge_at_immigrate # set/show controls of moving charges
@@ -778,7 +780,64 @@ At reading, current status of OOM is shown.
under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
be stopped.)
-11. TODO
+11. Memory Pressure
+
+The pressure level notifications can be used to monitor the memory
+allocation cost; based on the pressure, applications can implement
+different strategies of managing their memory resources. The pressure
+levels are defined as following:
+
+The "low" level means that the system is reclaiming memory for new
+allocations. Monitoring this reclaiming activity might be useful for
+maintaining cache level. Upon notification, the program (typically
+"Activity Manager") might analyze vmstat and act in advance (i.e.
+prematurely shutdown unimportant services).
+
+The "medium" level means that the system is experiencing medium memory
+pressure, the system might be making swap, paging out active file caches,
+etc. Upon this event applications may decide to further analyze
+vmstat/zoneinfo/memcg or internal memory usage statistics and free any
+resources that can be easily reconstructed or re-read from a disk.
+
+The "critical" level means that the system is actively thrashing, it is
+about to out of memory (OOM) or even the in-kernel OOM killer is on its
+way to trigger. Applications should do whatever they can to help the
+system. It might be too late to consult with vmstat or any other
+statistics, so it's advisable to take an immediate action.
+
+The events are propagated upward until the event is handled, i.e. the
+events are not pass-through. Here is what this means: for example you have
+three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
+and C, and suppose group C experiences some pressure. In this situation,
+only group C will receive the notification, i.e. groups A and B will not
+receive it. This is done to avoid excessive "broadcasting" of messages,
+which disturbs the system and which is especially bad if we are low on
+memory or thrashing. So, organize the cgroups wisely, or propagate the
+events manually (or, ask us to implement the pass-through events,
+explaining why would you need them.)
+
+The file memory.pressure_level is only used to setup an eventfd,
+read/write operations are no implemented.
+
+Test:
+
+ Here is a small script example that makes a new cgroup, sets up a
+ memory limit, sets up a notification in the cgroup and then makes child
+ cgroup experience a critical pressure:
+
+ # cd /sys/fs/cgroup/memory/
+ # mkdir foo
+ # cd foo
+ # cgroup_event_listener memory.pressure_level low &
+ # echo 8000000 > memory.limit_in_bytes
+ # echo 8000000 > memory.memsw.limit_in_bytes
+ # echo $$ > tasks
+ # dd if=/dev/zero | read x
+
+ (Expect a bunch of notifications, and eventually, the oom-killer will
+ trigger.)
+
+12. TODO
1. Add support for accounting huge pages (as a separate controller)
2. Make per-cgroup scanner reclaim not-shared pages first
diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
new file mode 100644
index 0000000..fa84783
--- /dev/null
+++ b/include/linux/vmpressure.h
@@ -0,0 +1,47 @@
+#ifndef __LINUX_VMPRESSURE_H
+#define __LINUX_VMPRESSURE_H
+
+#include <linux/mutex.h>
+#include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/gfp.h>
+#include <linux/types.h>
+#include <linux/cgroup.h>
+
+struct vmpressure {
+ unsigned int scanned;
+ unsigned int reclaimed;
+ /* The lock is used to keep the scanned/reclaimed above in sync. */
+ struct mutex sr_lock;
+
+ struct list_head events;
+ /* Have to grab the lock on events traversal or modifications. */
+ struct mutex events_lock;
+
+ struct work_struct work;
+};
+
+struct mem_cgroup;
+
+#ifdef CONFIG_MEMCG
+extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
+ unsigned long scanned, unsigned long reclaimed);
+extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
+#else
+static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
+ unsigned long scanned, unsigned long reclaimed) {}
+static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
+ int prio) {}
+#endif /* CONFIG_MEMCG */
+
+extern void vmpressure_init(struct vmpressure *vmpr);
+extern struct vmpressure *memcg_to_vmpr(struct mem_cgroup *memcg);
+extern struct cgroup_subsys_state *vmpr_to_css(struct vmpressure *vmpr);
+extern struct vmpressure *css_to_vmpr(struct cgroup_subsys_state *css);
+extern int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
+ struct eventfd_ctx *eventfd,
+ const char *args);
+extern void vmpressure_unregister_event(struct cgroup *cg, struct cftype *cft,
+ struct eventfd_ctx *eventfd);
+
+#endif /* __LINUX_VMPRESSURE_H */
diff --git a/mm/Makefile b/mm/Makefile
index 3a46287..72c5acb 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -50,7 +50,7 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
-obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o vmpressure.o
obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f608546..2482f2c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -49,6 +49,7 @@
#include <linux/fs.h>
#include <linux/seq_file.h>
#include <linux/vmalloc.h>
+#include <linux/vmpressure.h>
#include <linux/mm_inline.h>
#include <linux/page_cgroup.h>
#include <linux/cpu.h>
@@ -376,6 +377,9 @@ struct mem_cgroup {
atomic_t numainfo_events;
atomic_t numainfo_updating;
#endif
+
+ struct vmpressure vmpr;
+
/*
* Per cgroup active and inactive list, similar to the
* per zone LRU lists.
@@ -576,6 +580,24 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
return container_of(s, struct mem_cgroup, css);
}
+/* Some nice accessors for the vmpressure. */
+struct vmpressure *memcg_to_vmpr(struct mem_cgroup *memcg)
+{
+ if (!memcg)
+ memcg = root_mem_cgroup;
+ return &memcg->vmpr;
+}
+
+struct cgroup_subsys_state *vmpr_to_css(struct vmpressure *vmpr)
+{
+ return &container_of(vmpr, struct mem_cgroup, vmpr)->css;
+}
+
+struct vmpressure *css_to_vmpr(struct cgroup_subsys_state *css)
+{
+ return &mem_cgroup_from_css(css)->vmpr;
+}
+
static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
{
return (memcg == root_mem_cgroup);
@@ -6074,6 +6096,11 @@ static struct cftype mem_cgroup_files[] = {
.unregister_event = mem_cgroup_oom_unregister_event,
.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
},
+ {
+ .name = "pressure_level",
+ .register_event = vmpressure_register_event,
+ .unregister_event = vmpressure_unregister_event,
+ },
#ifdef CONFIG_NUMA
{
.name = "numa_stat",
@@ -6365,6 +6392,7 @@ mem_cgroup_css_alloc(struct cgroup *cont)
memcg->move_charge_at_immigrate = 0;
mutex_init(&memcg->thresholds_lock);
spin_lock_init(&memcg->move_lock);
+ vmpressure_init(&memcg->vmpr);
return &memcg->css;
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
new file mode 100644
index 0000000..ae0ff8e
--- /dev/null
+++ b/mm/vmpressure.c
@@ -0,0 +1,252 @@
+/*
+ * Linux VM pressure
+ *
+ * Copyright 2012 Linaro Ltd.
+ * Anton Vorontsov <anton.vorontsov(a)linaro.org>
+ *
+ * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
+ * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
+#include <linux/eventfd.h>
+#include <linux/swap.h>
+#include <linux/printk.h>
+#include <linux/vmpressure.h>
+
+/*
+ * The window size is the number of scanned pages before we try to analyze
+ * the scanned/reclaimed ratio (or difference).
+ *
+ * It is used as a rate-limit tunable for the "low" level notification,
+ * and for averaging medium/critical levels. Using small window sizes can
+ * cause lot of false positives, but too big window size will delay the
+ * notifications.
+ *
+ * TODO: Make the window size depend on machine size, as we do for vmstat
+ * thresholds.
+ */
+static const unsigned int vmpressure_win = SWAP_CLUSTER_MAX * 16;
+static const unsigned int vmpressure_level_med = 60;
+static const unsigned int vmpressure_level_critical = 95;
+static const unsigned int vmpressure_level_critical_prio = 3;
+
+enum vmpressure_levels {
+ VMPRESSURE_LOW = 0,
+ VMPRESSURE_MEDIUM,
+ VMPRESSURE_CRITICAL,
+ VMPRESSURE_NUM_LEVELS,
+};
+
+static const char *vmpressure_str_levels[] = {
+ [VMPRESSURE_LOW] = "low",
+ [VMPRESSURE_MEDIUM] = "medium",
+ [VMPRESSURE_CRITICAL] = "critical",
+};
+
+static enum vmpressure_levels vmpressure_level(unsigned int pressure)
+{
+ if (pressure >= vmpressure_level_critical)
+ return VMPRESSURE_CRITICAL;
+ else if (pressure >= vmpressure_level_med)
+ return VMPRESSURE_MEDIUM;
+ return VMPRESSURE_LOW;
+}
+
+static enum vmpressure_levels vmpressure_calc_level(unsigned int scanned,
+ unsigned int reclaimed)
+{
+ unsigned long scale = scanned + reclaimed;
+ unsigned long pressure;
+
+ if (!scanned)
+ return VMPRESSURE_LOW;
+
+ /*
+ * We calculate the ratio (in percents) of how many pages were
+ * scanned vs. reclaimed in a given time frame (window). Note that
+ * time is in VM reclaimer's "ticks", i.e. number of pages
+ * scanned. This makes it possible to set desired reaction time
+ * and serves as a ratelimit.
+ */
+ pressure = scale - (reclaimed * scale / scanned);
+ pressure = pressure * 100 / scale;
+
+ pr_debug("%s: %3lu (s: %6u r: %6u)\n", __func__, pressure,
+ scanned, reclaimed);
+
+ return vmpressure_level(pressure);
+}
+
+void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
+ unsigned long scanned, unsigned long reclaimed)
+{
+ struct vmpressure *vmpr = memcg_to_vmpr(memcg);
+
+ /*
+ * So far we are only interested application memory, or, in case
+ * of low pressure, in FS/IO memory reclaim. We are also
+ * interested indirect reclaim (kswapd sets sc->gfp_mask to
+ * GFP_KERNEL).
+ */
+ if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
+ return;
+
+ if (!scanned)
+ return;
+
+ mutex_lock(&vmpr->sr_lock);
+ vmpr->scanned += scanned;
+ vmpr->reclaimed += reclaimed;
+ mutex_unlock(&vmpr->sr_lock);
+
+ if (scanned < vmpressure_win || work_pending(&vmpr->work))
+ return;
+ schedule_work(&vmpr->work);
+}
+
+void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
+{
+ if (prio > vmpressure_level_critical_prio)
+ return;
+
+ /*
+ * OK, the prio is below the threshold, updating vmpressure
+ * information before diving into long shrinking of long range
+ * vmscan.
+ */
+ vmpressure(gfp, memcg, vmpressure_win, 0);
+}
+
+static struct vmpressure *wk_to_vmpr(struct work_struct *wk)
+{
+ return container_of(wk, struct vmpressure, work);
+}
+
+static struct vmpressure *cg_to_vmpr(struct cgroup *cg)
+{
+ return css_to_vmpr(cgroup_subsys_state(cg, mem_cgroup_subsys_id));
+}
+
+struct vmpressure_event {
+ struct eventfd_ctx *efd;
+ enum vmpressure_levels level;
+ struct list_head node;
+};
+
+static bool vmpressure_event(struct vmpressure *vmpr,
+ unsigned long scanned, unsigned long reclaimed)
+{
+ struct vmpressure_event *ev;
+ int level = vmpressure_calc_level(scanned, reclaimed);
+ bool signalled = false;
+
+ mutex_lock(&vmpr->events_lock);
+
+ list_for_each_entry(ev, &vmpr->events, node) {
+ if (level >= ev->level) {
+ eventfd_signal(ev->efd, 1);
+ signalled = true;
+ }
+ }
+
+ mutex_unlock(&vmpr->events_lock);
+
+ return signalled;
+}
+
+static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
+{
+ struct cgroup *cg = vmpr_to_css(vmpr)->cgroup;
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cg);
+
+ memcg = parent_mem_cgroup(memcg);
+ if (!memcg)
+ return NULL;
+ return memcg_to_vmpr(memcg);
+}
+
+static void vmpressure_wk_fn(struct work_struct *wk)
+{
+ struct vmpressure *vmpr = wk_to_vmpr(wk);
+ unsigned long s;
+ unsigned long r;
+
+ mutex_lock(&vmpr->sr_lock);
+ s = vmpr->scanned;
+ r = vmpr->reclaimed;
+ vmpr->scanned = 0;
+ vmpr->reclaimed = 0;
+ mutex_unlock(&vmpr->sr_lock);
+
+ do {
+ if (vmpressure_event(vmpr, s, r))
+ break;
+ /*
+ * If not handled, propagate the event upward into the
+ * hierarchy.
+ */
+ } while ((vmpr = vmpressure_parent(vmpr)));
+}
+
+int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
+ struct eventfd_ctx *eventfd, const char *args)
+{
+ struct vmpressure *vmpr = cg_to_vmpr(cg);
+ struct vmpressure_event *ev;
+ int lvl;
+
+ for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) {
+ if (!strcmp(vmpressure_str_levels[lvl], args))
+ break;
+ }
+
+ if (lvl >= VMPRESSURE_NUM_LEVELS)
+ return -EINVAL;
+
+ ev = kzalloc(sizeof(*ev), GFP_KERNEL);
+ if (!ev)
+ return -ENOMEM;
+
+ ev->efd = eventfd;
+ ev->level = lvl;
+
+ mutex_lock(&vmpr->events_lock);
+ list_add(&ev->node, &vmpr->events);
+ mutex_unlock(&vmpr->events_lock);
+
+ return 0;
+}
+
+void vmpressure_unregister_event(struct cgroup *cg, struct cftype *cft,
+ struct eventfd_ctx *eventfd)
+{
+ struct vmpressure *vmpr = cg_to_vmpr(cg);
+ struct vmpressure_event *ev;
+
+ mutex_lock(&vmpr->events_lock);
+ list_for_each_entry(ev, &vmpr->events, node) {
+ if (ev->efd != eventfd)
+ continue;
+ list_del(&ev->node);
+ kfree(ev);
+ break;
+ }
+ mutex_unlock(&vmpr->events_lock);
+}
+
+void vmpressure_init(struct vmpressure *vmpr)
+{
+ mutex_init(&vmpr->sr_lock);
+ mutex_init(&vmpr->events_lock);
+ INIT_LIST_HEAD(&vmpr->events);
+ INIT_WORK(&vmpr->work, vmpressure_wk_fn);
+}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index df78d17..616e2bb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -19,6 +19,7 @@
#include <linux/pagemap.h>
#include <linux/init.h>
#include <linux/highmem.h>
+#include <linux/vmpressure.h>
#include <linux/vmstat.h>
#include <linux/file.h>
#include <linux/writeback.h>
@@ -1982,6 +1983,11 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
}
memcg = mem_cgroup_iter(root, memcg, &reclaim);
} while (memcg);
+
+ vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
+ sc->nr_scanned - nr_scanned,
+ sc->nr_reclaimed - nr_reclaimed);
+
} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
sc->nr_scanned - nr_scanned, sc));
}
@@ -2167,6 +2173,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
count_vm_event(ALLOCSTALL);
do {
+ vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
+ sc->priority);
sc->nr_scanned = 0;
aborted_reclaim = shrink_zones(zonelist, sc);
--
1.8.1.4
The next patch will setup automatically the broadcast timer for
the different cpuidle driver when one idle state stops its timer.
This will be part of the generic code.
But some ARM boards, like s3c64xx, uses cpuidle but without the
CONFIG_GENERIC_CLOCKEVENTS_BUILD set. Hence the cpuidle framework
will be compiled with the code supposed to be generic, that is
with clockevents_notify and the different enum.
Also the function clockevents_notify is a noop macro, this is fine
except the usual code is:
int cpu = smp_processor_id();
clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ON, &cpu);
and that raises a warning for the variable cpu which is not used.
Move the clock_event_nofitiers enum definition out of the
CONFIG_GENERIC_CLOCKEVENTS_BUILD section to prevent a compilation
error when these are used in the code.
Change the clockevents_notify macro to a static inline noop function
to prevent a compilation warning.
Signed-off-by: Daniel Lezcano <daniel.lezcano(a)linaro.org>
---
include/linux/clockchips.h | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)
diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 6634652..f9fd937 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -8,6 +8,20 @@
#ifndef _LINUX_CLOCKCHIPS_H
#define _LINUX_CLOCKCHIPS_H
+/* Clock event notification values */
+enum clock_event_nofitiers {
+ CLOCK_EVT_NOTIFY_ADD,
+ CLOCK_EVT_NOTIFY_BROADCAST_ON,
+ CLOCK_EVT_NOTIFY_BROADCAST_OFF,
+ CLOCK_EVT_NOTIFY_BROADCAST_FORCE,
+ CLOCK_EVT_NOTIFY_BROADCAST_ENTER,
+ CLOCK_EVT_NOTIFY_BROADCAST_EXIT,
+ CLOCK_EVT_NOTIFY_SUSPEND,
+ CLOCK_EVT_NOTIFY_RESUME,
+ CLOCK_EVT_NOTIFY_CPU_DYING,
+ CLOCK_EVT_NOTIFY_CPU_DEAD,
+};
+
#ifdef CONFIG_GENERIC_CLOCKEVENTS_BUILD
#include <linux/clocksource.h>
@@ -26,20 +40,6 @@ enum clock_event_mode {
CLOCK_EVT_MODE_RESUME,
};
-/* Clock event notification values */
-enum clock_event_nofitiers {
- CLOCK_EVT_NOTIFY_ADD,
- CLOCK_EVT_NOTIFY_BROADCAST_ON,
- CLOCK_EVT_NOTIFY_BROADCAST_OFF,
- CLOCK_EVT_NOTIFY_BROADCAST_FORCE,
- CLOCK_EVT_NOTIFY_BROADCAST_ENTER,
- CLOCK_EVT_NOTIFY_BROADCAST_EXIT,
- CLOCK_EVT_NOTIFY_SUSPEND,
- CLOCK_EVT_NOTIFY_RESUME,
- CLOCK_EVT_NOTIFY_CPU_DYING,
- CLOCK_EVT_NOTIFY_CPU_DEAD,
-};
-
/*
* Clock event features
*/
@@ -173,7 +173,7 @@ extern int tick_receive_broadcast(void);
#ifdef CONFIG_GENERIC_CLOCKEVENTS
extern void clockevents_notify(unsigned long reason, void *arg);
#else
-# define clockevents_notify(reason, arg) do { } while (0)
+static inline void clockevents_notify(unsigned long reason, void *arg) {}
#endif
#else /* CONFIG_GENERIC_CLOCKEVENTS_BUILD */
@@ -181,7 +181,7 @@ extern void clockevents_notify(unsigned long reason, void *arg);
static inline void clockevents_suspend(void) {}
static inline void clockevents_resume(void) {}
-#define clockevents_notify(reason, arg) do { } while (0)
+static inline void clockevents_notify(unsigned long reason, void *arg) {}
#endif
--
1.7.9.5
Earlier definitions of affected and related cpus were:
Related_cpus: CPUs which run at the same hardware frequency.
Affected_cpus: CPUs which need to have their frequency coordinated by software.
These definitions were very confusing as they don't communicate the real
difference between them.
Following are the new definitions of these variables:
Related_cpus: All (Online & Offline) CPUs that run at the same hardware frequency.
Affected_cpus: Online CPUs that run at the same hardware frequency.
Above definitions are more consistent with latest cpufreq core code.
Signed-off-by: Viresh Kumar <viresh.kumar(a)linaro.org>
---
tools/power/cpupower/utils/cpufreq-info.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/power/cpupower/utils/cpufreq-info.c b/tools/power/cpupower/utils/cpufreq-info.c
index 28953c9..a81d4ec 100644
--- a/tools/power/cpupower/utils/cpufreq-info.c
+++ b/tools/power/cpupower/utils/cpufreq-info.c
@@ -247,7 +247,7 @@ static void debug_output_one(unsigned int cpu)
cpus = cpufreq_get_related_cpus(cpu);
if (cpus) {
- printf(_(" CPUs which run at the same hardware frequency: "));
+ printf(_(" All (Online & Offline) CPUs that run at the same hardware frequency: "));
while (cpus->next) {
printf("%d ", cpus->cpu);
cpus = cpus->next;
@@ -258,7 +258,7 @@ static void debug_output_one(unsigned int cpu)
cpus = cpufreq_get_affected_cpus(cpu);
if (cpus) {
- printf(_(" CPUs which need to have their frequency coordinated by software: "));
+ printf(_(" Online CPUs that run at the same hardware frequency: "));
while (cpus->next) {
printf("%d ", cpus->cpu);
cpus = cpus->next;
--
1.7.12.rc2.18.g61b472e
Reentrancy into the clock framework from the clk.h api is necessary for
clocks that are prepared and unprepared via i2c_transfer (which includes
many PMICs and discrete audio chips) as well as for several other use
cases.
This patch implements reentrancy by adding two global atomic_t's which
track the context of the current caller. Context in this case is the
return value from get_current(). One context variable is for slow
operations protected by the prepare_mutex and the other is for fast
operations protected by the enable_lock spinlock.
The clk.h api implementations are modified to first see if the relevant
global lock is already held and if so compare the global context (set by
whoever is holding the lock) against their own context (via a call to
get_current()). If the two match then this function is a nested call
from the one already holding the lock and we procede. If the context
does not match then procede to call mutex_lock and busy-wait for the
existing task to complete.
This patch does not increase concurrency for unrelated calls into the
clock framework. Instead it simply allows reentrancy by the single task
which is currently holding the global clock framework lock.
Signed-off-by: Mike Turquette <mturquette(a)linaro.org>
Cc: Rajagopal Venkat <rajagopal.venkat(a)linaro.org>
Cc: David Brown <davidb(a)codeaurora.org>
Cc: Ulf Hansson <ulf.hansson(a)linaro.org>
Cc: Laurent Pinchart <laurent.pinchart(a)ideasonboard.com>
---
drivers/clk/clk.c | 255 ++++++++++++++++++++++++++++++++++++++---------------
1 file changed, 186 insertions(+), 69 deletions(-)
diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c
index 5e8ffff..17432a5 100644
--- a/drivers/clk/clk.c
+++ b/drivers/clk/clk.c
@@ -19,9 +19,12 @@
#include <linux/of.h>
#include <linux/device.h>
#include <linux/init.h>
+#include <linux/sched.h>
static DEFINE_SPINLOCK(enable_lock);
static DEFINE_MUTEX(prepare_lock);
+static atomic_t prepare_context;
+static atomic_t enable_context;
static HLIST_HEAD(clk_root_list);
static HLIST_HEAD(clk_orphan_list);
@@ -456,27 +459,6 @@ unsigned int __clk_get_prepare_count(struct clk *clk)
return !clk ? 0 : clk->prepare_count;
}
-unsigned long __clk_get_rate(struct clk *clk)
-{
- unsigned long ret;
-
- if (!clk) {
- ret = 0;
- goto out;
- }
-
- ret = clk->rate;
-
- if (clk->flags & CLK_IS_ROOT)
- goto out;
-
- if (!clk->parent)
- ret = 0;
-
-out:
- return ret;
-}
-
unsigned long __clk_get_flags(struct clk *clk)
{
return !clk ? 0 : clk->flags;
@@ -566,6 +548,35 @@ struct clk *__clk_lookup(const char *name)
return NULL;
}
+/*** locking & reentrancy ***/
+
+static void clk_fwk_lock(void)
+{
+ /* hold the framework-wide lock, context == NULL */
+ mutex_lock(&prepare_lock);
+
+ /* set context for any reentrant calls */
+ atomic_set(&prepare_context, (int) get_current());
+}
+
+static void clk_fwk_unlock(void)
+{
+ /* clear the context */
+ atomic_set(&prepare_context, 0);
+
+ /* release the framework-wide lock, context == NULL */
+ mutex_unlock(&prepare_lock);
+}
+
+static bool clk_is_reentrant(void)
+{
+ if (mutex_is_locked(&prepare_lock))
+ if ((void *) atomic_read(&prepare_context) == get_current())
+ return true;
+
+ return false;
+}
+
/*** clk api ***/
void __clk_unprepare(struct clk *clk)
@@ -600,9 +611,15 @@ void __clk_unprepare(struct clk *clk)
*/
void clk_unprepare(struct clk *clk)
{
- mutex_lock(&prepare_lock);
+ /* re-enter if call is from the same context */
+ if (clk_is_reentrant()) {
+ __clk_unprepare(clk);
+ return;
+ }
+
+ clk_fwk_lock();
__clk_unprepare(clk);
- mutex_unlock(&prepare_lock);
+ clk_fwk_unlock();
}
EXPORT_SYMBOL_GPL(clk_unprepare);
@@ -648,10 +665,16 @@ int clk_prepare(struct clk *clk)
{
int ret;
- mutex_lock(&prepare_lock);
- ret = __clk_prepare(clk);
- mutex_unlock(&prepare_lock);
+ /* re-enter if call is from the same context */
+ if (clk_is_reentrant()) {
+ ret = __clk_prepare(clk);
+ goto out;
+ }
+ clk_fwk_lock();
+ ret = __clk_prepare(clk);
+ clk_fwk_unlock();
+out:
return ret;
}
EXPORT_SYMBOL_GPL(clk_prepare);
@@ -692,8 +715,27 @@ void clk_disable(struct clk *clk)
{
unsigned long flags;
+ /* this call re-enters if it is from the same context */
+ if (spin_is_locked(&enable_lock)) {
+ if ((void *) atomic_read(&enable_context) == get_current()) {
+ __clk_disable(clk);
+ return;
+ }
+ }
+
+ /* hold the framework-wide lock, context == NULL */
spin_lock_irqsave(&enable_lock, flags);
+
+ /* set context for any reentrant calls */
+ atomic_set(&enable_context, (int) get_current());
+
+ /* disable the clock(s) */
__clk_disable(clk);
+
+ /* clear the context */
+ atomic_set(&enable_context, 0);
+
+ /* release the framework-wide lock, context == NULL */
spin_unlock_irqrestore(&enable_lock, flags);
}
EXPORT_SYMBOL_GPL(clk_disable);
@@ -745,10 +787,29 @@ int clk_enable(struct clk *clk)
unsigned long flags;
int ret;
+ /* this call re-enters if it is from the same context */
+ if (spin_is_locked(&enable_lock)) {
+ if ((void *) atomic_read(&enable_context) == get_current()) {
+ ret = __clk_enable(clk);
+ goto out;
+ }
+ }
+
+ /* hold the framework-wide lock, context == NULL */
spin_lock_irqsave(&enable_lock, flags);
+
+ /* set context for any reentrant calls */
+ atomic_set(&enable_context, (int) get_current());
+
+ /* enable the clock(s) */
ret = __clk_enable(clk);
- spin_unlock_irqrestore(&enable_lock, flags);
+ /* clear the context */
+ atomic_set(&enable_context, 0);
+
+ /* release the framework-wide lock, context == NULL */
+ spin_unlock_irqrestore(&enable_lock, flags);
+out:
return ret;
}
EXPORT_SYMBOL_GPL(clk_enable);
@@ -792,10 +853,17 @@ long clk_round_rate(struct clk *clk, unsigned long rate)
{
unsigned long ret;
- mutex_lock(&prepare_lock);
+ /* this call re-enters if it is from the same context */
+ if (clk_is_reentrant()) {
+ ret = __clk_round_rate(clk, rate);
+ goto out;
+ }
+
+ clk_fwk_lock();
ret = __clk_round_rate(clk, rate);
- mutex_unlock(&prepare_lock);
+ clk_fwk_unlock();
+out:
return ret;
}
EXPORT_SYMBOL_GPL(clk_round_rate);
@@ -877,6 +945,30 @@ static void __clk_recalc_rates(struct clk *clk, unsigned long msg)
__clk_recalc_rates(child, msg);
}
+unsigned long __clk_get_rate(struct clk *clk)
+{
+ unsigned long ret;
+
+ if (!clk) {
+ ret = 0;
+ goto out;
+ }
+
+ if (clk->flags & CLK_GET_RATE_NOCACHE)
+ __clk_recalc_rates(clk, 0);
+
+ ret = clk->rate;
+
+ if (clk->flags & CLK_IS_ROOT)
+ goto out;
+
+ if (!clk->parent)
+ ret = 0;
+
+out:
+ return ret;
+}
+
/**
* clk_get_rate - return the rate of clk
* @clk: the clk whose rate is being returned
@@ -889,14 +981,22 @@ unsigned long clk_get_rate(struct clk *clk)
{
unsigned long rate;
- mutex_lock(&prepare_lock);
+ /*
+ * FIXME - any locking here seems heavy weight
+ * can clk->rate be replaced with an atomic_t?
+ * same logic can likely be applied to prepare_count & enable_count
+ */
- if (clk && (clk->flags & CLK_GET_RATE_NOCACHE))
- __clk_recalc_rates(clk, 0);
+ if (clk_is_reentrant()) {
+ rate = __clk_get_rate(clk);
+ goto out;
+ }
+ clk_fwk_lock();
rate = __clk_get_rate(clk);
- mutex_unlock(&prepare_lock);
+ clk_fwk_unlock();
+out:
return rate;
}
EXPORT_SYMBOL_GPL(clk_get_rate);
@@ -1073,6 +1173,39 @@ static void clk_change_rate(struct clk *clk)
clk_change_rate(child);
}
+int __clk_set_rate(struct clk *clk, unsigned long rate)
+{
+ int ret = 0;
+ struct clk *top, *fail_clk;
+
+ /* bail early if nothing to do */
+ if (rate == clk->rate)
+ return 0;
+
+ if ((clk->flags & CLK_SET_RATE_GATE) && clk->prepare_count) {
+ return -EBUSY;
+ }
+
+ /* calculate new rates and get the topmost changed clock */
+ top = clk_calc_new_rates(clk, rate);
+ if (!top)
+ return -EINVAL;
+
+ /* notify that we are about to change rates */
+ fail_clk = clk_propagate_rate_change(top, PRE_RATE_CHANGE);
+ if (fail_clk) {
+ pr_warn("%s: failed to set %s rate\n", __func__,
+ fail_clk->name);
+ clk_propagate_rate_change(top, ABORT_RATE_CHANGE);
+ return -EBUSY;
+ }
+
+ /* change the rates */
+ clk_change_rate(top);
+
+ return ret;
+}
+
/**
* clk_set_rate - specify a new rate for clk
* @clk: the clk whose rate is being changed
@@ -1096,44 +1229,18 @@ static void clk_change_rate(struct clk *clk)
*/
int clk_set_rate(struct clk *clk, unsigned long rate)
{
- struct clk *top, *fail_clk;
int ret = 0;
- /* prevent racing with updates to the clock topology */
- mutex_lock(&prepare_lock);
-
- /* bail early if nothing to do */
- if (rate == clk->rate)
- goto out;
-
- if ((clk->flags & CLK_SET_RATE_GATE) && clk->prepare_count) {
- ret = -EBUSY;
- goto out;
- }
-
- /* calculate new rates and get the topmost changed clock */
- top = clk_calc_new_rates(clk, rate);
- if (!top) {
- ret = -EINVAL;
- goto out;
- }
-
- /* notify that we are about to change rates */
- fail_clk = clk_propagate_rate_change(top, PRE_RATE_CHANGE);
- if (fail_clk) {
- pr_warn("%s: failed to set %s rate\n", __func__,
- fail_clk->name);
- clk_propagate_rate_change(top, ABORT_RATE_CHANGE);
- ret = -EBUSY;
+ if (clk_is_reentrant()) {
+ ret = __clk_set_rate(clk, rate);
goto out;
}
- /* change the rates */
- clk_change_rate(top);
+ clk_fwk_lock();
+ ret = __clk_set_rate(clk, rate);
+ clk_fwk_unlock();
out:
- mutex_unlock(&prepare_lock);
-
return ret;
}
EXPORT_SYMBOL_GPL(clk_set_rate);
@@ -1148,10 +1255,16 @@ struct clk *clk_get_parent(struct clk *clk)
{
struct clk *parent;
- mutex_lock(&prepare_lock);
+ if (clk_is_reentrant()) {
+ parent = __clk_get_parent(clk);
+ goto out;
+ }
+
+ clk_fwk_lock();
parent = __clk_get_parent(clk);
- mutex_unlock(&prepare_lock);
+ clk_fwk_unlock();
+out:
return parent;
}
EXPORT_SYMBOL_GPL(clk_get_parent);
@@ -1330,6 +1443,7 @@ out:
int clk_set_parent(struct clk *clk, struct clk *parent)
{
int ret = 0;
+ bool reenter;
if (!clk || !clk->ops)
return -EINVAL;
@@ -1337,8 +1451,10 @@ int clk_set_parent(struct clk *clk, struct clk *parent)
if (!clk->ops->set_parent)
return -ENOSYS;
- /* prevent racing with updates to the clock topology */
- mutex_lock(&prepare_lock);
+ reenter = clk_is_reentrant();
+
+ if (!reenter)
+ clk_fwk_lock();
if (clk->parent == parent)
goto out;
@@ -1367,7 +1483,8 @@ int clk_set_parent(struct clk *clk, struct clk *parent)
__clk_reparent(clk, parent);
out:
- mutex_unlock(&prepare_lock);
+ if (!reenter)
+ clk_fwk_unlock();
return ret;
}
--
1.7.10.4
Guenter and Anton,
Only one minor update as Guenter mentioned for v6, thank you.
v6 -> v7 changes:
- move exporting symbols from [5/5] to [4/5], which was a mistake.
v5 -> v6 changes:
- add depend on AB8500_BM in Kconfig
- fix wrong usage of clamp_val()
- export symbols for module compiling
v4 -> v5 changes:
- split the old [2/3]-ab8500-re-arrange-ab8500-power-and-temperature-data into
new three [2/5], [3/5] and [4/5] patches.
- hwmon driver minor coding style clean ups:
- {} usage in if-else statement in ab8500_read_sensor function
- index error fix in gpadc_monitor function
- fix issue of clamp_val() usage
- remove unnecessary else in function abx500_attrs_visible
- remove redundant print message about irq set up
- return the calling function return value directly in probe function
v3 -> v4 changes:
for patch [3/3]
- define delays in HZ
- update ab8500_read_sensor function, returning temp by parameter
- remove ab8500_is_visible function
- use clamp_val in set_min and set_max callback
- remove unnecessary locks in remove and suspend functions
- let abx500 and ab8500 use its own data structure
for patch [2/3]
- move the data tables from driver/power/ab8500_bmdata.c to
include/linux/power/ab8500.h
- rename driver/power/ab8500_bmdata.c to driver/power/ab8500_bm.c
- rename these variable names to eliminate CamelCase warnings
- add const attribute to these data
v2 -> v3 changes:
- Add interface for converting voltage to temperature
- Remove temp5 sensor since we cannot offer temperature read interface of it
- Update hyst to use absolute temperature instead of a difference
- Add the 3/3 patch
v1 -> v2 changes:
- Add Documentation/hwmon/abx500 and Documentation/hwmon/abx500
- Make devices which cannot report milli-Celsius invisible
- Add temp5_crit interface
- Re-work the old find_active_thresholds() to threshold_updated()
- Reset updated_min_alarm and updated_max_alarm at the end of each loop
- Update the hyst mechamisn to make it works as real hyst
- Remove non-stand attributes
- Re-order the operations sequence inside probe and remove functions
- Update all the lock usages to eliminate race conditions
- Make attibutes index starts from 0
also changes:
- Since the old [1/2] "ARM: ux500: rename ab8500 to abx500 for hwmon driver"
has been merged by Samuel, so won't send it again.
- Add another new patch "ab8500_btemp: export two symblols" as [2/2] of this
patch set.
Hongbo Zhang (5):
ab8500_btemp: make ab8500_btemp_get* interfaces public
ab8500: power: eliminate CamelCase warning of some variables
ab8500: power: add const attributes to some data arrays
ab8500: power: export abx500_res_to_temp tables for hwmon
hwmon: add ST-Ericsson ABX500 hwmon driver
Documentation/hwmon/ab8500 | 22 ++
Documentation/hwmon/abx500 | 28 ++
drivers/hwmon/Kconfig | 13 +
drivers/hwmon/Makefile | 1 +
drivers/hwmon/ab8500.c | 206 +++++++++++++++
drivers/hwmon/abx500.c | 491 +++++++++++++++++++++++++++++++++++
drivers/hwmon/abx500.h | 69 +++++
drivers/power/ab8500_bmdata.c | 42 +--
drivers/power/ab8500_btemp.c | 5 +-
drivers/power/ab8500_fg.c | 4 +-
include/linux/mfd/abx500.h | 6 +-
include/linux/mfd/abx500/ab8500-bm.h | 5 +
include/linux/power/ab8500.h | 16 ++
13 files changed, 885 insertions(+), 23 deletions(-)
create mode 100644 Documentation/hwmon/ab8500
create mode 100644 Documentation/hwmon/abx500
create mode 100644 drivers/hwmon/ab8500.c
create mode 100644 drivers/hwmon/abx500.c
create mode 100644 drivers/hwmon/abx500.h
create mode 100644 include/linux/power/ab8500.h
--
1.8.0
It is not possible for init() to be called for any cpu other than cpu0. During
bootup whatever cpu is used to boot system will be assigned as cpu0. And later
on policy->cpu can only change if we hotunplug all cpus first and then hotplug
them back in different order, which isn't possible (system requires atleast one
cpu to be up always :)).
Though I can see one situation where policy->cpu can be different then zero.
- Hot-unplug cpu 0.
- rmmod cpufreq-cpu0 module
- insmod it back
- hotplug cpu 0 again.
Here, policy->cpu would be different. But the driver doesn't have any dependency
on cpu0 as such. We don't mind which cpu of a system is policy->cpu and so this
check is just not required.
Remove it.
Signed-off-by: Viresh Kumar <viresh.kumar(a)linaro.org>
---
drivers/cpufreq/cpufreq-cpu0.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/drivers/cpufreq/cpufreq-cpu0.c b/drivers/cpufreq/cpufreq-cpu0.c
index 0f16267..1cab820 100644
--- a/drivers/cpufreq/cpufreq-cpu0.c
+++ b/drivers/cpufreq/cpufreq-cpu0.c
@@ -124,9 +124,6 @@ static int cpu0_cpufreq_init(struct cpufreq_policy *policy)
{
int ret;
- if (policy->cpu != 0)
- return -EINVAL;
-
ret = cpufreq_frequency_table_cpuinfo(policy, freq_table);
if (ret) {
pr_err("invalid frequency table: %d\n", ret);
--
1.7.12.rc2.18.g61b472e
This patchset was called: "Create sched_select_cpu() and use it for workqueues"
for the first three versions.
Earlier discussions over v3, v2 and v1 can be found here:
https://lkml.org/lkml/2013/3/18/364http://lists.linaro.org/pipermail/linaro-dev/2012-November/014344.htmlhttp://www.mail-archive.com/linaro-dev@lists.linaro.org/msg13342.html
For power saving it is better to schedule work on cpus that aren't idle, as
bringing a cpu/cluster from idle state can be very costly (both performance and
power wise). Earlier we tried to use timer infrastructure to take this decision
but we found out later that scheduler gives even better results and so we should
use scheduler for choosing cpu for scheduling work.
In workqueue subsystem workqueues with flag WQ_UNBOUND are the ones which uses
cpu to select target cpu.
Here we are migrating few users of workqueues to WQ_UNBOUND. These drivers are
found to be very much active on idle or lightly busy system and using WQ_UNBOUND
for these gave impressive results.
Setup:
-----
- ARM Vexpress TC2 - big.LITTLE CPU
- Core 0-1: A15, 2-4: A7
- rootfs: linaro-ubuntu-devel
This patchset has been tested on a big LITTLE system (heterogeneous) but is
useful for all other homogeneous systems as well. During these tests audio was
played in background using aplay.
Results:
-------
Cluster A15 Energy Cluster A7 Energy Total
------------------------- ----------------------- ------
Without this patchset (Energy in Joules):
---------------------------------------------------
0.151162 2.183545 2.334707
0.223730 2.687067 2.910797
0.289687 2.732702 3.022389
0.454198 2.745908 3.200106
0.495552 2.746465 3.242017
Average:
0.322866 2.619137 2.942003
With this patchset (Energy in Joules):
-----------------------------------------------
0.226421 2.283658 2.510079
0.151361 2.236656 2.388017
0.197726 2.249849 2.447575
0.221915 2.229446 2.451361
0.347098 2.257707 2.604805
Average:
0.2289042 2.2514632 2.4803674
Above tests are repeated multiple times and events are tracked using trace-cmd
and analysed using kernelshark. And it was easily noticeable that idle time for
many cpus has increased considerably, which eventually saved some power.
PS: All the earlier Acks we got for drivers are reverted here as patches have
been updated significantly.
V3->V4:
-------
- Dropped changes to kernel/sched directory and hence
sched_select_non_idle_cpu().
- Dropped queue_work_on_any_cpu()
- Created system_freezable_unbound_wq
- Changed all patches accordingly.
V2->V3:
-------
- Dropped changes into core queue_work() API, rather create *_on_any_cpu()
APIs
- Dropped running timers migration patch as that was broken
- Migrated few users of workqueues to use *_on_any_cpu() APIs.
Viresh Kumar (4):
workqueue: Add system wide system_freezable_unbound_wq
PHYLIB: queue work on unbound wq
block: queue work on unbound wq
fbcon: queue work on unbound wq
block/blk-core.c | 3 ++-
block/blk-ioc.c | 2 +-
block/genhd.c | 10 ++++++----
drivers/net/phy/phy.c | 9 +++++----
drivers/video/console/fbcon.c | 2 +-
include/linux/workqueue.h | 4 ++++
kernel/workqueue.c | 7 ++++++-
7 files changed, 25 insertions(+), 12 deletions(-)
--
1.7.12.rc2.18.g61b472e
Hi Guys,
All patches are pushed here for others to apply (you can apply from mail to):
http://git.linaro.org/gitweb?p=people/vireshk/linux.git;a=shortlog;h=refs/h…
Currently, there can't be multiple instances of single governor_type. If we have
a multi-package system, where we have multiple instances of struct policy (per
package), we can't have multiple instances of same governor. i.e. We can't have
multiple instances of ondemand governor for multiple packages.
Governors directory in sysfs is created at /sys/devices/system/cpu/cpufreq/
governor-name/. Which again reflects that there can be only one instance of a
governor_type in the system.
This is a bottleneck for multicluster system, where we want different packages
to use same governor type, but with different tunables.
This patchset is inclined towards fixing this issue. Now we will create
governors directory in cpu/cpu*/cpufreq/<gov> for platforms which have multiple
struct policy alive at any moment. Platform drivers requiring this feature must
set have_governor_per_policy variable in their instance of cpufreq_driver. For
others the interface is kept same: cpu/cpufreq/<gov>.
This is V4 of this patchset. V3 is already applied by Rafael in his linux-next
branch. Jacob Shin reported some regressions with this patchset and when I went
into testing it with his configuration I found more issues then what he
reported.
To test these over linux-next you need to revert following first:
db9baec cpufreq: Get rid of "struct global_attr"
86bd6f0 cpufreq: governor: Implement per policy instances of governors
8ae67b1 cpufreq: Add per policy governor-init/exit infrastructure
I have tested this for following now and believe there are no more regressions
with it:
- platform with a single policy instance or single group of cpu
- platform with multiple policies but which don't want per policy instance of
governor
- platform with multiple policies and which want per policy instance of governor
I have tried with different settings and combinations of governors.
@Rafael: To simplify your life I have sorted out your branch and you can simply
pickup the complete branch that I have pushed.
V3->V4:
- We have two instances of all show/store routines for ondemand/conservative
governor. One for per-policy instance of governor and other for one governor
instance for all policies.
- Dropped: db9baec cpufreq: Get rid of "struct global_attr".
- Fixed cpufreq_governor_dbs for multiple policies using same governor instance.
- Implemented few macro's in cpufreq_governor.h to make above stuff clean.
- Renamed have_multiple_policies to have_governor_per_policy
- Some more minor cleanups
Viresh Kumar (2):
cpufreq: Add per policy governor-init/exit infrastructure
cpufreq: governor: Implement per policy instances of governors
drivers/cpufreq/cpufreq.c | 36 ++++-
drivers/cpufreq/cpufreq_conservative.c | 193 ++++++++++++++----------
drivers/cpufreq/cpufreq_governor.c | 212 +++++++++++++++++---------
drivers/cpufreq/cpufreq_governor.h | 117 +++++++++++++--
drivers/cpufreq/cpufreq_ondemand.c | 263 ++++++++++++++++++++-------------
include/linux/cpufreq.h | 17 ++-
6 files changed, 562 insertions(+), 276 deletions(-)
--
1.7.12.rc2.18.g61b472e
At few places in documentation cpufreq_frequency_table is written as
cpufreq_freq_table. Fix these.
Signed-off-by: Viresh Kumar <viresh.kumar(a)linaro.org>
---
Documentation/cpu-freq/cpu-drivers.txt | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/cpu-freq/cpu-drivers.txt b/Documentation/cpu-freq/cpu-drivers.txt
index c94383f..a3585ea 100644
--- a/Documentation/cpu-freq/cpu-drivers.txt
+++ b/Documentation/cpu-freq/cpu-drivers.txt
@@ -185,10 +185,10 @@ the reference implementation in drivers/cpufreq/longrun.c
As most cpufreq processors only allow for being set to a few specific
frequencies, a "frequency table" with some functions might assist in
some work of the processor driver. Such a "frequency table" consists
-of an array of struct cpufreq_freq_table entries, with any value in
+of an array of struct cpufreq_frequency_table entries, with any value in
"index" you want to use, and the corresponding frequency in
"frequency". At the end of the table, you need to add a
-cpufreq_freq_table entry with frequency set to CPUFREQ_TABLE_END. And
+cpufreq_frequency_table entry with frequency set to CPUFREQ_TABLE_END. And
if you want to skip one entry in the table, set the frequency to
CPUFREQ_ENTRY_INVALID. The entries don't need to be in ascending
order.
--
1.7.12.rc2.18.g61b472e