We need to migrate away all the background kernel activities (Unbound) for systems requiring isolation of cores (HPC, Real time, networking, etc). After creating cpusets, you can write 1 or 0 to cpuset.quiesce file.
In our case, we are working on a networking machine which wants to run time critical data plane threads on some CPUs, i.e. a single thread per CPU. And these CPUs shouldn't be interrupted at all by background kernel activities, like timers/hrtimers/workqueues/etc..
Writing '1': on this file would migrate unbound/unpinned timers and hrtimers away from the CPUs of the cpuset in question. Also it would disallow addition of any new unpinned timers & hrtimers to isolated CPUs.
Writing '0': will disable isolation of CPUs in current cpuset and unpinned timers/hrtimers would be allowed in future on these CPUs.
This patchset allow us to do this. In the first few patches it builds basic infrastructure in timers and hrtimers and then finally use that in cpusets. Also it updates get_nohz_timer_target() to stop adding new timers/hrtimers to these isolated CPUs.
V1: https://lkml.org/lkml/2014/3/20/319
Not many comments received.
Based on some other timers/hrtimers cleanup I did: http://comments.gmane.org/gmane.linux.kernel/1677797
Available here: git://git.linaro.org/people/viresh.kumar/linux.git isolate-cpusets
V1->V2: - Add support to migrate hrtimers as well (V1 only had timers) - cpuset.quiesce now supports writing 0 and reading as well - update get_nohz_timer_target() to stop adding new timers/hrtimers to these isolated CPUs. - Minor fixups that I noticed
Known issues: 1. Patch: "timer: track pinned timers with TIMER_PINNED flag", following reporting by kbuild system (Don't know how to fix this):
config: make ARCH=blackfin allyesconfig Note: the vireshk/timer-cleanup-for-tglx HEAD ea63467ac9150cd86f4d960887116f99a2803b56 builds fine. It only hurts bisectibility. All error/warnings: kernel/timer.c: In function 'init_timers': >> kernel/timer.c:1683:2: error: call to '__compiletime_assert_1683' >> declared with attribute error: BUILD_BUG_ON failed: >> __alignof__(struct tvec_base) & TIMER_FLAG_MASK 2. Patch: "timer: don't migrate pinned timers", following reporting by kbuild system (Not really a problem created due to this patch, but just highlighted an existing bug. As pinned timers must be removed by owners before CPU goes down):
smpboot: CPU 1 is now offline ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1935 at kernel/timer.c:1621 migrate_timer_list+0xd6/0xf0() migrate_timer_list: can't migrate pinned timer: ffffffff81f06a60, deactivating it Modules linked in:
Viresh Kumar (8): timer: track pinned timers with TIMER_PINNED flag timer: don't migrate pinned timers timer: create timer_quiesce_cpu() to isolate CPU from timers hrtimer: update timer->state with 'pinned' information hrtimer: don't migrate pinned timers hrtimer: create hrtimer_quiesce_cpu() to isolate CPU from hrtimers cpuset: Create sysfs file: cpusets.quiesce to isolate CPUs sched: don't queue timers on quiesced CPUs
Documentation/cgroups/cpusets.txt | 19 +++++++- include/linux/cpuset.h | 8 ++++ include/linux/hrtimer.h | 6 +++ include/linux/timer.h | 13 ++++-- kernel/cpuset.c | 76 ++++++++++++++++++++++++++++++++ kernel/hrtimer.c | 69 ++++++++++++++++++++++++----- kernel/sched/core.c | 9 ++-- kernel/timer.c | 91 +++++++++++++++++++++++++++++++-------- 8 files changed, 253 insertions(+), 38 deletions(-)
In order to quiesce a CPU on which Isolation might be required, we need to move away all the timers queued on that CPU. There are two types of timers queued on any CPU: ones that are pinned to that CPU and others can run on any CPU but are queued on CPU in question. And we need to migrate only the second type of timers away from the CPU entering quiesce state.
For this we need some basic infrastructure in timer core to identify which timers are pinned and which are not.
Hence, this patch adds another flag bit TIMER_PINNED which will be set only for the timers which are pinned to a CPU.
It also removes 'pinned' parameter of __mod_timer() as it is no more required.
NOTE: One functional change worth mentioning
Existing Behavior: add_timer_on() followed by multiple mod_timer() wouldn't pin the timer on CPU mentioned in add_timer_on()..
New Behavior: add_timer_on() followed by multiple mod_timer() would pin the timer on CPU running mod_timer().
I didn't gave much attention to this as we should call mod_timer_on() for the timers queued with add_timer_on(). Though if required we can simply clear the TIMER_PINNED flag in mod_timer().
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- include/linux/timer.h | 10 ++++++---- kernel/timer.c | 27 ++++++++++++++++++++------- 2 files changed, 26 insertions(+), 11 deletions(-)
diff --git a/include/linux/timer.h b/include/linux/timer.h index 8c5a197..2962403 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -49,7 +49,7 @@ extern struct tvec_base boot_tvec_bases; #endif
/* - * Note that all tvec_bases are at least 4 byte aligned and lower two bits + * Note that all tvec_bases are at least 8 byte aligned and lower three bits * of base in timer_list is guaranteed to be zero. Use them for flags. * * A deferrable timer will work normally when the system is busy, but @@ -61,14 +61,18 @@ extern struct tvec_base boot_tvec_bases; * the completion of the running instance from IRQ handlers, for example, * by calling del_timer_sync(). * + * A pinned timer is allowed to run only on the cpu mentioned and shouldn't be + * migrated to any other CPU. + * * Note: The irq disabled callback execution is a special case for * workqueue locking issues. It's not meant for executing random crap * with interrupts disabled. Abuse is monitored! */ #define TIMER_DEFERRABLE 0x1LU #define TIMER_IRQSAFE 0x2LU +#define TIMER_PINNED 0x4LU
-#define TIMER_FLAG_MASK 0x3LU +#define TIMER_FLAG_MASK 0x7LU
#define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \ .entry = { .prev = TIMER_ENTRY_STATIC }, \ @@ -179,8 +183,6 @@ extern int mod_timer_pinned(struct timer_list *timer, unsigned long expires);
extern void set_timer_slack(struct timer_list *time, int slack_hz);
-#define TIMER_NOT_PINNED 0 -#define TIMER_PINNED 1 /* * The jiffies value which is added to now, when there is no timer * in the timer wheel: diff --git a/kernel/timer.c b/kernel/timer.c index d13eb56..e8bcaff 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -104,6 +104,11 @@ static inline unsigned int tbase_get_irqsafe(struct tvec_base *base) return ((unsigned int)(unsigned long)base & TIMER_IRQSAFE); }
+static inline unsigned int tbase_get_pinned(struct tvec_base *base) +{ + return ((unsigned int)(unsigned long)base & TIMER_PINNED); +} + static inline struct tvec_base *tbase_get_base(struct tvec_base *base) { return ((struct tvec_base *)((unsigned long)base & ~TIMER_FLAG_MASK)); @@ -117,6 +122,13 @@ timer_set_base(struct timer_list *timer, struct tvec_base *new_base) timer->base = (struct tvec_base *)((unsigned long)(new_base) | flags); }
+static inline void +timer_set_flags(struct timer_list *timer, unsigned int flags) +{ + timer->base = (struct tvec_base *)((unsigned long)(timer->base) | + flags); +} + static unsigned long round_jiffies_common(unsigned long j, int cpu, bool force_up) { @@ -742,8 +754,7 @@ static struct tvec_base *lock_timer_base(struct timer_list *timer, }
static inline int -__mod_timer(struct timer_list *timer, unsigned long expires, - bool pending_only, int pinned) +__mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only) { struct tvec_base *base, *new_base; unsigned long flags; @@ -760,7 +771,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
debug_activate(timer, expires);
- cpu = get_nohz_timer_target(pinned); + cpu = get_nohz_timer_target(tbase_get_pinned(timer->base)); new_base = per_cpu(tvec_bases, cpu);
if (base != new_base) { @@ -802,7 +813,7 @@ out_unlock: */ int mod_timer_pending(struct timer_list *timer, unsigned long expires) { - return __mod_timer(timer, expires, true, TIMER_NOT_PINNED); + return __mod_timer(timer, expires, true); } EXPORT_SYMBOL(mod_timer_pending);
@@ -877,7 +888,7 @@ int mod_timer(struct timer_list *timer, unsigned long expires) if (timer_pending(timer) && timer->expires == expires) return 1;
- return __mod_timer(timer, expires, false, TIMER_NOT_PINNED); + return __mod_timer(timer, expires, false); } EXPORT_SYMBOL(mod_timer);
@@ -905,7 +916,8 @@ int mod_timer_pinned(struct timer_list *timer, unsigned long expires) if (timer->expires == expires && timer_pending(timer)) return 1;
- return __mod_timer(timer, expires, false, TIMER_PINNED); + timer_set_flags(timer, TIMER_PINNED); + return __mod_timer(timer, expires, false); } EXPORT_SYMBOL(mod_timer_pinned);
@@ -944,6 +956,7 @@ void add_timer_on(struct timer_list *timer, int cpu)
timer_stats_timer_set_start_info(timer); BUG_ON(timer_pending(timer) || !timer->function); + timer_set_flags(timer, TIMER_PINNED); spin_lock_irqsave(&base->lock, flags); timer_set_base(timer, base); debug_activate(timer, timer->expires); @@ -1493,7 +1506,7 @@ signed long __sched schedule_timeout(signed long timeout) expire = timeout + jiffies;
setup_timer_on_stack(&timer, process_timeout, (unsigned long)current); - __mod_timer(&timer, expire, false, TIMER_NOT_PINNED); + __mod_timer(&timer, expire, false); schedule(); del_singleshot_timer_sync(&timer);
migrate_timer() is called when a CPU goes down and its timers are required to be migrated to some other CPU. Its the responsibility of the users of the timer to remove it before control reaches to migrate_timers().
As these were the pinned timers, the best we can do is: don't migrate these and report to the user as well.
That's all this patch does.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- kernel/timer.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/kernel/timer.c b/kernel/timer.c index e8bcaff..6c3a371 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -1606,11 +1606,21 @@ static int init_timers_cpu(int cpu) static void migrate_timer_list(struct tvec_base *new_base, struct list_head *head) { struct timer_list *timer; + int is_pinned;
while (!list_empty(head)) { timer = list_first_entry(head, struct timer_list, entry); /* We ignore the accounting on the dying cpu */ detach_timer(timer, false); + + is_pinned = tbase_get_pinned(timer->base); + + /* Check if CPU still has pinned timers */ + if (unlikely(WARN(is_pinned, + "%s: can't migrate pinned timer: %p, deactivating it\n", + __func__, timer))) + continue; + timer_set_base(timer, new_base); internal_add_timer(new_base, timer); }
To isolate CPUs (isolate from timers) from sysfs using cpusets, we need some support from the timer core. i.e. A routine timer_quiesce_cpu() which would migrate away all the unpinned timers, but shouldn't touch the pinned ones.
This patch creates this routine.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- include/linux/timer.h | 3 +++ kernel/timer.c | 54 ++++++++++++++++++++++++++++++++++++++++----------- 2 files changed, 46 insertions(+), 11 deletions(-)
diff --git a/include/linux/timer.h b/include/linux/timer.h index 2962403..1588a4f 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -196,6 +196,9 @@ extern void set_timer_slack(struct timer_list *time, int slack_hz); */ extern unsigned long get_next_timer_interrupt(unsigned long now);
+/* To be used from cpusets, only */ +extern void timer_quiesce_cpu(void *cpup); + /* * Timer-statistics info: */ diff --git a/kernel/timer.c b/kernel/timer.c index 6c3a371..4676a07 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -1602,18 +1602,27 @@ static int init_timers_cpu(int cpu) return 0; }
-#ifdef CONFIG_HOTPLUG_CPU -static void migrate_timer_list(struct tvec_base *new_base, struct list_head *head) +#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_CPUSETS) +static void migrate_timer_list(struct tvec_base *new_base, + struct list_head *head, bool remove_pinned) { struct timer_list *timer; + struct list_head pinned_list; int is_pinned;
+ INIT_LIST_HEAD(&pinned_list); + while (!list_empty(head)) { timer = list_first_entry(head, struct timer_list, entry); - /* We ignore the accounting on the dying cpu */ - detach_timer(timer, false);
is_pinned = tbase_get_pinned(timer->base); + if (!remove_pinned && is_pinned) { + list_move_tail(&timer->entry, &pinned_list); + continue; + } else { + /* We ignore the accounting on the dying cpu */ + detach_timer(timer, false); + }
/* Check if CPU still has pinned timers */ if (unlikely(WARN(is_pinned, @@ -1624,15 +1633,18 @@ static void migrate_timer_list(struct tvec_base *new_base, struct list_head *hea timer_set_base(timer, new_base); internal_add_timer(new_base, timer); } + + if (!list_empty(&pinned_list)) + list_splice_tail(&pinned_list, head); }
-static void migrate_timers(int cpu) +/* Migrate timers from 'cpu' to this_cpu */ +static void __migrate_timers(int cpu, bool remove_pinned) { struct tvec_base *old_base; struct tvec_base *new_base; int i;
- BUG_ON(cpu_online(cpu)); old_base = per_cpu(tvec_bases, cpu); new_base = get_cpu_var(tvec_bases); /* @@ -1645,20 +1657,40 @@ static void migrate_timers(int cpu) BUG_ON(old_base->running_timer);
for (i = 0; i < TVR_SIZE; i++) - migrate_timer_list(new_base, old_base->tv1.vec + i); + migrate_timer_list(new_base, old_base->tv1.vec + i, + remove_pinned); for (i = 0; i < TVN_SIZE; i++) { - migrate_timer_list(new_base, old_base->tv2.vec + i); - migrate_timer_list(new_base, old_base->tv3.vec + i); - migrate_timer_list(new_base, old_base->tv4.vec + i); - migrate_timer_list(new_base, old_base->tv5.vec + i); + migrate_timer_list(new_base, old_base->tv2.vec + i, + remove_pinned); + migrate_timer_list(new_base, old_base->tv3.vec + i, + remove_pinned); + migrate_timer_list(new_base, old_base->tv4.vec + i, + remove_pinned); + migrate_timer_list(new_base, old_base->tv5.vec + i, + remove_pinned); }
spin_unlock(&old_base->lock); spin_unlock_irq(&new_base->lock); put_cpu_var(tvec_bases); } +#endif /* CONFIG_HOTPLUG_CPU || CONFIG_CPUSETS */ + +#ifdef CONFIG_HOTPLUG_CPU +static void migrate_timers(int cpu) +{ + BUG_ON(cpu_online(cpu)); + __migrate_timers(cpu, true); +} #endif /* CONFIG_HOTPLUG_CPU */
+#ifdef CONFIG_CPUSETS +void timer_quiesce_cpu(void *cpup) +{ + __migrate_timers(*(int *)cpup, false); +} +#endif /* CONFIG_CPUSETS */ + static int timer_cpu_notify(struct notifier_block *self, unsigned long action, void *hcpu) {
'Pinned' information would be required in migrate_hrtimers() now, as we can migrate non-pinned timers away without a hotplug (i.e. with cpuset.quiesce). And so we may need to identify pinned timers now, as we can't migrate them.
This patch reuses the timer->state variable for setting this flag as there were enough number of free bits available in this variable. And there is no point increasing size of this struct by adding another field.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- include/linux/hrtimer.h | 3 +++ kernel/hrtimer.c | 12 ++++++++++-- 2 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h index 435ac4c..9fdb67b 100644 --- a/include/linux/hrtimer.h +++ b/include/linux/hrtimer.h @@ -55,6 +55,7 @@ enum hrtimer_restart { * 0x01 enqueued into rbtree * 0x02 callback function running * 0x04 timer is migrated to another cpu + * 0x08 timer is pinned to a cpu * * Special cases: * 0x03 callback function running and enqueued @@ -81,6 +82,8 @@ enum hrtimer_restart { #define HRTIMER_STATE_ENQUEUED 0x01 #define HRTIMER_STATE_CALLBACK 0x02 #define HRTIMER_STATE_MIGRATE 0x04 +#define HRTIMER_PINNED_SHIFT 3 +#define HRTIMER_STATE_PINNED (1 << HRTIMER_PINNED_SHIFT)
/** * struct hrtimer - the basic hrtimer structure diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index d62fe32..c5a4bf4 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -905,7 +905,11 @@ static void __remove_hrtimer(struct hrtimer *timer, unsigned long newstate, hrtimer_force_reprogram(base->cpu_base, 1); } #endif - timer->state = newstate; + /* + * We need to preserve PINNED state here, otherwise we may end up + * migrating pinned hrtimers as well. + */ + timer->state = newstate | (timer->state & HRTIMER_STATE_PINNED); }
/* remove hrtimer, called with base lock held */ @@ -970,6 +974,10 @@ int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
timer_stats_hrtimer_set_start_info(timer);
+ /* Update pinned state */ + timer->state &= ~HRTIMER_STATE_PINNED; + timer->state |= !!(mode & HRTIMER_MODE_PINNED) << HRTIMER_PINNED_SHIFT; + enqueue_hrtimer(timer);
/* @@ -1227,7 +1235,7 @@ static void __run_hrtimer(struct hrtimer *timer, ktime_t *now) * hrtimer_start_range_ns() or in hrtimer_interrupt() */ if (restart != HRTIMER_NORESTART) { - BUG_ON(timer->state != HRTIMER_STATE_CALLBACK); + BUG_ON(!(timer->state & HRTIMER_STATE_CALLBACK)); enqueue_hrtimer(timer); }
migrate_hrtimer() is called when a CPU goes down and its timers are required to be migrated to some other CPU. Its the responsibility of the users of the hrtimer to remove it before control reaches to migrate_hrtimer().
As these were the pinned hrtimers, the best we can do is: don't migrate these and report to the user as well.
That's all this patch does.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- kernel/hrtimer.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index c5a4bf4..853dd8c 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -1640,6 +1640,7 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base, { struct hrtimer *timer; struct timerqueue_node *node; + int is_pinned;
while ((node = timerqueue_getnext(&old_base->active))) { timer = container_of(node, struct hrtimer, node); @@ -1652,6 +1653,15 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base, * under us on another CPU */ __remove_hrtimer(timer, HRTIMER_STATE_MIGRATE, 0); + + is_pinned = timer->state & HRTIMER_STATE_PINNED; + + /* Check if CPU still has pinned timers */ + if (unlikely(WARN(is_pinned, + "%s: can't migrate pinned timer: %p, deactivating it\n", + __func__, timer))) + continue; + timer->base = new_base; /* * Enqueue the timers on the new cpu. This does not
To isolate CPUs (isolate from hrtimers) from sysfs using cpusets, we need some support from the hrtimer core. i.e. A routine hrtimer_quiesce_cpu() which would migrate away all the unpinned hrtimers, but shouldn't touch the pinned ones.
This patch creates this routine.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- include/linux/hrtimer.h | 3 +++ kernel/hrtimer.c | 47 +++++++++++++++++++++++++++++++++++++++-------- 2 files changed, 42 insertions(+), 8 deletions(-)
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h index 9fdb67b..0718753 100644 --- a/include/linux/hrtimer.h +++ b/include/linux/hrtimer.h @@ -350,6 +350,9 @@ DECLARE_PER_CPU(struct tick_device, tick_cpu_device);
/* Exported timer functions: */
+/* To be used from cpusets, only */ +extern void hrtimer_quiesce_cpu(void *cpup); + /* Initialize timers: */ extern void hrtimer_init(struct hrtimer *timer, clockid_t which_clock, enum hrtimer_mode mode); diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 853dd8c..e8cd1db 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -1633,17 +1633,21 @@ static void init_hrtimers_cpu(int cpu) hrtimer_init_hres(cpu_base); }
-#ifdef CONFIG_HOTPLUG_CPU - +#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_CPUSETS) static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base, - struct hrtimer_clock_base *new_base) + struct hrtimer_clock_base *new_base, + bool remove_pinned) { struct hrtimer *timer; struct timerqueue_node *node; + struct timerqueue_head pinned; int is_pinned;
+ timerqueue_init_head(&pinned); + while ((node = timerqueue_getnext(&old_base->active))) { timer = container_of(node, struct hrtimer, node); + BUG_ON(hrtimer_callback_running(timer)); debug_deactivate(timer);
@@ -1655,6 +1659,10 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base, __remove_hrtimer(timer, HRTIMER_STATE_MIGRATE, 0);
is_pinned = timer->state & HRTIMER_STATE_PINNED; + if (!remove_pinned && is_pinned) { + timerqueue_add(&pinned, &timer->node); + continue; + }
/* Check if CPU still has pinned timers */ if (unlikely(WARN(is_pinned, @@ -1676,18 +1684,24 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base, /* Clear the migration state bit */ timer->state &= ~HRTIMER_STATE_MIGRATE; } + + /* Re-queue pinned timers for non-hotplug usecase */ + while ((node = timerqueue_getnext(&pinned))) { + timer = container_of(node, struct hrtimer, node); + + timerqueue_del(&pinned, &timer->node); + enqueue_hrtimer(timer); + timer->state &= ~HRTIMER_STATE_MIGRATE; + } }
-static void migrate_hrtimers(int scpu) +static void __migrate_hrtimers(int scpu, bool remove_pinned) { struct hrtimer_cpu_base *old_base, *new_base; struct hrtimer_clock_base *clock_base; unsigned int active_bases; int i;
- BUG_ON(cpu_online(scpu)); - tick_cancel_sched_timer(scpu); - local_irq_disable(); old_base = &per_cpu(hrtimer_bases, scpu); new_base = &__get_cpu_var(hrtimer_bases); @@ -1700,7 +1714,8 @@ static void migrate_hrtimers(int scpu)
for_each_active_base(i, clock_base, old_base, active_bases) migrate_hrtimer_list(clock_base, - &new_base->clock_base[clock_base->index]); + &new_base->clock_base[clock_base->index], + remove_pinned);
raw_spin_unlock(&old_base->lock); raw_spin_unlock(&new_base->lock); @@ -1709,9 +1724,25 @@ static void migrate_hrtimers(int scpu) __hrtimer_peek_ahead_timers(); local_irq_enable(); } +#endif /* CONFIG_HOTPLUG_CPU || CONFIG_CPUSETS */
+#ifdef CONFIG_HOTPLUG_CPU +static void migrate_hrtimers(int scpu) +{ + BUG_ON(cpu_online(scpu)); + tick_cancel_sched_timer(scpu); + + __migrate_hrtimers(scpu, true); +} #endif /* CONFIG_HOTPLUG_CPU */
+#ifdef CONFIG_CPUSETS +void hrtimer_quiesce_cpu(void *cpup) +{ + __migrate_hrtimers(*(int *)cpup, false); +} +#endif /* CONFIG_CPUSETS */ + static int hrtimer_cpu_notify(struct notifier_block *self, unsigned long action, void *hcpu) {
For networking applications, platforms need to provide one CPU per each user space data plane thread. These CPUs shouldn't be interrupted by kernel at all unless userspace has requested for some functionality. Currently, there are background kernel activities that are running on almost every CPU, like: timers/hrtimers/watchdogs/etc, and these are required to be migrated to other CPUs.
To achieve that, this patch adds another option to cpusets, i.e. 'quiesce'. Writing '1' on this file would migrate these unbound/unpinned timers/hrtimers away from the CPUs of the cpuset in question. Also it would disallow addition of any new unpinned timers/hrtimers to isolated CPUs (This would be handled in next patch). Writing '0' will disable isolation of CPUs in current cpuset and unpinned timers/hrtimers would be allowed in future on these CPUs.
Currently, only timers and hrtimers are migrated. This would be followed by other kernel infrastructure later if required.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- Documentation/cgroups/cpusets.txt | 19 ++++++++-- include/linux/cpuset.h | 8 +++++ kernel/cpuset.c | 76 +++++++++++++++++++++++++++++++++++++++ 3 files changed, 101 insertions(+), 2 deletions(-)
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt index 7740038..8c1078b 100644 --- a/Documentation/cgroups/cpusets.txt +++ b/Documentation/cgroups/cpusets.txt @@ -22,7 +22,8 @@ CONTENTS: 1.6 What is memory spread ? 1.7 What is sched_load_balance ? 1.8 What is sched_relax_domain_level ? - 1.9 How do I use cpusets ? + 1.9 What is quiesce? + 1.10 How do I use cpusets ? 2. Usage Examples and Syntax 2.1 Basic Usage 2.2 Adding/removing cpus @@ -581,7 +582,21 @@ If your situation is: then increasing 'sched_relax_domain_level' would benefit you.
-1.9 How do I use cpusets ? +1.9 What is quiesce ? +-------------------------------------- +We need to migrate away all the background kernel activities (Unbound) for +systems requiring isolation of cores (HPC, Real time, networking, etc). After +creating cpusets, you can write 1 or 0 to cpuset.quiesce file. + +Writing '1': on this file would migrate unbound/unpinned timers and hrtimers +away from the CPUs of the cpuset in question. Also it would disallow addition of +any new unpinned timers & hrtimers to isolated CPUs. + +Writing '0': will disable isolation of CPUs in current cpuset and unpinned +timers/hrtimers would be allowed in future on these CPUs. + + +1.10 How do I use cpusets ? --------------------------
In order to minimize the impact of cpusets on critical kernel diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 3fe661f..1ce0775 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -15,6 +15,13 @@
#ifdef CONFIG_CPUSETS
+extern cpumask_var_t cpuset_quiesced_cpus_mask; + +static inline bool cpu_quiesced(int cpu) +{ + return cpumask_test_cpu(cpu, cpuset_quiesced_cpus_mask); +} + extern int number_of_cpusets; /* How many cpusets are defined in system? */
extern int cpuset_init(void); @@ -123,6 +130,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
#else /* !CONFIG_CPUSETS */
+static inline bool cpu_quiesced(int cpu) { return 0; } static inline int cpuset_init(void) { return 0; } static inline void cpuset_init_smp(void) {}
diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 4410ac6..256cf11 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -43,10 +43,12 @@ #include <linux/pagemap.h> #include <linux/proc_fs.h> #include <linux/rcupdate.h> +#include <linux/tick.h> #include <linux/sched.h> #include <linux/seq_file.h> #include <linux/security.h> #include <linux/slab.h> +#include <linux/smp.h> #include <linux/spinlock.h> #include <linux/stat.h> #include <linux/string.h> @@ -150,6 +152,7 @@ typedef enum { CS_SCHED_LOAD_BALANCE, CS_SPREAD_PAGE, CS_SPREAD_SLAB, + CS_QUIESCE, } cpuset_flagbits_t;
/* convenient tests for these bits */ @@ -193,6 +196,14 @@ static inline int is_spread_slab(const struct cpuset *cs) return test_bit(CS_SPREAD_SLAB, &cs->flags); }
+static inline int is_cpu_quiesced(const struct cpuset *cs) +{ + return test_bit(CS_QUIESCE, &cs->flags); +} + +/* Mask of CPUs which have requested isolation */ +cpumask_var_t cpuset_quiesced_cpus_mask; + static struct cpuset top_cpuset = { .flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)), @@ -1261,6 +1272,53 @@ static int update_relax_domain_level(struct cpuset *cs, s64 val) }
/** + * quiesce_cpuset - Move unbound timers/hrtimers away from cpuset.cpus + * @cs: cpuset to be quiesced + * + * For isolating a core with cpusets we require all unbound timers/hrtimers to + * move away from isolated core. We migrate these to one of the CPUs which + * hasn't isolated itself yet. And the CPU is selected by + * smp_call_function_any() routine. + * + * Currently we are only migrating timers and hrtimers away. + */ +static int quiesce_cpuset(struct cpuset *cs, int turning_on) +{ + int from_cpu; + cpumask_t cpumask; + + /* Fail if we are already in the requested state */ + if (!(is_cpu_quiesced(cs) ^ turning_on)) + return -EINVAL; + + if (!turning_on) { + cpumask_andnot(cpuset_quiesced_cpus_mask, + cpuset_quiesced_cpus_mask, cs->cpus_allowed); + return 0; + } + + cpumask_andnot(&cpumask, cpu_online_mask, cs->cpus_allowed); + cpumask_andnot(&cpumask, &cpumask, cpuset_quiesced_cpus_mask); + + if (cpumask_empty(&cpumask)) { + pr_err("%s: Couldn't find a CPU to migrate to\n", __func__); + return -EPERM; + } + + cpumask_or(cpuset_quiesced_cpus_mask, cpuset_quiesced_cpus_mask, + cs->cpus_allowed); + + for_each_cpu(from_cpu, cs->cpus_allowed) { + smp_call_function_any(&cpumask, hrtimer_quiesce_cpu, &from_cpu, + 1); + smp_call_function_any(&cpumask, timer_quiesce_cpu, &from_cpu, + 1); + } + + return 0; +} + +/** * cpuset_change_flag - make a task's spread flags the same as its cpuset's * @tsk: task to be updated * @data: cpuset to @tsk belongs to @@ -1326,6 +1384,9 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, if (err < 0) goto out;
+ if (bit == CS_QUIESCE && quiesce_cpuset(cs, turning_on)) + goto out; + err = heap_init(&heap, PAGE_SIZE, GFP_KERNEL, NULL); if (err < 0) goto out; @@ -1597,6 +1658,7 @@ typedef enum { FILE_MEMORY_PRESSURE, FILE_SPREAD_PAGE, FILE_SPREAD_SLAB, + FILE_CPU_QUIESCE, } cpuset_filetype_t;
static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, @@ -1640,6 +1702,9 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, case FILE_SPREAD_SLAB: retval = update_flag(CS_SPREAD_SLAB, cs, val); break; + case FILE_CPU_QUIESCE: + retval = update_flag(CS_QUIESCE, cs, val); + break; default: retval = -EINVAL; break; @@ -1791,6 +1856,8 @@ static u64 cpuset_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) return is_spread_page(cs); case FILE_SPREAD_SLAB: return is_spread_slab(cs); + case FILE_CPU_QUIESCE: + return is_cpu_quiesced(cs); default: BUG(); } @@ -1908,6 +1975,13 @@ static struct cftype files[] = { .private = FILE_MEMORY_PRESSURE_ENABLED, },
+ { + .name = "quiesce", + .read_u64 = cpuset_read_u64, + .write_u64 = cpuset_write_u64, + .private = FILE_CPU_QUIESCE, + }, + { } /* terminate */ };
@@ -2065,6 +2139,8 @@ int __init cpuset_init(void) if (!alloc_cpumask_var(&cpus_attach, GFP_KERNEL)) BUG();
+ BUG_ON(!zalloc_cpumask_var(&cpuset_quiesced_cpus_mask, GFP_KERNEL)); + number_of_cpusets = 1; return 0; }
CPUSets have cpusets.quiesce sysfs file now, with which some CPUs can opt for isolating themselves from background kernel activities, like: timers & hrtimers.
get_nohz_timer_target() is used for finding suitable CPU for firing a timer. To guarantee that new timers wouldn't be queued on quiesced CPUs, we need to modify this routine.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- kernel/sched/core.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c0339e2..b235af2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -557,17 +557,18 @@ void resched_cpu(int cpu) */ int get_nohz_timer_target(int pinned) { - int cpu = smp_processor_id(); - int i; + int cpu = smp_processor_id(), i; struct sched_domain *sd;
- if (pinned || !get_sysctl_timer_migration() || !idle_cpu(cpu)) + if (pinned || !get_sysctl_timer_migration() || + !(idle_cpu(cpu) || cpu_quiesced(cpu))) return cpu;
rcu_read_lock(); for_each_domain(cpu, sd) { for_each_cpu(i, sched_domain_span(sd)) { - if (!idle_cpu(i)) { + /* Don't push timers to quiesced CPUs */ + if (!(cpu_quiesced(i) || idle_cpu(i))) { cpu = i; goto unlock; }
On Fri, 2014-04-04 at 14:05 +0530, Viresh Kumar wrote:
We need to migrate away all the background kernel activities (Unbound) for systems requiring isolation of cores (HPC, Real time, networking, etc). After creating cpusets, you can write 1 or 0 to cpuset.quiesce file.
I wonder if adding a quiesce switch is really necessary.
Seems to me that if you don't have load balancing turned off, you can't be very concerned about perturbation, so this should be tied into the load balancing on/off switch as an extension to isolating cores from the #1 perturbation source, the scheduler.
I also didn't notice a check for is_cpu_exclusive() at a glance, which would be a bug, but one that would go away if this additional isolation were coupled to the existing isolation switch.
-Mike
Hi Mike,
On 6 April 2014 14:00, Mike Galbraith umgwanakikbuti@gmail.com wrote:
I wonder if adding a quiesce switch is really necessary.
Seems to me that if you don't have load balancing turned off, you can't be very concerned about perturbation, so this should be tied into the load balancing on/off switch as an extension to isolating cores from the #1 perturbation source, the scheduler.
Its more about not doing any background activities on these CPU which can be avoided. So, even if a add_timer() is issued from these isolated CPUs, it should goto the set chosen for doing background activity, unless add_timer_on() has been issued, in which case user wants that code to execute on the isolated core.
Probably, yes, people would be disabling load_balancing between these cpusets to avoid migration of tasks to isolated core as well.. Atleast we are using it :)
I also didn't notice a check for is_cpu_exclusive() at a glance, which would be a bug, but one that would go away if this additional isolation were coupled to the existing isolation switch.
Yeah, there is no check for that. But I didn't got your point completely. Why do I need to check for exclusivity on the isolated CPUs? So, that same CPU isn't isolated as well as non-isolated on two separate sets?
Thanks for your feedback.
-- viresh
On Mon, 2014-04-07 at 09:41 +0530, Viresh Kumar wrote:
Hi Mike,
On 6 April 2014 14:00, Mike Galbraith umgwanakikbuti@gmail.com wrote:
I wonder if adding a quiesce switch is really necessary.
Seems to me that if you don't have load balancing turned off, you can't be very concerned about perturbation, so this should be tied into the load balancing on/off switch as an extension to isolating cores from the #1 perturbation source, the scheduler.
Its more about not doing any background activities on these CPU which can be avoided. So, even if a add_timer() is issued from these isolated CPUs, it should goto the set chosen for doing background activity, unless add_timer_on() has been issued, in which case user wants that code to execute on the isolated core.
Probably, yes, people would be disabling load_balancing between these cpusets to avoid migration of tasks to isolated core as well.. Atleast we are using it :)
Yes, that's the whole point I'm trying to make. If you do not have cores isolated, there's no point to quiesce, as timers and whatnot are the least of your worries. Conversely, why would you bother isolating cores if you didn't want a nice quiet environment. Seems to me full isolation ala killing sched domains implies that your goal is bare metal.. or as close as you can get to that anyway.
I also didn't notice a check for is_cpu_exclusive() at a glance, which would be a bug, but one that would go away if this additional isolation were coupled to the existing isolation switch.
Yeah, there is no check for that. But I didn't got your point completely. Why do I need to check for exclusivity on the isolated CPUs? So, that same CPU isn't isolated as well as non-isolated on two separate sets?
Yes, both on and off is kinda hard to do per cpu :)
-Mike
On 7 April 2014 10:34, Mike Galbraith umgwanakikbuti@gmail.com wrote:
Yes, both on and off is kinda hard to do per cpu :)
Correct.. would get it fixed in V3..
linaro-kernel@lists.linaro.org