[RFC 0/4] Migrate timers away from cpuset on setting cpuset.quiesce

List overview All Threads
Download

newer

older

Arm declared the current A53...

arm-soc boot: 48 pass, 0 fail...

Viresh Kumar

20 Mar 2014 20 Mar '14

1:48 p.m.

For networking platforms we need to provide one isolated CPU per user space data plane thread. These CPUs should not be interrupted by kernel at all unless userspace has requested for some syscalls. And so must live in isolated mode. Currently, there are background kernel activities that are running on almost every CPU, like: timers/hrtimers/watchdogs/etc, and these are required to be migrated to other CPUs.

This patchset tries to migrate un-pinned timers away in the first attempt. And support for migrating others like hrtimers/workqueues/etc would be added later.

This has only went through basic testing currently on ARM Samsung Exynos board which only has two CPUs. Separate cpusets were created for these two CPUs and then timers were migrated from one cpuset to other.

This option was earlier suggested by Peter Z. here.

https://lkml.org/lkml/2014/1/15/186

Please provide your inputs on how this can be improved..

Viresh Kumar (4): timer: track pinned timers with TIMER_PINNED flag timer: don't migrate pinned timers timer: create timer_quiesce_cpu() for cpusets.quiesce option cpuset: Add cpusets.quiesce option

include/linux/timer.h | 10 +++--- kernel/cpuset.c | 56 +++++++++++++++++++++++++++++++ kernel/timer.c | 92 +++++++++++++++++++++++++++++++++++++++++---------- 3 files changed, 136 insertions(+), 22 deletions(-)

-- 1.7.12.rc2.18.g61b472e

Show replies by date

Viresh Kumar

20 Mar 20 Mar

1:48 p.m.

New subject: [RFC 1/4] timer: track pinned timers with TIMER_PINNED flag

In order to quiesce a CPU on which Isolation might be required, we need to move away all the timers queued on that CPU. There are two types of timers queued on any CPU: ones that are pinned to that CPU and others can run on any CPU but are queued on CPU in question. And we need to migrate only the second type of timers away from the CPU entering quiesce state.

For this we need some basic infrastructure in timer core to identify which timers are pinned and which are not.

Hence, this patch adds another flag bit TIMER_PINNED which will be set only for the timers which are pinned to a CPU.

It also removes 'pinned' parameter of __mod_timer() as it is no more required.

NOTE: One functional change worth mentioning

Existing Behavior: add_timer_on() followed by multiple mod_timer() wouldn't pin the timer on CPU mentioned in add_timer_on()..

New Behavior: add_timer_on() followed by multiple mod_timer() would pin the timer on CPU running mod_timer().

I didn't gave much attention to this as we should call mod_timer_on() for the timers queued with add_timer_on(). Though if required we can simply clear the TIMER_PINNED flag in mod_timer().

Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- include/linux/timer.h | 10 ++++++---- kernel/timer.c | 27 ++++++++++++++++++++------- 2 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/include/linux/timer.h b/include/linux/timer.h index 8c5a197..2962403 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -49,7 +49,7 @@ extern struct tvec_base boot_tvec_bases; #endif

/* - * Note that all tvec_bases are at least 4 byte aligned and lower two bits + * Note that all tvec_bases are at least 8 byte aligned and lower three bits * of base in timer_list is guaranteed to be zero. Use them for flags. * * A deferrable timer will work normally when the system is busy, but @@ -61,14 +61,18 @@ extern struct tvec_base boot_tvec_bases; * the completion of the running instance from IRQ handlers, for example, * by calling del_timer_sync(). * + * A pinned timer is allowed to run only on the cpu mentioned and shouldn't be + * migrated to any other CPU. + * * Note: The irq disabled callback execution is a special case for * workqueue locking issues. It's not meant for executing random crap * with interrupts disabled. Abuse is monitored! */ #define TIMER_DEFERRABLE 0x1LU #define TIMER_IRQSAFE 0x2LU +#define TIMER_PINNED 0x4LU

-#define TIMER_FLAG_MASK 0x3LU +#define TIMER_FLAG_MASK 0x7LU

#define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \ .entry = { .prev = TIMER_ENTRY_STATIC }, \ @@ -179,8 +183,6 @@ extern int mod_timer_pinned(struct timer_list *timer, unsigned long expires);

extern void set_timer_slack(struct timer_list *time, int slack_hz);

-#define TIMER_NOT_PINNED 0 -#define TIMER_PINNED 1 /* * The jiffies value which is added to now, when there is no timer * in the timer wheel: diff --git a/kernel/timer.c b/kernel/timer.c index 87bd529..fec4ab4 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -104,6 +104,11 @@ static inline unsigned int tbase_get_irqsafe(struct tvec_base *base) return ((unsigned int)(unsigned long)base & TIMER_IRQSAFE); }

+static inline unsigned int tbase_get_pinned(struct tvec_base *base) +{ + return ((unsigned int)(unsigned long)base & TIMER_PINNED); +} + static inline struct tvec_base *tbase_get_base(struct tvec_base *base) { return ((struct tvec_base *)((unsigned long)base & ~TIMER_FLAG_MASK)); @@ -117,6 +122,13 @@ timer_set_base(struct timer_list *timer, struct tvec_base *new_base) timer->base = (struct tvec_base *)((unsigned long)(new_base) | flags); }

+static inline void +timer_set_flags(struct timer_list *timer, unsigned int flags) +{ + timer->base = (struct tvec_base *)((unsigned long)(timer->base) | + flags); +} + static unsigned long round_jiffies_common(unsigned long j, int cpu, bool force_up) { @@ -742,8 +754,7 @@ static struct tvec_base *lock_timer_base(struct timer_list *timer, }

static inline int -__mod_timer(struct timer_list *timer, unsigned long expires, - bool pending_only, int pinned) +__mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only) { struct tvec_base *base, *new_base; unsigned long flags; @@ -760,7 +771,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires,

debug_activate(timer, expires);

- cpu = get_nohz_timer_target(pinned); + cpu = get_nohz_timer_target(!!tbase_get_pinned(timer->base)); new_base = per_cpu(tvec_bases, cpu);

if (base != new_base) { @@ -802,7 +813,7 @@ out_unlock: */ int mod_timer_pending(struct timer_list *timer, unsigned long expires) { - return __mod_timer(timer, expires, true, TIMER_NOT_PINNED); + return __mod_timer(timer, expires, true); } EXPORT_SYMBOL(mod_timer_pending);

@@ -877,7 +888,7 @@ int mod_timer(struct timer_list *timer, unsigned long expires) if (timer_pending(timer) && timer->expires == expires) return 1;

- return __mod_timer(timer, expires, false, TIMER_NOT_PINNED); + return __mod_timer(timer, expires, false); } EXPORT_SYMBOL(mod_timer);

@@ -905,7 +916,8 @@ int mod_timer_pinned(struct timer_list *timer, unsigned long expires) if (timer->expires == expires && timer_pending(timer)) return 1;

- return __mod_timer(timer, expires, false, TIMER_PINNED); + timer_set_flags(timer, TIMER_PINNED); + return __mod_timer(timer, expires, false); } EXPORT_SYMBOL(mod_timer_pinned);

@@ -944,6 +956,7 @@ void add_timer_on(struct timer_list *timer, int cpu)

timer_stats_timer_set_start_info(timer); BUG_ON(timer_pending(timer) || !timer->function); + timer_set_flags(timer, TIMER_PINNED); spin_lock_irqsave(&base->lock, flags); timer_set_base(timer, base); debug_activate(timer, timer->expires); @@ -1493,7 +1506,7 @@ signed long __sched schedule_timeout(signed long timeout) expire = timeout + jiffies;

setup_timer_on_stack(&timer, process_timeout, (unsigned long)current); - __mod_timer(&timer, expire, false, TIMER_NOT_PINNED); + __mod_timer(&timer, expire, false); schedule(); del_singleshot_timer_sync(&timer);

-- 1.7.12.rc2.18.g61b472e

Viresh Kumar

1:48 p.m.

New subject: [RFC 2/4] timer: don't migrate pinned timers

migrate_timer() is called when a CPU goes down and its timers are required to be migrated to some other CPU. Its the responsibility of the users of the timer to remove it before control reaches to migrate_timers().

As these were the pinned timers, the best we can do is: don't migrate these and report to the user as well.

That's all this patch does.

Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- kernel/timer.c | 11 +++++++++++ 1 file changed, 11 insertions(+)

diff --git a/kernel/timer.c b/kernel/timer.c index fec4ab4..a7f8b99 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -1606,11 +1606,22 @@ static int init_timers_cpu(int cpu) static void migrate_timer_list(struct tvec_base *new_base, struct list_head *head) { struct timer_list *timer; + int is_pinned;

while (!list_empty(head)) { timer = list_first_entry(head, struct timer_list, entry); /* We ignore the accounting on the dying cpu */ detach_timer(timer, false); + + is_pinned = tbase_get_pinned(timer->base); + + /* Check if CPU still has pinned timers */ + if (is_pinned) { + pr_warn("%s: can't migrate pinned timer: %p, removing it\n", + __func__, timer); + continue; + } + timer_set_base(timer, new_base); internal_add_timer(new_base, timer); }

-- 1.7.12.rc2.18.g61b472e

Kevin Hilman

31 Mar 31 Mar

3:56 p.m.

New subject: [RFC 2/4] timer: don't migrate pinned timers

Viresh Kumar viresh.kumar@linaro.org writes:

...

migrate_timer() is called when a CPU goes down and its timers are required to be migrated to some other CPU. Its the responsibility of the users of the timer to remove it before control reaches to migrate_timers().

As these were the pinned timers, the best we can do is: don't migrate these and report to the user as well.

That's all this patch does.

Signed-off-by: Viresh Kumar viresh.kumar@linaro.org

kernel/timer.c | 11 +++++++++++ 1 file changed, 11 insertions(+)

diff --git a/kernel/timer.c b/kernel/timer.c index fec4ab4..a7f8b99 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -1606,11 +1606,22 @@ static int init_timers_cpu(int cpu) static void migrate_timer_list(struct tvec_base *new_base, struct list_head *head) { struct timer_list *timer;

int is_pinned;

while (!list_empty(head)) { timer = list_first_entry(head, struct timer_list, entry); /* We ignore the accounting on the dying cpu */ detach_timer(timer, false);
is_pinned = tbase_get_pinned(timer->base);
/* Check if CPU still has pinned timers */
if (is_pinned) {
	pr_warn("%s: can't migrate pinned timer: %p, removing it\n",
			__func__, timer);

printk message will be confusing: removing it from what?

Kevin

...

	continue;
}
timer_set_base(timer, new_base); internal_add_timer(new_base, timer); }

Viresh Kumar

1 Apr 1 Apr

4:32 a.m.

New subject: [RFC 2/4] timer: don't migrate pinned timers

On 31 March 2014 21:26, Kevin Hilman khilman@linaro.org wrote:

...

Viresh Kumar viresh.kumar@linaro.org writes:

...

```
        if (is_pinned) {
```

                pr_warn("%s: can't migrate pinned timer: %p, removing it\n",

                                __func__, timer);

printk message will be confusing: removing it from what?

Hmm.. So, I am looking to do two modifications here. Just need inputs if that would be the right thing to do:

- do a WARN() here as these timers should have been already removed - change print to: "can't migrate pinned timer %p, deactivating timer"

Looks fine?

Kevin Hilman

4 Apr 4 Apr

3:15 p.m.

New subject: [RFC 2/4] timer: don't migrate pinned timers

Viresh Kumar viresh.kumar@linaro.org writes:

...

On 31 March 2014 21:26, Kevin Hilman khilman@linaro.org wrote:

...
Viresh Kumar viresh.kumar@linaro.org writes:

...
        if (is_pinned) {
                pr_warn("%s: can't migrate pinned timer: %p, removing it\n",
                                __func__, timer);
printk message will be confusing: removing it from what?
Hmm.. So, I am looking to do two modifications here. Just need inputs if that would be the right thing to do:

do a WARN() here as these timers should have been already removed

change print to: "can't migrate pinned timer %p, deactivating timer"

Looks fine?

Yes.

Viresh Kumar

20 Mar 20 Mar

1:49 p.m.

New subject: [RFC 3/4] timer: create timer_quiesce_cpu() for cpusets.quiesce option

Cpusets would be using timer migration code to isolate/quiese a CPU. And hence that should be changed a bit as the CPU in question would be online now. Also we need to make sure that we don't remove pinned timers.

Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- kernel/timer.c | 54 +++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 43 insertions(+), 11 deletions(-)

diff --git a/kernel/timer.c b/kernel/timer.c index a7f8b99..81d64e0 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -1602,18 +1602,27 @@ static int init_timers_cpu(int cpu) return 0; }

-#ifdef CONFIG_HOTPLUG_CPU -static void migrate_timer_list(struct tvec_base *new_base, struct list_head *head) +#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_CPUSETS) +static void migrate_timer_list(struct tvec_base *new_base, + struct list_head *head, bool remove_pinned) { struct timer_list *timer; + struct list_head pinned_list; int is_pinned;

+ INIT_LIST_HEAD(&pinned_list); + while (!list_empty(head)) { timer = list_first_entry(head, struct timer_list, entry); - /* We ignore the accounting on the dying cpu */ - detach_timer(timer, false);

is_pinned = tbase_get_pinned(timer->base); + if (!remove_pinned && is_pinned) { + list_move_tail(&timer->entry, &pinned_list); + continue; + } else { + /* We ignore the accounting on the dying cpu */ + detach_timer(timer, false); + }

/* Check if CPU still has pinned timers */ if (is_pinned) { @@ -1625,15 +1634,18 @@ static void migrate_timer_list(struct tvec_base *new_base, struct list_head *hea timer_set_base(timer, new_base); internal_add_timer(new_base, timer); } + + if (!list_empty(&pinned_list)) + list_splice_tail(&pinned_list, head); }

-static void migrate_timers(int cpu) +/* Migrate timers from 'cpu' to this_cpu */ +static void __migrate_timers(int cpu, bool remove_pinned) { struct tvec_base *old_base; struct tvec_base *new_base; int i;

- BUG_ON(cpu_online(cpu)); old_base = per_cpu(tvec_bases, cpu); new_base = get_cpu_var(tvec_bases); /* @@ -1646,20 +1658,40 @@ static void migrate_timers(int cpu) BUG_ON(old_base->running_timer);

for (i = 0; i < TVR_SIZE; i++) - migrate_timer_list(new_base, old_base->tv1.vec + i); + migrate_timer_list(new_base, old_base->tv1.vec + i, + remove_pinned); for (i = 0; i < TVN_SIZE; i++) { - migrate_timer_list(new_base, old_base->tv2.vec + i); - migrate_timer_list(new_base, old_base->tv3.vec + i); - migrate_timer_list(new_base, old_base->tv4.vec + i); - migrate_timer_list(new_base, old_base->tv5.vec + i); + migrate_timer_list(new_base, old_base->tv2.vec + i, + remove_pinned); + migrate_timer_list(new_base, old_base->tv3.vec + i, + remove_pinned); + migrate_timer_list(new_base, old_base->tv4.vec + i, + remove_pinned); + migrate_timer_list(new_base, old_base->tv5.vec + i, + remove_pinned); }

spin_unlock(&old_base->lock); spin_unlock_irq(&new_base->lock); put_cpu_var(tvec_bases); } +#endif /* CONFIG_HOTPLUG_CPU || CONFIG_CPUSETS */ + +#ifdef CONFIG_HOTPLUG_CPU +static void migrate_timers(int cpu) +{ + BUG_ON(cpu_online(cpu)); + __migrate_timers(cpu, true); +} #endif /* CONFIG_HOTPLUG_CPU */

+#ifdef CONFIG_CPUSETS +void timer_quiesce_cpu(void *data) +{ + __migrate_timers((int)data, false); +} +#endif /* CONFIG_CPUSETS */ + static int timer_cpu_notify(struct notifier_block *self, unsigned long action, void *hcpu) {

-- 1.7.12.rc2.18.g61b472e

Viresh Kumar

1:49 p.m.

New subject: [RFC 4/4] cpuset: Add cpusets.quiesce option

For networking applications platforms need to provide one CPU per each user space data plane thread. These CPUs should not be interrupted by kernel at all unless userspace has requested for some syscalls. Currently, there are background kernel activities that are running on almost every CPU, like: timers/hrtimers/watchdogs/etc, and these are required to be migrated to other CPUs.

To achieve that, this patch adds another option to cpusets, i.e. 'quiesce'. Writing '1' on this file would migrate these unbound/unpinned timers/workqueues away from the CPUs of the cpuset in question. Writing '0' has no effect and this file can't be read from userspace as we aren't maintaining a state here.

Currently, only timers are migrated. This would be followed by other kernel infrastructure later.

Suggested-by: Peter Zijlstra peterz@infradead.org Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- kernel/cpuset.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 56 insertions(+)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 3d54c41..1b79ae6 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -43,10 +43,12 @@ #include <linux/pagemap.h> #include <linux/proc_fs.h> #include <linux/rcupdate.h> +#include <linux/tick.h> #include <linux/sched.h> #include <linux/seq_file.h> #include <linux/security.h> #include <linux/slab.h> +#include <linux/smp.h> #include <linux/spinlock.h> #include <linux/stat.h> #include <linux/string.h> @@ -150,6 +152,7 @@ typedef enum { CS_SCHED_LOAD_BALANCE, CS_SPREAD_PAGE, CS_SPREAD_SLAB, + CS_QUIESCE, } cpuset_flagbits_t;

/* convenient tests for these bits */ @@ -1208,6 +1211,44 @@ static int update_relax_domain_level(struct cpuset *cs, s64 val) return 0; }

+void timer_quiesce_cpu(void *cpu); + +/** + * quiesce_cpuset - Move unbound timers/workqueues away from cpuset.cpus + * @cs: cpuset to be quiesced + * + * For isolating a core with cpusets we require all unbound timers/workqueues to + * move away for isolated core. For simplicity, currently we migrate these to + * the first online CPU which is not part of tick_nohz_full_mask. + * + * Currently we are only migrating timers away. + */ +void quiesce_cpuset(struct cpuset *cs) +{ + int from_cpu, to_cpu; + cpumask_t cpumask; + + cpumask_andnot(&cpumask, cpu_online_mask, cs->cpus_allowed); + +#ifdef CONFIG_NO_HZ_FULL + cpumask_andnot(&cpumask, &cpumask, tick_nohz_full_mask); +#endif + + if (cpumask_empty(&cpumask)) { + pr_err("%s: Couldn't find a CPU to migrate to\n", __func__); + return; + } + + to_cpu = cpumask_first(&cpumask); + + for_each_cpu(from_cpu, cs->cpus_allowed) { + pr_debug("%s: Migrating from CPU:%d to CPU:%d\n", __func__, + from_cpu, to_cpu); + smp_call_function_single(to_cpu, timer_quiesce_cpu, + (void *)from_cpu, true); + } +} + /** * update_tasks_flags - update the spread flags of tasks in the cpuset. * @cs: the cpuset in which each task's spread flags needs to be changed @@ -1244,6 +1285,11 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, int spread_flag_changed; int err;

+ if (bit == CS_QUIESCE && turning_on) { + quiesce_cpuset(cs); + return 0; + } + trialcs = alloc_trial_cpuset(cs); if (!trialcs) return -ENOMEM; @@ -1526,6 +1572,7 @@ typedef enum { FILE_MEMORY_PRESSURE, FILE_SPREAD_PAGE, FILE_SPREAD_SLAB, + FILE_CPU_QUIESCE, } cpuset_filetype_t;

static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, @@ -1569,6 +1616,9 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, case FILE_SPREAD_SLAB: retval = update_flag(CS_SPREAD_SLAB, cs, val); break; + case FILE_CPU_QUIESCE: + retval = update_flag(CS_QUIESCE, cs, val); + break; default: retval = -EINVAL; break; @@ -1837,6 +1887,12 @@ static struct cftype files[] = { .private = FILE_MEMORY_PRESSURE_ENABLED, },

+ { + .name = "quiesce", + .write_u64 = cpuset_write_u64, + .private = FILE_CPU_QUIESCE, + }, + { } /* terminate */ };

-- 1.7.12.rc2.18.g61b472e

Li Zefan

27 Mar 27 Mar

2:47 a.m.

New subject: [RFC 4/4] cpuset: Add cpusets.quiesce option

On 2014/3/20 21:49, Viresh Kumar wrote:

...

For networking applications platforms need to provide one CPU per each user space data plane thread. These CPUs should not be interrupted by kernel at all unless userspace has requested for some syscalls. Currently, there are background kernel activities that are running on almost every CPU, like: timers/hrtimers/watchdogs/etc, and these are required to be migrated to other CPUs.

To achieve that, this patch adds another option to cpusets, i.e. 'quiesce'. Writing '1' on this file would migrate these unbound/unpinned timers/workqueues away from the CPUs of the cpuset in question. Writing '0' has no effect and this file can't be read from userspace as we aren't maintaining a state here.

This doesn't look like a complete solution, because newer timers/workqueues can still run in those CPUs. Seems like the proposal discussed is to support setting cpu affinity for workqueues through sysfs. If so, we can migrate workqueues when affinity is set, so we don't need this cpuset.quiesce ?

...

Currently, only timers are migrated. This would be followed by other kernel infrastructure later.

Suggested-by: Peter Zijlstra peterz@infradead.org Signed-off-by: Viresh Kumar viresh.kumar@linaro.org

Viresh Kumar

4:29 a.m.

New subject: [RFC 4/4] cpuset: Add cpusets.quiesce option

On 27 March 2014 08:17, Li Zefan lizefan@huawei.com wrote:

...

This doesn't look like a complete solution, because newer timers/workqueues can still run in those CPUs.

The initial idea was to disable load balance between CPUs and then do this. So, that new timers and workqueues from other CPUs would never get queued on this CPU..

But I think we can just modify get_nohz_timer_target() for making sure this for timers..

...

Seems like the proposal discussed is to support setting cpu affinity for workqueues through sysfs. If so, we can migrate workqueues when affinity is set, so we don't need this cpuset.quiesce ?

That was another thread just for workqueues, but this one is about migrating everything else as well.. Probably some more additions apart from timers/ hrtimers/wqs in future. So, for us it is still required :)

Daniel Sangorrin

24 Apr 24 Apr

7:25 a.m.

Dear Viresh Kumar,

I tried your set of patches for isolating particular CPU cores from unpinned timers. On x86_64 they were working fine, however I found out that on ARM they would fail under the following test:

# mount -t cpuset none /cpuset # cd /cpuset # mkdir rt # cd rt # echo 1 > cpus # echo 1 > cpu_exclusive # cd # taskset 0x2 ./setquiesce.sh <--- contains "echo 1 > /cpuset/rt/quiesce" [ 75.622375] ------------[ cut here ]------------ [ 75.627258] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:2595 __migrate_hrtimers+0x17c/0x1bc() [ 75.636840] DEBUG_LOCKS_WARN_ON(current->hardirq_context) [ 75.636840] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.14.0-rc1-37710-g23c8f02 #1 [ 75.649627] [<c0014d18>] (unwind_backtrace) from [<c00119e8>] (show_stack+0x10/0x14) [ 75.649627] [<c00119e8>] (show_stack) from [<c065b61c>] (dump_stack+0x78/0x94) [ 75.662689] [<c065b61c>] (dump_stack) from [<c003e9a4>] (warn_slowpath_common+0x60/0x84) [ 75.670410] [<c003e9a4>] (warn_slowpath_common) from [<c003ea24>] (warn_slowpath_fmt+0x30/0x40) [ 75.677673] [<c003ea24>] (warn_slowpath_fmt) from [<c005d7b0>] (__migrate_hrtimers+0x17c/0x1bc) [ 75.677673] [<c005d7b0>] (__migrate_hrtimers) from [<c009e004>] (generic_smp_call_function_single_interrupt+0x8c/0x104) [ 75.699645] [<c009e004>] (generic_smp_call_function_single_interrupt) from [<c00134d0>] (handle_IPI+0xa4/0x16c) [ 75.706970] [<c00134d0>] (handle_IPI) from [<c0008614>] (gic_handle_irq+0x54/0x5c) [ 75.715087] [<c0008614>] (gic_handle_irq) from [<c0012624>] (__irq_svc+0x44/0x5c) [ 75.725311] Exception stack(0xc08a3f58 to 0xc08a3fa0) ....

I also backported your patches to Linux 3.10.y and found the same problem both in ARM and x86_64. However, I think I figured out the reason for those errors. Please, could you check the patch below (it applies on the top of your tree, branch isolate-cpusets) and let me know what you think?

Thanks, Daniel

-------------------------PATCH STARTS HERE--------------------------------- cpuset: quiesce: change irq disable/enable by irq save/restore

The function __migrate_timers can be called under interrupt context or thread context depending on the core where the system call was executed. In case it executes under interrupt context, it seems a bad idea to leave interrupts enabled after migrating the timers. In fact, this caused kernel errors on the ARM architecture and on the x86_64 architecture with the 3.10 kernel (backported version of the cpuset-quiesce patch).

Signed-off-by: Daniel Sangorrin daniel.sangorrin@toshiba.co.jp Signed-off-by: Yoshitake Kobayashi yoshitake.kobayashi@toshiba.co.jp --- kernel/hrtimer.c | 5 +++-- kernel/timer.c | 5 +++-- 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index e8cd1db..abb1707 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -1701,8 +1701,9 @@ static void __migrate_hrtimers(int scpu, bool remove_pinned) struct hrtimer_clock_base *clock_base; unsigned int active_bases; int i; + unsigned long flags;

- local_irq_disable(); + local_irq_save(flags); old_base = &per_cpu(hrtimer_bases, scpu); new_base = &__get_cpu_var(hrtimer_bases); /* @@ -1722,7 +1723,7 @@ static void __migrate_hrtimers(int scpu, bool remove_pinned)

/* Check, if we got expired work to do */ __hrtimer_peek_ahead_timers(); - local_irq_enable(); + local_irq_restore(flags); } #endif /* CONFIG_HOTPLUG_CPU || CONFIG_CPUSETS */

diff --git a/kernel/timer.c b/kernel/timer.c index 4676a07..2715b7d 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -1644,6 +1644,7 @@ static void __migrate_timers(int cpu, bool remove_pinned) struct tvec_base *old_base; struct tvec_base *new_base; int i; + unsigned long flags;

old_base = per_cpu(tvec_bases, cpu); new_base = get_cpu_var(tvec_bases); @@ -1651,7 +1652,7 @@ static void __migrate_timers(int cpu, bool remove_pinned) * The caller is globally serialized and nobody else * takes two locks at once, deadlock is not possible. */ - spin_lock_irq(&new_base->lock); + spin_lock_irqsave(&new_base->lock, flags); spin_lock_nested(&old_base->lock, SINGLE_DEPTH_NESTING);

BUG_ON(old_base->running_timer); @@ -1671,7 +1672,7 @@ static void __migrate_timers(int cpu, bool remove_pinned) }

spin_unlock(&old_base->lock); - spin_unlock_irq(&new_base->lock); + spin_unlock_irqrestore(&new_base->lock, flags); put_cpu_var(tvec_bases); } #endif /* CONFIG_HOTPLUG_CPU || CONFIG_CPUSETS */

-- 1.8.1.1 -------------------------PATCH ENDS HERE--------------------------------- On 2014/03/20 22:48, Viresh Kumar wrote: > For networking platforms we need to provide one isolated CPU per user space data > plane thread. These CPUs should not be interrupted by kernel at all unless > userspace has requested for some syscalls. And so must live in isolated mode. > Currently, there are background kernel activities that are running on almost > every CPU, like: timers/hrtimers/watchdogs/etc, and these are required to be > migrated to other CPUs. > > This patchset tries to migrate un-pinned timers away in the first attempt. And > support for migrating others like hrtimers/workqueues/etc would be added later. > > This has only went through basic testing currently on ARM Samsung Exynos board > which only has two CPUs. Separate cpusets were created for these two CPUs and > then timers were migrated from one cpuset to other. > > This option was earlier suggested by Peter Z. here. > > https://lkml.org/lkml/2014/1/15/186 > > Please provide your inputs on how this can be improved.. > > Viresh Kumar (4): > timer: track pinned timers with TIMER_PINNED flag > timer: don't migrate pinned timers > timer: create timer_quiesce_cpu() for cpusets.quiesce option > cpuset: Add cpusets.quiesce option > > include/linux/timer.h | 10 +++--- > kernel/cpuset.c | 56 +++++++++++++++++++++++++++++++ > kernel/timer.c | 92 +++++++++++++++++++++++++++++++++++++++++---------- > 3 files changed, 136 insertions(+), 22 deletions(-) > -- Toshiba Corporate Software Engineering Center Daniel SANGORRIN E-mail: daniel.sangorrin@toshiba.co.jp

Viresh Kumar

7:43 a.m.

On 24 April 2014 12:55, Daniel Sangorrin daniel.sangorrin@toshiba.co.jp wrote:

...

I tried your set of patches for isolating particular CPU cores from unpinned timers. On x86_64 they were working fine, however I found out that on ARM they would fail under the following test:

I am happy that these drew attention from somebody Atleast :)

...

# mount -t cpuset none /cpuset # cd /cpuset # mkdir rt # cd rt # echo 1 > cpus # echo 1 > cpu_exclusive # cd # taskset 0x2 ./setquiesce.sh <--- contains "echo 1 > /cpuset/rt/quiesce" [ 75.622375] ------------[ cut here ]------------ [ 75.627258] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:2595 __migrate_hrtimers+0x17c/0x1bc() [ 75.636840] DEBUG_LOCKS_WARN_ON(current->hardirq_context) [ 75.636840] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.14.0-rc1-37710-g23c8f02 #1 [ 75.649627] [<c0014d18>] (unwind_backtrace) from [<c00119e8>] (show_stack+0x10/0x14) [ 75.649627] [<c00119e8>] (show_stack) from [<c065b61c>] (dump_stack+0x78/0x94) [ 75.662689] [<c065b61c>] (dump_stack) from [<c003e9a4>] (warn_slowpath_common+0x60/0x84) [ 75.670410] [<c003e9a4>] (warn_slowpath_common) from [<c003ea24>] (warn_slowpath_fmt+0x30/0x40) [ 75.677673] [<c003ea24>] (warn_slowpath_fmt) from [<c005d7b0>] (__migrate_hrtimers+0x17c/0x1bc) [ 75.677673] [<c005d7b0>] (__migrate_hrtimers) from [<c009e004>] (generic_smp_call_function_single_interrupt+0x8c/0x104) [ 75.699645] [<c009e004>] (generic_smp_call_function_single_interrupt) from [<c00134d0>] (handle_IPI+0xa4/0x16c) [ 75.706970] [<c00134d0>] (handle_IPI) from [<c0008614>] (gic_handle_irq+0x54/0x5c) [ 75.715087] [<c0008614>] (gic_handle_irq) from [<c0012624>] (__irq_svc+0x44/0x5c) [ 75.725311] Exception stack(0xc08a3f58 to 0xc08a3fa0)

I couldn't understand why we went via a interrupt here ? Probably CPU1 was idle and was woken up with a IPI and then this happened. But in that case too, shouldn't the script run from process context instead ?

...

I also backported your patches to Linux 3.10.y and found the same problem both in ARM and x86_64.

There are very few changes in between 3.10 and latest for timers/hrtimers and so things are expected to be the same.

...

However, I think I figured out the reason for those errors. Please, could you check the patch below (it applies on the top of your tree, branch isolate-cpusets) and let me know what you think?

Okay, just to let you know, I have also found some issues and they are now pushed in my tree.. Also it is rebased over 3.15-rc2 now.

...

-------------------------PATCH STARTS HERE--------------------------------- cpuset: quiesce: change irq disable/enable by irq save/restore

The function __migrate_timers can be called under interrupt context or thread context depending on the core where the system call was executed. In case it executes under interrupt context, it

How exactly?

...

seems a bad idea to leave interrupts enabled after migrating the timers. In fact, this caused kernel errors on the ARM architecture and on the x86_64 architecture with the 3.10 kernel (backported version of the cpuset-quiesce patch).

I can't keep it as a separate patch and so would be required to merge it into my original patch..

Thanks for your inputs :)

Daniel Sangorrin

8:31 a.m.

On 2014/04/24 16:43, Viresh Kumar wrote:

...

On 24 April 2014 12:55, Daniel Sangorrin daniel.sangorrin@toshiba.co.jp wrote:

...
I tried your set of patches for isolating particular CPU cores from unpinned timers. On x86_64 they were working fine, however I found out that on ARM they would fail under the following test:

I am happy that these drew attention from somebody Atleast :)

Thanks to you for your hard work.

...

...
# mount -t cpuset none /cpuset # cd /cpuset # mkdir rt # cd rt # echo 1 > cpus # echo 1 > cpu_exclusive # cd # taskset 0x2 ./setquiesce.sh <--- contains "echo 1 > /cpuset/rt/quiesce" [ 75.622375] ------------[ cut here ]------------ [ 75.627258] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:2595 __migrate_hrtimers+0x17c/0x1bc() [ 75.636840] DEBUG_LOCKS_WARN_ON(current->hardirq_context) [ 75.636840] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.14.0-rc1-37710-g23c8f02 #1 [ 75.649627] [<c0014d18>] (unwind_backtrace) from [<c00119e8>] (show_stack+0x10/0x14) [ 75.649627] [<c00119e8>] (show_stack) from [<c065b61c>] (dump_stack+0x78/0x94) [ 75.662689] [<c065b61c>] (dump_stack) from [<c003e9a4>] (warn_slowpath_common+0x60/0x84) [ 75.670410] [<c003e9a4>] (warn_slowpath_common) from [<c003ea24>] (warn_slowpath_fmt+0x30/0x40) [ 75.677673] [<c003ea24>] (warn_slowpath_fmt) from [<c005d7b0>] (__migrate_hrtimers+0x17c/0x1bc) [ 75.677673] [<c005d7b0>] (__migrate_hrtimers) from [<c009e004>] (generic_smp_call_function_single_interrupt+0x8c/0x104) [ 75.699645] [<c009e004>] (generic_smp_call_function_single_interrupt) from [<c00134d0>] (handle_IPI+0xa4/0x16c) [ 75.706970] [<c00134d0>] (handle_IPI) from [<c0008614>] (gic_handle_irq+0x54/0x5c) [ 75.715087] [<c0008614>] (gic_handle_irq) from [<c0012624>] (__irq_svc+0x44/0x5c) [ 75.725311] Exception stack(0xc08a3f58 to 0xc08a3fa0)

I couldn't understand why we went via a interrupt here ? Probably CPU1 was idle and was woken up with a IPI and then this happened. But in that case too, shouldn't the script run from process context instead ?

In kernel/cpuset.c:quiesce_cpuset() you are using the function 'smp_call_function_any' which asks CPU cores in 'cpumask' to execute the functions 'hrtimer_quiesce_cpu' and 'timer_quiesce_cpu'.

In the case above, 'cpumask' corresponds to core 0. Since I'm forcing the call to be executed from core 1 (by using taskset), an inter-processor interrupt is sent to core 0 for those functions to be executed.

...

...
I also backported your patches to Linux 3.10.y and found the same problem both in ARM and x86_64.

There are very few changes in between 3.10 and latest for timers/hrtimers and so things are expected to be the same.

...
However, I think I figured out the reason for those errors. Please, could you check the patch below (it applies on the top of your tree, branch isolate-cpusets) and let me know what you think?

Okay, just to let you know, I have also found some issues and they are now pushed in my tree.. Also it is rebased over 3.15-rc2 now.

Ok, thank you! I see that you have already fixed the problem. I tested your tree on ARM and now it seems to work correctly.

...

...
-------------------------PATCH STARTS HERE--------------------------------- cpuset: quiesce: change irq disable/enable by irq save/restore

The function __migrate_timers can be called under interrupt context or thread context depending on the core where the system call was executed. In case it executes under interrupt context, it

How exactly?

See my reply above.

...

...
seems a bad idea to leave interrupts enabled after migrating the timers. In fact, this caused kernel errors on the ARM architecture and on the x86_64 architecture with the 3.10 kernel (backported version of the cpuset-quiesce patch).

I can't keep it as a separate patch and so would be required to merge it into my original patch..

Thanks for your inputs :)

To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

Thanks, Daniel

-- Toshiba Corporate Software Engineering Center Daniel SANGORRIN E-mail: daniel.sangorrin@toshiba.co.jp

Viresh Kumar

8:41 a.m.

On 24 April 2014 14:01, Daniel Sangorrin daniel.sangorrin@toshiba.co.jp wrote:

...

In kernel/cpuset.c:quiesce_cpuset() you are using the function 'smp_call_function_any' which asks CPU cores in 'cpumask' to execute the functions 'hrtimer_quiesce_cpu' and 'timer_quiesce_cpu'.

In the case above, 'cpumask' corresponds to core 0. Since I'm forcing the call to be executed from core 1 (by using taskset), an inter-processor interrupt is sent to core 0 for those functions to be executed.

Ahh, I understood that now :) .. So we are setting cpuset.quiesce from CPU1 which will do a IPI to get migrate_timers called on CPU0.. I was setting quiesce from CPU0 only in my tests :)

But how does this work fine on x86 then? There we should have exactly same problem, isn't it?

...

Ok, thank you! I see that you have already fixed the problem. I tested your tree on ARM and now it seems to work correctly.

Yeah, I just pushed your changes as well at the time I wrote last mail :)

Daniel Sangorrin

9:24 a.m.

On 2014/04/24 17:41, Viresh Kumar wrote:

...

On 24 April 2014 14:01, Daniel Sangorrin daniel.sangorrin@toshiba.co.jp wrote:

...
In kernel/cpuset.c:quiesce_cpuset() you are using the function 'smp_call_function_any' which asks CPU cores in 'cpumask' to execute the functions 'hrtimer_quiesce_cpu' and 'timer_quiesce_cpu'.

In the case above, 'cpumask' corresponds to core 0. Since I'm forcing the call to be executed from core 1 (by using taskset), an inter-processor interrupt is sent to core 0 for those functions to be executed.

Ahh, I understood that now :) .. So we are setting cpuset.quiesce from CPU1 which will do a IPI to get migrate_timers called on CPU0.. I was setting quiesce from CPU0 only in my tests :)

But how does this work fine on x86 then? There we should have exactly same problem, isn't it?

Yeah, I'm not sure why it is working on 3.15 x86_64 but not in 3.10 x86_64. Perhaps it's related to this patch: https://lkml.org/lkml/2013/9/19/349

...

...
Ok, thank you! I see that you have already fixed the problem. I tested your tree on ARM and now it seems to work correctly.

Yeah, I just pushed your changes as well at the time I wrote last mail :)

Oh, I see!

Why didn't you just apply the patch on top of your tree so that the information included in the git commit (e.g: my name and mail) remains?

This part:

cpuset: quiesce: change irq disable/enable by irq save/restore

Signed-off-by: Daniel Sangorrin daniel.sangorrin@toshiba.co.jp Signed-off-by: Yoshitake Kobayashi yoshitake.kobayashi@toshiba.co.jp

Thanks, Daniel

-- Toshiba Corporate Software Engineering Center Daniel SANGORRIN E-mail: daniel.sangorrin@toshiba.co.jp

Viresh Kumar

9:30 a.m.

On 24 April 2014 14:54, Daniel Sangorrin daniel.sangorrin@toshiba.co.jp wrote:

...

Why didn't you just apply the patch on top of your tree so that the information included in the git commit (e.g: my name and mail) remains?

This part:

cpuset: quiesce: change irq disable/enable by irq save/restore

The function __migrate_timers can be called under interrupt context or thread context depending on the core where the system call was executed. In case it executes under interrupt context, it seems a bad idea to leave interrupts enabled after migrating the timers. In fact, this caused kernel errors on the ARM architecture and on the x86_64 architecture with the 3.10 kernel (backported version of the cpuset-quiesce patch).

Signed-off-by: Daniel Sangorrin daniel.sangorrin@toshiba.co.jp Signed-off-by: Yoshitake Kobayashi yoshitake.kobayashi@toshiba.co.jp

That's what I told you earlier when I said this:

...

I can't keep it as a separate patch and so would be required to merge it into my original patch..

And the reason being: "No patch is supposed to break things, otherwise git bisect wouldn't work smoothly".. And so git bisect would complain this issue after my patch and so I have to merge the fixes you gave into the original patch as its not yet merged.

Daniel Sangorrin

25 Apr 25 Apr

12:31 a.m.

...

...
I can't keep it as a separate patch and so would be required to merge it into my original patch..

And the reason being: "No patch is supposed to break things, otherwise git bisect wouldn't work smoothly".. And so git bisect would complain this issue after my patch and so I have to merge the fixes you gave into the original patch as its not yet merged. -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

Ok, no problem then.

Thanks, Daniel

-- Toshiba Corporate Software Engineering Center Daniel SANGORRIN E-mail: daniel.sangorrin@toshiba.co.jp

Viresh Kumar

4:51 a.m.

On 25 April 2014 06:01, Daniel Sangorrin daniel.sangorrin@toshiba.co.jp wrote:

...

...
...
I can't keep it as a separate patch and so would be required to merge it into my original patch..

And the reason being: "No patch is supposed to break things, otherwise git bisect wouldn't work smoothly".. And so git bisect would complain this issue after my patch and so I have to merge the fixes you gave into the original patch as its not yet merged. -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

Ok, no problem then.

Do you want me to add your Tested-by for my next version ?

Daniel Sangorrin

5:21 a.m.

On 2014/04/25 13:51, Viresh Kumar wrote:

...

On 25 April 2014 06:01, Daniel Sangorrin daniel.sangorrin@toshiba.co.jp wrote:

...
...
...
I can't keep it as a separate patch and so would be required to merge it into my original patch..

And the reason being: "No patch is supposed to break things, otherwise git bisect wouldn't work smoothly".. And so git bisect would complain this issue after my patch and so I have to merge the fixes you gave into the original patch as its not yet merged. -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

Ok, no problem then.

Do you want me to add your Tested-by for my next version ?

Sure.

Tested-by: Daniel Sangorrin daniel.sangorrin@toshiba.co.jp

-- Toshiba Corporate Software Engineering Center Daniel SANGORRIN E-mail: daniel.sangorrin@toshiba.co.jp

4201

days inactive

4237

days old

linaro-kernel@lists.linaro.org

18 comments

participants

tags (0)

participants (4)

Daniel Sangorrin
Kevin Hilman
Li Zefan
Viresh Kumar