[QUERY]: Is using CPU hotplug right for isolating CPUs?

List overview All Threads
Download

newer

older

[PATCH v3 0/3] [RFC] Add support...

next boot: 46 pass, 2 fail...

Viresh Kumar

15 Jan 2014 15 Jan '14

9:27 a.m.

Hi Again,

I am now successful in isolating a CPU completely using CPUsets, NO_HZ_FULL and CPU hotplug..

My setup and requirements for those who weren't following the earlier mails:

For networking machines it is required to run data plane threads on some CPUs (i.e. one thread per CPU) and these CPUs shouldn't be interrupted by kernel at all.

Earlier I tried CPUSets with NO_HZ by creating two groups with load_balancing disabled between them and manually tried to move all tasks out to CPU0 group. But even then there were interruptions which were continuously coming on CPU1 (which I am trying to isolate). These were some workqueue events, some timers (like prandom), timer overflow events (As NO_HZ_FULL pushes hrtimer to long ahead in future, 450 seconds, rather than disabling them completely, and these hardware timers were overflowing their counters after 90 seconds on Samsung Exynos board).

So after creating CPUsets I hotunplugged CPU1 and added it back immediately. This moved all these interruptions away and now CPU1 is running my single thread ("stress") for ever.

Now my question is: Is there anything particularly wrong about using hotplugging here ? Will that lead to a disaster :)

Thanks in Advance.

-- viresh

Show replies by date

Peter Zijlstra

15 Jan 15 Jan

10:38 a.m.

On Wed, Jan 15, 2014 at 02:57:36PM +0530, Viresh Kumar wrote:

...

Hi Again,

I am now successful in isolating a CPU completely using CPUsets, NO_HZ_FULL and CPU hotplug..

My setup and requirements for those who weren't following the earlier mails:

For networking machines it is required to run data plane threads on some CPUs (i.e. one thread per CPU) and these CPUs shouldn't be interrupted by kernel at all.

Earlier I tried CPUSets with NO_HZ by creating two groups with load_balancing disabled between them and manually tried to move all tasks out to CPU0 group. But even then there were interruptions which were continuously coming on CPU1 (which I am trying to isolate). These were some workqueue events, some timers (like prandom), timer overflow events (As NO_HZ_FULL pushes hrtimer to long ahead in future, 450 seconds, rather than disabling them completely, and these hardware timers were overflowing their counters after 90 seconds on Samsung Exynos board).

So after creating CPUsets I hotunplugged CPU1 and added it back immediately. This moved all these interruptions away and now CPU1 is running my single thread ("stress") for ever.

Now my question is: Is there anything particularly wrong about using hotplugging here ? Will that lead to a disaster :)

Nah, its just ugly and we should fix it. You need to be careful to not place tasks in a cpuset you're going to unplug though, that'll give funny results.

Viresh Kumar

10:47 a.m.

On 15 January 2014 16:08, Peter Zijlstra peterz@infradead.org wrote:

...

Nah, its just ugly and we should fix it. You need to be careful to not place tasks in a cpuset you're going to unplug though, that'll give funny results.

Okay. So how do you suggest to get rid of cases like a work queued on CPU1 initially and because it gets queued again from its work handler, it stays on the same CPU forever.

And then there were timer overflow events that occur because hrtimer is started by tick-sched stuff for 450 seconds later in time.

-- viresh

Peter Zijlstra

11:34 a.m.

On Wed, Jan 15, 2014 at 04:17:26PM +0530, Viresh Kumar wrote:

...

On 15 January 2014 16:08, Peter Zijlstra peterz@infradead.org wrote:

...
Nah, its just ugly and we should fix it. You need to be careful to not place tasks in a cpuset you're going to unplug though, that'll give funny results.

Okay. So how do you suggest to get rid of cases like a work queued on CPU1 initially and because it gets queued again from its work handler, it stays on the same CPU forever.

We should have a cpuset.quiesce control or something that moves all timers out.

...

And then there were timer overflow events that occur because hrtimer is started by tick-sched stuff for 450 seconds later in time.

-ENOPARSE

Viresh Kumar

28 Feb 28 Feb

9:04 a.m.

On 15 January 2014 17:04, Peter Zijlstra peterz@infradead.org wrote:

...

On Wed, Jan 15, 2014 at 04:17:26PM +0530, Viresh Kumar wrote:

...
On 15 January 2014 16:08, Peter Zijlstra peterz@infradead.org wrote:

...
Nah, its just ugly and we should fix it. You need to be careful to not place tasks in a cpuset you're going to unplug though, that'll give funny results.

Okay. So how do you suggest to get rid of cases like a work queued on CPU1 initially and because it gets queued again from its work handler, it stays on the same CPU forever.

We should have a cpuset.quiesce control or something that moves all timers out.

What should we do here if we have a valid base->running_timer for the cpu requesting the quiesce ?

Frederic Weisbecker

15 Jan 15 Jan

5:17 p.m.

On Wed, Jan 15, 2014 at 02:57:36PM +0530, Viresh Kumar wrote:

...

Hi Again,

I am now successful in isolating a CPU completely using CPUsets, NO_HZ_FULL and CPU hotplug..

My setup and requirements for those who weren't following the earlier mails:

For networking machines it is required to run data plane threads on some CPUs (i.e. one thread per CPU) and these CPUs shouldn't be interrupted by kernel at all.

Earlier I tried CPUSets with NO_HZ by creating two groups with load_balancing disabled between them and manually tried to move all tasks out to CPU0 group. But even then there were interruptions which were continuously coming on CPU1 (which I am trying to isolate). These were some workqueue events, some timers (like prandom), timer overflow events (As NO_HZ_FULL pushes hrtimer to long ahead in future, 450 seconds, rather than disabling them completely, and these hardware timers were overflowing their counters after 90 seconds on Samsung Exynos board).

Are you sure about that? NO_HZ_FULL shouldn't touch much hrtimers. Those are independant from the tick.

Although some of them seem to rely on the softirq, but that seem to concern the tick hrtimer only.

Viresh Kumar

16 Jan 16 Jan

8:32 a.m.

On 15 January 2014 22:47, Frederic Weisbecker fweisbec@gmail.com wrote:

...

Are you sure about that? NO_HZ_FULL shouldn't touch much hrtimers. Those are independant from the tick.

Although some of them seem to rely on the softirq, but that seem to concern the tick hrtimer only.

To make it clear I was talking about the hrtimer used by tick_sched_timer. I have crossed checked which timers are active on isolated CPU from /proc/timer_list and it gave on tick_sched_timer's hrtimer.

In the attached trace (dft.txt), see these locations: - Line 252: Time 302.573881: we scheduled the hrtimer for 300 seconds ahead of current time. - Line 254, 258, 262, 330, 334: We got interruptions continuously after ~90 seconds and this looked to be a case of timer's counter overflow. Isn't it? (I have removed some lines towards the end of this file to make it shorter, though dft.dat is untouched)

-- viresh

Thomas Gleixner

9:46 a.m.

On Thu, 16 Jan 2014, Viresh Kumar wrote:

...

On 15 January 2014 22:47, Frederic Weisbecker fweisbec@gmail.com wrote:

...
Are you sure about that? NO_HZ_FULL shouldn't touch much hrtimers. Those are independant from the tick.

Although some of them seem to rely on the softirq, but that seem to concern the tick hrtimer only.

To make it clear I was talking about the hrtimer used by tick_sched_timer. I have crossed checked which timers are active on isolated CPU from /proc/timer_list and it gave on tick_sched_timer's hrtimer.

In the attached trace (dft.txt), see these locations:

Line 252: Time 302.573881: we scheduled the hrtimer for 300 seconds

ahead of current time.

Line 254, 258, 262, 330, 334: We got interruptions continuously after

~90 seconds and this looked to be a case of timer's counter overflow. Isn't it? (I have removed some lines towards the end of this file to make it shorter, though dft.dat is untouched)

Just do the math.

max reload value / timer freq = max time span

So:

0x7fffffff / 24MHz = 89.478485 sec

Nothing to do here except to get rid of the requirement to arm the timer at all.

Thanks,

tglx

Viresh Kumar

20 Jan 20 Jan

11:30 a.m.

On 16 January 2014 15:16, Thomas Gleixner tglx@linutronix.de wrote:

...

Just do the math.

 max reload value / timer freq = max time span

Thanks.

...

So:
 0x7fffffff / 24MHz = 89.478485 sec
Nothing to do here except to get rid of the requirement to arm the timer at all.

@Frederic: Any inputs on how to get rid of this timer here?

Frederic Weisbecker

3:51 p.m.

On Mon, Jan 20, 2014 at 05:00:20PM +0530, Viresh Kumar wrote:

...

On 16 January 2014 15:16, Thomas Gleixner tglx@linutronix.de wrote:

...
Just do the math.
 max reload value / timer freq = max time span
Thanks.

...
So:
 0x7fffffff / 24MHz = 89.478485 sec
Nothing to do here except to get rid of the requirement to arm the timer at all.
@Frederic: Any inputs on how to get rid of this timer here?

I fear you can't. If you schedule a timer in 4 seconds away and your clockdevice can only count up to 2 seconds, you can't help much the interrupt in the middle to cope with the overflow.

So you need to act on the source of the timer:

* identify what cause this timer * try to turn that feature off * if you can't then move the timer to the housekeeping CPU

I'll have a look into the latter point to affine global timers to the housekeeping CPU. Per cpu timers need more inspection though. Either we rework them to be possibly handled by remote/housekeeping CPUs, or we let the associate feature to be turned off. All in one it's a case by case work.

Viresh Kumar

21 Jan 21 Jan

10:33 a.m.

On 20 January 2014 21:21, Frederic Weisbecker fweisbec@gmail.com wrote:

...

I fear you can't. If you schedule a timer in 4 seconds away and your clockdevice can only count up to 2 seconds, you can't help much the interrupt in the middle to cope with the overflow.

So you need to act on the source of the timer:

identify what cause this timer

try to turn that feature off

if you can't then move the timer to the housekeeping CPU

So, the main problem in my case was caused by this:

<...>-2147 [001] d..2 302.573881: hrtimer_start: hrtimer=c172aa50 function=tick_sched_timer expires=602075000000 softexpires=602075000000

I have mentioned this earlier when I sent you attachments. I think this is somehow tied with the NO_HZ_FULL stuff? As the timer is queued for 300 seconds after current time.

How to get this out?

...

I'll have a look into the latter point to affine global timers to the housekeeping CPU. Per cpu timers need more inspection though. Either we rework them to be possibly handled by remote/housekeeping CPUs, or we let the associate feature to be turned off. All in one it's a case by case work.

Which CPUs are housekeeping CPUs? How do we declare them?

Frederic Weisbecker

23 Jan 23 Jan

2:58 p.m.

On Tue, Jan 21, 2014 at 04:03:53PM +0530, Viresh Kumar wrote:

...

On 20 January 2014 21:21, Frederic Weisbecker fweisbec@gmail.com wrote:

...
I fear you can't. If you schedule a timer in 4 seconds away and your clockdevice can only count up to 2 seconds, you can't help much the interrupt in the middle to cope with the overflow.

So you need to act on the source of the timer:

identify what cause this timer

try to turn that feature off

if you can't then move the timer to the housekeeping CPU

So, the main problem in my case was caused by this:
       <...>-2147  [001] d..2   302.573881: hrtimer_start:
hrtimer=c172aa50 function=tick_sched_timer expires=602075000000 softexpires=602075000000

I have mentioned this earlier when I sent you attachments. I think this is somehow tied with the NO_HZ_FULL stuff? As the timer is queued for 300 seconds after current time.

How to get this out?

So it's scheduled away 300 seconds later. It might be a pending timer_list. Enabling the timer tracepoints may give you some clues.

...

...
I'll have a look into the latter point to affine global timers to the housekeeping CPU. Per cpu timers need more inspection though. Either we rework them to be possibly handled by remote/housekeeping CPUs, or we let the associate feature to be turned off. All in one it's a case by case work.

Which CPUs are housekeeping CPUs? How do we declare them?

It's not yet implemented, but it's an idea (partly from Thomas) of something we can do to define some general policy on various periodic/async work affinity to enforce isolation.

The basic idea is to define the CPU handling the timekeeping duty to be the housekeeping CPU. Given that CPU must keep a periodic tick, lets move all the unbound timers and workqueues there. And also try to move some CPU affine work as well. For example we could handle the scheduler tick of the full dynticks CPUs into that housekeeping CPU, at a low freqency. This way we could remove that 1 second scheduler tick max deferment per CPU. It may be an overkill though to run all the scheduler ticks on a single CPU so there may be other ways to cope with that.

And I would like to keep that housekeeping notion flexible enough to be extendable on more than one CPU, as I heard that some people plan to reserve one CPU per node on big NUMA machines for such a purpose. So that could be a cpumask, augmented with an infrastructure.

Of course, if some people help contributing in this area, some things may eventually move foward on the support of CPU isolation. I can't do that all alone, at least not quickly, given all the things already pending in my queue (fix buggy nohz iowait accounting, support RCU full sysidle detection, apply AMD range breakpoints patches, further cleanup posix cpu timers, etc...).

Thanks.

Viresh Kumar

24 Jan 24 Jan

5:21 a.m.

On 23 January 2014 20:28, Frederic Weisbecker fweisbec@gmail.com wrote:

...

On Tue, Jan 21, 2014 at 04:03:53PM +0530, Viresh Kumar wrote:

...

...
So, the main problem in my case was caused by this:
       <...>-2147  [001] d..2   302.573881: hrtimer_start:
hrtimer=c172aa50 function=tick_sched_timer expires=602075000000 softexpires=602075000000

I have mentioned this earlier when I sent you attachments. I think this is somehow tied with the NO_HZ_FULL stuff? As the timer is queued for 300 seconds after current time.

How to get this out?
So it's scheduled away 300 seconds later. It might be a pending timer_list. Enabling the timer tracepoints may give you some clues.

Trace was done with that enabled. /proc/timer_list confirms that a hrtimer is queued for 300 seconds later for tick_sched_timer. And so I assumed this is part of the current NO_HZ_FULL implementation.

Just to confirm, when we decide that a CPU is running a single task and so can enter tickless mode, do we queue this tick_sched_timer for 300 seconds ahead of time? If not, then who is doing this :)

...

...
Which CPUs are housekeeping CPUs? How do we declare them?

It's not yet implemented, but it's an idea (partly from Thomas) of something we can do to define some general policy on various periodic/async work affinity to enforce isolation.

The basic idea is to define the CPU handling the timekeeping duty to be the housekeeping CPU. Given that CPU must keep a periodic tick, lets move all the unbound timers and workqueues there. And also try to move some CPU affine work as well. For example we could handle the scheduler tick of the full dynticks CPUs into that housekeeping CPU, at a low freqency. This way we could remove that 1 second scheduler tick max deferment per CPU. It may be an overkill though to run all the scheduler ticks on a single CPU so there may be other ways to cope with that.

And I would like to keep that housekeeping notion flexible enough to be extendable on more than one CPU, as I heard that some people plan to reserve one CPU per node on big NUMA machines for such a purpose. So that could be a cpumask, augmented with an infrastructure.

Of course, if some people help contributing in this area, some things may eventually move foward on the support of CPU isolation. I can't do that all alone, at least not quickly, given all the things already pending in my queue (fix buggy nohz iowait accounting, support RCU full sysidle detection, apply AMD range breakpoints patches, further cleanup posix cpu timers, etc...).

I see. As I am currently working on the isolation stuff which is very much required for my usecase, I will try to do that as the second step of my work. The first one stays something like a cpuset.quiesce option that PeterZ suggested.

Any pointers of earlier discussion on this topic would be helpful to start working on this..

Mike Galbraith

8:29 a.m.

On Fri, 2014-01-24 at 10:51 +0530, Viresh Kumar wrote:

...

On 23 January 2014 20:28, Frederic Weisbecker fweisbec@gmail.com wrote:

...
On Tue, Jan 21, 2014 at 04:03:53PM +0530, Viresh Kumar wrote:

...
...
So, the main problem in my case was caused by this:
       <...>-2147  [001] d..2   302.573881: hrtimer_start:
hrtimer=c172aa50 function=tick_sched_timer expires=602075000000 softexpires=602075000000

I have mentioned this earlier when I sent you attachments. I think this is somehow tied with the NO_HZ_FULL stuff? As the timer is queued for 300 seconds after current time.

How to get this out?
So it's scheduled away 300 seconds later. It might be a pending timer_list. Enabling the timer tracepoints may give you some clues.
Trace was done with that enabled. /proc/timer_list confirms that a hrtimer is queued for 300 seconds later for tick_sched_timer. And so I assumed this is part of the current NO_HZ_FULL implementation.

Just to confirm, when we decide that a CPU is running a single task and so can enter tickless mode, do we queue this tick_sched_timer for 300 seconds ahead of time? If not, then who is doing this :)

...
...
Which CPUs are housekeeping CPUs? How do we declare them?

It's not yet implemented, but it's an idea (partly from Thomas) of something we can do to define some general policy on various periodic/async work affinity to enforce isolation.

The basic idea is to define the CPU handling the timekeeping duty to be the housekeeping CPU. Given that CPU must keep a periodic tick, lets move all the unbound timers and workqueues there. And also try to move some CPU affine work as well. For example we could handle the scheduler tick of the full dynticks CPUs into that housekeeping CPU, at a low freqency. This way we could remove that 1 second scheduler tick max deferment per CPU. It may be an overkill though to run all the scheduler ticks on a single CPU so there may be other ways to cope with that.

And I would like to keep that housekeeping notion flexible enough to be extendable on more than one CPU, as I heard that some people plan to reserve one CPU per node on big NUMA machines for such a purpose. So that could be a cpumask, augmented with an infrastructure.

Of course, if some people help contributing in this area, some things may eventually move foward on the support of CPU isolation. I can't do that all alone, at least not quickly, given all the things already pending in my queue (fix buggy nohz iowait accounting, support RCU full sysidle detection, apply AMD range breakpoints patches, further cleanup posix cpu timers, etc...).

I see. As I am currently working on the isolation stuff which is very much required for my usecase, I will try to do that as the second step of my work. The first one stays something like a cpuset.quiesce option that PeterZ suggested.

Any pointers of earlier discussion on this topic would be helpful to start working on this..

All of that nohz_full stuff would be a lot more usable if it were dynamic via cpusets. As the thing sits, if you need a small group of tickless cores once in a while, you have to eat a truckload of overhead and zillion threads always. The price is high.

I have a little hack for my -rt kernel that allows the user to turn the tick on/off (and cpupri) on a per fully isolated set basis, because jitter is lower with the tick than with nohz doing it's thing. With that, you can set up whatever portion of box to meet your needs on the fly. When you need very low jitter, turn all load balancing off in your critical set, turn nohz off, turn rt load balancing off, and 80 core boxen become usable for cool zillion dollar realtime video games.. box becomes a militarized playstation.

Doing the same with nohz_full would be a _lot_ harder (my hacks are trivial), but would be a lot more attractive to users than always eating the high nohz_full cost whether using it or not. Poke buttons, threads are born or die, patch in/out expensive accounting goop and whatnot, play evil high speed stock market bandit, or whatever else, at the poke of couple buttons.

-Mike

Frederic Weisbecker

28 Jan 28 Jan

1:23 p.m.

On Fri, Jan 24, 2014 at 10:51:14AM +0530, Viresh Kumar wrote:

...

On 23 January 2014 20:28, Frederic Weisbecker fweisbec@gmail.com wrote:

...
On Tue, Jan 21, 2014 at 04:03:53PM +0530, Viresh Kumar wrote:

...
...
So, the main problem in my case was caused by this:
       <...>-2147  [001] d..2   302.573881: hrtimer_start:
hrtimer=c172aa50 function=tick_sched_timer expires=602075000000 softexpires=602075000000

I have mentioned this earlier when I sent you attachments. I think this is somehow tied with the NO_HZ_FULL stuff? As the timer is queued for 300 seconds after current time.

How to get this out?
So it's scheduled away 300 seconds later. It might be a pending timer_list. Enabling the timer tracepoints may give you some clues.
Trace was done with that enabled. /proc/timer_list confirms that a hrtimer is queued for 300 seconds later for tick_sched_timer. And so I assumed this is part of the current NO_HZ_FULL implementation.

Just to confirm, when we decide that a CPU is running a single task and so can enter tickless mode, do we queue this tick_sched_timer for 300 seconds ahead of time? If not, then who is doing this :)

No, when a single task is running on a full dynticks CPU, the tick is supposed to run every seconds. I'm actually suprised it doesn't happen in your traces, did you tweak something specific?

The 300 seconds timer is probably due to a timer_list, just enable the timer_start and timer_expire_entry events to get the name of the culprit.

...

...
...
Which CPUs are housekeeping CPUs? How do we declare them?

It's not yet implemented, but it's an idea (partly from Thomas) of something we can do to define some general policy on various periodic/async work affinity to enforce isolation.

The basic idea is to define the CPU handling the timekeeping duty to be the housekeeping CPU. Given that CPU must keep a periodic tick, lets move all the unbound timers and workqueues there. And also try to move some CPU affine work as well. For example we could handle the scheduler tick of the full dynticks CPUs into that housekeeping CPU, at a low freqency. This way we could remove that 1 second scheduler tick max deferment per CPU. It may be an overkill though to run all the scheduler ticks on a single CPU so there may be other ways to cope with that.

And I would like to keep that housekeeping notion flexible enough to be extendable on more than one CPU, as I heard that some people plan to reserve one CPU per node on big NUMA machines for such a purpose. So that could be a cpumask, augmented with an infrastructure.

Of course, if some people help contributing in this area, some things may eventually move foward on the support of CPU isolation. I can't do that all alone, at least not quickly, given all the things already pending in my queue (fix buggy nohz iowait accounting, support RCU full sysidle detection, apply AMD range breakpoints patches, further cleanup posix cpu timers, etc...).

I see. As I am currently working on the isolation stuff which is very much required for my usecase, I will try to do that as the second step of my work. The first one stays something like a cpuset.quiesce option that PeterZ suggested.

Cool!

...

Any pointers of earlier discussion on this topic would be helpful to start working on this..

I think that being able to control the UNBOUND workqueue affinity may be a nice first step.

Thanks.

Kevin Hilman

4:11 p.m.

On Tue, Jan 28, 2014 at 5:23 AM, Frederic Weisbecker fweisbec@gmail.com wrote:

...

On Fri, Jan 24, 2014 at 10:51:14AM +0530, Viresh Kumar wrote:

...
On 23 January 2014 20:28, Frederic Weisbecker fweisbec@gmail.com wrote:

...
On Tue, Jan 21, 2014 at 04:03:53PM +0530, Viresh Kumar wrote:

...
...
So, the main problem in my case was caused by this:
       <...>-2147  [001] d..2   302.573881: hrtimer_start:
hrtimer=c172aa50 function=tick_sched_timer expires=602075000000 softexpires=602075000000

I have mentioned this earlier when I sent you attachments. I think this is somehow tied with the NO_HZ_FULL stuff? As the timer is queued for 300 seconds after current time.

How to get this out?
So it's scheduled away 300 seconds later. It might be a pending timer_list. Enabling the timer tracepoints may give you some clues.
Trace was done with that enabled. /proc/timer_list confirms that a hrtimer is queued for 300 seconds later for tick_sched_timer. And so I assumed this is part of the current NO_HZ_FULL implementation.

Just to confirm, when we decide that a CPU is running a single task and so can enter tickless mode, do we queue this tick_sched_timer for 300 seconds ahead of time? If not, then who is doing this :)
No, when a single task is running on a full dynticks CPU, the tick is supposed to run every seconds. I'm actually suprised it doesn't happen in your traces, did you tweak something specific?

I think Viresh is using my patch/hack to configure/disable the 1Hz residual tick.

Kevin

Viresh Kumar

3 Feb 3 Feb

8:26 a.m.

On 28 January 2014 21:41, Kevin Hilman khilman@linaro.org wrote:

...

I think Viresh is using my patch/hack to configure/disable the 1Hz residual tick.

Yeah. I am using sched_tick_max_deferment by setting it to -1. Why do we need a timer every second for NO_HZ_FULL currently?

Viresh Kumar

11 Feb 11 Feb

8:52 a.m.

On 28 January 2014 18:53, Frederic Weisbecker fweisbec@gmail.com wrote:

...

No, when a single task is running on a full dynticks CPU, the tick is supposed to run every seconds. I'm actually suprised it doesn't happen in your traces, did you tweak something specific?

Why do we need this 1 second tick currently? And what will happen if I hotunplug that CPU and get it back? Would the timer for tick move away from CPU in question? I see that when I have changed this 1sec stuff to 300 seconds. But what would be impact of that? Will things still work normally?

Frederic Weisbecker

13 Feb 13 Feb

2:20 p.m.

On Tue, Feb 11, 2014 at 02:22:43PM +0530, Viresh Kumar wrote:

...

On 28 January 2014 18:53, Frederic Weisbecker fweisbec@gmail.com wrote:

...
No, when a single task is running on a full dynticks CPU, the tick is supposed to run every seconds. I'm actually suprised it doesn't happen in your traces, did you tweak something specific?

Why do we need this 1 second tick currently? And what will happen if I hotunplug that CPU and get it back? Would the timer for tick move away from CPU in question? I see that when I have changed this 1sec stuff to 300 seconds. But what would be impact of that? Will things still work normally?

So the problem resides in the gazillions accounting maintained in scheduler_tick() and current->sched_class->task_tick().

The scheduler correctness depends on these to be updated regularly. If you deactivate or increase the delay with very high values, the result is unpredictable. Just expect that at least some scheduler feature will behave randomly, like load balancing for example or simply local fairness issues.

So we have that 1 Hz max that makes sure that things are moving forward while keeping a rate that should be still nice for HPC workloads. But we certainly want to find a way to remove the need for any tick altogether for extreme real time workloads which need guarantees rather than just optimizations.

I see two potential solutions for that:

1) Rework the scheduler accounting such that it is safe against full dynticks. That was the initial plan but it's scary. The scheduler accountings is a huge maze. And I'm not sure it's actually worth the complication.

2) Offload the accounting. For example we could imagine that the timekeeping could handle the task_tick() calls on behalf of the full dynticks CPUs. At a small rate like 1 Hz.

Lei Wen

20 Jan 20 Jan

1:59 p.m.

Hi Viresh,

On Wed, Jan 15, 2014 at 5:27 PM, Viresh Kumar viresh.kumar@linaro.org wrote:

...

Hi Again,

I am now successful in isolating a CPU completely using CPUsets, NO_HZ_FULL and CPU hotplug..

My setup and requirements for those who weren't following the earlier mails:

For networking machines it is required to run data plane threads on some CPUs (i.e. one thread per CPU) and these CPUs shouldn't be interrupted by kernel at all.

Earlier I tried CPUSets with NO_HZ by creating two groups with load_balancing disabled between them and manually tried to move all tasks out to CPU0 group. But even then there were interruptions which were continuously coming on CPU1 (which I am trying to isolate). These were some workqueue events, some timers (like prandom), timer overflow events (As NO_HZ_FULL pushes hrtimer to long ahead in future, 450 seconds, rather than disabling them completely, and these hardware timers were overflowing their counters after 90 seconds on Samsung Exynos board).

So after creating CPUsets I hotunplugged CPU1 and added it back immediately. This moved all these interruptions away and now CPU1 is running my single thread ("stress") for ever.

I have one question regarding unbounded workqueue migration in your case. You use hotplug to migrate the unbounded work to other cpus, but its cpu mask would still be 0xf, since cannot be changed by cpuset.

My question is how you could prevent this unbounded work migrate back to your isolated cpu? Seems to me there is no such mechanism in kernel, am I understand wrong?

Thanks, Lei

...

Now my question is: Is there anything particularly wrong about using hotplugging here ? Will that lead to a disaster :)

Thanks in Advance.

-- viresh

linaro-kernel mailing list linaro-kernel@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-kernel

Viresh Kumar

3 p.m.

On 20 January 2014 19:29, Lei Wen adrian.wenl@gmail.com wrote:

...

Hi Viresh,

Hi Lei,

...

I have one question regarding unbounded workqueue migration in your case. You use hotplug to migrate the unbounded work to other cpus, but its cpu mask would still be 0xf, since cannot be changed by cpuset.

My question is how you could prevent this unbounded work migrate back to your isolated cpu? Seems to me there is no such mechanism in kernel, am I understand wrong?

These workqueues are normally queued back from workqueue handler. And we normally queue them on the local cpu, that's the default behavior of workqueue subsystem. And so they land up on the same CPU again and again.

Frederic Weisbecker

3:41 p.m.

On Mon, Jan 20, 2014 at 08:30:10PM +0530, Viresh Kumar wrote:

...

On 20 January 2014 19:29, Lei Wen adrian.wenl@gmail.com wrote:

...
Hi Viresh,

Hi Lei,

...
I have one question regarding unbounded workqueue migration in your case. You use hotplug to migrate the unbounded work to other cpus, but its cpu mask would still be 0xf, since cannot be changed by cpuset.

My question is how you could prevent this unbounded work migrate back to your isolated cpu? Seems to me there is no such mechanism in kernel, am I understand wrong?

These workqueues are normally queued back from workqueue handler. And we normally queue them on the local cpu, that's the default behavior of workqueue subsystem. And so they land up on the same CPU again and again.

But for workqueues having a global affinity, I think they can be rescheduled later on the old CPUs. Although I'm not sure about that, I'm Cc'ing Tejun.

Also, one of the plan is to extend the sysfs interface of workqueues to override their affinity. If any of you guys want to try something there, that would be welcome. Also we want to work on the timer affinity. Perhaps we don't need a user interface for that, or maybe something on top of full dynticks to outline that we want the unbound timers to run on housekeeping CPUs only.

Lei Wen

21 Jan 21 Jan

2:07 a.m.

On Mon, Jan 20, 2014 at 11:41 PM, Frederic Weisbecker fweisbec@gmail.com wrote:

...

On Mon, Jan 20, 2014 at 08:30:10PM +0530, Viresh Kumar wrote:

...
On 20 January 2014 19:29, Lei Wen adrian.wenl@gmail.com wrote:

...
Hi Viresh,

Hi Lei,

...
I have one question regarding unbounded workqueue migration in your case. You use hotplug to migrate the unbounded work to other cpus, but its cpu mask would still be 0xf, since cannot be changed by cpuset.

My question is how you could prevent this unbounded work migrate back to your isolated cpu? Seems to me there is no such mechanism in kernel, am I understand wrong?

These workqueues are normally queued back from workqueue handler. And we normally queue them on the local cpu, that's the default behavior of workqueue subsystem. And so they land up on the same CPU again and again.

But for workqueues having a global affinity, I think they can be rescheduled later on the old CPUs. Although I'm not sure about that, I'm Cc'ing Tejun.

Agree, since worker thread is made as enterring into all cpus, it cannot prevent scheduler do the migration.

But here is one point, that I see Viresh alredy set up two cpuset with scheduler load balance disabled, so it should stop the task migration between those two groups? Since the sched_domain changed?

What is more, I also did similiar test, and find when I set two such cpuset group, like core 0-2 to cpuset1, core 3 to cpuset2, while hotunplug the core3 afterwise. I find the cpuset's cpus member becomes NULL even I hotplug the core3 back again. So is it a bug?

Thanks, Lei

...

Also, one of the plan is to extend the sysfs interface of workqueues to override their affinity. If any of you guys want to try something there, that would be welcome. Also we want to work on the timer affinity. Perhaps we don't need a user interface for that, or maybe something on top of full dynticks to outline that we want the unbound timers to run on housekeeping CPUs only.

Viresh Kumar

9:50 a.m.

On 21 January 2014 07:37, Lei Wen adrian.wenl@gmail.com wrote:

...

What is more, I also did similiar test, and find when I set two such cpuset group, like core 0-2 to cpuset1, core 3 to cpuset2, while hotunplug the core3 afterwise. I find the cpuset's cpus member becomes NULL even I hotplug the core3 back again. So is it a bug?

I confirm the same :)

Frederic Weisbecker

23 Jan 23 Jan

1:54 p.m.

On Tue, Jan 21, 2014 at 10:07:58AM +0800, Lei Wen wrote:

...

On Mon, Jan 20, 2014 at 11:41 PM, Frederic Weisbecker fweisbec@gmail.com wrote:

...
On Mon, Jan 20, 2014 at 08:30:10PM +0530, Viresh Kumar wrote:

...
On 20 January 2014 19:29, Lei Wen adrian.wenl@gmail.com wrote:

...
Hi Viresh,

Hi Lei,

...
I have one question regarding unbounded workqueue migration in your case. You use hotplug to migrate the unbounded work to other cpus, but its cpu mask would still be 0xf, since cannot be changed by cpuset.

My question is how you could prevent this unbounded work migrate back to your isolated cpu? Seems to me there is no such mechanism in kernel, am I understand wrong?

These workqueues are normally queued back from workqueue handler. And we normally queue them on the local cpu, that's the default behavior of workqueue subsystem. And so they land up on the same CPU again and again.

But for workqueues having a global affinity, I think they can be rescheduled later on the old CPUs. Although I'm not sure about that, I'm Cc'ing Tejun.

Agree, since worker thread is made as enterring into all cpus, it cannot prevent scheduler do the migration.

But here is one point, that I see Viresh alredy set up two cpuset with scheduler load balance disabled, so it should stop the task migration between those two groups? Since the sched_domain changed?

What is more, I also did similiar test, and find when I set two such cpuset group, like core 0-2 to cpuset1, core 3 to cpuset2, while hotunplug the core3 afterwise. I find the cpuset's cpus member becomes NULL even I hotplug the core3 back again. So is it a bug?

Not sure, you may need to check cpuset internals.

Viresh Kumar

2:27 p.m.

On 23 January 2014 19:24, Frederic Weisbecker fweisbec@gmail.com wrote:

...

On Tue, Jan 21, 2014 at 10:07:58AM +0800, Lei Wen wrote:

...
I find the cpuset's cpus member becomes NULL even I hotplug the core3 back again. So is it a bug?

Not sure, you may need to check cpuset internals.

I think this is the correct behavior. Userspace must decide what to do with that CPU once it is back. Simply reverting to earlier cpusets configuration might not be the right approach.

Also, what if cpusets have been rewritten in-between hotplug events.

Viresh Kumar

21 Jan 21 Jan

9:49 a.m.

On 20 January 2014 21:11, Frederic Weisbecker fweisbec@gmail.com wrote:

...

But for workqueues having a global affinity, I think they can be rescheduled later on the old CPUs. Although I'm not sure about that, I'm Cc'ing Tejun.

Works queued on workqueues with WQ_UNBOUND flag set are run on any cpu and is decided by scheduler, whereas works queued on workqueues with this flag not set and without a cpu number mentioned while queuing work, runs on local CPU always.

...

Also, one of the plan is to extend the sysfs interface of workqueues to override their affinity. If any of you guys want to try something there, that would be welcome. Also we want to work on the timer affinity. Perhaps we don't need a user interface for that, or maybe something on top of full dynticks to outline that we want the unbound timers to run on housekeeping CPUs only.

What about a quiesce option as mentioned by PeterZ? With that we can move all UNBOUND timers and workqueues away. But to guarantee that we don't get them queued again later we need to make similar updates in workqueue/timer subsystem to disallow queuing any such stuff on such cpusets.

Frederic Weisbecker

23 Jan 23 Jan

2:01 p.m.

On Tue, Jan 21, 2014 at 03:19:36PM +0530, Viresh Kumar wrote:

...

On 20 January 2014 21:11, Frederic Weisbecker fweisbec@gmail.com wrote:

...
But for workqueues having a global affinity, I think they can be rescheduled later on the old CPUs. Although I'm not sure about that, I'm Cc'ing Tejun.

Works queued on workqueues with WQ_UNBOUND flag set are run on any cpu and is decided by scheduler, whereas works queued on workqueues with this flag not set and without a cpu number mentioned while queuing work, runs on local CPU always.

Ok, so it is fine to migrate the latter kind I guess?

...

...
Also, one of the plan is to extend the sysfs interface of workqueues to override their affinity. If any of you guys want to try something there, that would be welcome. Also we want to work on the timer affinity. Perhaps we don't need a user interface for that, or maybe something on top of full dynticks to outline that we want the unbound timers to run on housekeeping CPUs only.

What about a quiesce option as mentioned by PeterZ? With that we can move all UNBOUND timers and workqueues away. But to guarantee that we don't get them queued again later we need to make similar updates in workqueue/timer subsystem to disallow queuing any such stuff on such cpusets.

I haven't checked the details but then this quiesce option would involve a dependency on cpuset for any workload involving workqueues affinity. I'm not sure we really want this. Besides, workqueues have an existing sysfs interface that can be easily extended.

Now indeed we may also want to enforce some policy to make sure that further created and queued workqueues are affine to a specific subset of CPUs. And then cpuset sounds like a good idea :)

Viresh Kumar

24 Jan 24 Jan

8:53 a.m.

On 23 January 2014 19:31, Frederic Weisbecker fweisbec@gmail.com wrote:

...

Ok, so it is fine to migrate the latter kind I guess?

Unless somebody has abused the API and used bound workqueues where he should have used unbound ones.

...

I haven't checked the details but then this quiesce option would involve a dependency on cpuset for any workload involving workqueues affinity. I'm not sure we really want this. Besides, workqueues have an existing sysfs interface that can be easily extended.

Now indeed we may also want to enforce some policy to make sure that further created and queued workqueues are affine to a specific subset of CPUs. And then cpuset sounds like a good idea :)

Exactly. Cpuset would be more useful here. Probably we can keep both cpusets and sysfs interface of workqueues..

I will try to add this option under cpuset which will initially move timers and workqueues away from the cpuset in question.

4266

days inactive

4310

days old

linaro-kernel@lists.linaro.org

28 comments

participants

tags (0)

participants (7)

Frederic Weisbecker
Kevin Hilman
Lei Wen
Mike Galbraith
Peter Zijlstra
Thomas Gleixner
Viresh Kumar