Hi Frederic/Kevin,
I was doing some work where I was required to use NO_HZ_FULL on core 1 on a dual core ARM machine.
I observed that I was able to isolate the second core using cpusets but whenever the tick occurs, it occurs twice. i.e. Timer count gets updated by two every time my core is disturbed.
I tried to trace it (output attached) and found this sequence (Talking only about core 1 here): - Single task was running on Core 1 (using cpusets) - got an arch_timer interrupt - started servicing vmstat stuff - so came out of NO_HZ_FULL domain as there is more than one task on Core - queued work again and went to the existing single task (stress) - again got arch_timer interrupt after 5 ms (HZ=200) - got "tick_stop" event and went into NO_HZ_FULL domain again.. - Got isolated again for long duration..
So the query is: why don't we check that at the end of servicing vmstat stuff and migrating back to "stress" ??
Thanks.
-- viresh
On Tue, Dec 03, 2013 at 01:57:37PM +0530, Viresh Kumar wrote:
Hi Frederic/Kevin,
I was doing some work where I was required to use NO_HZ_FULL on core 1 on a dual core ARM machine.
I observed that I was able to isolate the second core using cpusets but whenever the tick occurs, it occurs twice. i.e. Timer count gets updated by two every time my core is disturbed.
I tried to trace it (output attached) and found this sequence (Talking only about core 1 here):
- Single task was running on Core 1 (using cpusets)
- got an arch_timer interrupt
- started servicing vmstat stuff
- so came out of NO_HZ_FULL domain as there is more than
one task on Core
- queued work again and went to the existing single task (stress)
- again got arch_timer interrupt after 5 ms (HZ=200)
Right, looking at the details, the 2nd interrupt is caused by workqueue delayed work bdi writeback.
- got "tick_stop" event and went into NO_HZ_FULL domain again..
- Got isolated again for long duration..
So the query is: why don't we check that at the end of servicing vmstat stuff and migrating back to "stress" ??
I fear I don't understand your question. Do you mean why don't we prevent from that bdi writeback work to run when we are in full dynticks mode?
We can't just ignore workqueues and timers callback when they are scheduled otherwise the kernel is going to behave randomly.
OTOH what we can do is to work on these per cpu workqueues and timers and do what's necessary to avoid them to fire, as explained in detail there Documentation/kernel-per-CPU-kthreads.txt
There is also the problem of unbound workqueues for which we don't have a solution yet. But the idea is that we could tweak their affinity from sysfs.
Thanks.
-- viresh
Hey, guys.
On Wed, Dec 11, 2013 at 02:22:14PM +0100, Frederic Weisbecker wrote:
I fear I don't understand your question. Do you mean why don't we prevent from that bdi writeback work to run when we are in full dynticks mode?
We can't just ignore workqueues and timers callback when they are scheduled otherwise the kernel is going to behave randomly.
OTOH what we can do is to work on these per cpu workqueues and timers and do what's necessary to avoid them to fire, as explained in detail there Documentation/kernel-per-CPU-kthreads.txt
Hmmm... some per-cpu workqueues can be turned into unbound ones and the writeback is one of those. Currently, this is used for powersaving on mobile but could also be useful for jitter control. In the long term, it could be beneficial to strictly distinguish the workqueues which really need per-cpu behavior and the ones which are per-cpu just for optimization.
There is also the problem of unbound workqueues for which we don't have a solution yet. But the idea is that we could tweak their affinity from sysfs.
Yes, this is a long term todo item but I'm currently a bit too swamped to tackle it myself. cc'ing Lai, who has pretty good knowledge of workqueue internals, and Bandan, who seemed interested in working on implementing default attrs.
Thanks.
Tejun Heo tj@kernel.org writes:
Hey, guys.
On Wed, Dec 11, 2013 at 02:22:14PM +0100, Frederic Weisbecker wrote:
I fear I don't understand your question. Do you mean why don't we prevent from that bdi writeback work to run when we are in full dynticks mode?
We can't just ignore workqueues and timers callback when they are scheduled otherwise the kernel is going to behave randomly.
OTOH what we can do is to work on these per cpu workqueues and timers and do what's necessary to avoid them to fire, as explained in detail there Documentation/kernel-per-CPU-kthreads.txt
Hmmm... some per-cpu workqueues can be turned into unbound ones and the writeback is one of those.
Ah, looks like the writeback one is already unbound, and configurable from sysfs.
Viresh, add this to your test script, and it should get this workqueue out of the way:
# pin the writeback workqueue to CPU0 echo 1 > /sys/bus/workqueue/devices/writeback/cpumask
Kevin
Currently, this is used for powersaving on mobile but could also be useful for jitter control. In the long term, it could be beneficial to strictly distinguish the workqueues which really need per-cpu behavior and the ones which are per-cpu just for optimization.
There is also the problem of unbound workqueues for which we don't have a solution yet. But the idea is that we could tweak their affinity from sysfs.
Yes, this is a long term todo item but I'm currently a bit too swamped to tackle it myself. cc'ing Lai, who has pretty good knowledge of workqueue internals, and Bandan, who seemed interested in working on implementing default attrs.
Thanks.
Sorry for the delay, was on holidays..
On 11 December 2013 18:52, Frederic Weisbecker fweisbec@gmail.com wrote:
On Tue, Dec 03, 2013 at 01:57:37PM +0530, Viresh Kumar wrote:
- again got arch_timer interrupt after 5 ms (HZ=200)
Right, looking at the details, the 2nd interrupt is caused by workqueue delayed work bdi writeback.
I am not that great at reading traces or kernelshark output, but I still feel I haven't seen anything wrong. And I wasn't talking about the delayed workqueue here..
I am looking at the trace I attached with kernelshark after filtering out CPU0 events: - Event 41, timestamp: 159.891973 - it ends at event 56, timestamp: 159.892043
And after that the next event comes after 5 Seconds.
And so I was talking for the Event 41.
So the query is: why don't we check that at the end of servicing vmstat stuff and migrating back to "stress" ??
I fear I don't understand your question. Do you mean why don't we prevent from that bdi writeback work to run when we are in full dynticks mode?
No..
Viresh Kumar viresh.kumar@linaro.org writes:
Sorry for the delay, was on holidays..
On 11 December 2013 18:52, Frederic Weisbecker fweisbec@gmail.com wrote:
On Tue, Dec 03, 2013 at 01:57:37PM +0530, Viresh Kumar wrote:
- again got arch_timer interrupt after 5 ms (HZ=200)
Right, looking at the details, the 2nd interrupt is caused by workqueue delayed work bdi writeback.
I am not that great at reading traces or kernelshark output, but I still feel I haven't seen anything wrong. And I wasn't talking about the delayed workqueue here..
I am looking at the trace I attached with kernelshark after filtering out CPU0 events:
- Event 41, timestamp: 159.891973
- it ends at event 56, timestamp: 159.892043
For future reference, for generating email friendly trace output for discussion like this, you can use something like:
trace-cmd report --cpu=1 trace.dat
And after that the next event comes after 5 Seconds.
And so I was talking for the Event 41.
That first event (Event 41) is an interrupt, and comes from the scheduler tick. The tick is happening because the writeback workqueue just ran and we're not in NO_HZ mode.
However, as soon as that IRQ (and resulting softirqs) are finished, we enter NO_HZ mode again. But as you mention, it only lasts for ~5 sec when the timer fires again. Once again, it fires because of the writeback workqueue, and soon therafter it switches back to NO_HZ mode again.
So the solution to avoid this jitter on the NO_HZ CPU is to set the affinity of the writeback workqueue to CPU0:
# pin the writeback workqueue to CPU0 echo 1 > /sys/bus/workqueue/devices/writeback/cpumask
I suspect by doing that, you will no longer see the jitter.
Kevin
On Tue, Dec 17, 2013 at 08:35:39AM -0800, Kevin Hilman wrote:
Viresh Kumar viresh.kumar@linaro.org writes:
Sorry for the delay, was on holidays..
On 11 December 2013 18:52, Frederic Weisbecker fweisbec@gmail.com wrote:
On Tue, Dec 03, 2013 at 01:57:37PM +0530, Viresh Kumar wrote:
- again got arch_timer interrupt after 5 ms (HZ=200)
Right, looking at the details, the 2nd interrupt is caused by workqueue delayed work bdi writeback.
I am not that great at reading traces or kernelshark output, but I still feel I haven't seen anything wrong. And I wasn't talking about the delayed workqueue here..
I am looking at the trace I attached with kernelshark after filtering out CPU0 events:
- Event 41, timestamp: 159.891973
- it ends at event 56, timestamp: 159.892043
For future reference, for generating email friendly trace output for discussion like this, you can use something like:
trace-cmd report --cpu=1 trace.dat
And after that the next event comes after 5 Seconds.
And so I was talking for the Event 41.
That first event (Event 41) is an interrupt, and comes from the scheduler tick. The tick is happening because the writeback workqueue just ran and we're not in NO_HZ mode.
However, as soon as that IRQ (and resulting softirqs) are finished, we enter NO_HZ mode again. But as you mention, it only lasts for ~5 sec when the timer fires again. Once again, it fires because of the writeback workqueue, and soon therafter it switches back to NO_HZ mode again.
So the solution to avoid this jitter on the NO_HZ CPU is to set the affinity of the writeback workqueue to CPU0:
# pin the writeback workqueue to CPU0 echo 1 > /sys/bus/workqueue/devices/writeback/cpumask
Very interesting trick, I'm going to add it to my dyntick-testing suite.
Thanks!
On 17 December 2013 22:05, Kevin Hilman khilman@linaro.org wrote:
For future reference, for generating email friendly trace output for discussion like this, you can use something like:
trace-cmd report --cpu=1 trace.dat
Okay..
And after that the next event comes after 5 Seconds.
And so I was talking for the Event 41.
That first event (Event 41) is an interrupt, and comes from the scheduler tick. The tick is happening because the writeback workqueue just ran and we're not in NO_HZ mode.
This is what I was trying to ask. Why can't we enter in NO_HZ_FULL mode as soon as writeback workqueue just ran? That way we can go into NOHZ mode earlier..
However, as soon as that IRQ (and resulting softirqs) are finished, we enter NO_HZ mode again. But as you mention, it only lasts for ~5 sec when the timer fires again. Once again, it fires because of the writeback workqueue, and soon therafter it switches back to NO_HZ mode again.
That's fine.. It wasn't part of my query :) .. But yes your trick would be useful for my usecase :)
Viresh Kumar viresh.kumar@linaro.org writes:
On 17 December 2013 22:05, Kevin Hilman khilman@linaro.org wrote:
For future reference, for generating email friendly trace output for discussion like this, you can use something like:
trace-cmd report --cpu=1 trace.dat
Okay..
And after that the next event comes after 5 Seconds.
And so I was talking for the Event 41.
That first event (Event 41) is an interrupt, and comes from the scheduler tick. The tick is happening because the writeback workqueue just ran and we're not in NO_HZ mode.
This is what I was trying to ask. Why can't we enter in NO_HZ_FULL mode as soon as writeback workqueue just ran? That way we can go into NOHZ mode earlier..
Ah, I see. So you're basically asking why we can't evaluate whether to turn off the tick more often, for example right after the workqueues are done. I suppose Frederic may have some views on that, but there's likely additional overhead from those checks as well as that workqueues may not be the only thing keeping us out of NO_HZ.
Kevin
On 18 December 2013 19:21, Kevin Hilman khilman@linaro.org wrote:
Ah, I see. So you're basically asking why we can't evaluate whether to turn off the tick more often, for example right after the workqueues are done. I suppose Frederic may have some views on that, but there's likely additional overhead from those checks as well as that workqueues may not be the only thing keeping us out of NO_HZ.
I see that sched_switch is called at the end most of the times so an check there might be useful ?
Adding Ingo/Peter..
On 18 December 2013 20:03, Viresh Kumar viresh.kumar@linaro.org wrote:
On 18 December 2013 19:21, Kevin Hilman khilman@linaro.org wrote:
Ah, I see. So you're basically asking why we can't evaluate whether to turn off the tick more often, for example right after the workqueues are done. I suppose Frederic may have some views on that, but there's likely additional overhead from those checks as well as that workqueues may not be the only thing keeping us out of NO_HZ.
I see that sched_switch is called at the end most of the times so an check there might be useful ?
Wrong time, probably many people on vacation now. But I am working, so will continue reporting my problems, in case somebody is around :)
My usecase: I am working on making ARM better for Networking servers. In our usecase we need to isolate few of the cores in our SoC, so that they run a single user space task per CPU. And userspace will take care of data plane side of things for them.
Now, we want to use NO_HZ_FULL with CPUSets (And this is what I have been trying since sometime), so that we don't get any, any interruption on those cores. They should keep running that task unless that task tries to switch to kernel space.
I am getting interrupted by few of the workqueues (other than per-cpu ones). One of them was bdi writeback one, that we discussed earlier.
I have done some work in the past about Power efficient workqueues (Mentioned by Tejun few mails back), which used to switch those works on UNBOUND workqueues and so scheduler would decide on the CPU it want's to queue those works on.
With an idle CPU, it works fine as scheduler doesn't wake up a idle CPU for servicing that work.
*But wouldn't it make sense if we can tell scheduler that don't queue these works on a CPU that is running in NO_HZ_FULL mode?*
Also any suggestions on how to get rid of __prandom_timer events on such CPUs?
Thanks in Advance..
-- viresh
On 23 December 2013 13:48, Viresh Kumar viresh.kumar@linaro.org wrote:
Wrong time, probably many people on vacation now. But I am working, so will continue reporting my problems, in case somebody is around :)
Ping!! (Probably many people would be back from their vacations.)
On Mon, Dec 23, 2013 at 01:48:02PM +0530, Viresh Kumar wrote:
*But wouldn't it make sense if we can tell scheduler that don't queue these works on a CPU that is running in NO_HZ_FULL mode?*
No,.. that's the wrong way around.
Also any suggestions on how to get rid of __prandom_timer events on such CPUs?
That looks to be a normal unpinned timer, it should migrate to a 'busy' cpu once the one its running on it going idle.
ISTR people trying to make that active and also migrating on nohz full or somesuch, just like the workqueues. Forgot what happened with that; if it got dropped it should probably be ressurected.
On 7 January 2014 14:17, Peter Zijlstra peterz@infradead.org wrote:
On Mon, Dec 23, 2013 at 01:48:02PM +0530, Viresh Kumar wrote:
*But wouldn't it make sense if we can tell scheduler that don't queue these works on a CPU that is running in NO_HZ_FULL mode?*
No,.. that's the wrong way around.
Hmm.. Just to make it clear I didn't meant that any input from workqueue code should go to scheduler but something like this:
Scheduler will check following before pushing a task on any CPU: - If that CPU is part of NO_HZ_FULL cpu list? - If yes, is that CPU running only one task for now? i.e. running task for best performance case? - If yes, then don't queue new task to that CPU, whether task belongs to workqueue or not doesn't matter.
That looks to be a normal unpinned timer, it should migrate to a 'busy' cpu once the one its running on it going idle.
ISTR people trying to make that active and also migrating on nohz full or somesuch, just like the workqueues. Forgot what happened with that; if it got dropped it should probably be ressurected.
I will search for that in archives..
linaro-kernel@lists.linaro.org