From: Steve Muckle smuckle.linux@gmail.com
The rate-limit tunable in the schedutil governor applies to transitions to both lower and higher frequencies. On several platforms it is not the ideal tunable though, as it is difficult to get best power/performance figures using the same limit in both directions.
It is common on mobile platforms with demanding user interfaces to want to increase frequency rapidly for example but decrease slowly.
One of the example can be a case where we have short busy periods followed by similar or longer idle periods. If we keep the rate-limit high enough, we will not go to higher frequencies soon enough. On the other hand, if we keep it too low, we will have too many frequency transitions, as we will always reduce the frequency after the busy period.
It would be very useful if we can set low rate-limit while increasing the frequency (so that we can respond to the short busy periods quickly) and high rate-limit while decreasing frequency (so that we don't reduce the frequency immediately after the short busy period and that may avoid frequency transitions before the next busy period).
Implement separate up/down transition rate limits. Note that the governor avoids frequency recalculations for a period equal to minimum of up and down rate-limit. A global mutex is also defined to protect updates to min_rate_limit_us via two separate sysfs files.
Note that this wouldn't change behavior of the schedutil governor for the platforms which wish to keep same values for both up and down rate limits.
This is tested with the rt-app [1] on ARM Exynos, dual A15 processor platform.
Testcase: Run a SCHED_OTHER thread on CPU0 which will emulate work-load for X ms of busy period out of the total period of Y ms, i.e. Y - X ms of idle period. The values of X/Y taken were: 20/40, 20/50, 20/70, i.e idle periods of 20, 30 and 50 ms respectively. These were tested against values of up/down rate limits as: 10/10 ms and 10/40 ms.
For every test we noticed a performance increase of 5-10% with the schedutil governor, which was very much expected.
[Viresh]: Simplified user interface and introduced min_rate_limit_us + mutex, rewrote commit log and included test results.
[1] https://github.com/scheduler-tools/rt-app/
Signed-off-by: Steve Muckle smuckle.linux@gmail.com Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- kernel/sched/cpufreq_schedutil.c | 106 +++++++++++++++++++++++++++++++++------ 1 file changed, 90 insertions(+), 16 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 42a220e78f00..7fae0dbfe4bd 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -22,7 +22,8 @@
struct sugov_tunables { struct gov_attr_set attr_set; - unsigned int rate_limit_us; + unsigned int up_rate_limit_us; + unsigned int down_rate_limit_us; };
struct sugov_policy { @@ -33,7 +34,9 @@ struct sugov_policy {
raw_spinlock_t update_lock; /* For shared policies */ u64 last_freq_update_time; - s64 freq_update_delay_ns; + s64 min_rate_limit_ns; + s64 up_rate_delay_ns; + s64 down_rate_delay_ns; unsigned int next_freq;
/* The next fields are only needed if fast switch cannot be used. */ @@ -84,7 +87,27 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) }
delta_ns = time - sg_policy->last_freq_update_time; - return delta_ns >= sg_policy->freq_update_delay_ns; + + /* No need to recalculate next freq for min_rate_limit_us at least */ + return delta_ns >= sg_policy->min_rate_limit_ns; +} + +static bool sugov_up_down_rate_limit(struct sugov_policy *sg_policy, u64 time, + unsigned int next_freq) +{ + s64 delta_ns; + + delta_ns = time - sg_policy->last_freq_update_time; + + if (next_freq > sg_policy->next_freq && + delta_ns < sg_policy->up_rate_delay_ns) + return true; + + if (next_freq < sg_policy->next_freq && + delta_ns < sg_policy->down_rate_delay_ns) + return true; + + return false; }
static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, @@ -92,6 +115,9 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, { struct cpufreq_policy *policy = sg_policy->policy;
+ if (sugov_up_down_rate_limit(sg_policy, time, next_freq)) + return; + sg_policy->last_freq_update_time = time;
if (policy->fast_switch_enabled) { @@ -340,15 +366,32 @@ static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr return container_of(attr_set, struct sugov_tunables, attr_set); }
-static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf) +static DEFINE_MUTEX(min_rate_lock); + +static void update_min_rate_limit_us(struct sugov_policy *sg_policy) +{ + mutex_lock(&min_rate_lock); + sg_policy->min_rate_limit_ns = min(sg_policy->up_rate_delay_ns, + sg_policy->down_rate_delay_ns); + mutex_unlock(&min_rate_lock); +} + +static ssize_t up_rate_limit_us_show(struct gov_attr_set *attr_set, char *buf) +{ + struct sugov_tunables *tunables = to_sugov_tunables(attr_set); + + return sprintf(buf, "%u\n", tunables->up_rate_limit_us); +} + +static ssize_t down_rate_limit_us_show(struct gov_attr_set *attr_set, char *buf) { struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
- return sprintf(buf, "%u\n", tunables->rate_limit_us); + return sprintf(buf, "%u\n", tunables->down_rate_limit_us); }
-static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, - size_t count) +static ssize_t up_rate_limit_us_store(struct gov_attr_set *attr_set, + const char *buf, size_t count) { struct sugov_tunables *tunables = to_sugov_tunables(attr_set); struct sugov_policy *sg_policy; @@ -357,18 +400,42 @@ static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *bu if (kstrtouint(buf, 10, &rate_limit_us)) return -EINVAL;
- tunables->rate_limit_us = rate_limit_us; + tunables->up_rate_limit_us = rate_limit_us;
- list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) - sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC; + list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) { + sg_policy->up_rate_delay_ns = rate_limit_us * NSEC_PER_USEC; + update_min_rate_limit_us(sg_policy); + }
return count; }
-static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us); +static ssize_t down_rate_limit_us_store(struct gov_attr_set *attr_set, + const char *buf, size_t count) +{ + struct sugov_tunables *tunables = to_sugov_tunables(attr_set); + struct sugov_policy *sg_policy; + unsigned int rate_limit_us; + + if (kstrtouint(buf, 10, &rate_limit_us)) + return -EINVAL; + + tunables->down_rate_limit_us = rate_limit_us; + + list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) { + sg_policy->down_rate_delay_ns = rate_limit_us * NSEC_PER_USEC; + update_min_rate_limit_us(sg_policy); + } + + return count; +} + +static struct governor_attr up_rate_limit_us = __ATTR_RW(up_rate_limit_us); +static struct governor_attr down_rate_limit_us = __ATTR_RW(down_rate_limit_us);
static struct attribute *sugov_attributes[] = { - &rate_limit_us.attr, + &up_rate_limit_us.attr, + &down_rate_limit_us.attr, NULL };
@@ -512,10 +579,13 @@ static int sugov_init(struct cpufreq_policy *policy) goto stop_kthread; }
- tunables->rate_limit_us = LATENCY_MULTIPLIER; + tunables->up_rate_limit_us = LATENCY_MULTIPLIER; + tunables->down_rate_limit_us = LATENCY_MULTIPLIER; lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC; - if (lat) - tunables->rate_limit_us *= lat; + if (lat) { + tunables->up_rate_limit_us *= lat; + tunables->down_rate_limit_us *= lat; + }
policy->governor_data = sg_policy; sg_policy->tunables = tunables; @@ -574,7 +644,11 @@ static int sugov_start(struct cpufreq_policy *policy) struct sugov_policy *sg_policy = policy->governor_data; unsigned int cpu;
- sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC; + sg_policy->up_rate_delay_ns = + sg_policy->tunables->up_rate_limit_us * NSEC_PER_USEC; + sg_policy->down_rate_delay_ns = + sg_policy->tunables->down_rate_limit_us * NSEC_PER_USEC; + update_min_rate_limit_us(sg_policy); sg_policy->last_freq_update_time = 0; sg_policy->next_freq = UINT_MAX; sg_policy->work_in_progress = false;
On 17-11-16, 10:48, Viresh Kumar wrote:
From: Steve Muckle smuckle.linux@gmail.com
The rate-limit tunable in the schedutil governor applies to transitions to both lower and higher frequencies. On several platforms it is not the ideal tunable though, as it is difficult to get best power/performance figures using the same limit in both directions.
It is common on mobile platforms with demanding user interfaces to want to increase frequency rapidly for example but decrease slowly.
One of the example can be a case where we have short busy periods followed by similar or longer idle periods. If we keep the rate-limit high enough, we will not go to higher frequencies soon enough. On the other hand, if we keep it too low, we will have too many frequency transitions, as we will always reduce the frequency after the busy period.
It would be very useful if we can set low rate-limit while increasing the frequency (so that we can respond to the short busy periods quickly) and high rate-limit while decreasing frequency (so that we don't reduce the frequency immediately after the short busy period and that may avoid frequency transitions before the next busy period).
Implement separate up/down transition rate limits. Note that the governor avoids frequency recalculations for a period equal to minimum of up and down rate-limit. A global mutex is also defined to protect updates to min_rate_limit_us via two separate sysfs files.
Note that this wouldn't change behavior of the schedutil governor for the platforms which wish to keep same values for both up and down rate limits.
This is tested with the rt-app [1] on ARM Exynos, dual A15 processor platform.
Testcase: Run a SCHED_OTHER thread on CPU0 which will emulate work-load for X ms of busy period out of the total period of Y ms, i.e. Y - X ms of idle period. The values of X/Y taken were: 20/40, 20/50, 20/70, i.e idle periods of 20, 30 and 50 ms respectively. These were tested against values of up/down rate limits as: 10/10 ms and 10/40 ms.
For every test we noticed a performance increase of 5-10% with the schedutil governor, which was very much expected.
[Viresh]: Simplified user interface and introduced min_rate_limit_us + mutex, rewrote commit log and included test results.
[1] https://github.com/scheduler-tools/rt-app/
Signed-off-by: Steve Muckle smuckle.linux@gmail.com Signed-off-by: Viresh Kumar viresh.kumar@linaro.org
kernel/sched/cpufreq_schedutil.c | 106 +++++++++++++++++++++++++++++++++------ 1 file changed, 90 insertions(+), 16 deletions(-)
(Background story for others from my discussion with Rafael on IRC: Rafael proposed that instead of this patch we can add down_rate_limit_delta_us (>0 =) which can be added to rate_limit_us (rate limit while increasing freq) to find the rate limit to be used in the downward direction. And I raised the point that it looks much neater to have separate up and down rate_limit_us. I also said that people may have a valid case where they want to keep down_rate_limit lower than up_rate_limit and Rafael wasn't fully sure of any such cases).
Hi Rafael,
I gathered inputs from few people (including Google) and there are two cases basically we need to support:
- Performance mode (normal phone mode): We want to have higher values of down_rate_limit and lower_values of up_rate_limit as performance is more important here and we want to avoid unnecessary transitions to lower frequencies for short bursts.
- Power-saving mode (like the Battery-saver mode in Android phones): We want to save power here but don't want to be too bad in performance at the same time. So, platform may not use powersave governor but schedutil. But in this case we may want to keep down_rate_limit to a value lower than up_rate_limit, as we want to switch back to lower frequencies as soon as the load reduces. This will also ensure that we move to higher frequencies at a slower pace, i.e. less aggressively.
On Mon, Nov 21, 2016 at 03:38:05PM +0530, Viresh Kumar wrote:
On 17-11-16, 10:48, Viresh Kumar wrote:
From: Steve Muckle smuckle.linux@gmail.com
The rate-limit tunable in the schedutil governor applies to transitions to both lower and higher frequencies. On several platforms it is not the ideal tunable though, as it is difficult to get best power/performance figures using the same limit in both directions.
It is common on mobile platforms with demanding user interfaces to want to increase frequency rapidly for example but decrease slowly.
One of the example can be a case where we have short busy periods followed by similar or longer idle periods. If we keep the rate-limit high enough, we will not go to higher frequencies soon enough. On the other hand, if we keep it too low, we will have too many frequency transitions, as we will always reduce the frequency after the busy period.
It would be very useful if we can set low rate-limit while increasing the frequency (so that we can respond to the short busy periods quickly) and high rate-limit while decreasing frequency (so that we don't reduce the frequency immediately after the short busy period and that may avoid frequency transitions before the next busy period).
Implement separate up/down transition rate limits. Note that the governor avoids frequency recalculations for a period equal to minimum of up and down rate-limit. A global mutex is also defined to protect updates to min_rate_limit_us via two separate sysfs files.
Note that this wouldn't change behavior of the schedutil governor for the platforms which wish to keep same values for both up and down rate limits.
This is tested with the rt-app [1] on ARM Exynos, dual A15 processor platform.
Testcase: Run a SCHED_OTHER thread on CPU0 which will emulate work-load for X ms of busy period out of the total period of Y ms, i.e. Y - X ms of idle period. The values of X/Y taken were: 20/40, 20/50, 20/70, i.e idle periods of 20, 30 and 50 ms respectively. These were tested against values of up/down rate limits as: 10/10 ms and 10/40 ms.
For every test we noticed a performance increase of 5-10% with the schedutil governor, which was very much expected.
[Viresh]: Simplified user interface and introduced min_rate_limit_us + mutex, rewrote commit log and included test results.
[1] https://github.com/scheduler-tools/rt-app/
Signed-off-by: Steve Muckle smuckle.linux@gmail.com Signed-off-by: Viresh Kumar viresh.kumar@linaro.org
kernel/sched/cpufreq_schedutil.c | 106 +++++++++++++++++++++++++++++++++------ 1 file changed, 90 insertions(+), 16 deletions(-)
(Background story for others from my discussion with Rafael on IRC: Rafael proposed that instead of this patch we can add down_rate_limit_delta_us (>0 =) which can be added to rate_limit_us (rate limit while increasing freq) to find the rate limit to be used in the downward direction. And I raised the point that it looks much neater to have separate up and down rate_limit_us. I also said that people may have a valid case where they want to keep down_rate_limit lower than up_rate_limit and Rafael wasn't fully sure of any such cases).
Urgh...
So no tunables and rate limits here at all please.
During LPC we discussed the rampup and decay issues and decided that we should very much first address them by playing with the PELT stuff. Morton was going to play with capping the decay on the util signal. This should greatly improve the ramp-up scenario and cure some other wobbles.
The decay can be set by changing the over-all pelt decay, if so desired.
Also, there was the idea of; once the above ideas have all been explored; tying the freq ram rate to the power curve.
So NAK on everything tunable here.
On 21-11-16, 11:19, Peter Zijlstra wrote:
Urgh...
So no tunables and rate limits here at all please.
During LPC we discussed the rampup and decay issues and decided that we should very much first address them by playing with the PELT stuff. Morton was going to play with capping the decay on the util signal. This should greatly improve the ramp-up scenario and cure some other wobbles.
The decay can be set by changing the over-all pelt decay, if so desired.
Also, there was the idea of; once the above ideas have all been explored; tying the freq ram rate to the power curve.
So NAK on everything tunable here.
Okay, as I told you on IRC, we already have a tunable: rate_limit_us for the schedutil governor which defines the minimum time before which the governor wouldn't try to update the frequency again. Perhaps 10-20 ms is the ideal value for that everyone is using.
So eventually that should also die and we should get inputs from PELT stuff ?
On Mon, Nov 21, 2016 at 04:18:00PM +0530, Viresh Kumar wrote:
On 21-11-16, 11:19, Peter Zijlstra wrote:
Urgh...
So no tunables and rate limits here at all please.
During LPC we discussed the rampup and decay issues and decided that we should very much first address them by playing with the PELT stuff. Morton was going to play with capping the decay on the util signal. This should greatly improve the ramp-up scenario and cure some other wobbles.
The decay can be set by changing the over-all pelt decay, if so desired.
Also, there was the idea of; once the above ideas have all been explored; tying the freq ram rate to the power curve.
So NAK on everything tunable here.
Okay, as I told you on IRC, we already have a tunable: rate_limit_us for the schedutil governor which defines the minimum time before which the governor wouldn't try to update the frequency again. Perhaps 10-20 ms is the ideal value for that everyone is using.
So eventually that should also die and we should get inputs from PELT stuff ?
I think it should be replaced by a value provided by the driver. It makes sense to have a rate-limit in so far as that it doesn't make sense to try and program the hardware faster than it can actually change frequencies and/or have a programming cost amortization. And this very clearly is a driver specific thing.
It however doesn't make sense to me to fudge with this in order to achieve ramp up/down differences.
On 21-11-16, 12:12, Peter Zijlstra wrote:
I think it should be replaced by a value provided by the driver. It makes sense to have a rate-limit in so far as that it doesn't make sense to try and program the hardware faster than it can actually change frequencies and/or have a programming cost amortization. And this very clearly is a driver specific thing.
We already have something called as transition_latency for that (though it isn't used much currently).
It however doesn't make sense to me to fudge with this in order to achieve ramp up/down differences.
So if a platform, for example, can do DVFS in say 100-500 us, then the scheduler should try to re-evaluate frequency (and update it) after that short of a period? Wouldn't that scheme waste lots of time doing just freq updates? And that's the primary reason why cpufreq governors have some sort of sampling-rate or rate-limit until now.
On Mon, Nov 21, 2016 at 05:00:16PM +0530, Viresh Kumar wrote:
On 21-11-16, 12:12, Peter Zijlstra wrote:
I think it should be replaced by a value provided by the driver. It makes sense to have a rate-limit in so far as that it doesn't make sense to try and program the hardware faster than it can actually change frequencies and/or have a programming cost amortization. And this very clearly is a driver specific thing.
We already have something called as transition_latency for that (though it isn't used much currently).
It however doesn't make sense to me to fudge with this in order to achieve ramp up/down differences.
So if a platform, for example, can do DVFS in say 100-500 us, then the scheduler should try to re-evaluate frequency (and update it) after that short of a period? Wouldn't that scheme waste lots of time doing just freq updates? And that's the primary reason why cpufreq governors have some sort of sampling-rate or rate-limit until now.
Dunno.. there's of course the cost amortization, but by the time we've reached sugov_should_update_freq() most of the 'expensive' parts have already been done from the scheduler's POV and its once again down to the driver.
On 21/11/16 11:19, Peter Zijlstra wrote:
On Mon, Nov 21, 2016 at 03:38:05PM +0530, Viresh Kumar wrote:
On 17-11-16, 10:48, Viresh Kumar wrote:
[...]
(Background story for others from my discussion with Rafael on IRC: Rafael proposed that instead of this patch we can add down_rate_limit_delta_us (>0 =) which can be added to rate_limit_us (rate limit while increasing freq) to find the rate limit to be used in the downward direction. And I raised the point that it looks much neater to have separate up and down rate_limit_us. I also said that people may have a valid case where they want to keep down_rate_limit lower than up_rate_limit and Rafael wasn't fully sure of any such cases).
Urgh...
So no tunables and rate limits here at all please.
During LPC we discussed the rampup and decay issues and decided that we should very much first address them by playing with the PELT stuff. Morton was going to play with capping the decay on the util signal. This should greatly improve the ramp-up scenario and cure some other wobbles.
The decay can be set by changing the over-all pelt decay, if so desired.
Do you mean we might want to change the decay (make it different from ramp-up) once for all, or maybe we make it tunable so that we can address different power/perf requirements?
Also, there was the idea of; once the above ideas have all been explored; tying the freq ram rate to the power curve.
Yep. That's an interesting one to look at, but it might require some time.
So NAK on everything tunable here.
On Mon, Nov 21, 2016 at 12:14:32PM +0000, Juri Lelli wrote:
On 21/11/16 11:19, Peter Zijlstra wrote:
So no tunables and rate limits here at all please.
During LPC we discussed the rampup and decay issues and decided that we should very much first address them by playing with the PELT stuff. Morton was going to play with capping the decay on the util signal. This should greatly improve the ramp-up scenario and cure some other wobbles.
The decay can be set by changing the over-all pelt decay, if so desired.
Do you mean we might want to change the decay (make it different from ramp-up) once for all, or maybe we make it tunable so that we can address different power/perf requirements?
So the limited decay would be the dominant factor in ramp-up time, leaving the regular PELT period the dominant factor for ramp-down.
(Note that the decay limit would only be applied on the per-task signal, not the accumulated signal.)
It could be an option, for some, to build the kernel with a PELT window of 16ms or so (half its current size), this of course means regenerating all the constants etc.. And this very much is a compile time thing.
We could fairly easy; if this is so desired; make the PELT window size a CONFIG option (hidden by default).
But like everything; patches should come with numbers justifying them etc..
Also, there was the idea of; once the above ideas have all been explored; tying the freq ram rate to the power curve.
Yep. That's an interesting one to look at, but it might require some time.
Sure, just saying that we should resist knobs until all other avenues have been explored. Never start with a knob.
On 21/11/16 13:26, Peter Zijlstra wrote:
On Mon, Nov 21, 2016 at 12:14:32PM +0000, Juri Lelli wrote:
On 21/11/16 11:19, Peter Zijlstra wrote:
So no tunables and rate limits here at all please.
During LPC we discussed the rampup and decay issues and decided that we should very much first address them by playing with the PELT stuff. Morton was going to play with capping the decay on the util signal. This should greatly improve the ramp-up scenario and cure some other wobbles.
The decay can be set by changing the over-all pelt decay, if so desired.
Do you mean we might want to change the decay (make it different from ramp-up) once for all, or maybe we make it tunable so that we can address different power/perf requirements?
So the limited decay would be the dominant factor in ramp-up time, leaving the regular PELT period the dominant factor for ramp-down.
Hmmm, AFAIU the limited decay will help not forgetting completely the contribution of tasks that sleep for a long time, but it won't modify the actual ramp-up of the signal. So, for new tasks we will need to play with a sensible initial value (trading off perf and power as usual).
(Note that the decay limit would only be applied on the per-task signal, not the accumulated signal.)
Right, and since schedutil consumes the latter, we could still suffer from too frequent frequency switch events I guess (this is where the down threshold thing came as a quick and dirty fix). Maybe we can think of some smoothing applied to the accumulated signal, or make it decay slower (don't really know what this means in practice, though :) ?
It could be an option, for some, to build the kernel with a PELT window of 16ms or so (half its current size), this of course means regenerating all the constants etc.. And this very much is a compile time thing.
Right. I seem to remember that helped a bit for mobile type of workloads. But never did a thorough evaluation.
We could fairly easy; if this is so desired; make the PELT window size a CONFIG option (hidden by default).
But like everything; patches should come with numbers justifying them etc..
Sure. :)
Also, there was the idea of; once the above ideas have all been explored; tying the freq ram rate to the power curve.
Yep. That's an interesting one to look at, but it might require some time.
Sure, just saying that we should resist knobs until all other avenues have been explored. Never start with a knob.
On Mon, Nov 21, 2016 at 01:53:08PM +0000, Juri Lelli wrote:
On 21/11/16 13:26, Peter Zijlstra wrote:
So the limited decay would be the dominant factor in ramp-up time, leaving the regular PELT period the dominant factor for ramp-down.
Hmmm, AFAIU the limited decay will help not forgetting completely the contribution of tasks that sleep for a long time, but it won't modify the actual ramp-up of the signal. So, for new tasks we will need to play with a sensible initial value (trading off perf and power as usual).
Oh, you mean ramp-up for bright spanking new tasks? I forgot the details, but I think we can fudge the 'history' such that those too ramp up quickly.
(Note that the decay limit would only be applied on the per-task signal, not the accumulated signal.)
Right, and since schedutil consumes the latter, we could still suffer from too frequent frequency switch events I guess (this is where the down threshold thing came as a quick and dirty fix). Maybe we can think of some smoothing applied to the accumulated signal, or make it decay slower (don't really know what this means in practice, though :) ?
Not sure I follow. So by limiting decay to the task value, the moment we add it back to the accumulated signal (wakeup), the accumulated signal jumps up quickly and ramp-up is achieved.
On 21/11/16 15:17, Peter Zijlstra wrote:
On Mon, Nov 21, 2016 at 01:53:08PM +0000, Juri Lelli wrote:
On 21/11/16 13:26, Peter Zijlstra wrote:
So the limited decay would be the dominant factor in ramp-up time, leaving the regular PELT period the dominant factor for ramp-down.
Hmmm, AFAIU the limited decay will help not forgetting completely the contribution of tasks that sleep for a long time, but it won't modify the actual ramp-up of the signal. So, for new tasks we will need to play with a sensible initial value (trading off perf and power as usual).
Oh, you mean ramp-up for bright spanking new tasks? I forgot the details, but I think we can fudge the 'history' such that those too ramp up quickly.
Right. I think Vincent had some ideas on this front already.
(Note that the decay limit would only be applied on the per-task signal, not the accumulated signal.)
Right, and since schedutil consumes the latter, we could still suffer from too frequent frequency switch events I guess (this is where the down threshold thing came as a quick and dirty fix). Maybe we can think of some smoothing applied to the accumulated signal, or make it decay slower (don't really know what this means in practice, though :) ?
Not sure I follow. So by limiting decay to the task value, the moment we add it back to the accumulated signal (wakeup), the accumulated signal jumps up quickly and ramp-up is achieved.
This is true, but it seems that this potentially spiky behaviour (which in general depends on tasks composition and periodicity) might affect power savings (as in you don't generally want to switch between high and low freqs too often). So that's why I was just thinking that some sort of smoothing applied to the signal schedutil uses might help.
On Mon, Nov 21, 2016 at 02:37:27PM +0000, Juri Lelli wrote:
On 21/11/16 15:17, Peter Zijlstra wrote:
Not sure I follow. So by limiting decay to the task value, the moment we add it back to the accumulated signal (wakeup), the accumulated signal jumps up quickly and ramp-up is achieved.
This is true, but it seems that this potentially spiky behaviour (which in general depends on tasks composition and periodicity) might affect power savings (as in you don't generally want to switch between high and low freqs too often). So that's why I was just thinking that some sort of smoothing applied to the signal schedutil uses might help.
Hurm.. so during LPC it was said that fast ramp-up was desired. Note that we'll not ramp down this fast, the accumulated signal will decay slowly as per blocked-load PELT rules. So only ramp-up is spiky, but that is what was desired AFAIU.
On 21/11/16 15:43, Peter Zijlstra wrote:
On Mon, Nov 21, 2016 at 02:37:27PM +0000, Juri Lelli wrote:
On 21/11/16 15:17, Peter Zijlstra wrote:
Not sure I follow. So by limiting decay to the task value, the moment we add it back to the accumulated signal (wakeup), the accumulated signal jumps up quickly and ramp-up is achieved.
This is true, but it seems that this potentially spiky behaviour (which in general depends on tasks composition and periodicity) might affect power savings (as in you don't generally want to switch between high and low freqs too often). So that's why I was just thinking that some sort of smoothing applied to the signal schedutil uses might help.
Hurm.. so during LPC it was said that fast ramp-up was desired. Note that we'll not ramp down this fast, the accumulated signal will decay slowly as per blocked-load PELT rules. So only ramp-up is spiky, but that is what was desired AFAIU.
Yep, fast ramp-up is quite crucial I'd say. And it's also true that we should in theory already ramp-down slower. My worries originate mostly from our experience with Android devices, for which we ended up introducing thresholds as per subject of this thread. But it's also true that the landscape it's different there (e.g., slightly different governor, different utilization signal, etc.), so I guess we should now re-evaluate things in light of what we discussed here and at LPC.
On 21 November 2016 at 15:37, Juri Lelli Juri.Lelli@arm.com wrote:
On 21/11/16 15:17, Peter Zijlstra wrote:
On Mon, Nov 21, 2016 at 01:53:08PM +0000, Juri Lelli wrote:
On 21/11/16 13:26, Peter Zijlstra wrote:
So the limited decay would be the dominant factor in ramp-up time, leaving the regular PELT period the dominant factor for ramp-down.
Hmmm, AFAIU the limited decay will help not forgetting completely the contribution of tasks that sleep for a long time, but it won't modify the actual ramp-up of the signal. So, for new tasks we will need to play with a sensible initial value (trading off perf and power as usual).
Oh, you mean ramp-up for bright spanking new tasks? I forgot the details, but I think we can fudge the 'history' such that those too ramp up quickly.
Right. I think Vincent had some ideas on this front already.
You are probably referring to some properties linked to how the PELT signal is evolving. As an example, an increase of 100 pf the utilization during the running phase means that we have for sure run for more than 5ms. This could probably used such kind of properties when estimating the utilization level of the task or the CPU
(Note that the decay limit would only be applied on the per-task signal, not the accumulated signal.)
Right, and since schedutil consumes the latter, we could still suffer from too frequent frequency switch events I guess (this is where the down threshold thing came as a quick and dirty fix). Maybe we can think of some smoothing applied to the accumulated signal, or make it decay slower (don't really know what this means in practice, though :) ?
Not sure I follow. So by limiting decay to the task value, the moment we add it back to the accumulated signal (wakeup), the accumulated signal jumps up quickly and ramp-up is achieved.
This is true, but it seems that this potentially spiky behaviour (which in general depends on tasks composition and periodicity) might affect power savings (as in you don't generally want to switch between high and low freqs too often). So that's why I was just thinking that some sort of smoothing applied to the signal schedutil uses might help.
On 22-Nov 10:27, Vincent Guittot wrote:
On 21 November 2016 at 15:37, Juri Lelli Juri.Lelli@arm.com wrote:
On 21/11/16 15:17, Peter Zijlstra wrote:
On Mon, Nov 21, 2016 at 01:53:08PM +0000, Juri Lelli wrote:
On 21/11/16 13:26, Peter Zijlstra wrote:
So the limited decay would be the dominant factor in ramp-up time, leaving the regular PELT period the dominant factor for ramp-down.
Hmmm, AFAIU the limited decay will help not forgetting completely the contribution of tasks that sleep for a long time, but it won't modify the actual ramp-up of the signal. So, for new tasks we will need to play with a sensible initial value (trading off perf and power as usual).
Oh, you mean ramp-up for bright spanking new tasks? I forgot the details, but I think we can fudge the 'history' such that those too ramp up quickly.
Right. I think Vincent had some ideas on this front already.
You are probably referring to some properties linked to how the PELT signal is evolving. As an example, an increase of 100 pf the utilization during the running phase means that we have for sure run for more than 5ms. This could probably used such kind of properties when estimating the utilization level of the task or the CPU
I like the idea of using PELT features to derive richer information. However, to better evaluate the effect we can expect, we should always keep in mind the specific timings of the different scenarios we want to target the optimizations for.
Thus, for example in the specific case of Android phones, the most important tasks for the user experience are usually running every 16ms and for a time which is in the range of 4 to 6ms. This means that having to wait 5ms to trigger an action it can be a too long time.
I know that your example was intentionally simplified, however it suggested me that maybe we should try to start a "campaign" to collect a description of use-cases we would like to optimize for.
Knowing timing and desirable behaviours at the end can also help on design and implement better solutions.
On 21-Nov 13:53, Juri Lelli wrote:
On 21/11/16 13:26, Peter Zijlstra wrote:
On Mon, Nov 21, 2016 at 12:14:32PM +0000, Juri Lelli wrote:
On 21/11/16 11:19, Peter Zijlstra wrote:
So no tunables and rate limits here at all please.
During LPC we discussed the rampup and decay issues and decided that we should very much first address them by playing with the PELT stuff. Morton was going to play with capping the decay on the util signal. This should greatly improve the ramp-up scenario and cure some other wobbles.
The decay can be set by changing the over-all pelt decay, if so desired.
Do you mean we might want to change the decay (make it different from ramp-up) once for all, or maybe we make it tunable so that we can address different power/perf requirements?
So the limited decay would be the dominant factor in ramp-up time, leaving the regular PELT period the dominant factor for ramp-down.
Hmmm, AFAIU the limited decay will help not forgetting completely the contribution of tasks that sleep for a long time, but it won't modify the actual ramp-up of the signal. So, for new tasks we will need to play with a sensible initial value (trading off perf and power as usual).
A fundamental problem in IMO is that we are trying to use a "dynamic metric" to act as a "predictor".
PELT is a "dynamic metric" since it continuously change while a task is running. Thus it does not really provides an answer to the question "how big this task is?" _while_ the task is running. Such an information is available only when the task sleep. Indeed, only when the task completes an activation and goes to sleep PELT has reached a value which represents how much CPU bandwidth has been required by that task.
For example, if we consider the simple yet interesting case of a periodic task, PELT is a wobbling signal which reports a correct measure of how much bandwidth is required only when a task completes its RUNNABLE status. To be more precise, the correct value is provided by the average PELT and this also depends on the period of the task compared to the PELT rate constant. But still, to me a fundamental point is that the "raw PELT value" is not really meaningful in _each and every single point in time_.
All that considered, we should be aware that to properly drive schedutil and (in the future) the energy aware scheduler decisions we perhaps need better instead a "predictor". In the simple case of the periodic task, a good predictor should be something which reports always the same answer _in each point in time_.
For example, a task running 30 [ms] every 100 [ms] is a ~300 util_avg task. With PELT, we get a signal which range between [120,550] with an average of ~300 which is instead completely ignored. By capping the decay we will get:
decay_cap [ms] range average 0 120:550 300 64 140:560 310 32 320:660 430
which means that still the raw PELT signal is wobbling and never provides a consistent response to drive decisions.
Thus, a "predictor" should be something which sample information from PELT to provide a more consistent view, a sort of of low-pass filter on top of the "dynamic metric" which is PELT.
Should not such a "predictor" help on solving some of the issues related to PELT slow ramp-up or fast ramp-down?
It should provides benefits, similar to that of the proposed knobs, not only to schedutil but also to other clients of the PELT signal.
(Note that the decay limit would only be applied on the per-task signal, not the accumulated signal.)
Right, and since schedutil consumes the latter, we could still suffer from too frequent frequency switch events I guess (this is where the down threshold thing came as a quick and dirty fix). Maybe we can think of some smoothing applied to the accumulated signal, or make it decay slower (don't really know what this means in practice, though :) ?
It could be an option, for some, to build the kernel with a PELT window of 16ms or so (half its current size), this of course means regenerating all the constants etc.. And this very much is a compile time thing.
Right. I seem to remember that helped a bit for mobile type of workloads. But never did a thorough evaluation.
We could fairly easy; if this is so desired; make the PELT window size a CONFIG option (hidden by default).
But like everything; patches should come with numbers justifying them etc..
Sure. :)
Also, there was the idea of; once the above ideas have all been explored; tying the freq ram rate to the power curve.
Yep. That's an interesting one to look at, but it might require some time.
Sure, just saying that we should resist knobs until all other avenues have been explored. Never start with a knob.
On Mon, Nov 21, 2016 at 02:59:19PM +0000, Patrick Bellasi wrote:
A fundamental problem in IMO is that we are trying to use a "dynamic metric" to act as a "predictor".
PELT is a "dynamic metric" since it continuously change while a task is running. Thus it does not really provides an answer to the question "how big this task is?" _while_ the task is running. Such an information is available only when the task sleep. Indeed, only when the task completes an activation and goes to sleep PELT has reached a value which represents how much CPU bandwidth has been required by that task.
I'm not sure I agree with that. We can only tell how big a task is _while_ its running, esp. since its behaviour is not steady-state. Tasks can change etc..
Also, as per the whole argument on why peak_util was bad, at the moment a task goes to sleep, the PELT signal is actually an over-estimate, since it hasn't yet had time to average out.
And a real predictor requires a crytal-ball instruction, but until such time that hardware people bring us that goodness, we'll have to live with predicting the near future based on the recent past.
For example, if we consider the simple yet interesting case of a periodic task, PELT is a wobbling signal which reports a correct measure of how much bandwidth is required only when a task completes its RUNNABLE status.
Its actually an over-estimate at that point, since it just added a sizable chunk to the signal (for having been runnable) that hasn't yet had time to decay back to the actual value.
To be more precise, the correct value is provided by the average PELT and this also depends on the period of the task compared to the PELT rate constant. But still, to me a fundamental point is that the "raw PELT value" is not really meaningful in _each and every single point in time_.
Agreed.
All that considered, we should be aware that to properly drive schedutil and (in the future) the energy aware scheduler decisions we perhaps need better instead a "predictor". In the simple case of the periodic task, a good predictor should be something which reports always the same answer _in each point in time_.
So the problem with this is that not many tasks are that periodic, and any filter you put on top will add, lets call it, momentum to the signal. A reluctance to change. This might negatively affect non-periodic tasks.
In any case, worth trying, see what happens.
For example, a task running 30 [ms] every 100 [ms] is a ~300 util_avg task. With PELT, we get a signal which range between [120,550] with an average of ~300 which is instead completely ignored. By capping the decay we will get:
decay_cap [ms] range average 0 120:550 300 64 140:560 310 32 320:660 430
which means that still the raw PELT signal is wobbling and never provides a consistent response to drive decisions.
Thus, a "predictor" should be something which sample information from PELT to provide a more consistent view, a sort of of low-pass filter on top of the "dynamic metric" which is PELT.
Should not such a "predictor" help on solving some of the issues related to PELT slow ramp-up or fast ramp-down?
I think intel_pstate recently added a local PID filter, I asked at the time if something like that should live in generic code, looks like maybe it should.
On Mon, Nov 21, 2016 at 04:26:06PM +0100, Peter Zijlstra wrote:
For example, a task running 30 [ms] every 100 [ms] is a ~300 util_avg task. With PELT, we get a signal which range between [120,550] with an average of ~300 which is instead completely ignored. By capping the decay we will get:
decay_cap [ms] range average 0 120:550 300 64 140:560 310 32 320:660 430
which means that still the raw PELT signal is wobbling and never provides a consistent response to drive decisions.
Thus, a "predictor" should be something which sample information from PELT to provide a more consistent view, a sort of of low-pass filter on top of the "dynamic metric" which is PELT.
Note that a simple low-pass IIR filter will only shrink the wobble, not take it out entirely. Or rather, if you make it so strong as to take the entire wobble out, it'll likely react like molasses.
On 21-Nov 16:26, Peter Zijlstra wrote:
On Mon, Nov 21, 2016 at 02:59:19PM +0000, Patrick Bellasi wrote:
A fundamental problem in IMO is that we are trying to use a "dynamic metric" to act as a "predictor".
PELT is a "dynamic metric" since it continuously change while a task is running. Thus it does not really provides an answer to the question "how big this task is?" _while_ the task is running. Such an information is available only when the task sleep. Indeed, only when the task completes an activation and goes to sleep PELT has reached a value which represents how much CPU bandwidth has been required by that task.
I'm not sure I agree with that. We can only tell how big a task is _while_ its running, esp. since its behaviour is not steady-state. Tasks can change etc..
Sure, what I was saying is that while a task is running we can only know that it still needs more CPU bandwidth but not how much it will consume at the end. PELT rumping up measure how much bandwidth a task consumed so far and only at the end it allows to know how much we need, usually defined by the average between the initial decayed value and the final ramp-up value.
Also, as per the whole argument on why peak_util was bad, at the moment a task goes to sleep, the PELT signal is actually an over-estimate, since it hasn't yet had time to average out.
Right, but there are two main observations on that point: 1) how much we over-esitmate depends on the task periodicity compared to the PELT rate 2) the peak_util was just an initial bit (quite oversimplified) of a more complete solution which can allow to track a better metric, like for example the average, and transparently expose it in place of the raw PELT signal whenever it make sense
And a real predictor requires a crytal-ball instruction, but until such time that hardware people bring us that goodness, we'll have to live with predicting the near future based on the recent past.
That will definitively be a bright future :)
However, I agree that the only sensible and possible thing is to estimate based on recent past. The point is to decide which "past" provides the most useful information.
PELT past is measured in terms on 1ms, every few [ms] the task size PELT reports is different while the task is running.
Perhaps a better approach could be to consolidate the PELT information each time a task completes an activation. In this case the past will be measued in terms of "the last time this task executed".
For example, if we consider the simple yet interesting case of a periodic task, PELT is a wobbling signal which reports a correct measure of how much bandwidth is required only when a task completes its RUNNABLE status.
Its actually an over-estimate at that point, since it just added a sizable chunk to the signal (for having been runnable) that hasn't yet had time to decay back to the actual value.
Kind of disagree on "actual value", when the value is decayed what we get is a lower estimation of the actual required bandwidth. Compared to the PELT averate: the peak value over-estimate almost as much as the decayed value lower-estimate, isn't it?
To be more precise, the correct value is provided by the average PELT and this also depends on the period of the task compared to the PELT rate constant. But still, to me a fundamental point is that the "raw PELT value" is not really meaningful in _each and every single point in time_.
Agreed.
All that considered, we should be aware that to properly drive schedutil and (in the future) the energy aware scheduler decisions we perhaps need better instead a "predictor". In the simple case of the periodic task, a good predictor should be something which reports always the same answer _in each point in time_.
So the problem with this is that not many tasks are that periodic, and any filter you put on top will add, lets call it, momentum to the signal. A reluctance to change. This might negatively affect non-periodic tasks.
In mobile environment many "main" tasks are generally quite periodic with a limited variability every other activation. We could argue that those tasks should be scheduled using a different classes, however we should also consider that sometimes this is not possible.
However, I agree that a generic solution should fit variable tasks as well. That's why the more complete and generic solution, wrt the peak_util posted by Morten, was something which allows to transparently switch from the estimated value to the PELT one. for example when the PELT value ramps up above the estimated one.
In any case, worth trying, see what happens.
Are you saying that you would like to see the code which implements a more generic version of the peak_util "filter" on top of PELT?
IMO it could be a good exercise now that we agree we want to improve PELT without replacing it.
For example, a task running 30 [ms] every 100 [ms] is a ~300 util_avg task. With PELT, we get a signal which range between [120,550] with an average of ~300 which is instead completely ignored. By capping the decay we will get:
decay_cap [ms] range average 0 120:550 300 64 140:560 310 32 320:660 430
which means that still the raw PELT signal is wobbling and never provides a consistent response to drive decisions.
Thus, a "predictor" should be something which sample information from PELT to provide a more consistent view, a sort of of low-pass filter on top of the "dynamic metric" which is PELT.
Should not such a "predictor" help on solving some of the issues related to PELT slow ramp-up or fast ramp-down?
I think intel_pstate recently added a local PID filter, I asked at the time if something like that should live in generic code, looks like maybe it should.
That PID filter is not "just" a software implementation of the ACPI's Collaborative Processor Performance Control (CPPC) when HWP hardware is not provided by a certain processor?
On Mon, Nov 21, 2016 at 04:24:24PM +0000, Patrick Bellasi wrote:
On 21-Nov 16:26, Peter Zijlstra wrote:
In any case, worth trying, see what happens.
Are you saying that you would like to see the code which implements a more generic version of the peak_util "filter" on top of PELT?
Not sure about peak_util, I was more thinking of an IIR/PID filter, as per the email thread referenced below. Doesn't make sense to hide that in intel_pstate if it appears to be universally useful etc..
IMO it could be a good exercise now that we agree we want to improve PELT without replacing it.
I think it would make sense to keep it inside sched_cpufreq for now.
For example, a task running 30 [ms] every 100 [ms] is a ~300 util_avg task. With PELT, we get a signal which range between [120,550] with an average of ~300 which is instead completely ignored. By capping the decay we will get:
decay_cap [ms] range average 0 120:550 300 64 140:560 310 32 320:660 430
which means that still the raw PELT signal is wobbling and never provides a consistent response to drive decisions.
Thus, a "predictor" should be something which sample information from PELT to provide a more consistent view, a sort of of low-pass filter on top of the "dynamic metric" which is PELT.
Should not such a "predictor" help on solving some of the issues related to PELT slow ramp-up or fast ramp-down?
I think intel_pstate recently added a local PID filter, I asked at the time if something like that should live in generic code, looks like maybe it should.
That PID filter is not "just" a software implementation of the ACPI's Collaborative Processor Performance Control (CPPC) when HWP hardware is not provided by a certain processor?
I think it was this thread:
http://lkml.kernel.org/r/1572483.RZjvRFdxPx@vostro.rjw.lan
It never really made sense such a filter should live in individual drivers.
On Mon, Nov 21, 2016 at 5:46 PM, Peter Zijlstra peterz@infradead.org wrote:
On Mon, Nov 21, 2016 at 04:24:24PM +0000, Patrick Bellasi wrote:
On 21-Nov 16:26, Peter Zijlstra wrote:
In any case, worth trying, see what happens.
Are you saying that you would like to see the code which implements a more generic version of the peak_util "filter" on top of PELT?
Not sure about peak_util, I was more thinking of an IIR/PID filter, as per the email thread referenced below. Doesn't make sense to hide that in intel_pstate if it appears to be universally useful etc..
IMO it could be a good exercise now that we agree we want to improve PELT without replacing it.
I think it would make sense to keep it inside sched_cpufreq for now.
For example, a task running 30 [ms] every 100 [ms] is a ~300 util_avg task. With PELT, we get a signal which range between [120,550] with an average of ~300 which is instead completely ignored. By capping the decay we will get:
decay_cap [ms] range average 0 120:550 300 64 140:560 310 32 320:660 430
which means that still the raw PELT signal is wobbling and never provides a consistent response to drive decisions.
Thus, a "predictor" should be something which sample information from PELT to provide a more consistent view, a sort of of low-pass filter on top of the "dynamic metric" which is PELT.
Should not such a "predictor" help on solving some of the issues related to PELT slow ramp-up or fast ramp-down?
I think intel_pstate recently added a local PID filter, I asked at the time if something like that should live in generic code, looks like maybe it should.
That PID filter is not "just" a software implementation of the ACPI's Collaborative Processor Performance Control (CPPC) when HWP hardware is not provided by a certain processor?
I think it was this thread:
http://lkml.kernel.org/r/1572483.RZjvRFdxPx@vostro.rjw.lan
It never really made sense such a filter should live in individual drivers.
We don't use the IIR filter in intel_pstate after all.
We evaluated it, but it affected performance too much to be useful for us.
That said in the "proportional" version of the intel_pstate's P-state selection algorithm (without PID) we ramp up faster than we reduce the P-state, but the approach used in there depends on using the feedback registers.
And, of course, that's only used if HWP is not active.
Thanks, Rafael
2016-11-21 20:26 GMT+08:00 Peter Zijlstra peterz@infradead.org:
On Mon, Nov 21, 2016 at 12:14:32PM +0000, Juri Lelli wrote:
On 21/11/16 11:19, Peter Zijlstra wrote:
So no tunables and rate limits here at all please.
During LPC we discussed the rampup and decay issues and decided that we should very much first address them by playing with the PELT stuff. Morton was going to play with capping the decay on the util signal. This should greatly improve the ramp-up scenario and cure some other wobbles.
The decay can be set by changing the over-all pelt decay, if so desired.
Do you mean we might want to change the decay (make it different from ramp-up) once for all, or maybe we make it tunable so that we can address different power/perf requirements?
So the limited decay would be the dominant factor in ramp-up time, leaving the regular PELT period the dominant factor for ramp-down.
(Note that the decay limit would only be applied on the per-task signal, not the accumulated signal.)
What's the meaning of "signal" in this thread?
Regards, Wanpeng Li
linaro-kernel@lists.linaro.org