On 22 April 2015 at 13:10, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Tue, Apr 21, 2015 at 05:58:03PM +0100, Mike Turquette wrote:
Quoting Juri Lelli (2015-04-16 09:46:47)
On 16/04/15 06:29, Michael Turquette wrote:
+#define UP_THRESHOLD 95
Is this a leftover? In the changelog you say that you moved away from thresholds. Anyway, since we scale utilization by freq, I'm not sure we can live without some sort of up_threshold. The problem is that if you are running a task flat out on a CPU at a certain freq, let's say the lower one, you'll always get a usage for that CPU that corresponds to the current capacity of that CPU at that freq. As you use the usage signal to decide when to ramp up, you will never ramp up in this situation because the signal won't cross the capacity at the lower frequency.
Juri & Morten,
Yes, the UP_THRESHOLD constant is a leftover.
We discussed the issue of usage being capped at the current capacity in our call yesterday but I have some doubts. Let's forget big.little for a moment and talk about an SMP system. On my pandaboard I clearly see usage values taken directly from get_cpu_usgae that scale up and down through the whole range (and as a result the cpu frequencies selected cover the whole range).
Let me clarify that 'capped' was the wrong word. It is converging towards the current capacity. Sorry for the confusion.
cfs.utilization_load_avg is the sum of PELT utilization for all tasks on the rq. Utilization is running time tracking which means that the sum can only temporarily and under special circumstances (such as task migration and fork) go above 100% (1024) if we ignore frequency invariance. If it goes above it will converge to 100% over time. It happens fairly quickly for forked tasks as their avg_period is small in the early life of a new task.
In Vincent's patch set, my patch 'sched: Make sched entity usage tracking scale-invariant' changes this a bit. In __update_entity_runnable_avg() we now scale the utilization PELT signal by freq_curr/freq_max. The sum (cfs.utilization_load_avg) is therefore also converging towards freq_curr/freq_max (*1024). For example, running at 300 MHz and freq_max = 1000 MHz, the sum is converging towards 307. Without any migrations or new tasks, the utilization will be in the range 0..307 no matter how many tasks that are on the rq. Just as before, the sum may temporarily go above if you have new tasks being forked or task being migrated to the rq.
Let's take an example where you have a task an existing task waking up with a low utilization, say 100. It could be a webpage rendering thread that did minor updates to some webpage already loaded last time it was scheduled, but this time it is being scheduled to render a new webpage. The task PELT utilization is added to cfs.utilization_load_avg when it is enqueued, so the sum is now 100. freq_curr = 300 MHz. The task will start rendering the webpage and run for quite a while during which it will built up its PELT utilization. It will ramp up quickly in the beginning and converge towards 307 due to the freq_curr/freq_max scaling in __update_entity_runnable_avg(). Due to the properties of the geometric series it will converge slower and slower the closer we get to 307. PJT defined:
#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG */
where a period here is 1024 us. So if you don't have any other tasks causing any noise it may take quite a while to get to 307. Worst case 345 ms. If you do have noise you may not see this delay, but I wouldn't rely on it for determining when to increase the frequency.
In Vincent's patches get_cpu_usage() returns a somewhat modified metric.
static int get_cpu_usage(int cpu) { unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg; unsigned long capacity = capacity_orig_of(cpu);
if (usage >= SCHED_LOAD_SCALE) return capacity; return (usage * capacity) >> SCHED_LOAD_SHIFT;
}
The utilization is scaled and capped by cpu capacity. capacity_orig_of(cpu) is 1024 for non-SMT and non-big.LITTLE systems in which case get_cpu_usage() just enforces an upper limit for cfs.utilization_load_avg at 1024. For such systems get_cpu_usage() can be compared to normalized frequency (freq_curr*1024/freq_max). If you are running at 300 MHz, your normalized frequency is 300*1024/1000 = 307 and get_cpu_usage() will eventually return 307 if you have at least one always-running tasks on the cpu.
In mainline Linux, capacity != 1024 for SMT systems (determined by 1178/#hw_threads) and big.LITTLE systems with the clock-frequency property set in DT (which enables Vincent's capacity scaling code in topology.c and it enabled in exynos5420.dtsi). In this case get_cpu_usage() scales utilization to the range 0..capacity_orig_of(cpu).
If we take the example from before but now have an SMT system with two hw-threads per core, capacity_orig_of() = 589. If you have an always-running task and you are at 300 MHz, cfs.utilization_load_avg = 307 (as before), but get_cpu_usage() returns 307*589/1024 = 176. 307 is the convergence target again and won't go above unless other tasks show up and due to the capacity scaling in get_cpu_usage() it will never go above 176. If you were running at 1000 MHz (freq_max), get_cpu_usage() would return 589. You would never go above 589 despite your normalized frequency freq_curr*1024/freq_max = 1024*1024/1024 = 1024. So here you would be comparing usage on one scale (0..589) to frequency 'capacity' on another scale (0..1024). That is broken in my opinion. The same scaling must be applied on both sides. Either apply capacity_orig_of() scaling to the frequency or have a non-scaling version of get_cpu_usage().
The issue is the same for big.LITTLE systems. If you enabled Vincent's cpu_efficiency code for TC2 by setting the clock-frequency properties in the DT (as they are in the LSK tree), the A7 capacity_orig_of() = 606.
While I don't want big.LITTLE to be part of the sched/dvfs integration discussion, IMHO, we are working towards a goal of better scheduling and power management on all systems including big.LITTLE. So I think we should keep those in mind too and avoid cutting corners where we know it will cause trouble for some systems. I'm not asking for you to do big.LITTLE specific modifications or even mention it in the patch set, I'm just asking for minor changes that allows us to extend this to work for big.LITTLE as well.
I agree with Morten that you have to use capacity_orig_of(CPU) instead of SCHED_CAPACITY_SCALE when you compare the compute capacity of an frequency with the current usage of the CPU
get_cpu_usage is in the range [0..capacity_orig_of(CPU)] so you have to scale the compute capacity of the frequency point in the same range. As Morten points out, in SMP system the capacity_orig_of(CPU) is SCHED_CAPACITY_SCALE but directly usign this default value is a shortcut
Regards, Vincent
My current testing involves short running tasks that are quickly queued and dequeued, not a long running task as you suggest. Is there a different behavior in the way cfs.utilization_load_avg is used depending on task length?
PELT utilization tracks the running time of the tasks. cfs.utilization_load_avg is the sum of the PELT utilization of all tasks on the rq. The PELT utilization builds up when a task is running and decays when it is blocked/sleeping. Keep in mind that PELT utilization is initialized to max but with a very short history, so the utilization value is very sensitive in the early life of a task.
Can you please explain why you feel that the return value of get_cpu_usage will not exceed the current capacity? I do not observe this behavior. Do you see this when testing only my branch? Or do you see it when merging my branch with the eas v3 series?
I think it is covered above. I haven't tested the patches myself, but Juri has confirmed that get_cpu_usage() is converging towards freq_curr*1024/freq_max using the user-space governor.
Vincent,
The value of cfs.utilization_load_avg is already normalized against the max possible capacity, right? I do not believe that the return value of get_cpu_usage is capped at the current capacity, but please let me know if I have a misunderstanding.
As said above, it is not capped but converging towards freq_curr*capacity_orig_of(cpu)/frq_max.
I hope that answers your questions, please let me know if it doesn't.
Thanks, Morten