When log CPU's load and utilization, should directly use CPU's cfs_rq for tracking. If use the task's cfs_rq, it may introduce error value by using task_group's cfs_rq but not real CPU's cfs_rq.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6ac9ea3..26f3f2d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2825,7 +2825,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
if (entity_is_task(se)) trace_sched_load_avg_task(task_of(se), &se->avg); - trace_sched_load_avg_cpu(cpu, cfs_rq); + trace_sched_load_avg_cpu(cpu, &cpu_rq(cpu)->cfs); }
static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) -- 1.9.1
On 03/30/2016 12:11 PM, Leo Yan wrote:
When log CPU's load and utilization, should directly use CPU's cfs_rq for tracking. If use the task's cfs_rq, it may introduce error value by using task_group's cfs_rq but not real CPU's cfs_rq.
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6ac9ea3..26f3f2d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2825,7 +2825,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
if (entity_is_task(se)) trace_sched_load_avg_task(task_of(se), &se->avg);
trace_sched_load_avg_cpu(cpu, cfs_rq);
trace_sched_load_avg_cpu(cpu, &cpu_rq(cpu)->cfs);
}
static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
Yeah, this was a mistake. We're about to fix this by incorporating the 'tg->css.id' into the trace event so that we can trace tg owned cfs_rq's as well. Patrick has the details because these trace events have to be in sync with the LISA test tool.
I'm aware of the two following changes in this area:
commit 6e83d44d8844d8ff21a965a913818e078a54b1b6 Author: Dietmar Eggemann dietmar.eggemann@arm.com Date: Fri Jan 8 10:55:03 2016 +0000
Fix/DEBUG: sched: Move trace_sched_load_avg_cpu() to update_cfs_rq_load_avg()
In a system with periodic timer ticks (constant rate, no dynticks), in case a cpu becomes idle, the cfs_rq.avg.load_avg/util_avg signals are still updated via:
scheduler_tick() -> trigger_load_balance(rq)
-> run_rebalance_domains() -> rebalance_domains() -> update_blocked_averages()
The function update_blocked_averages() calls update_cfs_rq_load_avg() and not update_cfs_rq_load_avg() so in this situation cfs_rq.avg.load_avg/util_avg signals updates are not traced correctly.
!!! There are no updates via scheduler_tick() -> task_tick_fair() -> entity_tick() since current is the idle task. !!!
If accepted this patch has to be stashed into:
DEBUG: sched: add tracepoint for CPU load/util signals
Signed-off-by: Dietmar Eggemann dietmar.eggemann@arm.com
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6ac9ea36d2fb..c4b863d5a8d4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2802,6 +2802,8 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) cfs_rq->load_last_update_time_copy = sa->last_update_time; #endif
+ trace_sched_load_avg_cpu(cpu_of(rq_of(cfs_rq)), cfs_rq); + return decayed || removed; }
@@ -2825,7 +2827,6 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
if (entity_is_task(se)) trace_sched_load_avg_task(task_of(se), &se->avg); - trace_sched_load_avg_cpu(cpu, cfs_rq); }
commit 0c8a8fb2af52a29557e1b0aa3baf2b2d6da70334 Author: Dietmar Eggemann dietmar.eggemann@arm.com Date: Wed Mar 16 14:18:09 2016 +0000
Fix/DEBUG: sched: Add the ccs.id to sched_load_avg_cpu trace_event
This patch allows us tp distinguish between cfs_rq load signals between the root cfs_rq (representing the root task group w/ CONFIG_FAIR_GROUP_SCHED=y or simply the cpu) and the cfs_rq's representing other task groups by printing out the related css.id which is unique for each task group system wide.
In case of a system w/ CONFIG_FAIR_GROUP_SCHED not set we simply print a zero for the css.id.
Please note that the css.id numbering scheme starts with 1 for the root task group.
If accepted this patch has to be stashed into:
DEBUG: sched: add tracepoint for CPU load/util signals
Signed-off-by: Dietmar Eggemann dietmar.eggemann@arm.com
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 7ec9dcfc701a..0724bd03773d 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -641,18 +641,25 @@ TRACE_EVENT(sched_load_avg_cpu,
TP_STRUCT__entry( __field( int, cpu ) + __field( int, id ) __field( unsigned long, load_avg ) __field( unsigned long, util_avg ) ),
TP_fast_assign( __entry->cpu = cpu; +#ifdef CONFIG_FAIR_GROUP_SCHED + __entry->id = cfs_rq->tg->css.id; +#else + __entry->id = 0; +#endif __entry->load_avg = cfs_rq->avg.load_avg; __entry->util_avg = cfs_rq->avg.util_avg; ),
- TP_printk("cpu=%d load_avg=%lu util_avg=%lu", - __entry->cpu, __entry->load_avg, __entry->util_avg) + TP_printk("cpu=%d tg_css_id=%d load_avg=%lu util_avg=%lu", + __entry->cpu, __entry->id, __entry->load_avg, + __entry->util_avg) );
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Dietmar,
On Thu, Mar 31, 2016 at 12:32:32PM +0200, Dietmar Eggemann wrote:
On 03/30/2016 12:11 PM, Leo Yan wrote:
When log CPU's load and utilization, should directly use CPU's cfs_rq for tracking. If use the task's cfs_rq, it may introduce error value by using task_group's cfs_rq but not real CPU's cfs_rq.
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6ac9ea3..26f3f2d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2825,7 +2825,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
if (entity_is_task(se)) trace_sched_load_avg_task(task_of(se), &se->avg);
trace_sched_load_avg_cpu(cpu, cfs_rq);
trace_sched_load_avg_cpu(cpu, &cpu_rq(cpu)->cfs);
}
static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
Yeah, this was a mistake. We're about to fix this by incorporating the 'tg->css.id' into the trace event so that we can trace tg owned cfs_rq's as well. Patrick has the details because these trace events have to be in sync with the LISA test tool.
Thanks for confirmation. Please see below two questions.
I'm aware of the two following changes in this area:
commit 6e83d44d8844d8ff21a965a913818e078a54b1b6 Author: Dietmar Eggemann dietmar.eggemann@arm.com Date: Fri Jan 8 10:55:03 2016 +0000
Fix/DEBUG: sched: Move trace_sched_load_avg_cpu() to update_cfs_rq_load_avg() In a system with periodic timer ticks (constant rate, no dynticks), in case a cpu becomes idle, the cfs_rq.avg.load_avg/util_avg signals are still updated via: scheduler_tick() -> trigger_load_balance(rq) -> run_rebalance_domains() -> rebalance_domains() -> update_blocked_averages() The function update_blocked_averages() calls update_cfs_rq_load_avg() and not update_cfs_rq_load_avg() so in this situation cfs_rq.avg.load_avg/util_avg signals updates are not traced correctly. !!! There are no updates via scheduler_tick() -> task_tick_fair() -> entity_tick() since current is the idle task. !!!
So the occasions for updating CPU's cfs_rq.avg.load_avg/util_avg signals are: - Scheduler tick to trigger load balance; - Scheduler tick for not a idle task; - Enqueue/dequeue one task for this CPU;
Does I miss other pathes for updating CPU's utilization?
And for "dequeue one task case", though scheduler will update the running CPU's (CPU_A) utilization, but this doesn't mean the task's utilization will be removed from task's previous CPU's (CPU_B) utilization directly, (CPU_A may is different with CPU_B); scheduler will just temperarily remove the task's util_avg value in cfs_rq->removed_util_avg and until next time the task's previous CPU (CPU_B) is really be waken up and then will update its own cfs_rq->avg.util_avg. So that means during a period, CPU_B will keep task's utilization value, right?
If accepted this patch has to be stashed into: DEBUG: sched: add tracepoint for CPU load/util signals Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6ac9ea36d2fb..c4b863d5a8d4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2802,6 +2802,8 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) cfs_rq->load_last_update_time_copy = sa->last_update_time; #endif
trace_sched_load_avg_cpu(cpu_of(rq_of(cfs_rq)), cfs_rq);
return decayed || removed;
}
@@ -2825,7 +2827,6 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
if (entity_is_task(se)) trace_sched_load_avg_task(task_of(se), &se->avg);
trace_sched_load_avg_cpu(cpu, cfs_rq);
}
commit 0c8a8fb2af52a29557e1b0aa3baf2b2d6da70334 Author: Dietmar Eggemann dietmar.eggemann@arm.com Date: Wed Mar 16 14:18:09 2016 +0000
Fix/DEBUG: sched: Add the ccs.id to sched_load_avg_cpu trace_event This patch allows us tp distinguish between cfs_rq load signals between the root cfs_rq (representing the root task group w/ CONFIG_FAIR_GROUP_SCHED=y or simply the cpu) and the cfs_rq's representing other task groups by printing out the related css.id which is unique for each task group system wide. In case of a system w/ CONFIG_FAIR_GROUP_SCHED not set we simply print a zero for the css.id. Please note that the css.id numbering scheme starts with 1 for the root task group. If accepted this patch has to be stashed into: DEBUG: sched: add tracepoint for CPU load/util signals Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 7ec9dcfc701a..0724bd03773d 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -641,18 +641,25 @@ TRACE_EVENT(sched_load_avg_cpu,
TP_STRUCT__entry( __field( int, cpu )
__field( int, id ) __field( unsigned long, load_avg ) __field( unsigned long, util_avg ) ), TP_fast_assign( __entry->cpu = cpu;
+#ifdef CONFIG_FAIR_GROUP_SCHED
__entry->id = cfs_rq->tg->css.id;
+#else
__entry->id = 0;
+#endif
I tried to dump CPU's task group and saw different CPU may have the same css.id for their task groups.
So we need use CPU number and css.id to get a unique task group, right? And the root cfs_rq's css.id equals 1.
Thanks, Leo Yan
__entry->load_avg = cfs_rq->avg.load_avg; __entry->util_avg = cfs_rq->avg.util_avg; ),
TP_printk("cpu=%d load_avg=%lu util_avg=%lu",
__entry->cpu, __entry->load_avg, __entry->util_avg)
TP_printk("cpu=%d tg_css_id=%d load_avg=%lu util_avg=%lu",
__entry->cpu, __entry->id, __entry->load_avg,
__entry->util_avg)
);
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Leo,
On 01/04/16 07:32, Leo Yan wrote:
Hi Dietmar,
On Thu, Mar 31, 2016 at 12:32:32PM +0200, Dietmar Eggemann wrote:
On 03/30/2016 12:11 PM, Leo Yan wrote:
When log CPU's load and utilization, should directly use CPU's cfs_rq for tracking. If use the task's cfs_rq, it may introduce error value by using task_group's cfs_rq but not real CPU's cfs_rq.
[...]
commit 6e83d44d8844d8ff21a965a913818e078a54b1b6 Author: Dietmar Eggemann dietmar.eggemann@arm.com Date: Fri Jan 8 10:55:03 2016 +0000
Fix/DEBUG: sched: Move trace_sched_load_avg_cpu() to update_cfs_rq_load_avg() In a system with periodic timer ticks (constant rate, no dynticks), in case a cpu becomes idle, the cfs_rq.avg.load_avg/util_avg signals are still updated via: scheduler_tick() -> trigger_load_balance(rq) -> run_rebalance_domains() -> rebalance_domains() -> update_blocked_averages() The function update_blocked_averages() calls update_cfs_rq_load_avg() and not update_cfs_rq_load_avg() so in this situation cfs_rq.avg.load_avg/util_avg signals updates are not traced correctly. !!! There are no updates via scheduler_tick() -> task_tick_fair() -> entity_tick() since current is the idle task. !!!
So the occasions for updating CPU's cfs_rq.avg.load_avg/util_avg signals are:
- Scheduler tick to trigger load balance;
- Scheduler tick for not a idle task;
Do you mean the update_blocked_averages() calls in rebalance_domains() and idle_balance()?
- Enqueue/dequeue one task for this CPU;
Yes, enqueue_task_fair(*) and dequeue_task_fair(*). ( with (*) sched class fair entry points)
Does I miss other pathes for updating CPU's utilization?
update_cfs_rq_load_avg() is called from update_load_avg() too so there are plenty of other sides as well (e.g. pick_next_task_fair(*), set_curr_task_fair(*), put_prev_task_fair(*) and task_tick_fair(*)).
Don't forget attach_task_cfs_rq() and detach_task_cfs_rq() from switched_to_fair(*), switched_from_fair(*) and task_move_group_fair(*)
So it's pretty much all over the place.
And for "dequeue one task case", though scheduler will update the running CPU's (CPU_A) utilization, but this doesn't mean the task's utilization will be removed from task's previous CPU's (CPU_B) utilization directly, (CPU_A may is different with CPU_B); scheduler will just temperarily remove the task's util_avg value in cfs_rq->removed_util_avg and until next time the task's previous CPU (CPU_B) is really be waken up and then will update its own cfs_rq->avg.util_avg. So that means during a period, CPU_B will keep task's utilization value, right?
You're talking about remove_entity_load_avg() which is called in migrate_task_rq_fair(*) and task_dead_fair(*)? When called we don't have CPU_B's rq lock. We defer CPU_B's &cfs_rq->avg update till the next call to update_cfs_rq_load_avg() on CPU_B.
Also look where cfs_rq->avg.load_avg, cfs_rq->avg.util_avg are actually used in CFS in comparison to the non-blocked version cfs_rq->runnable_load_avg.
[..]
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 7ec9dcfc701a..0724bd03773d 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -641,18 +641,25 @@ TRACE_EVENT(sched_load_avg_cpu,
TP_STRUCT__entry( __field( int, cpu )
__field( int, id ) __field( unsigned long, load_avg ) __field( unsigned long, util_avg ) ), TP_fast_assign( __entry->cpu = cpu;
+#ifdef CONFIG_FAIR_GROUP_SCHED
__entry->id = cfs_rq->tg->css.id;
+#else
__entry->id = 0;
+#endif
I tried to dump CPU's task group and saw different CPU may have the same css.id for their task groups.
That's right.
So we need use CPU number and css.id to get a unique task group, right? And the root cfs_rq's css.id equals 1.
Exactly. E.g. with pivot/filter or multiple filters in Lisa/trappy its easily done.
[..]