eas-dev

eas-dev@lists.linaro.org

423 discussions

[RFC 0/9] cpufreq: schedutil: Allow remote wakeups

by Viresh Kumar

Hi, This is based of the work done by Steve Muckle [1] before he left Linaro and most of the patches are still under his authorship. I have done couple of improvements (detailed in individual patches) and removed the late callback support [2] as I wasn't sure of the value it adds. We can include it separately if others feel it is required. This series is based on pm/linux-next with patches [3] and [4] applied on top of it. With Android UI and benchmarks the latency of cpufreq response to certain scheduling events can become very critical. Currently, callbacks into schedutil are only made from the scheduler if the target CPU of the event is the same as the current CPU. This means there are certain situations where a target CPU may not run schedutil for some time. One testcase to show this behavior is where a task starts running on CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the system is configured such that new tasks should receive maximum demand initially, this should result in CPU0 increasing frequency immediately. Because of the above mentioned limitation though this does not occur. This is verified using ftrace with the sample [5] application. This patchset updates the scheduler to call cpufreq callbacks for remote CPUs as well and updates schedutil governor to deal with it. An additional flag is added to cpufreq policies to avoid sending IPIs to remote CPUs to update the frequency, if CPUs on the platform can change frequency of any other CPU. This series is tested with couple of usecases (Android: hackbench, recentfling, galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey board (64 bit octa-core, single policy). Only galleryfling showed minor improvements, while others didn't had much deviation. The reason being that this patchset only targets a corner case, where following are required to be true to improve performance and that doesn't happen too often with these tests: - Task is migrated to another CPU. - The task has maximum demand initially, and should take the CPU to higher OPPs. - And the target CPU doesn't call into schedutil until the next tick, without this patchset. -- viresh [1] https://git.linaro.org/people/steve.muckle/kernel.git/log/?h=pmwg-integrati… [2] https://git.linaro.org/people/steve.muckle/kernel.git/commit/?h=pmwg-integr… [3] https://marc.info/?l=linux-kernel&m=148766093718487&w=2 [4] https://marc.info/?l=linux-kernel&m=148903231720432&w=2 [5] http://pastebin.com/7LkMSRxE Steve Muckle (8): sched: cpufreq: add cpu to update_util_data irq_work: add irq_work_queue_on for !CONFIG_SMP sched: cpufreq: extend irq work to support fast switches sched: cpufreq: remove smp_processor_id() in remote paths sched: cpufreq: detect, process remote callbacks cpufreq: governor: support scheduler cpufreq callbacks on remote CPUs intel_pstate: ignore scheduler cpufreq callbacks on remote CPUs sched: cpufreq: enable remote sched cpufreq callbacks Viresh Kumar (1): cpufreq: Add dvfs_possible_from_any_cpu policy flag drivers/cpufreq/cpufreq-dt.c | 1 + drivers/cpufreq/cpufreq_governor.c | 2 +- drivers/cpufreq/intel_pstate.c | 3 ++ include/linux/cpufreq.h | 9 +++++ include/linux/irq_work.h | 7 ++++ include/linux/sched/cpufreq.h | 1 + kernel/sched/cpufreq.c | 1 + kernel/sched/cpufreq_schedutil.c | 80 +++++++++++++++++++++++++++++--------- kernel/sched/fair.c | 6 ++- kernel/sched/sched.h | 3 +- 10 files changed, 90 insertions(+), 23 deletions(-) -- 2.7.1.410.g6faf27b

8 years, 2 months

Our USPS courier can not contact you parcel # 000162608

by USPS Parcels Delivery

Hello, Your item has arrived at Tue, 11 Apr 2017 05:22:31 +0200, but our courier was not able to deliver the parcel. Please check the attachment for details! Thanks and best regards. Yevette Mains - USPS Senior Operation Agent.

8 years, 2 months

[ANNOUNCE] OSPM-summit is on next Monday

by Juri Lelli

Hi, we are only a week away from the OSPM-summit! Pack your bags (or stay tuned for the live streaming). Don't forget to subscribe to the summit mailing list to receive updates by either following the instructions available at the following link http://groups.google.com/group/ospm-summit-2017/boxsubscribe?email=<your_email> or sending an email to ospm-summit-2017+subscribe(a)googlegroups.com Archives are available at https://groups.google.com/forum/#!forum/ospm-summit-2017 More information about schedule and logistics follow. --- Power Management and Scheduling in the Linux Kernel (OSPM-summit) April 3-4, 2017 Scuola Superiore Sant'Anna (SSSA) Pisa, Italy http://retis.sssup.it/ospm-summit/ --- .:: FOCUS Power management and scheduling techniques to reduce energy consumption while meeting performance and latency requirements are receiving considerable attention from the Linux Kernel development community. The Power Management and Scheduling in the Linux Kernel (OSPM-summit) summit aims at fostering further interest and discussion to happen. .:: SCHEDULE The summit is organized to cover two days of discussions and talks. What follows is a tentative schedule, subject to last minute changes. Find more info and real time updates on this shared document: https://docs.google.com/spreadsheets/d/1B-IsUIGitvRa7ZzppEAJBMgGpIgRAAnu_Oh… Monday (2017-04-03) ******************* 09:00AM - 09:30AM Welcome and Introduction (DAY 1) --- 09:30AM - 10:20AM Tooling/LISA --- 10:20AM - 11:10AM About The Need to Power Instrument The Linux Kernel --- 11:10AM - 11:20AM Break --- 11:20AM - 12:10AM What are the latest evolutions in PELT and what next --- 12:10AM - 01:00PM PELT decay clamping/UTIL_EST --- 01:00PM - 02:30PM Lunch --- 02:30PM - 03:20PM EAS where we are --- 03:20PM - 04:10PM Energy model/Exotic topologies --- 04:10PM - 04:20PM Break --- 04:20PM - 05:10PM Schedtune --- 05:10PM - 06:00PM SCHED_DEADLINE and reclaiming Tuesday (2017-04-04) ******************** 09:00AM - 09:30AM Welcome and Introduction (DAY 2) --- 09:30AM - 10:20AM Discussion about possible improvements in the schedutil governor --- 10:20AM - 11:10AM Schedutil for SCHED_DEADLINE --- 11:10AM - 11:20AM Break --- 11:20AM - 12:10AM Parameterizing CFS load balancing: nr_running/util/load --- 12:10AM - 01:00PM Tracepoints for PELT --- 01:00PM - 02:30PM Lunch --- 02:30PM - 03:20PM IRQ prediction --- 03:20PM - 04:10PM I/O scheduling and power management with storage devices --- 04:10PM - 04:20PM Break --- 04:20PM - 05:10PM SCHED_DEADLINE group scheduling --- 05:10PM - 06:00PM A Hierarchical Scheduling Model for Dynamic Soft-Realtime Systems --- 06:00PM - 06:30PM Closing Remarks We are looking into setting up live streaming of the sessions. Details will be soon shared through the shared doc mentioned above and the event mailing list. List of attendees is also available in the doc and on the event website. .:: VENUE The workshop will take place at ReTiS Lab*, Scuola Superiore Sant'Anna, Pisa, Italy. Pisa is a small town, walking distance from the city center to the venue is 20 minutes, walking distance from the airport to the city center is 30 minutes. More details are available from the summit web page: http://retis.sssup.it/ospm-summit/ A map of the town with venue location, points of interest and transportation information is available at: https://drive.google.com/open?id=1ANKOXr2cuZkABXskDurgrGdl_js&usp=sharing Bus from Airport to city centre takes about 10 min and costs 1.20 euros (2 euros if bought on board). Large bills are usually not accepted. Taxi from Airport to city centre costs 10/15 euros. Credit cards are not accepted. .:: ORGANIZERS (in alphabetical order) Luca Abeni (SSSA) Patrick Bellasi (ARM) Tommaso Cucinotta (SSSA) Dietmar Eggemann (ARM) Sudeep Holla (ARM) Juri Lelli (ARM) Lorenzo Pieralisi (ARM) Morten Rasmussen (ARM)

8 years, 3 months

BUD17 PMWG themed hacking sessions

by Vincent Guittot

Hi All, Please find below links to the materials that have been used during BUD17 themed hacking session. Thank you to all the speakers. Even if we have often run out of time for topics, these detailled discussion were really interesting. We will try to make sure to have more time for all topics at next connect. Day Time slot Duration Topic (Speaker) Location Mon 11:00 30m Using power farm in dev mode (Lisa) Session Room 2: Jokai/Dery 11:30 30m Power and Security (Vincent and Joakim) Session Room 2: Jokai/Dery https://www.slideshare.net/linaroorg/bud17510-power-management-in-linux-tog… 12:00 30m Hibernation (Thara) Session Room 2: Jokai/Dery https://drive.google.com/open?id=0B2xJa7Y0HjwLalFJOEh0a1ljQkU 12:30 30m EAS Product Codeline - past, present and future (Chris/ARM) Session Room 2: Jokai/Dery Tue 14:00 30m 14:30 30m CPU Cluster Idling (Ulf) Session Room 2: Jokai/Dery https://www.slideshare.net/linaroorg/bud17102-update-on-cpu-cluster-idling 15:00 60m Bus scaling QoS (Georgi) Session Room 2: Jokai/Dery https://www.slideshare.net/linaroorg/bud17214-bus-scaling-qos-update Wed 14:00 40m SCHED_DEADLINE improvements (Juri/ARM) Session Room 2: Jokai/Dery https://drive.google.com/open?id=0B0gETIMiqtYIZlRTMEhyR2JtVG8 14:40 40m Scheduler load tracking update/next steps (Vincent) Session Room 2: Jokai/Dery https://www.slideshare.net/linaroorg/bud17218-scheduler-load-tracking-updat… 15:20 40m sched-freq -> schedutil work for the product codeline (Juri/ARM) Session Room 2: Jokai/Dery https://drive.google.com/open?id=18TYUCTwX5GcDiAiHQVEQDDT7Qy2Uckeo_8AHJSk1n… 16:00 30m EAS improvement - overutilization flag (Thara) Session Room 2: Jokai/Dery https://docs.google.com/presentation/d/1YsFCXOXDfvEVhnBe2GT_9rNdprxZEfDs854… 16:30 40m Working towards the next generation of SchedTune (Patrick /ARM) Session Room 2: Jokai/Dery https://docs.google.com/presentation/d/1aRRutXMzoeMVRQdIh-myIV89Wk67YDZhvtY… Thu 14:00 40m Thermal mitigation (Jean) Session Room 2: Jokai/Dery https://docs.google.com/presentation/d/1IP_MuldOJgOxYr1zi3aI5D8dh6iolFbBaEc… 14:40 30m Evaluating IPA on Pixel (Luckasz/ARM) Session Room 2: Jokai/Dery https://drive.google.com/file/d/0BwNVdIveTCaXQ2Nyak5vXy1nRHlWVS1odzhacXdhRV… 15:10 45m Trending towards an energy model expression in the mainline kernel (Brendan/ARM) Session Room 2: Jokai/Dery https://drive.google.com/drive/folders/0B9xjP1kyPKtZaEhzcFhzOUgwR28 15:55 15m Coffee break 16:10 50m Discussion/update on LISA (Patrick) Session Room 2: Jokai/Dery 17:00 30m A better testing strategy for EAS-core (Brendan/ARM) Session Room 2: Jokai/Dery https://github.com/bjackman/lisa/blob/bud17/ipynb/BUD17_LISA_Demo_backup.ip… Thanks Vincent

8 years, 3 months

[PATCH V2] Per Sched domain over utilization

by Thara Gopinath

The current implementation of overutilization, aborts energy aware scheduling if any cpu in the system is over-utilized. This patch introduces over utilization flag per sched domain level instead of a single flag system wide. Load balancing is done at the sched domain where any of the cpu is over utilized. If energy aware scheduling is enabled and no cpu in a sched domain is overuttilized, load balancing is skipped for that sched domain and energy aware scheduling continues at that level. The implementation takes advantage of the shared sched_domain structure that is common across all the sched domains at a level. The new flag introduced is placed in this structure so that all the sched domains the same level share the flag. In case of an overutilized cpu, the flag gets set at level1 sched_domain. The flag at the parent sched_domain level gets set in either of the two following scenarios. 1. There is a misfit task in one of the cpu's in this sched_domain. 2. The total utilization of the domain is greater than the domain capacity The flag is cleared if no cpu in a sched domain is overutilized. This implementation still can have corner scenarios with respect to misfit tasks. For example consider a sched group with n cpus and n+1 70%utilized tasks. Ideally this is a case for load balance to happen in a parent sched domain. But neither the total group utilization is high enough for the load balance to be triggered in the parent domain nor there is a cpu with a single overutilized task so that aload balance is triggered in a parent domain. But again this could be a purely academic sceanrio, as during task wake up these tasks will be placed more appropriately. Signed-off-by: Thara Gopinath <thara.gopinath(a)linaro.org> --- V1->V2: - Removed overutilized flag from sched_group structure. - In case of misfit task, it is ensured that a load balance is triggered in a parent sched domain with assymetric cpu capacities. include/linux/sched.h | 1 + kernel/sched/core.c | 7 ++- kernel/sched/fair.c | 138 +++++++++++++++++++++++++++++++++++++++++--------- kernel/sched/sched.h | 3 -- 4 files changed, 117 insertions(+), 32 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c5122e..971842a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1112,6 +1112,7 @@ struct sched_domain_shared { atomic_t ref; atomic_t nr_busy_cpus; int has_idle_cores; + bool overutilized; }; struct sched_domain { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 31a466f..e0a8758 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6659,11 +6659,10 @@ sd_init(struct sched_domain_topology_level *tl, * For all levels sharing cache; connect a sched_domain_shared * instance. */ - if (sd->flags & SD_SHARE_PKG_RESOURCES) { - sd->shared = *per_cpu_ptr(sdd->sds, sd_id); - atomic_inc(&sd->shared->ref); + sd->shared = *per_cpu_ptr(sdd->sds, sd_id); + atomic_inc(&sd->shared->ref); + if (sd->flags & SD_SHARE_PKG_RESOURCES) atomic_set(&sd->shared->nr_busy_cpus, sd_weight); - } sd->private = sdd; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 489f6d3..9d2bb07 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4735,6 +4735,30 @@ static inline void hrtick_update(struct rq *rq) static bool cpu_overutilized(int cpu); +static bool +is_sd_overutilized(struct sched_domain *sd) +{ + if (sd) + return sd->shared->overutilized; + else + return false; +} + +static void +set_sd_overutilized(struct sched_domain *sd) +{ + if (sd) + sd->shared->overutilized = true; +} + +static void +clear_sd_overutilized(struct sched_domain *sd) +{ + if (sd) + sd->shared->overutilized = false; +} + + /* * The enqueue_task method is called before nr_running is * increased. Here we update the fair scheduling stats and @@ -4744,6 +4768,7 @@ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; + struct sched_domain *sd; struct sched_entity *se = &p->se; int task_new = !(flags & ENQUEUE_WAKEUP); @@ -4787,9 +4812,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (!se) { add_nr_running(rq, 1); - if (!task_new && !rq->rd->overutilized && - cpu_overutilized(rq->cpu)) - rq->rd->overutilized = true; + rcu_read_lock(); + sd = rcu_dereference(rq->sd); + if (!task_new && !is_sd_overutilized(sd) && + cpu_overutilized(rq->cpu)) + set_sd_overutilized(sd); + rcu_read_unlock(); } hrtick_update(rq); } @@ -6173,8 +6201,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) unsigned long max_spare = 0; struct sched_domain *sd; - rcu_read_lock(); - + /* The rcu lock is/should be held in the caller function */ sd = rcu_dereference(per_cpu(sd_ea, prev_cpu)); if (!sd) @@ -6212,8 +6239,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) } unlock: - rcu_read_unlock(); - if (energy_cpu == prev_cpu && !cpu_overutilized(prev_cpu)) return prev_cpu; @@ -6247,10 +6272,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); } - if (energy_aware() && !(cpu_rq(prev_cpu)->rd->overutilized)) - return select_energy_cpu_brute(p, prev_cpu); - rcu_read_lock(); + sd = rcu_dereference(cpu_rq(prev_cpu)->sd); + if (energy_aware() && + !is_sd_overutilized(sd)) { + new_cpu = select_energy_cpu_brute(p, prev_cpu); + goto unlock; + } + + sd = NULL; + for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) break; @@ -6315,6 +6346,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f } /* while loop will break here if sd == NULL */ } + +unlock: rcu_read_unlock(); return new_cpu; @@ -7366,6 +7399,7 @@ struct sd_lb_stats { struct sched_group *local; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */ unsigned long total_capacity; /* Total capacity of all groups in sd */ + unsigned long total_util; /* Total util of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */ struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */ @@ -7385,6 +7419,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) .local = NULL, .total_load = 0UL, .total_capacity = 0UL, + .total_util = 0UL, .busiest_stat = { .avg_load = 0UL, .sum_nr_running = 0, @@ -7664,7 +7699,7 @@ group_type group_classify(struct sched_group *group, static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, int local_group, struct sg_lb_stats *sgs, - bool *overload, bool *overutilized) + bool *overload, bool *overutilized, bool *misfit_task) { unsigned long load; int i, nr_running; @@ -7699,8 +7734,16 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (!nr_running && idle_cpu(i)) sgs->idle_cpus++; - if (cpu_overutilized(i)) + if (cpu_overutilized(i)) { *overutilized = true; + /* + * If the cpu is overutilized and if there is only one + * current task in cfs runqueue, it is potentially a misfit + * task. + */ + if (rq->cfs.h_nr_running == 1) + *misfit_task = true; + } } /* Adjust by relative CPU capacity of the group */ @@ -7825,11 +7868,11 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq) */ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds) { - struct sched_domain *child = env->sd->child; + struct sched_domain *child = env->sd->child, *sd; struct sched_group *sg = env->sd->groups; struct sg_lb_stats tmp_sgs; int load_idx, prefer_sibling = 0; - bool overload = false, overutilized = false; + bool overload = false, overutilized = false, misfit_task = false; if (child && child->flags & SD_PREFER_SIBLING) prefer_sibling = 1; @@ -7851,7 +7894,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd } update_sg_lb_stats(env, sg, load_idx, local_group, sgs, - &overload, &overutilized); + &overload, &overutilized, + &misfit_task); if (local_group) goto next_group; @@ -7882,6 +7926,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* Now, start updating sd_lb_stats */ sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity; + sds->total_util += sgs->group_util; sg = sg->next; } while (sg != env->sd->groups); @@ -7895,14 +7940,45 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* update overload indicator if we are at root domain */ if (env->dst_rq->rd->overload != overload) env->dst_rq->rd->overload = overload; + } - /* Update over-utilization (tipping point, U >= 0) indicator */ - if (env->dst_rq->rd->overutilized != overutilized) - env->dst_rq->rd->overutilized = overutilized; - } else { - if (!env->dst_rq->rd->overutilized && overutilized) - env->dst_rq->rd->overutilized = true; + if (overutilized) + set_sd_overutilized(env->sd); + else + clear_sd_overutilized(env->sd); + + /* + * If there is a misfit task in one cpu in this sched_domain + * it is likely that the imbalance cannot be sorted out among + * the cpu's in this sched_domain. In this case set the + * overutilized flag at the parent sched_domain. + */ + if (misfit_task) { + + sd = env->sd->parent; + + /* + * In case of a misfit task, load balance at the parent + * sched domain level will make sense only if the the cpus + * have a different capacity. If cpus at a domain level have + * the same capacity, the misfit task cannot be well + * accomodated in any of the cpus and there in no point in + * trying a load balance at this level + */ + while (sd) { + if (sd->flags & SD_ASYM_CPUCAPACITY) { + set_sd_overutilized(sd); + break; + } + sd = sd->parent; + } } + + /* If the domain util is greater that domain capacity, load balancing + * needs to be done at the next sched domain level as well + */ + if (sds->total_capacity * 1024 < sds->total_util * capacity_margin) + set_sd_overutilized(env->sd->parent); } /** @@ -8122,8 +8198,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ update_sd_lb_stats(env, &sds); - if (energy_aware() && !env->dst_rq->rd->overutilized) - goto out_balanced; + if (energy_aware()) { + if (!is_sd_overutilized(env->sd)) + goto out_balanced; + } local = &sds.local_stat; busiest = &sds.busiest_stat; @@ -8981,6 +9059,11 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) rcu_read_lock(); for_each_domain(cpu, sd) { + if (energy_aware()) { + if (!is_sd_overutilized(sd)) + continue; + } + /* * Decay the newidle max times here because this is a regular * visit to all the domains. Decay ~1% per second. @@ -9280,6 +9363,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) { struct cfs_rq *cfs_rq; struct sched_entity *se = &curr->se; + struct sched_domain *sd; for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); @@ -9289,8 +9373,12 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) if (static_branch_unlikely(&sched_numa_balancing)) task_tick_numa(rq, curr); - if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr))) - rq->rd->overutilized = true; + rcu_read_lock(); + sd = rcu_dereference(rq->sd); + if (!is_sd_overutilized(sd) && + cpu_overutilized(task_cpu(curr))) + set_sd_overutilized(sd); + rcu_read_unlock(); } /* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index fa98ab3..b24cefa 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -563,9 +563,6 @@ struct root_domain { /* Indicate more than one runnable task for any CPU */ bool overload; - /* Indicate one or more cpus over-utilized (tipping point) */ - bool overutilized; - /* * The bit corresponding to a CPU gets set here if such CPU has more * than one runnable -deadline task (as it is below for RT tasks). -- 2.1.4

8 years, 3 months

[PATCH V2] cpufreq: schedutil: Redefine the rate_limit_us tunable

by Viresh Kumar

The rate_limit_us tunable is intended to reduce the possible overhead from running the schedutil governor. However, that overhead can be divided into two separate parts: the governor computations and the invocation of the scaling driver to set the CPU frequency. The latter is where the real overhead comes from. The former is much less expensive in terms of execution time and running it every time the governor callback is invoked by the scheduler, after rate_limit_us interval has passed since the last frequency update, would not be a problem. For this reason, redefine the rate_limit_us tunable so that it means the minimum time that has to pass between two consecutive invocations of the scaling driver by the schedutil governor (to set the CPU frequency). Signed-off-by: Viresh Kumar <viresh.kumar(a)linaro.org> --- V1->V2: Update $subject and commit log (Rafael) kernel/sched/cpufreq_schedutil.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index fd4659313640..306d97e7b57c 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -92,14 +92,13 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, { struct cpufreq_policy *policy = sg_policy->policy; - sg_policy->last_freq_update_time = time; - if (policy->fast_switch_enabled) { if (sg_policy->next_freq == next_freq) { trace_cpu_frequency(policy->cur, smp_processor_id()); return; } sg_policy->next_freq = next_freq; + sg_policy->last_freq_update_time = time; next_freq = cpufreq_driver_fast_switch(policy, next_freq); if (next_freq == CPUFREQ_ENTRY_INVALID) return; @@ -108,6 +107,7 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, trace_cpu_frequency(next_freq, smp_processor_id()); } else if (sg_policy->next_freq != next_freq) { sg_policy->next_freq = next_freq; + sg_policy->last_freq_update_time = time; sg_policy->work_in_progress = true; irq_work_queue(&sg_policy->irq_work); } -- 2.7.1.410.g6faf27b

8 years, 4 months

[viresh.kumar@linaro.org: [PATCH] cpufreq: schedutil: govern how frequently we change frequency with rate_limit]

by Viresh Kumar

Sorry that I forgot to cc eas-dev list for this patch. ----- Forwarded message from Viresh Kumar <viresh.kumar(a)linaro.org> ----- Date: Wed, 15 Feb 2017 22:45:47 +0530 From: Viresh Kumar <viresh.kumar(a)linaro.org> To: Rafael Wysocki <rjw(a)rjwysocki.net>, Ingo Molnar <mingo(a)redhat.com>, Peter Zijlstra <peterz(a)infradead.org> Cc: linaro-kernel(a)lists.linaro.org, linux-pm(a)vger.kernel.org, linux-kernel(a)vger.kernel.org, Vincent Guittot <vincent.guittot(a)linaro.org>, Viresh Kumar <viresh.kumar(a)linaro.org> Subject: [PATCH] cpufreq: schedutil: govern how frequently we change frequency with rate_limit X-Mailer: git-send-email 2.7.1.410.g6faf27b For an ideal system (where frequency change doesn't incur any penalty) we would like to change the frequency as soon as the load changes for a CPU. But the systems we have to work with are far from ideal and it takes time to change the frequency of a CPU. For many ARM platforms specially, it is at least 1 ms. In order to not spend too much time changing frequency, we have earlier introduced a sysfs controlled tunable for the schedutil governor: rate_limit_us. Currently, rate_limit_us controls how frequently we reevaluate frequency for a set of CPUs controlled by a cpufreq policy. But that may not be the ideal behavior we want. Consider for example the following scenario. The rate_limit_us tunable is set to 10 ms. The CPU has a constant load X and that requires the frequency to be set to Y. The schedutil governor changes the frequency to Y, updates last_freq_update_time and we wait for 10 ms to reevaluate the frequency again. After 10 ms, the schedutil governor reevaluates the load and finds it to be the same. And so it doesn't update the frequency, but updates last_freq_update_time before returning. Right after this point, the scheduler puts more load on the CPU and the CPU needs to go to a higher frequency Z. Because last_freq_update_time was updated just now, the schedutil governor waits for additional 10ms before reevaluating the load again. Normally, the time it takes to reevaluate the frequency is negligible compared to the time it takes to change the frequency. And considering that in the above scenario, as we haven't updated the frequency for over 10ms, we should have changed the frequency as soon as the load changed. This patch changes the way rate_limit_us is used, i.e. It now governs "How frequently we change the frequency" instead of "How frequently we reevaluate the frequency". One may think that this change may have increased the number of times we reevaluate the frequency after a period of rate_limit_us has expired since the last change, if the load isn't changing. But that is protected by the scheduler as normally it doesn't call into the schedutil governor before 1 ms (Hint: "decayed" in update_cfs_rq_load_avg()) since the last call. Tests were performed with this patch on a Dual cluster (same frequency domain), octa-core ARM64 platform (Hikey). Hackbench (Debian) and Vellamo/Galleryfling (Android) didn't had much difference in performance w/ or w/o this patch. Its difficult to create a test case (tried rt-app as well) where this patch will show a lot of improvements as the target of this patch is a real corner case. I.e. Current load is X (resulting in freq change), load after rate_limit_us is also X, but right after that load becomes Y. Undoubtedly this patch would improve the responsiveness in such cases. Signed-off-by: Viresh Kumar <viresh.kumar(a)linaro.org> --- kernel/sched/cpufreq_schedutil.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index fd4659313640..306d97e7b57c 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -92,14 +92,13 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, { struct cpufreq_policy *policy = sg_policy->policy; - sg_policy->last_freq_update_time = time; - if (policy->fast_switch_enabled) { if (sg_policy->next_freq == next_freq) { trace_cpu_frequency(policy->cur, smp_processor_id()); return; } sg_policy->next_freq = next_freq; + sg_policy->last_freq_update_time = time; next_freq = cpufreq_driver_fast_switch(policy, next_freq); if (next_freq == CPUFREQ_ENTRY_INVALID) return; @@ -108,6 +107,7 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, trace_cpu_frequency(next_freq, smp_processor_id()); } else if (sg_policy->next_freq != next_freq) { sg_policy->next_freq = next_freq; + sg_policy->last_freq_update_time = time; sg_policy->work_in_progress = true; irq_work_queue(&sg_policy->irq_work); } -- 2.7.1.410.g6faf27b ----- End forwarded message ----- -- viresh

8 years, 4 months

[PATCH] Per Sched domain over utilization

by Thara Gopinath

The current implementation of overutilization, aborts energy aware scheduling if any cpu in the system is over-utilized. This patch introduces over utilization flag per sched domain level instead of a single flag system wide. Load balancing is done at the sched domain where any of the cpu is over utilized. If energy aware scheduling is enabled and no cpu in a sched domain is overuttilized, load balancing is skipped for that sched domain and energy aware scheduling continues at that level. The implementation takes advantage of the shared sched_domain structure that is common across all the sched domains at a level. The new flag introduced is placed in this structure so that all the sched domains the same level share the flag. In case of an overutilized cpu, the flag gets set at level1 sched_domain. The flag at the parent sched_domain level gets set in either of the two following scenarios. 1. There is a misfit task in one of the cpu's in this sched_domain. 2. The total utilization of the domain is greater than the domain capacity The flag is cleared if no cpu in a sched domain is overutilized. Signed-off-by: Thara Gopinath <thara.gopinath(a)linaro.org> --- include/linux/sched.h | 1 + kernel/sched/core.c | 7 ++- kernel/sched/fair.c | 120 +++++++++++++++++++++++++++++++++++++++----------- 3 files changed, 99 insertions(+), 29 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c5122e..971842a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1112,6 +1112,7 @@ struct sched_domain_shared { atomic_t ref; atomic_t nr_busy_cpus; int has_idle_cores; + bool overutilized; }; struct sched_domain { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 31a466f..e0a8758 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6659,11 +6659,10 @@ sd_init(struct sched_domain_topology_level *tl, * For all levels sharing cache; connect a sched_domain_shared * instance. */ - if (sd->flags & SD_SHARE_PKG_RESOURCES) { - sd->shared = *per_cpu_ptr(sdd->sds, sd_id); - atomic_inc(&sd->shared->ref); + sd->shared = *per_cpu_ptr(sdd->sds, sd_id); + atomic_inc(&sd->shared->ref); + if (sd->flags & SD_SHARE_PKG_RESOURCES) atomic_set(&sd->shared->nr_busy_cpus, sd_weight); - } sd->private = sdd; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 489f6d3..485f597 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4735,6 +4735,30 @@ static inline void hrtick_update(struct rq *rq) static bool cpu_overutilized(int cpu); +static bool +is_sd_overutilized(struct sched_domain *sd) +{ + if (sd) + return sd->shared->overutilized; + else + return false; +} + +static void +set_sd_overutilized(struct sched_domain *sd) +{ + if (sd) + sd->shared->overutilized = true; +} + +static void +clear_sd_overutilized(struct sched_domain *sd) +{ + if (sd) + sd->shared->overutilized = false; +} + + /* * The enqueue_task method is called before nr_running is * increased. Here we update the fair scheduling stats and @@ -4744,6 +4768,7 @@ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; + struct sched_domain *sd; struct sched_entity *se = &p->se; int task_new = !(flags & ENQUEUE_WAKEUP); @@ -4787,9 +4812,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (!se) { add_nr_running(rq, 1); - if (!task_new && !rq->rd->overutilized && - cpu_overutilized(rq->cpu)) - rq->rd->overutilized = true; + rcu_read_lock(); + sd = rcu_dereference(rq->sd); + if (!task_new && !is_sd_overutilized(sd) && + cpu_overutilized(rq->cpu)) + set_sd_overutilized(sd); + rcu_read_unlock(); } hrtick_update(rq); } @@ -6173,8 +6201,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) unsigned long max_spare = 0; struct sched_domain *sd; - rcu_read_lock(); - + /* The rcu lock is/should be held in the caller function */ sd = rcu_dereference(per_cpu(sd_ea, prev_cpu)); if (!sd) @@ -6212,8 +6239,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) } unlock: - rcu_read_unlock(); - if (energy_cpu == prev_cpu && !cpu_overutilized(prev_cpu)) return prev_cpu; @@ -6247,10 +6272,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); } - if (energy_aware() && !(cpu_rq(prev_cpu)->rd->overutilized)) - return select_energy_cpu_brute(p, prev_cpu); - rcu_read_lock(); + sd = rcu_dereference(cpu_rq(prev_cpu)->sd); + if (energy_aware() && + !is_sd_overutilized(sd)) { + new_cpu = select_energy_cpu_brute(p, prev_cpu); + goto unlock; + } + + sd = NULL; + for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) break; @@ -6315,6 +6346,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f } /* while loop will break here if sd == NULL */ } + +unlock: rcu_read_unlock(); return new_cpu; @@ -7366,6 +7399,7 @@ struct sd_lb_stats { struct sched_group *local; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */ unsigned long total_capacity; /* Total capacity of all groups in sd */ + unsigned long total_util; /* Total util of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */ struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */ @@ -7385,6 +7419,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) .local = NULL, .total_load = 0UL, .total_capacity = 0UL, + .total_util = 0UL, .busiest_stat = { .avg_load = 0UL, .sum_nr_running = 0, @@ -7664,7 +7699,7 @@ group_type group_classify(struct sched_group *group, static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, int local_group, struct sg_lb_stats *sgs, - bool *overload, bool *overutilized) + bool *overload, bool *overutilized, bool *misfit_task) { unsigned long load; int i, nr_running; @@ -7699,8 +7734,16 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (!nr_running && idle_cpu(i)) sgs->idle_cpus++; - if (cpu_overutilized(i)) + if (cpu_overutilized(i)) { *overutilized = true; + /* + * If the cpu is overutilized and if there is only one + * current task in cfs runqueue, it is potentially a misfit + * task. + */ + if (rq->cfs.h_nr_running == 1) + *misfit_task = true; + } } /* Adjust by relative CPU capacity of the group */ @@ -7829,7 +7872,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd struct sched_group *sg = env->sd->groups; struct sg_lb_stats tmp_sgs; int load_idx, prefer_sibling = 0; - bool overload = false, overutilized = false; + bool overload = false, overutilized = false, misfit_task = false; if (child && child->flags & SD_PREFER_SIBLING) prefer_sibling = 1; @@ -7851,7 +7894,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd } update_sg_lb_stats(env, sg, load_idx, local_group, sgs, - &overload, &overutilized); + &overload, &overutilized, + &misfit_task); if (local_group) goto next_group; @@ -7882,6 +7926,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* Now, start updating sd_lb_stats */ sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity; + sds->total_util += sgs->group_util; sg = sg->next; } while (sg != env->sd->groups); @@ -7895,14 +7940,27 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* update overload indicator if we are at root domain */ if (env->dst_rq->rd->overload != overload) env->dst_rq->rd->overload = overload; - - /* Update over-utilization (tipping point, U >= 0) indicator */ - if (env->dst_rq->rd->overutilized != overutilized) - env->dst_rq->rd->overutilized = overutilized; - } else { - if (!env->dst_rq->rd->overutilized && overutilized) - env->dst_rq->rd->overutilized = true; } + + if (overutilized) + set_sd_overutilized(env->sd); + else + clear_sd_overutilized(env->sd); + + /* + * If there is a misfit task in one cpu in this sched_domain + * it is likely that the imbalance cannot be sorted out among + * the cpu's in this sched_domain. In this case set the + * overutilized flag at the parent sched_domain. + */ + if (misfit_task) + set_sd_overutilized(env->sd->parent); + + /* If the domain util is greater that domain capacity, load balancing + * needs to be done at the next sched domain level as well + */ + if (sds->total_capacity * 1024 < sds->total_util * capacity_margin) + set_sd_overutilized(env->sd->parent); } /** @@ -8122,8 +8180,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ update_sd_lb_stats(env, &sds); - if (energy_aware() && !env->dst_rq->rd->overutilized) - goto out_balanced; + if (energy_aware()) { + if (!is_sd_overutilized(env->sd)) + goto out_balanced; + } local = &sds.local_stat; busiest = &sds.busiest_stat; @@ -8981,6 +9041,11 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) rcu_read_lock(); for_each_domain(cpu, sd) { + if (energy_aware()) { + if (!is_sd_overutilized(sd)) + continue; + } + /* * Decay the newidle max times here because this is a regular * visit to all the domains. Decay ~1% per second. @@ -9280,6 +9345,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) { struct cfs_rq *cfs_rq; struct sched_entity *se = &curr->se; + struct sched_domain *sd; for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); @@ -9289,8 +9355,12 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) if (static_branch_unlikely(&sched_numa_balancing)) task_tick_numa(rq, curr); - if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr))) - rq->rd->overutilized = true; + rcu_read_lock(); + sd = rcu_dereference(rq->sd); + if (!is_sd_overutilized(sd) && + cpu_overutilized(task_cpu(curr))) + set_sd_overutilized(sd); + rcu_read_unlock(); } /* -- 2.1.4

8 years, 4 months

[ANNOUNCE][CFP] Power Management and Scheduling in the Linux Kernel (OSPM-summit)

by Juri Lelli

Power Management and Scheduling in the Linux Kernel (OSPM-summit) April 3-4, 2017 Scuola Superiore Sant'Anna (SSSA) Pisa, Italy http://retis.sssup.it/ospm-summit/ --- .:: FOCUS Power management and scheduling techniques to reduce energy consumption while meeting performance and latency requirements are receiving considerable attention from the Linux Kernel development community. The Power Management and Scheduling in the Linux Kernel (OSPM-summit) summit aims at fostering further interest and discussion to happen. .:: FORMAT The summit is organized to cover two days of discussions and talks. First day is mainly focused on discussion and hacking sessions about topics/patches that are already under review in the Linux kernel mailing lists and to debate and plan development tasks for more forward looking work items centred around power management in the Linux kernel. The list of topics includes (but it is not limited to): * Energy Aware Scheduling: next steps and energy model expression; * SCHED_DEADLINE reclaiming of unused bandwidth, coupling with schedutil cpufreq governor and group scheduling support; * fix the load metric exposed to cpuidle; * IRQ prediction; * ACPI power management: kernel/firmware bindings and development model; Second day instead welcomes presentations from both end users and developers on topics about Power management and scheduling in Linux covering, but not limited to: * Power management techniques * Real-time and non real-time scheduling techniques * Energy awareness * Mobile/Server power management real-world use cases (successes and failures) * Power management and scheduling tooling (tracing, configuration, integration testing, etc.) Presentations can cover recently developed technologies, ongoing work and new ideas. Please understand that this workshop is not intended for presenting sales and marketing pitches. .:: ATTENDING Attending the OSPM-summit is free of charge, but registration to the event is mandatory. The event can allow a maximum of 50 people (so, be sure to register early!). To register send an email to ospm-registration(a)retis.sssup.it. While it is not strictly required to submit a topic/presentation, registrations with a topic/presentation proposal will take precedence. .:: VENUE The workshop will take place at ReTiS Lab*, Scuola Superiore Sant'Anna, Pisa, Italy. Pisa is a small town, walking distance from the city center to the venue is 20 minutes, walking distance from the airport to the city center is 30 minutes. More details are available from the summit web page: http://retis.sssup.it/ospm-summit/ * https://goo.gl/maps/2pPXG2v7Lfp .:: SUBMIT A TOPIC/PRESENTATION To submit a topic/presentation send an email to ospm-registration(a)retis.sssup.it specifying: subject - [TOPIC] or [PRESENTATION] - short title body - first name, family name - abstract/topic of interest - affiliation (if any) - short biography - expected duration (only for topics, presentations get 30min slots) Deadline for submitting topics/presentations is 26th of February 2017. Notifications for accepted topics/presentations will be sent out on 5th of March 2017. .:: ORGANIZERS (in alphabetical order) Luca Abeni (SSSA) Patrick Bellasi (ARM) Tommaso Cucinotta (SSSA) Dietmar Eggemann (ARM) Sudeep Holla (ARM) Juri Lelli (ARM) Lorenzo Pieralisi (ARM) Morten Rasmussen (ARM)

8 years, 4 months

[PATCH 0/6] Improvement load balance for misfit task

by Leo Yan

This patch series is to improve load balance with more proper behaviour for misfit task. Current code introduces type 'group_misfit_task' to indicate one schedule group has misfit task, but before the misfit task can be really migrated onto higher capacity CPU there still have some barriers we need clear up. The first patch is to correct task_fits_max() so it can properly filter out misfit task on low capacity CPU. If without this patch, in system it's possible this function can always return true so the 'misfit' task mechanism will totally not be triggered. The second patch is to fix function group_smaller_cpu_capacity(), so we can make sure the schedule group with type 'group_misfit_task' will not wrongly be roll back to type 'group_other'. This will let all misfit related info be abondoned. The third patch is to fix nr_running accounting, if without this patch the scheduler will wronly consider the destination CPU has running task and skip migrate task on it. This patch is to give correct info like the destination CPU has no running task on it when the CPU is going into idle state, so should migrate misfit task by utilizing this time balance. The forth patch is a temperary patch if we have not backported Vincent's patches "sched: reflect sched_entity move into task_group's load" [1], If without this patch series, it's possible that the CPU is not overutilized but the CPU has one misfit task has been enqueued on it. So we set sgs->group_misfit_task by checking rq->misfit_task but not rely on cpu is overutilized or not. The fifth patch is to select busiest rq if the rq has misfit task, we let this kind rq has higher priority than the rq with highest weighted load. This criteria is only enabled for energy aware scheduling. The sixth patch is to aggressively kick active load balance for misfit task, so it has quite high chance for higher capacity CPU to immediately pull misfit task on it. [1] https://lkml.org/lkml/2016/10/17/223 Leo Yan (6): sched/fair: correct task_fits_max() for misfit task sched/fair: fix for group_smaller_cpu_capacity() sched/fair: fix nr_running accounting for new idle CPU sched/fair: fix to set sgs->group_misfit_task sched/fair: select busiest rq with misfit task sched/fair: kick active load balance for misfit task kernel/sched/fair.c | 59 +++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 46 insertions(+), 13 deletions(-) -- 2.7.4

8 years, 5 months

← Newer
1
...
28
29
30
31
32
33
34
...
43
Older →

Jump to page:

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

eas-dev