On Wed, Feb 23, 2022 at 3:49 AM Feng Tang feng.tang@intel.com wrote:
Hi Rafael,
On Tue, Feb 22, 2022 at 07:04:32PM +0100, Rafael J. Wysocki wrote:
On Tue, Feb 22, 2022 at 8:41 AM Feng Tang feng.tang@intel.com wrote:
On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada wrote:
Hi Doug,
I think you use CONFIG_NO_HZ_FULL. Here we are getting callback from scheduler. Can we check that if scheduler woke up on those CPUs? We can run "trace-cmd -e sched" and check in kernel shark if there is similar gaps in activity.
Srinivas analyzed the scheduler trace data from trace-cmd, and thought is related with the cpufreq callback is not called timeley from scheduling events:
" I mean we ignore the callback when the target CPU is not a local CPU as we have to do IPI to adjust MSRs. This will happen many times when sched_wake will wake up a new CPU for the thread (we will get a callack for the target) but once the remote thread start executing "sched_switch", we will get a callback on local CPU, so we will adjust frequencies (provided 10ms interval from the last call).
From the trace file I see the scenario where it took 72sec between two
updates: CPU 2 34412.597161 busy=78 freq=3232653 34484.450725 busy=63 freq=2606793
There is periodic activity in between, related to active load balancing in scheduler (since last frequency was higher these small work will also run at higher frequency). But those threads are not CFS class, so scheduler callback will not be called for them.
So removing the patch removed a trigger which would have caused a sched_switch to a CFS task and call a cpufreq/intel_pstate callback.
And so this behavior needs to be restored for the time being which means reverting the problematic commit for 5.17 if possible.
It is unlikely that we'll get a proper fix before -rc7 and we still need to test it properly.
Thanks for checking this!
I understand the time limit as we are approaching the close of 5.17, but still I don't think reverting commit b50db7095fe0 is the best solution as:
b50db7095fe0 is not just an optimization, but solves some real problems found in servers from big CSP (Cloud Service Provider) and data center's server room.
IMHO, b50db7095fe0 itself hasn't done anything wrong.
This problem found by Doug is a rarely happened case, though it is an expected thing as shown in existing comments of cfs_rq_util_change():
/* * There are a few boundary cases this might miss but it should * get called often enough that that should (hopefully) not be * a real problem. * * It will not get called when we go idle, because the idle * thread is a different class (!fair), nor will the utilization * number include things like RT tasks. * * As is, the util number is not freq-invariant (we'd have to * implement arch_scale_freq_capacity() for that). * * See cpu_util_cfs(). */ cpufreq_update_util(rq, flags);
As this happens with HWP-disabled case and a very calm system, can we find a proper solution in 5.17/5.18 or later, and cc stable?
We can.