RE: CPU excessively long times between frequency scaling driver calls - bisected

10 Feb 2022


      ...
-----Original Message-----
From: Doug Smythies dsmythies@telus.net
Sent: Wednesday, February 09, 2022 2:23 PM
To: Tang, Feng feng.tang@intel.com
Cc: Thomas Gleixner tglx@linutronix.de; paulmck@kernel.org;
stable@vger.kernel.org; x86@kernel.org; linux-pm@vger.kernel.org; srinivas
pandruvada srinivas.pandruvada@linux.intel.com; dsmythies
dsmythies@telus.net
Subject: Re: CPU excessively long times between frequency scaling driver
calls - bisected
On Tue, Feb 8, 2022 at 1:15 AM Feng Tang feng.tang@intel.com wrote:
...
On Mon, Feb 07, 2022 at 11:13:00PM -0800, Doug Smythies wrote:
...
...
...
Since kernel 5.16-rc4 and commit:
b50db7095fe002fa3e16605546cba66bf1b68a3e
" x86/tsc: Disable clocksource watchdog for TSC on qualified platorms"
There are now occasions where times between calls to the driver
can be over 100's of seconds and can result in the CPU frequency
being left unnecessarily high for extended periods.
From the number of clock cycles executed between these long
durations one can tell that the CPU has been running code, but
the driver never got called.
Attached are some graphs from some trace data acquired using
intel_pstate_tracer.py where one can observe an idle system
between about 42 and well over 200 seconds elapsed time, yet
CPU10 never gets called, which would have resulted in reducing
it's pstate request, until an elapsed time of 167.616 seconds,
126 seconds since the last call. The CPU frequency never does go to
minimum.
...
...
...
...
For reference, a similar CPU frequency graph is also attached,
with the commit reverted. The CPU frequency drops to minimum,
over about 10 or 15 seconds.,
commit b50db7095fe0 essentially disables the clocksource watchdog,
which literally doesn't have much to do with cpufreq code.
One thing I can think of is, without the patch, there is a
periodic clocksource timer running every 500 ms, and it loops to
run on all CPUs in turn. For your HW, it has 12 CPUs (from the
graph), so each CPU will get a timer (HW timer interrupt backed)
every 6 seconds. Could this affect the cpufreq governor's work
flow (I just quickly read some cpufreq code, and seem there is
irq_work/workqueue involved).
6 Seconds is the longest duration I have ever seen on this processor
before commit b50db7095fe0.
I said "the times between calls to the driver have never exceeded 10
seconds" originally, but that involved other processors.
I also did longer, 9000 second tests:
For a reverted kernel the driver was called 131,743, and 0 times the
duration was longer than 6.1 seconds.
For a non-reverted kernel the driver was called 110,241 times, and
1397 times the duration was longer than 6.1 seconds, and the maximum
duration was 303.6 seconds
Thanks for the data, which shows it is related to the removal of
clocksource watchdog timers. And under this specific configurations,
the cpufreq work flow has some dependence on that watchdog timers.
Also could you share you kernel config, boot message and some system
settings like for tickless mode, so that other people can try to
reproduce? thanks
I steal the kernel configuration file from the Ubuntu mainline PPA [1], what
they call "lowlatency", or 1000Hz tick. I make these changes before compile:
scripts/config --disable DEBUG_INFO
scripts/config --disable SYSTEM_TRUSTED_KEYS scripts/config --disable
SYSTEM_REVOCATION_KEYS
I also send you the config and dmesg files in an off-list email.
This is an idle, and very low periodic loads, system type test.
My test computer has no GUI and very few services running.
Notice that I have not used the word "regression" yet in this thread, because
I don't know for certain that it is. In the end, we don't care about CPU
frequency, we care about wasting energy.
It is definitely a change, and I am able to measure small increases in energy
use, but this is all at the low end of the power curve.
What do you use to measure the energy use? And what difference do you observe?
...
So far I have not found a significant example of increased power use, but I
also have not looked very hard.
During any test, many monitoring tools might shorten durations.
For example if I run turbostat, say:
sudo turbostat --Summary --quiet --show
Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --
interval
2.5
Well, yes then the maximum duration would be 2.5 seconds, because
turbostat wakes up each CPU to inquire about things causing a call to the CPU
scaling driver. (I tested this, for about
900 seconds.)
For my power tests I use a sample interval of >= 300 seconds.
So you use something like "turbostat sleep 900" for power test, and the RAPL Energy counters show the power difference?
Can you paste the turbostat output both w/ and w/o the watchdog?
Thanks,
rui
...
For duration only tests, turbostat is not run at the same time.
My grub line:
GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314
intel_pstate=active intel_pstate=no_hwp msr.allow_writes=on
cpuidle.governor=teo"
A typical pstate tracer command (with the script copied to the directory
where I run this stuff:):
sudo ./intel_pstate_tracer.py --interval 600 --name vnew02 --memory
800000
...
...
...
Can you try one test that keep all the current setting and change
the irq affinity of disk/network-card to 0xfff to let interrupts
from them be distributed to all CPUs?
I am willing to do the test, but I do not know how to change the irq
affinity.
I might say that too soon. I used to "echo fff > /proc/irq/xxx/smp_affinity"
(xx is the irq number of a device) to let interrupts be distributed to
all CPUs long time ago, but it doesn't work on my 2 desktops at hand.
Seems it only support one-cpu irq affinity in recent kernel.
You can still try that command, though it may not work.
I did not try this yet.
[1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17-rc3/

2024

2023

2022

2021

2020

2019

2018

2017

RE: CPU excessively long times between frequency scaling driver calls - bisected