CPU excessively long times between frequency scaling driver calls - bisected - Linux-stable-mirror

List overview All Threads
Download

newer

CPU excessively long times between frequency scaling driver calls - bisected

older

[patch 111/114] mm: fix race...

stable-rc/queue/5.4 baseline: 49...

Doug Smythies

8 Feb 2022 8 Feb '22

1:40 a.m.

Hi All,

Note before: I do not know If I have the e-mail address list correct, nor am I actually a member of the x86 distribution list. I am on the linux-pm email list.

When using the intel_pstate CPU frequency scaling driver with HWP disabled, active mode, powersave scaling governor, the times between calls to the driver have never exceeded 10 seconds.

Since kernel 5.16-rc4 and commit: b50db7095fe002fa3e16605546cba66bf1b68a3e " x86/tsc: Disable clocksource watchdog for TSC on qualified platorms"

There are now occasions where times between calls to the driver can be over 100's of seconds and can result in the CPU frequency being left unnecessarily high for extended periods.

From the number of clock cycles executed between these long durations one can tell that the CPU has been running code, but the driver never got called.

Attached are some graphs from some trace data acquired using intel_pstate_tracer.py where one can observe an idle system between about 42 and well over 200 seconds elapsed time, yet CPU10 never gets called, which would have resulted in reducing it's pstate request, until an elapsed time of 167.616 seconds, 126 seconds since the last call. The CPU frequency never does go to minimum.

For reference, a similar CPU frequency graph is also attached, with the commit reverted. The CPU frequency drops to minimum, over about 10 or 15 seconds.

Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz

Why this particular configuration, i.e. no-hwp, active, powersave? Because it is, by far, the easiest to observe what is going on.

... Doug

Attachments:

winmail.dat (application/ms-tnef — 107.8 KB)

Show replies by thread

Feng Tang

8 Feb 8 Feb

2:39 a.m.

Hi Doug,

Thanks for the report.

On Tue, Feb 08, 2022 at 09:40:14AM +0800, Doug Smythies wrote:

...

Hi All,

Note before: I do not know If I have the e-mail address list correct, nor am I actually a member of the x86 distribution list. I am on the linux-pm email list.

When using the intel_pstate CPU frequency scaling driver with HWP disabled, active mode, powersave scaling governor, the times between calls to the driver have never exceeded 10 seconds.

Since kernel 5.16-rc4 and commit: b50db7095fe002fa3e16605546cba66bf1b68a3e " x86/tsc: Disable clocksource watchdog for TSC on qualified platorms"

There are now occasions where times between calls to the driver can be over 100's of seconds and can result in the CPU frequency being left unnecessarily high for extended periods.

...
From the number of clock cycles executed between these long

durations one can tell that the CPU has been running code, but the driver never got called.

Attached are some graphs from some trace data acquired using intel_pstate_tracer.py where one can observe an idle system between about 42 and well over 200 seconds elapsed time, yet CPU10 never gets called, which would have resulted in reducing it's pstate request, until an elapsed time of 167.616 seconds, 126 seconds since the last call. The CPU frequency never does go to minimum.

For reference, a similar CPU frequency graph is also attached, with the commit reverted. The CPU frequency drops to minimum, over about 10 or 15 seconds.

commit b50db7095fe0 essentially disables the clocksource watchdog, which literally doesn't have much to do with cpufreq code.

One thing I can think of is, without the patch, there is a periodic clocksource timer running every 500 ms, and it loops to run on all CPUs in turn. For your HW, it has 12 CPUs (from the graph), so each CPU will get a timer (HW timer interrupt backed) every 6 seconds. Could this affect the cpufreq governor's work flow (I just quickly read some cpufreq code, and seem there is irq_work/workqueue involved).

Can you try one test that keep all the current setting and change the irq affinity of disk/network-card to 0xfff to let interrupts from them be distributed to all CPUs?

Thanks, Feng

...

Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz

Why this particular configuration, i.e. no-hwp, active, powersave? Because it is, by far, the easiest to observe what is going on.

... Doug

Doug Smythies

7:13 a.m.

Hi Feng,

Thank you for your reply.

On Mon, Feb 7, 2022 at 6:39 PM Feng Tang feng.tang@intel.com wrote:

...

Hi Doug,

Thanks for the report.

On Tue, Feb 08, 2022 at 09:40:14AM +0800, Doug Smythies wrote:

...
Hi All,

Note before: I do not know If I have the e-mail address list correct, nor am I actually a member of the x86 distribution list. I am on the linux-pm email list.

When using the intel_pstate CPU frequency scaling driver with HWP disabled, active mode, powersave scaling governor, the times between calls to the driver have never exceeded 10 seconds.

Since kernel 5.16-rc4 and commit: b50db7095fe002fa3e16605546cba66bf1b68a3e " x86/tsc: Disable clocksource watchdog for TSC on qualified platorms"

There are now occasions where times between calls to the driver can be over 100's of seconds and can result in the CPU frequency being left unnecessarily high for extended periods.

From the number of clock cycles executed between these long durations one can tell that the CPU has been running code, but the driver never got called.

Attached are some graphs from some trace data acquired using intel_pstate_tracer.py where one can observe an idle system between about 42 and well over 200 seconds elapsed time, yet CPU10 never gets called, which would have resulted in reducing it's pstate request, until an elapsed time of 167.616 seconds, 126 seconds since the last call. The CPU frequency never does go to minimum.

For reference, a similar CPU frequency graph is also attached, with the commit reverted. The CPU frequency drops to minimum, over about 10 or 15 seconds.,

commit b50db7095fe0 essentially disables the clocksource watchdog, which literally doesn't have much to do with cpufreq code.

One thing I can think of is, without the patch, there is a periodic clocksource timer running every 500 ms, and it loops to run on all CPUs in turn. For your HW, it has 12 CPUs (from the graph), so each CPU will get a timer (HW timer interrupt backed) every 6 seconds. Could this affect the cpufreq governor's work flow (I just quickly read some cpufreq code, and seem there is irq_work/workqueue involved).

6 Seconds is the longest duration I have ever seen on this processor before commit b50db7095fe0.

I said "the times between calls to the driver have never exceeded 10 seconds" originally, but that involved other processors.

I also did longer, 9000 second tests:

For a reverted kernel the driver was called 131,743, and 0 times the duration was longer than 6.1 seconds.

For a non-reverted kernel the driver was called 110,241 times, and 1397 times the duration was longer than 6.1 seconds, and the maximum duration was 303.6 seconds

...

Can you try one test that keep all the current setting and change the irq affinity of disk/network-card to 0xfff to let interrupts from them be distributed to all CPUs?

I am willing to do the test, but I do not know how to change the irq affinity.

...

Thanks, Feng

...
Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz

Why this particular configuration, i.e. no-hwp, active, powersave? Because it is, by far, the easiest to observe what is going on.

... Doug

Feng Tang

9:15 a.m.

On Mon, Feb 07, 2022 at 11:13:00PM -0800, Doug Smythies wrote:

...

...
...
Since kernel 5.16-rc4 and commit: b50db7095fe002fa3e16605546cba66bf1b68a3e " x86/tsc: Disable clocksource watchdog for TSC on qualified platorms"

There are now occasions where times between calls to the driver can be over 100's of seconds and can result in the CPU frequency being left unnecessarily high for extended periods.

From the number of clock cycles executed between these long durations one can tell that the CPU has been running code, but the driver never got called.

Attached are some graphs from some trace data acquired using intel_pstate_tracer.py where one can observe an idle system between about 42 and well over 200 seconds elapsed time, yet CPU10 never gets called, which would have resulted in reducing it's pstate request, until an elapsed time of 167.616 seconds, 126 seconds since the last call. The CPU frequency never does go to minimum.

For reference, a similar CPU frequency graph is also attached, with the commit reverted. The CPU frequency drops to minimum, over about 10 or 15 seconds.,

commit b50db7095fe0 essentially disables the clocksource watchdog, which literally doesn't have much to do with cpufreq code.

One thing I can think of is, without the patch, there is a periodic clocksource timer running every 500 ms, and it loops to run on all CPUs in turn. For your HW, it has 12 CPUs (from the graph), so each CPU will get a timer (HW timer interrupt backed) every 6 seconds. Could this affect the cpufreq governor's work flow (I just quickly read some cpufreq code, and seem there is irq_work/workqueue involved).

6 Seconds is the longest duration I have ever seen on this processor before commit b50db7095fe0.

I said "the times between calls to the driver have never exceeded 10 seconds" originally, but that involved other processors.

I also did longer, 9000 second tests:

For a reverted kernel the driver was called 131,743, and 0 times the duration was longer than 6.1 seconds.

For a non-reverted kernel the driver was called 110,241 times, and 1397 times the duration was longer than 6.1 seconds, and the maximum duration was 303.6 seconds

Thanks for the data, which shows it is related to the removal of clocksource watchdog timers. And under this specific configurations, the cpufreq work flow has some dependence on that watchdog timers.

Also could you share you kernel config, boot message and some system settings like for tickless mode, so that other people can try to reproduce? thanks

...

...
Can you try one test that keep all the current setting and change the irq affinity of disk/network-card to 0xfff to let interrupts from them be distributed to all CPUs?

I am willing to do the test, but I do not know how to change the irq affinity.

I might say that too soon. I used to "echo fff > /proc/irq/xxx/smp_affinity" (xx is the irq number of a device) to let interrupts be distributed to all CPUs long time ago, but it doesn't work on my 2 desktops at hand. Seems it only support one-cpu irq affinity in recent kernel.

You can still try that command, though it may not work.

Thanks, Feng

Doug Smythies

9 Feb 9 Feb

6:23 a.m.

On Tue, Feb 8, 2022 at 1:15 AM Feng Tang feng.tang@intel.com wrote:

...

On Mon, Feb 07, 2022 at 11:13:00PM -0800, Doug Smythies wrote:

...
...
...
Since kernel 5.16-rc4 and commit: b50db7095fe002fa3e16605546cba66bf1b68a3e " x86/tsc: Disable clocksource watchdog for TSC on qualified platorms"

There are now occasions where times between calls to the driver can be over 100's of seconds and can result in the CPU frequency being left unnecessarily high for extended periods.

From the number of clock cycles executed between these long durations one can tell that the CPU has been running code, but the driver never got called.

Attached are some graphs from some trace data acquired using intel_pstate_tracer.py where one can observe an idle system between about 42 and well over 200 seconds elapsed time, yet CPU10 never gets called, which would have resulted in reducing it's pstate request, until an elapsed time of 167.616 seconds, 126 seconds since the last call. The CPU frequency never does go to minimum.

For reference, a similar CPU frequency graph is also attached, with the commit reverted. The CPU frequency drops to minimum, over about 10 or 15 seconds.,

commit b50db7095fe0 essentially disables the clocksource watchdog, which literally doesn't have much to do with cpufreq code.

One thing I can think of is, without the patch, there is a periodic clocksource timer running every 500 ms, and it loops to run on all CPUs in turn. For your HW, it has 12 CPUs (from the graph), so each CPU will get a timer (HW timer interrupt backed) every 6 seconds. Could this affect the cpufreq governor's work flow (I just quickly read some cpufreq code, and seem there is irq_work/workqueue involved).

6 Seconds is the longest duration I have ever seen on this processor before commit b50db7095fe0.

I said "the times between calls to the driver have never exceeded 10 seconds" originally, but that involved other processors.

I also did longer, 9000 second tests:

For a reverted kernel the driver was called 131,743, and 0 times the duration was longer than 6.1 seconds.

For a non-reverted kernel the driver was called 110,241 times, and 1397 times the duration was longer than 6.1 seconds, and the maximum duration was 303.6 seconds

Thanks for the data, which shows it is related to the removal of clocksource watchdog timers. And under this specific configurations, the cpufreq work flow has some dependence on that watchdog timers.

Also could you share you kernel config, boot message and some system settings like for tickless mode, so that other people can try to reproduce? thanks

I steal the kernel configuration file from the Ubuntu mainline PPA [1], what they call "lowlatency", or 1000Hz tick. I make these changes before compile:

scripts/config --disable DEBUG_INFO scripts/config --disable SYSTEM_TRUSTED_KEYS scripts/config --disable SYSTEM_REVOCATION_KEYS

I also send you the config and dmesg files in an off-list email.

This is an idle, and very low periodic loads, system type test. My test computer has no GUI and very few services running. Notice that I have not used the word "regression" yet in this thread, because I don't know for certain that it is. In the end, we don't care about CPU frequency, we care about wasting energy. It is definitely a change, and I am able to measure small increases in energy use, but this is all at the low end of the power curve. So far I have not found a significant example of increased power use, but I also have not looked very hard.

During any test, many monitoring tools might shorten durations. For example if I run turbostat, say:

sudo turbostat --Summary --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --interval 2.5

Well, yes then the maximum duration would be 2.5 seconds, because turbostat wakes up each CPU to inquire about things causing a call to the CPU scaling driver. (I tested this, for about 900 seconds.)

For my power tests I use a sample interval of >= 300 seconds. For duration only tests, turbostat is not run at the same time.

My grub line:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 intel_pstate=active intel_pstate=no_hwp msr.allow_writes=on cpuidle.governor=teo"

A typical pstate tracer command (with the script copied to the directory where I run this stuff:):

sudo ./intel_pstate_tracer.py --interval 600 --name vnew02 --memory 800000

...

...
...
Can you try one test that keep all the current setting and change the irq affinity of disk/network-card to 0xfff to let interrupts from them be distributed to all CPUs?

I am willing to do the test, but I do not know how to change the irq affinity.

I might say that too soon. I used to "echo fff > /proc/irq/xxx/smp_affinity" (xx is the irq number of a device) to let interrupts be distributed to all CPUs long time ago, but it doesn't work on my 2 desktops at hand. Seems it only support one-cpu irq affinity in recent kernel.

You can still try that command, though it may not work.

I did not try this yet.

[1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17-rc3/

Zhang, Rui

10 Feb 10 Feb

7:45 a.m.

...

-----Original Message----- From: Doug Smythies dsmythies@telus.net Sent: Wednesday, February 09, 2022 2:23 PM To: Tang, Feng feng.tang@intel.com Cc: Thomas Gleixner tglx@linutronix.de; paulmck@kernel.org; stable@vger.kernel.org; x86@kernel.org; linux-pm@vger.kernel.org; srinivas pandruvada srinivas.pandruvada@linux.intel.com; dsmythies dsmythies@telus.net Subject: Re: CPU excessively long times between frequency scaling driver calls - bisected

On Tue, Feb 8, 2022 at 1:15 AM Feng Tang feng.tang@intel.com wrote:

...
On Mon, Feb 07, 2022 at 11:13:00PM -0800, Doug Smythies wrote:

...
...
...
Since kernel 5.16-rc4 and commit: b50db7095fe002fa3e16605546cba66bf1b68a3e " x86/tsc: Disable clocksource watchdog for TSC on qualified platorms"

There are now occasions where times between calls to the driver can be over 100's of seconds and can result in the CPU frequency being left unnecessarily high for extended periods.

From the number of clock cycles executed between these long durations one can tell that the CPU has been running code, but the driver never got called.

Attached are some graphs from some trace data acquired using intel_pstate_tracer.py where one can observe an idle system between about 42 and well over 200 seconds elapsed time, yet CPU10 never gets called, which would have resulted in reducing it's pstate request, until an elapsed time of 167.616 seconds, 126 seconds since the last call. The CPU frequency never does go to

minimum.

...
...
...
...
For reference, a similar CPU frequency graph is also attached, with the commit reverted. The CPU frequency drops to minimum, over about 10 or 15 seconds.,

commit b50db7095fe0 essentially disables the clocksource watchdog, which literally doesn't have much to do with cpufreq code.

One thing I can think of is, without the patch, there is a periodic clocksource timer running every 500 ms, and it loops to run on all CPUs in turn. For your HW, it has 12 CPUs (from the graph), so each CPU will get a timer (HW timer interrupt backed) every 6 seconds. Could this affect the cpufreq governor's work flow (I just quickly read some cpufreq code, and seem there is irq_work/workqueue involved).

6 Seconds is the longest duration I have ever seen on this processor before commit b50db7095fe0.

I said "the times between calls to the driver have never exceeded 10 seconds" originally, but that involved other processors.

I also did longer, 9000 second tests:

For a reverted kernel the driver was called 131,743, and 0 times the duration was longer than 6.1 seconds.

For a non-reverted kernel the driver was called 110,241 times, and 1397 times the duration was longer than 6.1 seconds, and the maximum duration was 303.6 seconds

Thanks for the data, which shows it is related to the removal of clocksource watchdog timers. And under this specific configurations, the cpufreq work flow has some dependence on that watchdog timers.

Also could you share you kernel config, boot message and some system settings like for tickless mode, so that other people can try to reproduce? thanks

I steal the kernel configuration file from the Ubuntu mainline PPA [1], what they call "lowlatency", or 1000Hz tick. I make these changes before compile:

scripts/config --disable DEBUG_INFO scripts/config --disable SYSTEM_TRUSTED_KEYS scripts/config --disable SYSTEM_REVOCATION_KEYS

I also send you the config and dmesg files in an off-list email.

This is an idle, and very low periodic loads, system type test. My test computer has no GUI and very few services running. Notice that I have not used the word "regression" yet in this thread, because I don't know for certain that it is. In the end, we don't care about CPU frequency, we care about wasting energy. It is definitely a change, and I am able to measure small increases in energy use, but this is all at the low end of the power curve.

What do you use to measure the energy use? And what difference do you observe?

...

So far I have not found a significant example of increased power use, but I also have not looked very hard.

During any test, many monitoring tools might shorten durations. For example if I run turbostat, say:

sudo turbostat --Summary --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt -- interval 2.5

Well, yes then the maximum duration would be 2.5 seconds, because turbostat wakes up each CPU to inquire about things causing a call to the CPU scaling driver. (I tested this, for about 900 seconds.)

For my power tests I use a sample interval of >= 300 seconds.

So you use something like "turbostat sleep 900" for power test, and the RAPL Energy counters show the power difference? Can you paste the turbostat output both w/ and w/o the watchdog?

Thanks, rui

...

For duration only tests, turbostat is not run at the same time.

My grub line:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 intel_pstate=active intel_pstate=no_hwp msr.allow_writes=on cpuidle.governor=teo"

A typical pstate tracer command (with the script copied to the directory where I run this stuff:):

sudo ./intel_pstate_tracer.py --interval 600 --name vnew02 --memory 800000

...
...
...
Can you try one test that keep all the current setting and change the irq affinity of disk/network-card to 0xfff to let interrupts from them be distributed to all CPUs?

I am willing to do the test, but I do not know how to change the irq affinity.

I might say that too soon. I used to "echo fff > /proc/irq/xxx/smp_affinity" (xx is the irq number of a device) to let interrupts be distributed to all CPUs long time ago, but it doesn't work on my 2 desktops at hand. Seems it only support one-cpu irq affinity in recent kernel.

You can still try that command, though it may not work.

I did not try this yet.

[1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17-rc3/

Doug Smythies

13 Feb 13 Feb

6:54 p.m.

Hi Rui,

On Wed, Feb 9, 2022 at 11:45 PM Zhang, Rui rui.zhang@intel.com wrote:

...

On 2022.02.09 14:23 Doug wrote:

...
On Tue, Feb 8, 2022 at 1:15 AM Feng Tang feng.tang@intel.com wrote:

...
On Mon, Feb 07, 2022 at 11:13:00PM -0800, Doug Smythies wrote:

...
...
...
Since kernel 5.16-rc4 and commit: b50db7095fe002fa3e16605546cba66bf1b68a3e " x86/tsc: Disable clocksource watchdog for TSC on qualified platorms"

There are now occasions where times between calls to the driver can be over 100's of seconds and can result in the CPU frequency being left unnecessarily high for extended periods.

From the number of clock cycles executed between these long durations one can tell that the CPU has been running code, but the driver never got called.

Attached are some graphs from some trace data acquired using intel_pstate_tracer.py where one can observe an idle system between about 42 and well over 200 seconds elapsed time, yet CPU10 never gets called, which would have resulted in reducing it's pstate request, until an elapsed time of 167.616 seconds, 126 seconds since the last call. The CPU frequency never does go to

minimum.

...
...
...
...
For reference, a similar CPU frequency graph is also attached, with the commit reverted. The CPU frequency drops to minimum, over about 10 or 15 seconds.,

commit b50db7095fe0 essentially disables the clocksource watchdog, which literally doesn't have much to do with cpufreq code.

One thing I can think of is, without the patch, there is a periodic clocksource timer running every 500 ms, and it loops to run on all CPUs in turn. For your HW, it has 12 CPUs (from the graph), so each CPU will get a timer (HW timer interrupt backed) every 6 seconds. Could this affect the cpufreq governor's work flow (I just quickly read some cpufreq code, and seem there is irq_work/workqueue involved).

6 Seconds is the longest duration I have ever seen on this processor before commit b50db7095fe0.

I said "the times between calls to the driver have never exceeded 10 seconds" originally, but that involved other processors.

I also did longer, 9000 second tests:

For a reverted kernel the driver was called 131,743, and 0 times the duration was longer than 6.1 seconds.

For a non-reverted kernel the driver was called 110,241 times, and 1397 times the duration was longer than 6.1 seconds, and the maximum duration was 303.6 seconds

Thanks for the data, which shows it is related to the removal of clocksource watchdog timers. And under this specific configurations, the cpufreq work flow has some dependence on that watchdog timers.

Also could you share you kernel config, boot message and some system settings like for tickless mode, so that other people can try to reproduce? thanks

I steal the kernel configuration file from the Ubuntu mainline PPA [1], what they call "lowlatency", or 1000Hz tick. I make these changes before compile:

scripts/config --disable DEBUG_INFO scripts/config --disable SYSTEM_TRUSTED_KEYS scripts/config --disable SYSTEM_REVOCATION_KEYS

I also send you the config and dmesg files in an off-list email.

This is an idle, and very low periodic loads, system type test. My test computer has no GUI and very few services running. Notice that I have not used the word "regression" yet in this thread, because I don't know for certain that it is. In the end, we don't care about CPU frequency, we care about wasting energy. It is definitely a change, and I am able to measure small increases in energy use, but this is all at the low end of the power curve.

Please keep the above statement in mind. The differences were rather minor, Since Rui asks for data below, I have tried to find better examples.

...

What do you use to measure the energy use? And what difference do you observe?

I use both turbostat and a processor power monitoring tool of my own. Don't get me wrong, I love turbostat, but it has become somewhat heavy (lots of code per interval) over recent years. My own utility has minimal code per interval, only uses pre-allocated memory with no disk or terminal interaction during the sampling phase, resulting in minimal system perturbation due to itself.

...

...
So far I have not found a significant example of increased power use, but I also have not looked very hard.

I looked a little harder for this reply, searching all single CPU loads for a few work/sleep frequencies (73, 113, 211, 347, 401 hertz) fixed work packet per work interval.

...

...
During any test, many monitoring tools might shorten durations. For example if I run turbostat, say:

sudo turbostat --Summary --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt -- interval 2.5

Well, yes then the maximum duration would be 2.5 seconds, because turbostat wakes up each CPU to inquire about things causing a call to the CPU scaling driver. (I tested this, for about 900 seconds.)

For my power tests I use a sample interval of >= 300 seconds.

So you use something like "turbostat sleep 900" for power test,

Typically I use 300 seconds for turbostat for this work. It was only the other day, that I saw a duration of 302 seconds, so yes I should try even longer, but it is becoming a tradeoff between experiments taking many many hours and time available.

...

and the RAPL Energy counters show the power difference?

Yes.

...

Can you paste the turbostat output both w/ and w/o the watchdog?

Workflow:

Task 1: Purpose: main periodic load.

411 hertz work sleep frequency and about 26.6% load at 4.8 Ghz, max_turbo (it would limit out at 100% duty cycle at about pstate 13)

Note: this is a higher load than I was using when I started this email thread.

Task 2: Purpose: To assist in promoting manifestation of the issue of potential concern with commit b50db7095fe0.

A short burst of work (about 30 milliseconds @ max turbo, longer at lower frequencies), then sleep for 45 seconds. Say, almost 0 load at 0.022 hertz. Greater than or equal to 30 milliseconds at full load, ensures that the intel_pstate driver will be called at least 2 or 3 times, raising the requested pstate for that CPU.

Task 3: Purpose: Acquire power data with minimum system perturbation due to this very monitoring task.

Both turbostat and my own power monitor were used, never at the same time (unless just to prove they reported the same power). This turbostat command:

turbostat --Summary --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --interval 300

and my program also sampled at 300 second per sample, typically 5 samples, and after 1/2 hour since boot.

Kernels: stock: 5.17-rc3 as is. revert: 5.17-rc3 with b50db7095fe0 reverted.

Test 1: no-hwp/active/powersave:

Doug's cpu_power_mon:

Stock: 5 samples @ 300 seconds per sample: average: 4.7 watts +9% minimum: 4.3 watts maximum: 4.9 watts

Revert: 5 samples @ 300 seconds per sample: average: 4.3 watts minimum: 4.2 watts maximum: 4.5 watts

Test 2: no-hwp/passive/schedutil:

In the beginning active/powersave was used because it is the easiest (at least for me) to understand and interpret the intel_pstate_tracer.py results. Long durations are common in passive mode, because something higher up can decide not to call the driver because nothing changed. Very valuable lost information in my opinion.

I didn't figure it out for awhile, but schedutil is bi-stable with this workflow, depending on if it approaches steady state from a higher or lower previous load (i.e. hysteresis). With either kernel it might run for hours and hours at an average of 6.1 watts or 4.9 watts. This difference dominates, so trying to reveal if there is any contribution from the commit of this thread is useless.

Test 3: Similar workflow as test1, but 347 Hertz and a little less work per work period.

Task 2 was changed to use more CPUs to try to potentially amplify manifestation of the effect.

Doug's cpu_power_mon:

Stock: 5 samples @ 300 seconds per sample: average: 4.2 watts +31% minimum: 3.5 watts maximum: 4.9 watts

Revert: 5 samples @ 300 seconds per sample: average: 3.2 watts minimum: 3.1 watts maximum: 3.2 watts

Re-do test 1, but with the improved task 2:

Turbostat:

Stock:

doug@s19:~$ sudo /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --interval 300 Busy% Bzy_MHz IRQ PkgTmp PkgWatt CorWatt GFXWatt RAMWatt 4.09 3335 282024 36 5.79 5.11 0.00 0.89 4.11 3449 283286 34 6.06 5.40 0.00 0.89 4.11 3504 284532 35 6.35 5.70 0.00 0.89 4.11 3493 283908 35 6.26 5.60 0.00 0.89 4.11 3499 284243 34 6.27 5.62 0.00 0.89

Revert:

doug@s19:~/freq-scalers/long_dur$ sudo /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --interval 300 Busy% Bzy_MHz IRQ PkgTmp PkgWatt CorWatt GFXWatt RAMWatt 4.12 3018 283114 34 4.57 3.91 0.00 0.89 4.12 3023 283482 34 4.59 3.94 0.00 0.89 4.12 3017 282609 34 4.71 4.05 0.00 0.89 4.12 3013 282898 34 4.64 3.99 0.00 0.89 4.12 3013 282956 36 4.56 3.90 0.00 0.89 4.12 3026 282807 34 4.61 3.95 0.00 0.89 4.12 3026 282738 35 4.50 3.84 0.00 0.89

Or average of 6.2 watts versus 4.6 watts, +35%.

Several other tests had undetectable power differences between kernels, but I did not go back and re-test with the improved task 2.

...

Thanks, rui

...
For duration only tests, turbostat is not run at the same time.

My grub line:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 intel_pstate=active intel_pstate=no_hwp msr.allow_writes=on cpuidle.governor=teo"

A typical pstate tracer command (with the script copied to the directory where I run this stuff:):

sudo ./intel_pstate_tracer.py --interval 600 --name vnew02 --memory 800000

...
...
...
Can you try one test that keep all the current setting and change the irq affinity of disk/network-card to 0xfff to let interrupts from them be distributed to all CPUs?

I am willing to do the test, but I do not know how to change the irq affinity.

I might say that too soon. I used to "echo fff > /proc/irq/xxx/smp_affinity" (xx is the irq number of a device) to let interrupts be distributed to all CPUs long time ago, but it doesn't work on my 2 desktops at hand. Seems it only support one-cpu irq affinity in recent kernel.

You can still try that command, though it may not work.

I did not try this yet.

[1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17-rc3/

srinivas pandruvada

14 Feb 14 Feb

3:17 p.m.

Hi Doug,

I think you use CONFIG_NO_HZ_FULL. Here we are getting callback from scheduler. Can we check that if scheduler woke up on those CPUs? We can run "trace-cmd -e sched" and check in kernel shark if there is similar gaps in activity.

Thanks, Srinivas

On Sun, 2022-02-13 at 10:54 -0800, Doug Smythies wrote:

...

Hi Rui,

On Wed, Feb 9, 2022 at 11:45 PM Zhang, Rui rui.zhang@intel.com wrote:

...
On 2022.02.09 14:23 Doug wrote:

...
On Tue, Feb 8, 2022 at 1:15 AM Feng Tang feng.tang@intel.com wrote:

...
On Mon, Feb 07, 2022 at 11:13:00PM -0800, Doug Smythies wrote:

...
...
> > Since kernel 5.16-rc4 and commit: > b50db7095fe002fa3e16605546cba66bf1b68a3e > " x86/tsc: Disable clocksource watchdog for TSC on > qualified platorms" > > There are now occasions where times between calls to the > driver > can be over 100's of seconds and can result in the CPU > frequency > being left unnecessarily high for extended periods. > > From the number of clock cycles executed between these > long > durations one can tell that the CPU has been running > code, but > the driver never got called. > > Attached are some graphs from some trace data acquired > using > intel_pstate_tracer.py where one can observe an idle > system > between about 42 and well over 200 seconds elapsed time, > yet > CPU10 never gets called, which would have resulted in > reducing > it's pstate request, until an elapsed time of 167.616 > seconds, > 126 seconds since the last call. The CPU frequency never > does go to

minimum.

...
...
...
> > For reference, a similar CPU frequency graph is also > attached, > with the commit reverted. The CPU frequency drops to > minimum, > over about 10 or 15 seconds.,

commit b50db7095fe0 essentially disables the clocksource watchdog, which literally doesn't have much to do with cpufreq code.

One thing I can think of is, without the patch, there is a periodic clocksource timer running every 500 ms, and it loops to run on all CPUs in turn. For your HW, it has 12 CPUs (from the graph), so each CPU will get a timer (HW timer interrupt backed) every 6 seconds. Could this affect the cpufreq governor's work flow (I just quickly read some cpufreq code, and seem there is irq_work/workqueue involved).

6 Seconds is the longest duration I have ever seen on this processor before commit b50db7095fe0.

I said "the times between calls to the driver have never exceeded 10 seconds" originally, but that involved other processors.

I also did longer, 9000 second tests:

For a reverted kernel the driver was called 131,743, and 0 times the duration was longer than 6.1 seconds.

For a non-reverted kernel the driver was called 110,241 times, and 1397 times the duration was longer than 6.1 seconds, and the maximum duration was 303.6 seconds

Thanks for the data, which shows it is related to the removal of clocksource watchdog timers. And under this specific configurations, the cpufreq work flow has some dependence on that watchdog timers.

Also could you share you kernel config, boot message and some system settings like for tickless mode, so that other people can try to reproduce? thanks

I steal the kernel configuration file from the Ubuntu mainline PPA [1], what they call "lowlatency", or 1000Hz tick. I make these changes before compile:

scripts/config --disable DEBUG_INFO scripts/config --disable SYSTEM_TRUSTED_KEYS scripts/config -- disable SYSTEM_REVOCATION_KEYS

I also send you the config and dmesg files in an off-list email.

This is an idle, and very low periodic loads, system type test. My test computer has no GUI and very few services running. Notice that I have not used the word "regression" yet in this thread, because I don't know for certain that it is. In the end, we don't care about CPU frequency, we care about wasting energy. It is definitely a change, and I am able to measure small increases in energy use, but this is all at the low end of the power curve.

Please keep the above statement in mind. The differences were rather minor, Since Rui asks for data below, I have tried to find better examples.

...
What do you use to measure the energy use? And what difference do you observe?

I use both turbostat and a processor power monitoring tool of my own. Don't get me wrong, I love turbostat, but it has become somewhat heavy (lots of code per interval) over recent years. My own utility has minimal code per interval, only uses pre-allocated memory with no disk or terminal interaction during the sampling phase, resulting in minimal system perturbation due to itself.

...
...
So far I have not found a significant example of increased power use, but I also have not looked very hard.

I looked a little harder for this reply, searching all single CPU loads for a few work/sleep frequencies (73, 113, 211, 347, 401 hertz) fixed work packet per work interval.

...
...
During any test, many monitoring tools might shorten durations. For example if I run turbostat, say:

sudo turbostat --Summary --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt -- interval 2.5

Well, yes then the maximum duration would be 2.5 seconds, because turbostat wakes up each CPU to inquire about things causing a call to the CPU scaling driver. (I tested this, for about 900 seconds.)

For my power tests I use a sample interval of >= 300 seconds.

So you use something like "turbostat sleep 900" for power test,

Typically I use 300 seconds for turbostat for this work. It was only the other day, that I saw a duration of 302 seconds, so yes I should try even longer, but it is becoming a tradeoff between experiments taking many many hours and time available.

...
and the RAPL Energy counters show the power difference?

Yes.

...
Can you paste the turbostat output both w/ and w/o the watchdog?

Workflow:

Task 1: Purpose: main periodic load.

411 hertz work sleep frequency and about 26.6% load at 4.8 Ghz, max_turbo (it would limit out at 100% duty cycle at about pstate 13)

Note: this is a higher load than I was using when I started this email thread.

Task 2: Purpose: To assist in promoting manifestation of the issue of potential concern with commit b50db7095fe0.

A short burst of work (about 30 milliseconds @ max turbo, longer at lower frequencies), then sleep for 45 seconds. Say, almost 0 load at 0.022 hertz. Greater than or equal to 30 milliseconds at full load, ensures that the intel_pstate driver will be called at least 2 or 3 times, raising the requested pstate for that CPU.

Task 3: Purpose: Acquire power data with minimum system perturbation due to this very monitoring task.

Both turbostat and my own power monitor were used, never at the same time (unless just to prove they reported the same power). This turbostat command:

turbostat --Summary --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --interval 300

and my program also sampled at 300 second per sample, typically 5 samples, and after 1/2 hour since boot.

Kernels: stock: 5.17-rc3 as is. revert: 5.17-rc3 with b50db7095fe0 reverted.

Test 1: no-hwp/active/powersave:

Doug's cpu_power_mon:

Stock: 5 samples @ 300 seconds per sample: average: 4.7 watts +9% minimum: 4.3 watts maximum: 4.9 watts

Revert: 5 samples @ 300 seconds per sample: average: 4.3 watts minimum: 4.2 watts maximum: 4.5 watts

Test 2: no-hwp/passive/schedutil:

In the beginning active/powersave was used because it is the easiest (at least for me) to understand and interpret the intel_pstate_tracer.py results. Long durations are common in passive mode, because something higher up can decide not to call the driver because nothing changed. Very valuable lost information in my opinion.

I didn't figure it out for awhile, but schedutil is bi-stable with this workflow, depending on if it approaches steady state from a higher or lower previous load (i.e. hysteresis). With either kernel it might run for hours and hours at an average of 6.1 watts or 4.9 watts. This difference dominates, so trying to reveal if there is any contribution from the commit of this thread is useless.

Test 3: Similar workflow as test1, but 347 Hertz and a little less work per work period.

Task 2 was changed to use more CPUs to try to potentially amplify manifestation of the effect.

Doug's cpu_power_mon:

Stock: 5 samples @ 300 seconds per sample: average: 4.2 watts +31% minimum: 3.5 watts maximum: 4.9 watts

Revert: 5 samples @ 300 seconds per sample: average: 3.2 watts minimum: 3.1 watts maximum: 3.2 watts

Re-do test 1, but with the improved task 2:

Turbostat:

Stock:

doug@s19:~$ sudo /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --interval 300 Busy% Bzy_MHz IRQ PkgTmp PkgWatt CorWatt GFXWatt RAMWatt 4.09 3335 282024 36 5.79 5.11 0.00 0.89 4.11 3449 283286 34 6.06 5.40 0.00 0.89 4.11 3504 284532 35 6.35 5.70 0.00 0.89 4.11 3493 283908 35 6.26 5.60 0.00 0.89 4.11 3499 284243 34 6.27 5.62 0.00 0.89

Revert:

doug@s19:~/freq-scalers/long_dur$ sudo /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --interval 300 Busy% Bzy_MHz IRQ PkgTmp PkgWatt CorWatt GFXWatt RAMWatt 4.12 3018 283114 34 4.57 3.91 0.00 0.89 4.12 3023 283482 34 4.59 3.94 0.00 0.89 4.12 3017 282609 34 4.71 4.05 0.00 0.89 4.12 3013 282898 34 4.64 3.99 0.00 0.89 4.12 3013 282956 36 4.56 3.90 0.00 0.89 4.12 3026 282807 34 4.61 3.95 0.00 0.89 4.12 3026 282738 35 4.50 3.84 0.00 0.89

Or average of 6.2 watts versus 4.6 watts, +35%.

Several other tests had undetectable power differences between kernels, but I did not go back and re-test with the improved task 2.

...
Thanks, rui

...
For duration only tests, turbostat is not run at the same time.

My grub line:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 intel_pstate=active intel_pstate=no_hwp msr.allow_writes=on cpuidle.governor=teo"

A typical pstate tracer command (with the script copied to the directory where I run this stuff:):

sudo ./intel_pstate_tracer.py --interval 600 --name vnew02 -- memory 800000

...
...
...
Can you try one test that keep all the current setting and change the irq affinity of disk/network-card to 0xfff to let interrupts from them be distributed to all CPUs?

I am willing to do the test, but I do not know how to change the irq affinity.

I might say that too soon. I used to "echo fff > /proc/irq/xxx/smp_affinity" (xx is the irq number of a device) to let interrupts be distributed to all CPUs long time ago, but it doesn't work on my 2 desktops at hand. Seems it only support one-cpu irq affinity in recent kernel.

You can still try that command, though it may not work.

I did not try this yet.

[1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17-rc3/

Doug Smythies

15 Feb 15 Feb

9:35 p.m.

Hi Srinivas,

On Mon, Feb 14, 2022 at 7:17 AM srinivas pandruvada srinivas.pandruvada@linux.intel.com wrote:

...

Hi Doug,

I think you use CONFIG_NO_HZ_FULL.

No. Here is the relevant excerpt from the kernel config file:

# # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ_FULL is not set CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y # end of Timers subsystem

...

Here we are getting callback from scheduler. Can we check that if scheduler woke up on those CPUs?

As far as I can determine, yes. But note that I am unfamiliar with this area.

...

We can run "trace-cmd -e sched" and check in kernel shark if there is similar gaps in activity.

I do not use trace-cmd and had never heard of kernel shark. Nor do I actually run any desktop GUI on linux, only servers. I attempted to acquire what you wanted with primitive trace commands.

Workflow: All as before (the 347 Hertz work/sleep frequency test), with a 20 minute trace in the middle. Powers were monitored again to confirm differences, and just in case trace itself modified the system response (it didn't).

Power averages (excluding the sample where the trace file was being written to disk): Stock: 4.1 +37% Revert: 3.0

I only looked at a few of the CPUs data, the largest, smallest and a mid-range file sizes, excluding the main working CPU. Maximum times between "sched_wakeup", seconds:

Stock: CPU 2: 4.0 CPU 4: 4.0 CPU 9: 1.0

Revert: CPU 1: 2.0 CPU 2: 4.0 CPU 7: 1.54

I do not know if other stuff in the files might be odd or not.

... Doug

Feng Tang

22 Feb 22 Feb

7:34 a.m.

On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada wrote:

...

Hi Doug,

I think you use CONFIG_NO_HZ_FULL. Here we are getting callback from scheduler. Can we check that if scheduler woke up on those CPUs? We can run "trace-cmd -e sched" and check in kernel shark if there is similar gaps in activity.

Srinivas analyzed the scheduler trace data from trace-cmd, and thought is related with the cpufreq callback is not called timeley from scheduling events:

" I mean we ignore the callback when the target CPU is not a local CPU as we have to do IPI to adjust MSRs. This will happen many times when sched_wake will wake up a new CPU for the thread (we will get a callack for the target) but once the remote thread start executing "sched_switch", we will get a callback on local CPU, so we will adjust frequencies (provided 10ms interval from the last call).

...

From the trace file I see the scenario where it took 72sec between two

updates: CPU 2 34412.597161 busy=78 freq=3232653 34484.450725 busy=63 freq=2606793

There is periodic activity in between, related to active load balancing in scheduler (since last frequency was higher these small work will also run at higher frequency). But those threads are not CFS class, so scheduler callback will not be called for them.

So removing the patch removed a trigger which would have caused a sched_switch to a CFS task and call a cpufreq/intel_pstate callback. But calling for every class, will be too many callbacks and not sure we can even call for "stop" class, which these migration threads are using. "

Following this direction, I made a hacky debug patch which should help to restore the previous behavior.

Doug, could you help to try it? thanks

It basically tries to make sure the cpufreq-update-util be called timely even for a silent system with very few interrupts (even from tick).

Thanks, Feng

From 6be5f5da66a847860b0b9924fbb09f93b2e2d6e6 Mon Sep 17 00:00:00 2001 From: Feng Tang feng.tang@intel.com Date: Tue, 22 Feb 2022 22:59:00 +0800 Subject: [PATCH] idle/intel-pstate: hacky debug patch to make sure the cpufreq_update_util callback being called timely in silent system

--- kernel/sched/idle.c | 10 ++++++++++ kernel/sched/sched.h | 13 +++++++++++++ 2 files changed, 23 insertions(+)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index d17b0a5ce6ac..cc538acb3f1a 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -258,15 +258,25 @@ static void cpuidle_idle_call(void) * * Called with polling cleared. */ +DEFINE_PER_CPU(u64, last_util_update_time); /* in jiffies */ static void do_idle(void) { int cpu = smp_processor_id(); + u64 expire;

/* * Check if we need to update blocked load */ nohz_run_idle_balance(cpu);

+#ifdef CONFIG_X86_INTEL_PSTATE + expire = __this_cpu_read(last_util_update_time) + HZ * 3; + if (unlikely(time_is_before_jiffies(expire))) { + idle_update_util(); + __this_cpu_write(last_util_update_time, get_jiffies_64()); + } +#endif + /* * If the arch has a polling bit, we maintain an invariant: * diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0e66749486e7..2a8d87988d1f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2809,6 +2809,19 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) if (data) data->func(data, rq_clock(rq), flags); } + +static inline void idle_update_util(void) +{ + struct update_util_data *data; + struct rq *rq = cpu_rq(raw_smp_processor_id()); + + data = rcu_dereference_sched(*per_cpu_ptr(&cpufreq_update_util_data, + cpu_of(rq))); + if (data) + data->func(data, rq_clock(rq), 0); +} + + #else static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} #endif /* CONFIG_CPU_FREQ */

-- 2.25.1

Rafael J. Wysocki

6:04 p.m.

On Tue, Feb 22, 2022 at 8:41 AM Feng Tang feng.tang@intel.com wrote:

...

On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada wrote:

...
Hi Doug,

I think you use CONFIG_NO_HZ_FULL. Here we are getting callback from scheduler. Can we check that if scheduler woke up on those CPUs? We can run "trace-cmd -e sched" and check in kernel shark if there is similar gaps in activity.

Srinivas analyzed the scheduler trace data from trace-cmd, and thought is related with the cpufreq callback is not called timeley from scheduling events:

" I mean we ignore the callback when the target CPU is not a local CPU as we have to do IPI to adjust MSRs. This will happen many times when sched_wake will wake up a new CPU for the thread (we will get a callack for the target) but once the remote thread start executing "sched_switch", we will get a callback on local CPU, so we will adjust frequencies (provided 10ms interval from the last call).

...
From the trace file I see the scenario where it took 72sec between two

updates: CPU 2 34412.597161 busy=78 freq=3232653 34484.450725 busy=63 freq=2606793

There is periodic activity in between, related to active load balancing in scheduler (since last frequency was higher these small work will also run at higher frequency). But those threads are not CFS class, so scheduler callback will not be called for them.

So removing the patch removed a trigger which would have caused a sched_switch to a CFS task and call a cpufreq/intel_pstate callback.

And so this behavior needs to be restored for the time being which means reverting the problematic commit for 5.17 if possible.

It is unlikely that we'll get a proper fix before -rc7 and we still need to test it properly.

...

But calling for every class, will be too many callbacks and not sure we can even call for "stop" class, which these migration threads are using. "

Calling it for RT/deadline may not be a bad idea.

schedutil takes these classes into account when computing the utilization now (see effective_cpu_util()), so doing callbacks only for CFS seems insufficient.

Another way to avoid the issue at hand may be to prevent entering deep idle via PM QoS if the CPUs are running at high frequencies.

...

Following this direction, I made a hacky debug patch which should help to restore the previous behavior.

Doug, could you help to try it? thanks

It basically tries to make sure the cpufreq-update-util be called timely even for a silent system with very few interrupts (even from tick).

Thanks, Feng

From 6be5f5da66a847860b0b9924fbb09f93b2e2d6e6 Mon Sep 17 00:00:00 2001 From: Feng Tang feng.tang@intel.com Date: Tue, 22 Feb 2022 22:59:00 +0800 Subject: [PATCH] idle/intel-pstate: hacky debug patch to make sure the cpufreq_update_util callback being called timely in silent system

kernel/sched/idle.c | 10 ++++++++++ kernel/sched/sched.h | 13 +++++++++++++ 2 files changed, 23 insertions(+)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index d17b0a5ce6ac..cc538acb3f1a 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -258,15 +258,25 @@ static void cpuidle_idle_call(void)

Called with polling cleared.

*/ +DEFINE_PER_CPU(u64, last_util_update_time); /* in jiffies */ static void do_idle(void) { int cpu = smp_processor_id();
  u64 expire;

  /*
   * Check if we need to update blocked load
   */
  nohz_run_idle_balance(cpu);
+#ifdef CONFIG_X86_INTEL_PSTATE

Why? Doesn't this affect the other ccpufreq governors?

...

  expire = __this_cpu_read(last_util_update_time) + HZ * 3;
  if (unlikely(time_is_before_jiffies(expire))) {
          idle_update_util();
          __this_cpu_write(last_util_update_time, get_jiffies_64());
  }
+#endif
  /*
   * If the arch has a polling bit, we maintain an invariant:
   *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0e66749486e7..2a8d87988d1f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2809,6 +2809,19 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) if (data) data->func(data, rq_clock(rq), flags); }

+static inline void idle_update_util(void) +{
  struct update_util_data *data;
  struct rq *rq = cpu_rq(raw_smp_processor_id());
  data = rcu_dereference_sched(*per_cpu_ptr(&cpufreq_update_util_data,
                                            cpu_of(rq)));
  if (data)
          data->func(data, rq_clock(rq), 0);
+}

#else static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}

#endif /* CONFIG_CPU_FREQ */

Doug Smythies

23 Feb 23 Feb

12:07 a.m.

Hi All,

I am about 1/2 way through testing Feng's "hacky debug patch", let me know if I am wasting my time, and I'll abort. So far, it works fine.

The compiler complains:

kernel/sched/idle.c: In function ‘do_idle’: ./include/linux/typecheck.h:12:18: warning: comparison of distinct pointer types lacks a cast 12 | (void)(&__dummy == &__dummy2); \ | ^~ ./include/linux/compiler.h:78:42: note: in definition of macro ‘unlikely’ 78 | # define unlikely(x) __builtin_expect(!!(x), 0) | ^ ./include/linux/jiffies.h:106:3: note: in expansion of macro ‘typecheck’ 106 | typecheck(unsigned long, b) && \ | ^~~~~~~~~ ./include/linux/jiffies.h:154:35: note: in expansion of macro ‘time_after’ 154 | #define time_is_before_jiffies(a) time_after(jiffies, a) | ^~~~~~~~~~ kernel/sched/idle.c:274:15: note: in expansion of macro ‘time_is_before_jiffies’ 274 | if (unlikely(time_is_before_jiffies(expire))) {

Test 1: 347 Hertz work/sleep frequency on one CPU while others do virtually no load, but at around 0.02 hertz aggregate. Processor package power (Watts):

Kernel 5.17-rc3 + Feng patch (6 samples at 300 sec per): Average: 3.2 Min: 3.1 Max: 3.3

Kernel 5.17-rc3 (Stock) re-stated:

...

Stock: 5 samples @ 300 seconds per sample: average: 4.2 watts +31% minimum: 3.5 watts maximum: 4.9 watts

Kernel 5.17-rc3 with with b50db7095fe0 reverted. (Revert) re-stated:

...

Revert: 5 samples @ 300 seconds per sample: average: 3.2 watts minimum: 3.1 watts maximum: 3.2 watts

I also ran intel_pstate_tracer.py for 5 hours 33 minutes (20,000 seconds) on an idle system looking for long durations. (just being thorough.) There were 12 occurrences of durations longer than 6.1 seconds. Max: 6.8 seconds. (O.K.) I had expected about 3 seconds max, based on my understanding of the patch code.

Old results restated, but corrected for a stupid mistake:

Stock: 1712 occurrences of durations longer than 6.1 seconds Max: 303.6 seconds. (Bad)

Revert: 3 occurrences of durations longer than 6.1 seconds Max: 6.5 seconds (O.K.)

On Tue, Feb 22, 2022 at 10:04 AM Rafael J. Wysocki rafael@kernel.org wrote:

...

On Tue, Feb 22, 2022 at 8:41 AM Feng Tang feng.tang@intel.com wrote:

...
On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada wrote:

...
Hi Doug,

I think you use CONFIG_NO_HZ_FULL. Here we are getting callback from scheduler. Can we check that if scheduler woke up on those CPUs? We can run "trace-cmd -e sched" and check in kernel shark if there is similar gaps in activity.

Srinivas analyzed the scheduler trace data from trace-cmd, and thought is related with the cpufreq callback is not called timeley from scheduling events:

" I mean we ignore the callback when the target CPU is not a local CPU as we have to do IPI to adjust MSRs. This will happen many times when sched_wake will wake up a new CPU for the thread (we will get a callack for the target) but once the remote thread start executing "sched_switch", we will get a callback on local CPU, so we will adjust frequencies (provided 10ms interval from the last call).

...
From the trace file I see the scenario where it took 72sec between two

updates: CPU 2 34412.597161 busy=78 freq=3232653 34484.450725 busy=63 freq=2606793

There is periodic activity in between, related to active load balancing in scheduler (since last frequency was higher these small work will also run at higher frequency). But those threads are not CFS class, so scheduler callback will not be called for them.

So removing the patch removed a trigger which would have caused a sched_switch to a CFS task and call a cpufreq/intel_pstate callback.

And so this behavior needs to be restored for the time being which means reverting the problematic commit for 5.17 if possible.

It is unlikely that we'll get a proper fix before -rc7 and we still need to test it properly.

...
But calling for every class, will be too many callbacks and not sure we can even call for "stop" class, which these migration threads are using. "

Calling it for RT/deadline may not be a bad idea.

schedutil takes these classes into account when computing the utilization now (see effective_cpu_util()), so doing callbacks only for CFS seems insufficient.

Another way to avoid the issue at hand may be to prevent entering deep idle via PM QoS if the CPUs are running at high frequencies.

...
Following this direction, I made a hacky debug patch which should help to restore the previous behavior.

Doug, could you help to try it? thanks

It basically tries to make sure the cpufreq-update-util be called timely even for a silent system with very few interrupts (even from tick).

Thanks, Feng

From 6be5f5da66a847860b0b9924fbb09f93b2e2d6e6 Mon Sep 17 00:00:00 2001 From: Feng Tang feng.tang@intel.com Date: Tue, 22 Feb 2022 22:59:00 +0800 Subject: [PATCH] idle/intel-pstate: hacky debug patch to make sure the cpufreq_update_util callback being called timely in silent system

kernel/sched/idle.c | 10 ++++++++++ kernel/sched/sched.h | 13 +++++++++++++ 2 files changed, 23 insertions(+)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index d17b0a5ce6ac..cc538acb3f1a 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -258,15 +258,25 @@ static void cpuidle_idle_call(void)

Called with polling cleared.

*/ +DEFINE_PER_CPU(u64, last_util_update_time); /* in jiffies */ static void do_idle(void) { int cpu = smp_processor_id();
  u64 expire;

  /*
   * Check if we need to update blocked load
   */
  nohz_run_idle_balance(cpu);
+#ifdef CONFIG_X86_INTEL_PSTATE
Why? Doesn't this affect the other ccpufreq governors?

Yes, I have the same question.

...

...
  expire = __this_cpu_read(last_util_update_time) + HZ * 3;
  if (unlikely(time_is_before_jiffies(expire))) {
          idle_update_util();
          __this_cpu_write(last_util_update_time, get_jiffies_64());
  }
+#endif
  /*
   * If the arch has a polling bit, we maintain an invariant:
   *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0e66749486e7..2a8d87988d1f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2809,6 +2809,19 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) if (data) data->func(data, rq_clock(rq), flags); }

+static inline void idle_update_util(void) +{
  struct update_util_data *data;
  struct rq *rq = cpu_rq(raw_smp_processor_id());
  data = rcu_dereference_sched(*per_cpu_ptr(&cpufreq_update_util_data,
                                            cpu_of(rq)));
  if (data)
          data->func(data, rq_clock(rq), 0);
+}

#else static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}

#endif /* CONFIG_CPU_FREQ */

srinivas pandruvada

12:32 a.m.

Hi Doug,

On Tue, 2022-02-22 at 16:07 -0800, Doug Smythies wrote:

...

Hi All,

I am about 1/2 way through testing Feng's "hacky debug patch", let me know if I am wasting my time, and I'll abort. So far, it works fine.

This just proves that if you add some callback during long idle, you will reach a less aggressive p-state. I think you already proved that with your results below showing 1W less average power ("Kernel 5.17-rc3 + Feng patch (6 samples at 300 sec per").

Rafael replied with one possible option. Alternatively when planing to enter deep idle, set P-state to min with a callback like we do in offline callback.

So we need to think about a proper solution for this.

Thanks, Srinivas

...

The compiler complains:

kernel/sched/idle.c: In function ‘do_idle’: ./include/linux/typecheck.h:12:18: warning: comparison of distinct pointer types lacks a cast    12 | (void)(&__dummy == &__dummy2); \       |                  ^~ ./include/linux/compiler.h:78:42: note: in definition of macro ‘unlikely’    78 | # define unlikely(x) __builtin_expect(!!(x), 0)       |                                          ^ ./include/linux/jiffies.h:106:3: note: in expansion of macro ‘typecheck’ 106 |   typecheck(unsigned long, b) && \       |   ^~~~~~~~~ ./include/linux/jiffies.h:154:35: note: in expansion of macro ‘time_after’ 154 | #define time_is_before_jiffies(a) time_after(jiffies, a)       |                                   ^~~~~~~~~~ kernel/sched/idle.c:274:15: note: in expansion of macro ‘time_is_before_jiffies’ 274 | if (unlikely(time_is_before_jiffies(expire))) {

Test 1: 347 Hertz work/sleep frequency on one CPU while others do virtually no load, but at around 0.02 hertz aggregate. Processor package power (Watts):

Kernel 5.17-rc3 + Feng patch (6 samples at 300 sec per): Average: 3.2 Min: 3.1 Max: 3.3

Kernel 5.17-rc3 (Stock) re-stated:

...
Stock: 5 samples @ 300 seconds per sample: average: 4.2 watts +31% minimum: 3.5 watts maximum: 4.9 watts

Kernel 5.17-rc3 with with b50db7095fe0 reverted. (Revert) re-stated:

...
Revert: 5 samples @ 300 seconds per sample: average: 3.2 watts minimum: 3.1 watts maximum: 3.2 watts

I also ran intel_pstate_tracer.py for 5 hours 33 minutes (20,000 seconds) on an idle system looking for long durations. (just being thorough.) There were 12 occurrences of durations longer than 6.1 seconds. Max: 6.8 seconds. (O.K.) I had expected about 3 seconds max, based on my understanding of the patch code.

Old results restated, but corrected for a stupid mistake:

Stock: 1712 occurrences of durations longer than 6.1 seconds Max: 303.6 seconds. (Bad)

Revert: 3 occurrences of durations longer than 6.1 seconds Max: 6.5 seconds (O.K.)

On Tue, Feb 22, 2022 at 10:04 AM Rafael J. Wysocki rafael@kernel.org wrote:

...
On Tue, Feb 22, 2022 at 8:41 AM Feng Tang feng.tang@intel.com wrote:

...
On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada wrote:

...
Hi Doug,

I think you use CONFIG_NO_HZ_FULL. Here we are getting callback from scheduler. Can we check that if scheduler woke up on those CPUs? We can run "trace-cmd -e sched" and check in kernel shark if there is similar gaps in activity.

Srinivas analyzed the scheduler trace data from trace-cmd, and thought is related with the cpufreq callback is not called timeley from scheduling events:

" I mean we ignore the callback when the target CPU is not a local CPU as we have to do IPI to adjust MSRs. This will happen many times when sched_wake will wake up a new CPU for the thread (we will get a callack for the target) but once the remote thread start executing "sched_switch", we will get a callback on local CPU, so we will adjust frequencies (provided 10ms interval from the last call).

...
From the trace file I see the scenario where it took 72sec between two

updates: CPU 2 34412.597161    busy=78         freq=3232653 34484.450725    busy=63         freq=2606793

There is periodic activity in between, related to active load balancing in scheduler (since last frequency was higher these small work will also run at higher frequency). But those threads are not CFS class, so scheduler callback will not be called for them.

So removing the patch removed a trigger which would have caused a sched_switch to a CFS task and call a cpufreq/intel_pstate callback.

And so this behavior needs to be restored for the time being which means reverting the problematic commit for 5.17 if possible.

It is unlikely that we'll get a proper fix before -rc7 and we still need to test it properly.

...
But calling for every class, will be too many callbacks and not sure we can even call for "stop" class, which these migration threads are using. "

Calling it for RT/deadline may not be a bad idea.

schedutil takes these classes into account when computing the utilization now (see effective_cpu_util()), so doing callbacks only for CFS seems insufficient.

Another way to avoid the issue at hand may be to prevent entering deep idle via PM QoS if the CPUs are running at high frequencies.

...
Following this direction, I made a hacky debug patch which should help to restore the previous behavior.

Doug, could you help to try it? thanks

It basically tries to make sure the cpufreq-update-util be called timely even for a silent system with very few interrupts (even from tick).

Thanks, Feng

From 6be5f5da66a847860b0b9924fbb09f93b2e2d6e6 Mon Sep 17 00:00:00 2001 From: Feng Tang feng.tang@intel.com Date: Tue, 22 Feb 2022 22:59:00 +0800 Subject: [PATCH] idle/intel-pstate: hacky debug patch to make sure the cpufreq_update_util callback being called timely in silent system

kernel/sched/idle.c | 10 ++++++++++ kernel/sched/sched.h | 13 +++++++++++++ 2 files changed, 23 insertions(+)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index d17b0a5ce6ac..cc538acb3f1a 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -258,15 +258,25 @@ static void cpuidle_idle_call(void) * * Called with polling cleared. */ +DEFINE_PER_CPU(u64, last_util_update_time);    /* in jiffies */ static void do_idle(void) {         int cpu = smp_processor_id(); +       u64 expire;

/*          * Check if we need to update blocked load          */         nohz_run_idle_balance(cpu);

+#ifdef CONFIG_X86_INTEL_PSTATE

Why? Doesn't this affect the other ccpufreq governors?

Yes, I have the same question.

...
...
+       expire = __this_cpu_read(last_util_update_time) + HZ * 3; +       if (unlikely(time_is_before_jiffies(expire))) { +               idle_update_util(); +               __this_cpu_write(last_util_update_time, get_jiffies_64()); +       } +#endif

/*          * If the arch has a polling bit, we maintain an invariant:          * diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0e66749486e7..2a8d87988d1f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2809,6 +2809,19 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)         if (data)                 data->func(data, rq_clock(rq), flags); }

+static inline void idle_update_util(void) +{ +       struct update_util_data *data; +       struct rq *rq = cpu_rq(raw_smp_processor_id());

+       data = rcu_dereference_sched(*per_cpu_ptr(&cpufreq_update_util_data, +                                                 cpu_of(rq))); +       if (data) +               data->func(data, rq_clock(rq), 0); +}

#else static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} #endif /* CONFIG_CPU_FREQ */ --

Feng Tang

12:40 a.m.

On Tue, Feb 22, 2022 at 04:32:29PM -0800, srinivas pandruvada wrote:

...

Hi Doug,

On Tue, 2022-02-22 at 16:07 -0800, Doug Smythies wrote:

...
Hi All,

I am about 1/2 way through testing Feng's "hacky debug patch", let me know if I am wasting my time, and I'll abort. So far, it works fine.

This just proves that if you add some callback during long idle, you will reach a less aggressive p-state. I think you already proved that with your results below showing 1W less average power ("Kernel 5.17-rc3

Feng patch (6 samples at 300 sec per").

Rafael replied with one possible option. Alternatively when planing to enter deep idle, set P-state to min with a callback like we do in offline callback.

Yes, if the system is going to idle, it makes sense to goto a lower cpufreq first (also what my debug patch will essentially lead to).

Given cprfreq-util's normal running frequency is every 10ms, doing this before entering idle is not a big extra burden.

Thanks, Feng

...

So we need to think about a proper solution for this.

Thanks, Srinivas

Feng Tang

2:49 a.m.

Hi Rafael,

On Tue, Feb 22, 2022 at 07:04:32PM +0100, Rafael J. Wysocki wrote:

...

On Tue, Feb 22, 2022 at 8:41 AM Feng Tang feng.tang@intel.com wrote:

...
On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada wrote:

...
Hi Doug,

I think you use CONFIG_NO_HZ_FULL. Here we are getting callback from scheduler. Can we check that if scheduler woke up on those CPUs? We can run "trace-cmd -e sched" and check in kernel shark if there is similar gaps in activity.

Srinivas analyzed the scheduler trace data from trace-cmd, and thought is related with the cpufreq callback is not called timeley from scheduling events:

" I mean we ignore the callback when the target CPU is not a local CPU as we have to do IPI to adjust MSRs. This will happen many times when sched_wake will wake up a new CPU for the thread (we will get a callack for the target) but once the remote thread start executing "sched_switch", we will get a callback on local CPU, so we will adjust frequencies (provided 10ms interval from the last call).

...
From the trace file I see the scenario where it took 72sec between two

updates: CPU 2 34412.597161 busy=78 freq=3232653 34484.450725 busy=63 freq=2606793

There is periodic activity in between, related to active load balancing in scheduler (since last frequency was higher these small work will also run at higher frequency). But those threads are not CFS class, so scheduler callback will not be called for them.

So removing the patch removed a trigger which would have caused a sched_switch to a CFS task and call a cpufreq/intel_pstate callback.

And so this behavior needs to be restored for the time being which means reverting the problematic commit for 5.17 if possible.

It is unlikely that we'll get a proper fix before -rc7 and we still need to test it properly.

Thanks for checking this!

I understand the time limit as we are approaching the close of 5.17, but still I don't think reverting commit b50db7095fe0 is the best solution as:

* b50db7095fe0 is not just an optimization, but solves some real problems found in servers from big CSP (Cloud Service Provider) and data center's server room.

* IMHO, b50db7095fe0 itself hasn't done anything wrong.

* This problem found by Doug is a rarely happened case, though it is an expected thing as shown in existing comments of cfs_rq_util_change():

/* * There are a few boundary cases this might miss but it should * get called often enough that that should (hopefully) not be * a real problem. * * It will not get called when we go idle, because the idle * thread is a different class (!fair), nor will the utilization * number include things like RT tasks. * * As is, the util number is not freq-invariant (we'd have to * implement arch_scale_freq_capacity() for that). * * See cpu_util_cfs(). */ cpufreq_update_util(rq, flags);

As this happens with HWP-disabled case and a very calm system, can we find a proper solution in 5.17/5.18 or later, and cc stable?

...

...
But calling for every class, will be too many callbacks and not sure we can even call for "stop" class, which these migration threads are using. "

Calling it for RT/deadline may not be a bad idea.

schedutil takes these classes into account when computing the utilization now (see effective_cpu_util()), so doing callbacks only for CFS seems insufficient.

Another way to avoid the issue at hand may be to prevent entering deep idle via PM QoS if the CPUs are running at high frequencies.

...
--- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -258,15 +258,25 @@ static void cpuidle_idle_call(void)

Called with polling cleared.

*/ +DEFINE_PER_CPU(u64, last_util_update_time); /* in jiffies */ static void do_idle(void) { int cpu = smp_processor_id();
  u64 expire;

  /*
   * Check if we need to update blocked load
   */
  nohz_run_idle_balance(cpu);
+#ifdef CONFIG_X86_INTEL_PSTATE
Why? Doesn't this affect the other ccpufreq governors?

You are right, this should be a generic thing.

I tried to limit its affect to other cases, though it's not necessary for a debug patch.

Thanks, Feng

Thomas Gleixner

9:40 a.m.

Rafael,

On Tue, Feb 22 2022 at 19:04, Rafael J. Wysocki wrote:

...

On Tue, Feb 22, 2022 at 8:41 AM Feng Tang feng.tang@intel.com wrote:

...
There is periodic activity in between, related to active load balancing in scheduler (since last frequency was higher these small work will also run at higher frequency). But those threads are not CFS class, so scheduler callback will not be called for them.

So removing the patch removed a trigger which would have caused a sched_switch to a CFS task and call a cpufreq/intel_pstate callback.

And so this behavior needs to be restored for the time being which means reverting the problematic commit for 5.17 if possible.

No. This is just papering over the problem. Just because the clocksource watchdog has the side effect of making cpufreq "work", does not make it a prerequisite for cpufreq. The commit unearthed a problem in the cpufreq code, so it needs to be fixed there.

Even if we'd revert it then, you can produce the same effect by adding 'tsc=reliable' to the kernel command line which disables the clocksource watchdog too. The commit is there to deal with modern hardware without requiring people to add 'tsc=reliable' to the command line.

Thanks,

tglx

Rafael J. Wysocki

2:11 p.m.

On Wed, Feb 23, 2022 at 3:49 AM Feng Tang feng.tang@intel.com wrote:

...

Hi Rafael,

On Tue, Feb 22, 2022 at 07:04:32PM +0100, Rafael J. Wysocki wrote:

...
On Tue, Feb 22, 2022 at 8:41 AM Feng Tang feng.tang@intel.com wrote:

...
On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada wrote:

...
Hi Doug,

I think you use CONFIG_NO_HZ_FULL. Here we are getting callback from scheduler. Can we check that if scheduler woke up on those CPUs? We can run "trace-cmd -e sched" and check in kernel shark if there is similar gaps in activity.

Srinivas analyzed the scheduler trace data from trace-cmd, and thought is related with the cpufreq callback is not called timeley from scheduling events:

" I mean we ignore the callback when the target CPU is not a local CPU as we have to do IPI to adjust MSRs. This will happen many times when sched_wake will wake up a new CPU for the thread (we will get a callack for the target) but once the remote thread start executing "sched_switch", we will get a callback on local CPU, so we will adjust frequencies (provided 10ms interval from the last call).

...
From the trace file I see the scenario where it took 72sec between two

updates: CPU 2 34412.597161 busy=78 freq=3232653 34484.450725 busy=63 freq=2606793

There is periodic activity in between, related to active load balancing in scheduler (since last frequency was higher these small work will also run at higher frequency). But those threads are not CFS class, so scheduler callback will not be called for them.

So removing the patch removed a trigger which would have caused a sched_switch to a CFS task and call a cpufreq/intel_pstate callback.

And so this behavior needs to be restored for the time being which means reverting the problematic commit for 5.17 if possible.

It is unlikely that we'll get a proper fix before -rc7 and we still need to test it properly.

Thanks for checking this!

I understand the time limit as we are approaching the close of 5.17, but still I don't think reverting commit b50db7095fe0 is the best solution as:
b50db7095fe0 is not just an optimization, but solves some real problems found in servers from big CSP (Cloud Service Provider) and data center's server room.

IMHO, b50db7095fe0 itself hasn't done anything wrong.
This problem found by Doug is a rarely happened case, though it is an expected thing as shown in existing comments of cfs_rq_util_change():
  /*
   * There are a few boundary cases this might miss but it should
   * get called often enough that that should (hopefully) not be
   * a real problem.
   *
   * It will not get called when we go idle, because the idle
   * thread is a different class (!fair), nor will the utilization
   * number include things like RT tasks.
   *
   * As is, the util number is not freq-invariant (we'd have to
   * implement arch_scale_freq_capacity() for that).
   *
   * See cpu_util_cfs().
   */
  cpufreq_update_util(rq, flags);
As this happens with HWP-disabled case and a very calm system, can we find a proper solution in 5.17/5.18 or later, and cc stable?

We can.

Rafael J. Wysocki

2:23 p.m.

On Wed, Feb 23, 2022 at 1:40 AM Feng Tang feng.tang@intel.com wrote:

...

On Tue, Feb 22, 2022 at 04:32:29PM -0800, srinivas pandruvada wrote:

...
Hi Doug,

On Tue, 2022-02-22 at 16:07 -0800, Doug Smythies wrote:

...
Hi All,

I am about 1/2 way through testing Feng's "hacky debug patch", let me know if I am wasting my time, and I'll abort. So far, it works fine.

This just proves that if you add some callback during long idle, you will reach a less aggressive p-state. I think you already proved that with your results below showing 1W less average power ("Kernel 5.17-rc3

Feng patch (6 samples at 300 sec per").

Rafael replied with one possible option. Alternatively when planing to enter deep idle, set P-state to min with a callback like we do in offline callback.

Yes, if the system is going to idle, it makes sense to goto a lower cpufreq first (also what my debug patch will essentially lead to).

Given cprfreq-util's normal running frequency is every 10ms, doing this before entering idle is not a big extra burden.

But this is not related to idle as such, but to the fact that idle sometimes stops the scheduler tick which otherwise would run the cpufreq governor callback on a regular basis.

It is stopping the tick that gets us into trouble, so I would avoid doing it if the current performance state is too aggressive.

In principle, PM QoS can be used for that from intel_pstate, but there is a problem with that approach, because it is not obvious what value to give to it and it is not always guaranteed to work (say when all of the C-states except for C1 are disabled).

So it looks like a new mechanism is needed for that.

Rafael J. Wysocki

2:23 p.m.

On Wed, Feb 23, 2022 at 10:40 AM Thomas Gleixner tglx@linutronix.de wrote:

...

Rafael,

On Tue, Feb 22 2022 at 19:04, Rafael J. Wysocki wrote:

...
On Tue, Feb 22, 2022 at 8:41 AM Feng Tang feng.tang@intel.com wrote:

...
There is periodic activity in between, related to active load balancing in scheduler (since last frequency was higher these small work will also run at higher frequency). But those threads are not CFS class, so scheduler callback will not be called for them.

So removing the patch removed a trigger which would have caused a sched_switch to a CFS task and call a cpufreq/intel_pstate callback.

And so this behavior needs to be restored for the time being which means reverting the problematic commit for 5.17 if possible.

No. This is just papering over the problem. Just because the clocksource watchdog has the side effect of making cpufreq "work", does not make it a prerequisite for cpufreq. The commit unearthed a problem in the cpufreq code, so it needs to be fixed there.

Even if we'd revert it then, you can produce the same effect by adding 'tsc=reliable' to the kernel command line which disables the clocksource watchdog too. The commit is there to deal with modern hardware without requiring people to add 'tsc=reliable' to the command line.

Allright (but I'll remember this exception).

Feng Tang

24 Feb 24 Feb

8:08 a.m.

On Wed, Feb 23, 2022 at 03:23:20PM +0100, Rafael J. Wysocki wrote:

...

On Wed, Feb 23, 2022 at 1:40 AM Feng Tang feng.tang@intel.com wrote:

...
On Tue, Feb 22, 2022 at 04:32:29PM -0800, srinivas pandruvada wrote:

...
Hi Doug,

On Tue, 2022-02-22 at 16:07 -0800, Doug Smythies wrote:

...
Hi All,

I am about 1/2 way through testing Feng's "hacky debug patch", let me know if I am wasting my time, and I'll abort. So far, it works fine.

This just proves that if you add some callback during long idle, you will reach a less aggressive p-state. I think you already proved that with your results below showing 1W less average power ("Kernel 5.17-rc3

Feng patch (6 samples at 300 sec per").

Rafael replied with one possible option. Alternatively when planing to enter deep idle, set P-state to min with a callback like we do in offline callback.

Yes, if the system is going to idle, it makes sense to goto a lower cpufreq first (also what my debug patch will essentially lead to).

Given cprfreq-util's normal running frequency is every 10ms, doing this before entering idle is not a big extra burden.

But this is not related to idle as such, but to the fact that idle sometimes stops the scheduler tick which otherwise would run the cpufreq governor callback on a regular basis.

It is stopping the tick that gets us into trouble, so I would avoid doing it if the current performance state is too aggressive.

I've tried to simulate Doug's environment by using his kconfig, and offline my 36 CPUs Desktop to leave 12 CPUs online, and on it I can still see Local timer interrupts when there is no active load, with the longest interval between 2 timer interrupts is 4 seconds, while idle class's task_tick_idle() will do nothing, and CFS' task_tick_fair() will in turn call cfs_rq_util_change()

I searched the cfs/deadline/rt code, these three classes all have places to call cpufreq_update_util(), either in enqueue/dequeue or changing running bandwidth. So I think entering idle also means the system load is under a big change, and worth calling the cpufreq callback.

...

In principle, PM QoS can be used for that from intel_pstate, but there is a problem with that approach, because it is not obvious what value to give to it and it is not always guaranteed to work (say when all of the C-states except for C1 are disabled).

So it looks like a new mechanism is needed for that.

If you think idle class is not the right place to solve it, I can also help testing new patches.

Thanks, Feng

Paul E. McKenney

2:44 p.m.

On Thu, Feb 24, 2022 at 04:08:30PM +0800, Feng Tang wrote:

...

On Wed, Feb 23, 2022 at 03:23:20PM +0100, Rafael J. Wysocki wrote:

...
On Wed, Feb 23, 2022 at 1:40 AM Feng Tang feng.tang@intel.com wrote:

...
On Tue, Feb 22, 2022 at 04:32:29PM -0800, srinivas pandruvada wrote:

...
Hi Doug,

On Tue, 2022-02-22 at 16:07 -0800, Doug Smythies wrote:

...
Hi All,

I am about 1/2 way through testing Feng's "hacky debug patch", let me know if I am wasting my time, and I'll abort. So far, it works fine.

This just proves that if you add some callback during long idle, you will reach a less aggressive p-state. I think you already proved that with your results below showing 1W less average power ("Kernel 5.17-rc3

Feng patch (6 samples at 300 sec per").

Rafael replied with one possible option. Alternatively when planing to enter deep idle, set P-state to min with a callback like we do in offline callback.

Yes, if the system is going to idle, it makes sense to goto a lower cpufreq first (also what my debug patch will essentially lead to).

Given cprfreq-util's normal running frequency is every 10ms, doing this before entering idle is not a big extra burden.

But this is not related to idle as such, but to the fact that idle sometimes stops the scheduler tick which otherwise would run the cpufreq governor callback on a regular basis.

It is stopping the tick that gets us into trouble, so I would avoid doing it if the current performance state is too aggressive.

I've tried to simulate Doug's environment by using his kconfig, and offline my 36 CPUs Desktop to leave 12 CPUs online, and on it I can still see Local timer interrupts when there is no active load, with the longest interval between 2 timer interrupts is 4 seconds, while idle class's task_tick_idle() will do nothing, and CFS' task_tick_fair() will in turn call cfs_rq_util_change()

Every four seconds? Could you please post your .config?

Thanx, Paul

...

I searched the cfs/deadline/rt code, these three classes all have places to call cpufreq_update_util(), either in enqueue/dequeue or changing running bandwidth. So I think entering idle also means the system load is under a big change, and worth calling the cpufreq callback.

...
In principle, PM QoS can be used for that from intel_pstate, but there is a problem with that approach, because it is not obvious what value to give to it and it is not always guaranteed to work (say when all of the C-states except for C1 are disabled).

So it looks like a new mechanism is needed for that.

If you think idle class is not the right place to solve it, I can also help testing new patches.

Thanks, Feng

Doug Smythies

4:29 p.m.

On 2022.02.24 04:08:30 Paul E. McKenney wrote:

...

On Thu, Feb 24, 2022 at 04:08:30PM +0800, Feng Tang wrote:

...
On Wed, Feb 23, 2022 at 03:23:20PM +0100, Rafael J. Wysocki wrote:

...
On Wed, Feb 23, 2022 at 1:40 AM Feng Tang feng.tang@intel.com wrote:

But this is not related to idle as such, but to the fact that idle sometimes stops the scheduler tick which otherwise would run the cpufreq governor callback on a regular basis.

It is stopping the tick that gets us into trouble, so I would avoid doing it if the current performance state is too aggressive.

I've tried to simulate Doug's environment by using his kconfig, and offline my 36 CPUs Desktop to leave 12 CPUs online, and on it I can still see Local timer interrupts when there is no active load, with the longest interval between 2 timer interrupts is 4 seconds, while idle class's task_tick_idle() will do nothing, and CFS' task_tick_fair() will in turn call cfs_rq_util_change()

Every four seconds? Could you please post your .config?
					Thanx, Paul

I steal the kernel config from the Ubuntu mainline PPA. See also earlier on this thread:

https://lore.kernel.org/linux-pm/CAAYoRsXkyWf0vmEE2HvjF6pzCC4utxTF=7AFx1PJv4...

but relevant part copied here:

...

I steal the kernel configuration file from the Ubuntu mainline PPA [1], what they call "lowlatency", or 1000Hz tick. I make these changes before compile:

scripts/config --disable DEBUG_INFO scripts/config --disable SYSTEM_TRUSTED_KEYS scripts/config --disable SYSTEM_REVOCATION_KEYS

I [will] also send you the config and dmesg files in an off-list email.

[1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17-rc3/

I put the same one I sent Feng on my web site where I was sharing stuff with Srinivas (coded to avoid the barrage of bots):

double u double u double u dot smythies dot com/~doug/linux/s18/hwp/srinivas/long_dur/

... Doug

Paul E. McKenney

4:58 p.m.

On Thu, Feb 24, 2022 at 08:29:22AM -0800, Doug Smythies wrote:

...

On 2022.02.24 04:08:30 Paul E. McKenney wrote:

...
On Thu, Feb 24, 2022 at 04:08:30PM +0800, Feng Tang wrote:

...
On Wed, Feb 23, 2022 at 03:23:20PM +0100, Rafael J. Wysocki wrote:

...
On Wed, Feb 23, 2022 at 1:40 AM Feng Tang feng.tang@intel.com wrote:

But this is not related to idle as such, but to the fact that idle sometimes stops the scheduler tick which otherwise would run the cpufreq governor callback on a regular basis.

It is stopping the tick that gets us into trouble, so I would avoid doing it if the current performance state is too aggressive.

I've tried to simulate Doug's environment by using his kconfig, and offline my 36 CPUs Desktop to leave 12 CPUs online, and on it I can still see Local timer interrupts when there is no active load, with the longest interval between 2 timer interrupts is 4 seconds, while idle class's task_tick_idle() will do nothing, and CFS' task_tick_fair() will in turn call cfs_rq_util_change()

Every four seconds? Could you please post your .config?
					Thanx, Paul
I steal the kernel config from the Ubuntu mainline PPA. See also earlier on this thread:

https://lore.kernel.org/linux-pm/CAAYoRsXkyWf0vmEE2HvjF6pzCC4utxTF=7AFx1PJv4...

but relevant part copied here:

...
I steal the kernel configuration file from the Ubuntu mainline PPA [1], what they call "lowlatency", or 1000Hz tick. I make these changes before compile:

scripts/config --disable DEBUG_INFO scripts/config --disable SYSTEM_TRUSTED_KEYS scripts/config --disable SYSTEM_REVOCATION_KEYS

I [will] also send you the config and dmesg files in an off-list email.

[1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17-rc3/

I put the same one I sent Feng on my web site where I was sharing stuff with Srinivas (coded to avoid the barrage of bots):

double u double u double u dot smythies dot com/~doug/linux/s18/hwp/srinivas/long_dur/

Thank you!

I don't see CONFIG_FAST_NO_HZ=y in your .config, so that is not the reason for your every-four-second timers. ;-)

(CONFIG_FAST_NO_HZ is being removed, FYI.)

Thanx, Paul

Feng Tang

25 Feb 25 Feb

12:29 a.m.

Hi Paul,

On Thu, Feb 24, 2022 at 06:44:23AM -0800, Paul E. McKenney wrote: [...]

...

...
...
...
...
Rafael replied with one possible option. Alternatively when planing to enter deep idle, set P-state to min with a callback like we do in offline callback.

Yes, if the system is going to idle, it makes sense to goto a lower cpufreq first (also what my debug patch will essentially lead to).

Given cprfreq-util's normal running frequency is every 10ms, doing this before entering idle is not a big extra burden.

But this is not related to idle as such, but to the fact that idle sometimes stops the scheduler tick which otherwise would run the cpufreq governor callback on a regular basis.

It is stopping the tick that gets us into trouble, so I would avoid doing it if the current performance state is too aggressive.

I've tried to simulate Doug's environment by using his kconfig, and offline my 36 CPUs Desktop to leave 12 CPUs online, and on it I can still see Local timer interrupts when there is no active load, with the longest interval between 2 timer interrupts is 4 seconds, while idle class's task_tick_idle() will do nothing, and CFS' task_tick_fair() will in turn call cfs_rq_util_change()

Every four seconds? Could you please post your .config?

Aha, I didn't make it clear, that the timer interrupt was not always coming every 4 seconds, but when system is silent, the maxim interval between 2 timer interrupts was 4 seconds.

When initially I checked this, I doubted if the timer interrupt are too few on the system, so I used Doug's config and tried to make my desktop silent (like disabling GUI), following is some trace_printk log, though I figured out later when idle thread is running, the idle class' scheduler tick will not help as it doesn't call cpufreq callback.

<idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <...>-1167 snapd-1128 snapd-1128 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 <idle>-0 [009] d.h1. 235.980053: hrtimer_interrupt: enter [009] d.h1. 235.981054: hrtimer_interrupt: enter [009] d.h1. 235.982053: hrtimer_interrupt: enter [009] d.h1. 235.983053: hrtimer_interrupt: enter [009] d.h1. 235.984053: hrtimer_interrupt: enter [009] d.h1. 235.985053: hrtimer_interrupt: enter [009] d.h1. 235.986054: hrtimer_interrupt: enter [009] d.h1. 235.987054: hrtimer_interrupt: enter [009] d.h1. 235.988054: hrtimer_interrupt: enter [009] d.h1. 235.989054: hrtimer_interrupt: enter [009] d.h1. 235.990054: hrtimer_interrupt: enter [009] d.h1. 235.991053: hrtimer_interrupt: enter [009] d.h1. 235.992054: hrtimer_interrupt: enter [009] d.h1. 235.993054: hrtimer_interrupt: enter [009] d.h1. 235.994054: hrtimer_interrupt: enter [009] d.h1. 236.331126: hrtimer_interrupt: enter [009] d.h1. 236.460130: hrtimer_interrupt: enter [009] d.s5. 236.460147: intel_pstate_update_util: old_state=48 new=27 [009] d.h1. 238.380130: hrtimer_interrupt: enter [009] d.s5. 238.380147: intel_pstate_update_util: old_state=27 new=12 [009] d.h1. 240.331133: hrtimer_interrupt: enter [009] d.h1. 240.364133: hrtimer_interrupt: enter [009] d.h1. 244.331135: hrtimer_interrupt: enter [009] d.h1. 248.331139: hrtimer_interrupt: enter [009] d.h1. 252.331138: hrtimer_interrupt: enter [009] d.h.. 254.860056: hrtimer_interrupt: enter [009] d.h.. 254.861054: hrtimer_interrupt: enter [009] d.h.. 254.862055: hrtimer_interrupt: enter [009] d.h1. 254.863056: hrtimer_interrupt: enter [009] d.h1. 254.864056: hrtimer_interrupt: enter [009] d.h1. 254.865055: hrtimer_interrupt: enter [009] d.h1. 256.331133: hrtimer_interrupt: enter [009] d.h1. 260.331127: hrtimer_interrupt: enter [009] d.h1. 264.331135: hrtimer_interrupt: enter

Thanks, Feng

...

					Thanx, Paul

Paul E. McKenney

1:06 a.m.

On Fri, Feb 25, 2022 at 08:29:51AM +0800, Feng Tang wrote:

...

Hi Paul,

On Thu, Feb 24, 2022 at 06:44:23AM -0800, Paul E. McKenney wrote: [...]

...
...
...
...
...
Rafael replied with one possible option. Alternatively when planing to enter deep idle, set P-state to min with a callback like we do in offline callback.

Yes, if the system is going to idle, it makes sense to goto a lower cpufreq first (also what my debug patch will essentially lead to).

Given cprfreq-util's normal running frequency is every 10ms, doing this before entering idle is not a big extra burden.

But this is not related to idle as such, but to the fact that idle sometimes stops the scheduler tick which otherwise would run the cpufreq governor callback on a regular basis.

It is stopping the tick that gets us into trouble, so I would avoid doing it if the current performance state is too aggressive.

I've tried to simulate Doug's environment by using his kconfig, and offline my 36 CPUs Desktop to leave 12 CPUs online, and on it I can still see Local timer interrupts when there is no active load, with the longest interval between 2 timer interrupts is 4 seconds, while idle class's task_tick_idle() will do nothing, and CFS' task_tick_fair() will in turn call cfs_rq_util_change()

Every four seconds? Could you please post your .config?

Aha, I didn't make it clear, that the timer interrupt was not always coming every 4 seconds, but when system is silent, the maxim interval between 2 timer interrupts was 4 seconds.

When initially I checked this, I doubted if the timer interrupt are too few on the system, so I used Doug's config and tried to make my desktop silent (like disabling GUI), following is some trace_printk log, though I figured out later when idle thread is running, the idle class' scheduler tick will not help as it doesn't call cpufreq callback.
      <idle>-0       [009] d.h1.   235.980053: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   235.981054: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   235.982053: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   235.983053: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   235.984053: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   235.985053: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   235.986054: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   235.987054: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   235.988054: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   235.989054: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   235.990054: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   235.991053: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   235.992054: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   235.993054: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   235.994054: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   236.331126: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   236.460130: hrtimer_interrupt: enter
      <idle>-0       [009] d.s5.   236.460147: intel_pstate_update_util: old_state=48 new=27
      <idle>-0       [009] d.h1.   238.380130: hrtimer_interrupt: enter
      <idle>-0       [009] d.s5.   238.380147: intel_pstate_update_util: old_state=27 new=12
      <idle>-0       [009] d.h1.   240.331133: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   240.364133: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   244.331135: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   248.331139: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   252.331138: hrtimer_interrupt: enter
       <...>-1167    [009] d.h..   254.860056: hrtimer_interrupt: enter
       snapd-1128    [009] d.h..   254.861054: hrtimer_interrupt: enter
       snapd-1128    [009] d.h..   254.862055: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   254.863056: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   254.864056: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   254.865055: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   256.331133: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   260.331127: hrtimer_interrupt: enter
      <idle>-0       [009] d.h1.   264.331135: hrtimer_interrupt: enter

Thank you for the clarification!

Thanx, Paul

Rafael J. Wysocki

5:45 p.m.

On Thursday, February 24, 2022 9:08:30 AM CET Feng Tang wrote:

...

On Wed, Feb 23, 2022 at 03:23:20PM +0100, Rafael J. Wysocki wrote:

...
On Wed, Feb 23, 2022 at 1:40 AM Feng Tang feng.tang@intel.com wrote:

...
On Tue, Feb 22, 2022 at 04:32:29PM -0800, srinivas pandruvada wrote:

...
Hi Doug,

On Tue, 2022-02-22 at 16:07 -0800, Doug Smythies wrote:

...
Hi All,

I am about 1/2 way through testing Feng's "hacky debug patch", let me know if I am wasting my time, and I'll abort. So far, it works fine.

This just proves that if you add some callback during long idle, you will reach a less aggressive p-state. I think you already proved that with your results below showing 1W less average power ("Kernel 5.17-rc3

Feng patch (6 samples at 300 sec per").

Rafael replied with one possible option. Alternatively when planing to enter deep idle, set P-state to min with a callback like we do in offline callback.

Yes, if the system is going to idle, it makes sense to goto a lower cpufreq first (also what my debug patch will essentially lead to).

Given cprfreq-util's normal running frequency is every 10ms, doing this before entering idle is not a big extra burden.

But this is not related to idle as such, but to the fact that idle sometimes stops the scheduler tick which otherwise would run the cpufreq governor callback on a regular basis.

It is stopping the tick that gets us into trouble, so I would avoid doing it if the current performance state is too aggressive.

I've tried to simulate Doug's environment by using his kconfig, and offline my 36 CPUs Desktop to leave 12 CPUs online, and on it I can still see Local timer interrupts when there is no active load, with the longest interval between 2 timer interrupts is 4 seconds, while idle class's task_tick_idle() will do nothing, and CFS' task_tick_fair() will in turn call cfs_rq_util_change()

I searched the cfs/deadline/rt code, these three classes all have places to call cpufreq_update_util(), either in enqueue/dequeue or changing running bandwidth. So I think entering idle also means the system load is under a big change, and worth calling the cpufreq callback.

...
In principle, PM QoS can be used for that from intel_pstate, but there is a problem with that approach, because it is not obvious what value to give to it and it is not always guaranteed to work (say when all of the C-states except for C1 are disabled).

So it looks like a new mechanism is needed for that.

If you think idle class is not the right place to solve it, I can also help testing new patches.

So I have the appended experimental patch to address this issue that's not been tested at all. Caveat emptor.

I've been looking for something relatively low-overhead and taking all of the dependencies into account.

--- drivers/cpufreq/intel_pstate.c | 40 ++++++++++++++++++++++++++++--------- drivers/cpuidle/governors/ladder.c | 6 +++-- drivers/cpuidle/governors/menu.c | 2 + drivers/cpuidle/governors/teo.c | 3 ++ include/linux/cpuidle.h | 4 +++ 5 files changed, 44 insertions(+), 11 deletions(-)

Index: linux-pm/drivers/cpuidle/governors/menu.c =================================================================== --- linux-pm.orig/drivers/cpuidle/governors/menu.c +++ linux-pm/drivers/cpuidle/governors/menu.c @@ -284,6 +284,8 @@ static int menu_select(struct cpuidle_dr if (unlikely(delta < 0)) { delta = 0; delta_tick = 0; + } else if (dev->retain_tick) { + delta = delta_tick; } data->next_timer_ns = delta;

Index: linux-pm/drivers/cpuidle/governors/teo.c =================================================================== --- linux-pm.orig/drivers/cpuidle/governors/teo.c +++ linux-pm/drivers/cpuidle/governors/teo.c @@ -308,6 +308,9 @@ static int teo_select(struct cpuidle_dri cpu_data->time_span_ns = local_clock();

duration_ns = tick_nohz_get_sleep_length(&delta_tick); + if (dev->retain_tick) + duration_ns = delta_tick; + cpu_data->sleep_length_ns = duration_ns;

@@ -172,6 +173,8 @@ extern int cpuidle_play_dead(void); extern struct cpuidle_driver *cpuidle_get_cpu_driver(struct cpuidle_device *dev); static inline struct cpuidle_device *cpuidle_get_device(void) {return __this_cpu_read(cpuidle_devices); } + +extern void cpuidle_update_retain_tick(bool val); #else static inline void disable_cpuidle(void) { } static inline bool cpuidle_not_available(struct cpuidle_driver *drv, @@ -211,6 +214,7 @@ static inline int cpuidle_play_dead(void static inline struct cpuidle_driver *cpuidle_get_cpu_driver( struct cpuidle_device *dev) {return NULL; } static inline struct cpuidle_device *cpuidle_get_device(void) {return NULL; } +static inline void cpuidle_update_retain_tick(bool val) { } #endif

#ifdef CONFIG_CPU_IDLE Index: linux-pm/drivers/cpufreq/intel_pstate.c =================================================================== --- linux-pm.orig/drivers/cpufreq/intel_pstate.c +++ linux-pm/drivers/cpufreq/intel_pstate.c @@ -19,6 +19,7 @@ #include <linux/list.h> #include <linux/cpu.h> #include <linux/cpufreq.h> +#include <linux/cpuidle.h> #include <linux/sysfs.h> #include <linux/types.h> #include <linux/fs.h> @@ -1970,6 +1971,30 @@ static inline void intel_pstate_cppc_set } #endif /* CONFIG_ACPI_CPPC_LIB */

+static void intel_pstate_update_perf_ctl(struct cpudata *cpu) +{ + int pstate = cpu->pstate.current_pstate; + + /* + * Avoid stopping the scheduler tick from cpuidle on CPUs in turbo + * P-states to prevent them from getting back to the high frequency + * right away after getting out of deep idle. + */ + cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate); + wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate)); +} + +static void intel_pstate_update_perf_ctl_wrapper(void *cpu_data) +{ + intel_pstate_update_perf_ctl(cpu_data); +} + +static void intel_pstate_update_perf_ctl_on_cpu(struct cpudata *cpu) +{ + smp_call_function_single(cpu->cpu, intel_pstate_update_perf_ctl_wrapper, + cpu, 1); +} + static void intel_pstate_set_pstate(struct cpudata *cpu, int pstate) { trace_cpu_frequency(pstate * cpu->pstate.scaling, cpu->cpu); @@ -1979,8 +2004,7 @@ static void intel_pstate_set_pstate(stru * the CPU being updated, so force the register update to run on the * right CPU. */ - wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL, - pstate_funcs.get_val(cpu, pstate)); + intel_pstate_update_perf_ctl_on_cpu(cpu); }

static void intel_pstate_set_min_pstate(struct cpudata *cpu) @@ -2256,7 +2280,7 @@ static void intel_pstate_update_pstate(s return;

cpu->pstate.current_pstate = pstate; - wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate)); + intel_pstate_update_perf_ctl(cpu); }

static void intel_pstate_adjust_pstate(struct cpudata *cpu) @@ -2843,11 +2867,9 @@ static void intel_cpufreq_perf_ctl_updat u32 target_pstate, bool fast_switch) { if (fast_switch) - wrmsrl(MSR_IA32_PERF_CTL, - pstate_funcs.get_val(cpu, target_pstate)); + intel_pstate_update_perf_ctl(cpu); else - wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL, - pstate_funcs.get_val(cpu, target_pstate)); + intel_pstate_update_perf_ctl_on_cpu(cpu); }

static int intel_cpufreq_update_pstate(struct cpufreq_policy *policy, @@ -2857,6 +2879,8 @@ static int intel_cpufreq_update_pstate(s int old_pstate = cpu->pstate.current_pstate;

target_pstate = intel_pstate_prepare_request(cpu, target_pstate); + cpu->pstate.current_pstate = target_pstate; + if (hwp_active) { int max_pstate = policy->strict_target ? target_pstate : cpu->max_perf_ratio; @@ -2867,8 +2891,6 @@ static int intel_cpufreq_update_pstate(s intel_cpufreq_perf_ctl_update(cpu, target_pstate, fast_switch); }

- cpu->pstate.current_pstate = target_pstate; - intel_cpufreq_trace(cpu, fast_switch ? INTEL_PSTATE_TRACE_FAST_SWITCH : INTEL_PSTATE_TRACE_TARGET, old_pstate);

Index: linux-pm/drivers/cpuidle/governors/ladder.c =================================================================== --- linux-pm.orig/drivers/cpuidle/governors/ladder.c +++ linux-pm/drivers/cpuidle/governors/ladder.c @@ -61,10 +61,10 @@ static inline void ladder_do_selection(s * ladder_select_state - selects the next state to enter * @drv: cpuidle driver * @dev: the CPU - * @dummy: not used + * @stop_tick: Whether or not to stop the scheduler tick */ static int ladder_select_state(struct cpuidle_driver *drv, - struct cpuidle_device *dev, bool *dummy) + struct cpuidle_device *dev, bool *stop_tick) { struct ladder_device *ldev = this_cpu_ptr(&ladder_devices); struct ladder_device_state *last_state; @@ -73,6 +73,8 @@ static int ladder_select_state(struct cp s64 latency_req = cpuidle_governor_latency_req(dev->cpu); s64 last_residency;

+ *stop_tick = !dev->retain_tick; + /* Special case when user has set very strict latency requirement */ if (unlikely(latency_req == 0)) { ladder_do_selection(dev, ldev, last_idx, 0);

Doug Smythies

26 Feb 26 Feb

12:36 a.m.

On Fri, Feb 25, 2022 at 9:46 AM Rafael J. Wysocki rjw@rjwysocki.net wrote:

...

On Thursday, February 24, 2022 9:08:30 AM CET Feng Tang wrote:

...

...
...
So it looks like a new mechanism is needed for that.

If you think idle class is not the right place to solve it, I can also help testing new patches.

So I have the appended experimental patch to address this issue that's not been tested at all. Caveat emptor.

Hi Rafael,

O.K., you gave fair warning.

The patch applied fine. It does not compile for me. The function cpuidle_update_retain_tick does not exist. Shouldn't it be somewhere in cpuidle.c? I used the function cpuidle_disable_device as a template for searching and comparing.

Because all of my baseline results are with kernel 5.17-rc3, that is what I am still using.

Error: ld: drivers/cpufreq/intel_pstate.o: in function `intel_pstate_update_perf_ctl': intel_pstate.c:(.text+0x2520): undefined reference to `cpuidle_update_retain_tick'

... Doug

...

I've been looking for something relatively low-overhead and taking all of the dependencies into account.

drivers/cpufreq/intel_pstate.c | 40 ++++++++++++++++++++++++++++--------- drivers/cpuidle/governors/ladder.c | 6 +++-- drivers/cpuidle/governors/menu.c | 2 + drivers/cpuidle/governors/teo.c | 3 ++ include/linux/cpuidle.h | 4 +++ 5 files changed, 44 insertions(+), 11 deletions(-)

Index: linux-pm/drivers/cpuidle/governors/menu.c

--- linux-pm.orig/drivers/cpuidle/governors/menu.c +++ linux-pm/drivers/cpuidle/governors/menu.c @@ -284,6 +284,8 @@ static int menu_select(struct cpuidle_dr if (unlikely(delta < 0)) { delta = 0; delta_tick = 0;
  } else if (dev->retain_tick) {
          delta = delta_tick;
  }
  data->next_timer_ns = delta;
Index: linux-pm/drivers/cpuidle/governors/teo.c

--- linux-pm.orig/drivers/cpuidle/governors/teo.c +++ linux-pm/drivers/cpuidle/governors/teo.c @@ -308,6 +308,9 @@ static int teo_select(struct cpuidle_dri cpu_data->time_span_ns = local_clock();
    duration_ns = tick_nohz_get_sleep_length(&delta_tick);
  if (dev->retain_tick)
          duration_ns = delta_tick;
  cpu_data->sleep_length_ns = duration_ns;

  /* Check if there is any choice in the first place. */
Index: linux-pm/include/linux/cpuidle.h

--- linux-pm.orig/include/linux/cpuidle.h +++ linux-pm/include/linux/cpuidle.h @@ -93,6 +93,7 @@ struct cpuidle_device { unsigned int registered:1; unsigned int enabled:1; unsigned int poll_time_limit:1;
  unsigned int            retain_tick:1;
  unsigned int            cpu;
  ktime_t                 next_hrtimer;
@@ -172,6 +173,8 @@ extern int cpuidle_play_dead(void); extern struct cpuidle_driver *cpuidle_get_cpu_driver(struct cpuidle_device *dev); static inline struct cpuidle_device *cpuidle_get_device(void) {return __this_cpu_read(cpuidle_devices); }

+extern void cpuidle_update_retain_tick(bool val); #else static inline void disable_cpuidle(void) { } static inline bool cpuidle_not_available(struct cpuidle_driver *drv, @@ -211,6 +214,7 @@ static inline int cpuidle_play_dead(void static inline struct cpuidle_driver *cpuidle_get_cpu_driver( struct cpuidle_device *dev) {return NULL; } static inline struct cpuidle_device *cpuidle_get_device(void) {return NULL; } +static inline void cpuidle_update_retain_tick(bool val) { } #endif

#ifdef CONFIG_CPU_IDLE Index: linux-pm/drivers/cpufreq/intel_pstate.c =================================================================== --- linux-pm.orig/drivers/cpufreq/intel_pstate.c +++ linux-pm/drivers/cpufreq/intel_pstate.c @@ -19,6 +19,7 @@ #include <linux/list.h> #include <linux/cpu.h> #include <linux/cpufreq.h> +#include <linux/cpuidle.h> #include <linux/sysfs.h> #include <linux/types.h> #include <linux/fs.h> @@ -1970,6 +1971,30 @@ static inline void intel_pstate_cppc_set } #endif /* CONFIG_ACPI_CPPC_LIB */

+static void intel_pstate_update_perf_ctl(struct cpudata *cpu) +{
  int pstate = cpu->pstate.current_pstate;
  /*
   * Avoid stopping the scheduler tick from cpuidle on CPUs in turbo
   * P-states to prevent them from getting back to the high frequency
   * right away after getting out of deep idle.
   */
  cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
  wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
+}

+static void intel_pstate_update_perf_ctl_wrapper(void *cpu_data) +{
  intel_pstate_update_perf_ctl(cpu_data);
+}

+static void intel_pstate_update_perf_ctl_on_cpu(struct cpudata *cpu) +{
  smp_call_function_single(cpu->cpu, intel_pstate_update_perf_ctl_wrapper,
                           cpu, 1);
+}

static void intel_pstate_set_pstate(struct cpudata *cpu, int pstate) { trace_cpu_frequency(pstate * cpu->pstate.scaling, cpu->cpu); @@ -1979,8 +2004,7 @@ static void intel_pstate_set_pstate(stru * the CPU being updated, so force the register update to run on the * right CPU. */
  wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
                pstate_funcs.get_val(cpu, pstate));
  intel_pstate_update_perf_ctl_on_cpu(cpu);
}

static void intel_pstate_set_min_pstate(struct cpudata *cpu) @@ -2256,7 +2280,7 @@ static void intel_pstate_update_pstate(s return;
    cpu->pstate.current_pstate = pstate;
  wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
  intel_pstate_update_perf_ctl(cpu);
}

static void intel_pstate_adjust_pstate(struct cpudata *cpu) @@ -2843,11 +2867,9 @@ static void intel_cpufreq_perf_ctl_updat u32 target_pstate, bool fast_switch) { if (fast_switch)
          wrmsrl(MSR_IA32_PERF_CTL,
                 pstate_funcs.get_val(cpu, target_pstate));
          intel_pstate_update_perf_ctl(cpu);
  else
          wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
                        pstate_funcs.get_val(cpu, target_pstate));
          intel_pstate_update_perf_ctl_on_cpu(cpu);
}

static int intel_cpufreq_update_pstate(struct cpufreq_policy *policy, @@ -2857,6 +2879,8 @@ static int intel_cpufreq_update_pstate(s int old_pstate = cpu->pstate.current_pstate;
    target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
  cpu->pstate.current_pstate = target_pstate;
  if (hwp_active) {
          int max_pstate = policy->strict_target ?
                                  target_pstate : cpu->max_perf_ratio;
@@ -2867,8 +2891,6 @@ static int intel_cpufreq_update_pstate(s intel_cpufreq_perf_ctl_update(cpu, target_pstate, fast_switch); }
  cpu->pstate.current_pstate = target_pstate;
  intel_cpufreq_trace(cpu, fast_switch ? INTEL_PSTATE_TRACE_FAST_SWITCH :
                      INTEL_PSTATE_TRACE_TARGET, old_pstate);
Index: linux-pm/drivers/cpuidle/governors/ladder.c

--- linux-pm.orig/drivers/cpuidle/governors/ladder.c +++ linux-pm/drivers/cpuidle/governors/ladder.c @@ -61,10 +61,10 @@ static inline void ladder_do_selection(s

ladder_select_state - selects the next state to enter

@drv: cpuidle driver

@dev: the CPU

@dummy: not used

@stop_tick: Whether or not to stop the scheduler tick

*/

static int ladder_select_state(struct cpuidle_driver *drv,
                         struct cpuidle_device *dev, bool *dummy)
                         struct cpuidle_device *dev, bool *stop_tick)
{ struct ladder_device *ldev = this_cpu_ptr(&ladder_devices); struct ladder_device_state *last_state; @@ -73,6 +73,8 @@ static int ladder_select_state(struct cp s64 latency_req = cpuidle_governor_latency_req(dev->cpu); s64 last_residency;
  *stop_tick = !dev->retain_tick;
  /* Special case when user has set very strict latency requirement */
  if (unlikely(latency_req == 0)) {
          ladder_do_selection(dev, ldev, last_idx, 0);

Feng Tang

28 Feb 28 Feb

4:12 a.m.

On Fri, Feb 25, 2022 at 04:36:53PM -0800, Doug Smythies wrote:

...

On Fri, Feb 25, 2022 at 9:46 AM Rafael J. Wysocki rjw@rjwysocki.net wrote:

...
On Thursday, February 24, 2022 9:08:30 AM CET Feng Tang wrote:

...

...
...
...
So it looks like a new mechanism is needed for that.

If you think idle class is not the right place to solve it, I can also help testing new patches.

So I have the appended experimental patch to address this issue that's not been tested at all. Caveat emptor.

Hi Rafael,

O.K., you gave fair warning.

The patch applied fine. It does not compile for me. The function cpuidle_update_retain_tick does not exist. Shouldn't it be somewhere in cpuidle.c? I used the function cpuidle_disable_device as a template for searching and comparing.

Because all of my baseline results are with kernel 5.17-rc3, that is what I am still using.

Error: ld: drivers/cpufreq/intel_pstate.o: in function `intel_pstate_update_perf_ctl': intel_pstate.c:(.text+0x2520): undefined reference to `cpuidle_update_retain_tick'

Same here, seems the cpuidle_update_retain_tick()'s implementation is missing.

Thanks, Feng

...

... Doug

...
I've been looking for something relatively low-overhead and taking all of the dependencies into account.

drivers/cpufreq/intel_pstate.c | 40 ++++++++++++++++++++++++++++--------- drivers/cpuidle/governors/ladder.c | 6 +++-- drivers/cpuidle/governors/menu.c | 2 + drivers/cpuidle/governors/teo.c | 3 ++ include/linux/cpuidle.h | 4 +++ 5 files changed, 44 insertions(+), 11 deletions(-)

Rafael J. Wysocki

7:36 p.m.

On Monday, February 28, 2022 5:12:28 AM CET Feng Tang wrote:

...

On Fri, Feb 25, 2022 at 04:36:53PM -0800, Doug Smythies wrote:

...
On Fri, Feb 25, 2022 at 9:46 AM Rafael J. Wysocki rjw@rjwysocki.net wrote:

...
On Thursday, February 24, 2022 9:08:30 AM CET Feng Tang wrote:

...

...
...
...
So it looks like a new mechanism is needed for that.

If you think idle class is not the right place to solve it, I can also help testing new patches.

So I have the appended experimental patch to address this issue that's not been tested at all. Caveat emptor.

Hi Rafael,

O.K., you gave fair warning.

The patch applied fine. It does not compile for me. The function cpuidle_update_retain_tick does not exist. Shouldn't it be somewhere in cpuidle.c? I used the function cpuidle_disable_device as a template for searching and comparing.

Because all of my baseline results are with kernel 5.17-rc3, that is what I am still using.

Error: ld: drivers/cpufreq/intel_pstate.o: in function `intel_pstate_update_perf_ctl': intel_pstate.c:(.text+0x2520): undefined reference to `cpuidle_update_retain_tick'

Same here, seems the cpuidle_update_retain_tick()'s implementation is missing.

That's a patch generation issue on my part, sorry.

However, it was a bit racy, so maybe it's good that it was not complete.

Below is a new version.

--- drivers/cpufreq/intel_pstate.c | 40 ++++++++++++++++++++++++++++--------- drivers/cpuidle/governor.c | 23 +++++++++++++++++++++ drivers/cpuidle/governors/ladder.c | 6 +++-- drivers/cpuidle/governors/menu.c | 2 + drivers/cpuidle/governors/teo.c | 3 ++ include/linux/cpuidle.h | 4 +++ 6 files changed, 67 insertions(+), 11 deletions(-)

Index: linux-pm/drivers/cpuidle/governors/menu.c =================================================================== --- linux-pm.orig/drivers/cpuidle/governors/menu.c +++ linux-pm/drivers/cpuidle/governors/menu.c @@ -284,6 +284,8 @@ static int menu_select(struct cpuidle_dr if (unlikely(delta < 0)) { delta = 0; delta_tick = 0; + } else if (cpuidle_retain_local_tick()) { + delta = delta_tick; } data->next_timer_ns = delta;

duration_ns = tick_nohz_get_sleep_length(&delta_tick); + if (cpuidle_retain_local_tick()) + duration_ns = delta_tick; + cpu_data->sleep_length_ns = duration_ns;

/* Check if there is any choice in the first place. */ Index: linux-pm/include/linux/cpuidle.h =================================================================== --- linux-pm.orig/include/linux/cpuidle.h +++ linux-pm/include/linux/cpuidle.h @@ -172,6 +172,9 @@ extern int cpuidle_play_dead(void); extern struct cpuidle_driver *cpuidle_get_cpu_driver(struct cpuidle_device *dev); static inline struct cpuidle_device *cpuidle_get_device(void) {return __this_cpu_read(cpuidle_devices); } + +extern void cpuidle_update_retain_tick(bool val); +extern bool cpuidle_retain_local_tick(void); #else static inline void disable_cpuidle(void) { } static inline bool cpuidle_not_available(struct cpuidle_driver *drv, @@ -211,6 +214,7 @@ static inline int cpuidle_play_dead(void static inline struct cpuidle_driver *cpuidle_get_cpu_driver( struct cpuidle_device *dev) {return NULL; } static inline struct cpuidle_device *cpuidle_get_device(void) {return NULL; } +static inline void cpuidle_update_retain_tick(bool val) { } #endif

static void intel_pstate_set_min_pstate(struct cpudata *cpu) @@ -2256,7 +2280,7 @@ static void intel_pstate_update_pstate(s return;

cpu->pstate.current_pstate = pstate; - wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate)); + intel_pstate_update_perf_ctl(cpu); }

static int intel_cpufreq_update_pstate(struct cpufreq_policy *policy, @@ -2857,6 +2879,8 @@ static int intel_cpufreq_update_pstate(s int old_pstate = cpu->pstate.current_pstate;

- cpu->pstate.current_pstate = target_pstate; - intel_cpufreq_trace(cpu, fast_switch ? INTEL_PSTATE_TRACE_FAST_SWITCH : INTEL_PSTATE_TRACE_TARGET, old_pstate);

+ *stop_tick = !cpuidle_retain_local_tick(); + /* Special case when user has set very strict latency requirement */ if (unlikely(latency_req == 0)) { ladder_do_selection(dev, ldev, last_idx, 0); Index: linux-pm/drivers/cpuidle/governor.c =================================================================== --- linux-pm.orig/drivers/cpuidle/governor.c +++ linux-pm/drivers/cpuidle/governor.c @@ -118,3 +118,26 @@ s64 cpuidle_governor_latency_req(unsigne

return (s64)device_req * NSEC_PER_USEC; } + +static DEFINE_PER_CPU(bool, cpuidle_retain_tick); + +/** + * cpuidle_update_retain_tick - Update the local CPU's retain_tick flag. + * @val: New value of the flag. + * + * The retain_tick flag controls whether or not to cpuidle is allowed to stop + * the scheduler tick on the local CPU and it can be updated with the help of + * this function. + */ +void cpuidle_update_retain_tick(bool val) +{ + this_cpu_write(cpuidle_retain_tick, val); +} + +/** + * couidle_retain_local_tick - Return the local CPU's retain_tick flag value. + */ +bool cpuidle_retain_local_tick(void) +{ + return this_cpu_read(cpuidle_retain_tick); +}

Feng Tang

1 Mar 1 Mar

5:52 a.m.

On Mon, Feb 28, 2022 at 08:36:03PM +0100, Rafael J. Wysocki wrote:

...

On Monday, February 28, 2022 5:12:28 AM CET Feng Tang wrote:

...
On Fri, Feb 25, 2022 at 04:36:53PM -0800, Doug Smythies wrote:

...
On Fri, Feb 25, 2022 at 9:46 AM Rafael J. Wysocki rjw@rjwysocki.net wrote:

...
On Thursday, February 24, 2022 9:08:30 AM CET Feng Tang wrote:

...

...
...
...
So it looks like a new mechanism is needed for that.

If you think idle class is not the right place to solve it, I can also help testing new patches.

So I have the appended experimental patch to address this issue that's not been tested at all. Caveat emptor.

Hi Rafael,

O.K., you gave fair warning.

The patch applied fine. It does not compile for me. The function cpuidle_update_retain_tick does not exist. Shouldn't it be somewhere in cpuidle.c? I used the function cpuidle_disable_device as a template for searching and comparing.

Because all of my baseline results are with kernel 5.17-rc3, that is what I am still using.

Error: ld: drivers/cpufreq/intel_pstate.o: in function `intel_pstate_update_perf_ctl': intel_pstate.c:(.text+0x2520): undefined reference to `cpuidle_update_retain_tick'

Same here, seems the cpuidle_update_retain_tick()'s implementation is missing.

That's a patch generation issue on my part, sorry.

However, it was a bit racy, so maybe it's good that it was not complete.

Below is a new version.

Thanks for the new version. I just gave it a try, and the occasional long delay of cpufreq auto-adjusting I have seen can not be reproduced after applying it.

As my test is a rough one which can't reproduce what Doug has seen (including the power meter data), it's better to wait for his test result.

And one minor question for the code.

...

drivers/cpufreq/intel_pstate.c | 40 ++++++++++++++++++++++++++++--------- drivers/cpuidle/governor.c | 23 +++++++++++++++++++++ drivers/cpuidle/governors/ladder.c | 6 +++-- drivers/cpuidle/governors/menu.c | 2 + drivers/cpuidle/governors/teo.c | 3 ++ include/linux/cpuidle.h | 4 +++ 6 files changed, 67 insertions(+), 11 deletions(-)

[SNIP]

...

Index: linux-pm/drivers/cpufreq/intel_pstate.c

--- linux-pm.orig/drivers/cpufreq/intel_pstate.c +++ linux-pm/drivers/cpufreq/intel_pstate.c @@ -19,6 +19,7 @@ #include <linux/list.h> #include <linux/cpu.h> #include <linux/cpufreq.h> +#include <linux/cpuidle.h> #include <linux/sysfs.h> #include <linux/types.h> #include <linux/fs.h> @@ -1970,6 +1971,30 @@ static inline void intel_pstate_cppc_set } #endif /* CONFIG_ACPI_CPPC_LIB */ +static void intel_pstate_update_perf_ctl(struct cpudata *cpu) +{
int pstate = cpu->pstate.current_pstate;

/*
* Avoid stopping the scheduler tick from cpuidle on CPUs in turbo
* P-states to prevent them from getting back to the high frequency
* right away after getting out of deep idle.
*/
cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);

In our test, the workload will make CPU go to highest frequency before going to idle, but should we also consider that the case that the high but not highest cupfreq?

Thanks, Feng

Rafael J. Wysocki

11:58 a.m.

On Tue, Mar 1, 2022 at 6:53 AM Feng Tang feng.tang@intel.com wrote:

...

On Mon, Feb 28, 2022 at 08:36:03PM +0100, Rafael J. Wysocki wrote:

...
On Monday, February 28, 2022 5:12:28 AM CET Feng Tang wrote:

...
On Fri, Feb 25, 2022 at 04:36:53PM -0800, Doug Smythies wrote:

...
On Fri, Feb 25, 2022 at 9:46 AM Rafael J. Wysocki rjw@rjwysocki.net wrote:

...
On Thursday, February 24, 2022 9:08:30 AM CET Feng Tang wrote:

...

...
...
> So it looks like a new mechanism is needed for that.

If you think idle class is not the right place to solve it, I can also help testing new patches.

So I have the appended experimental patch to address this issue that's not been tested at all. Caveat emptor.

Hi Rafael,

O.K., you gave fair warning.

The patch applied fine. It does not compile for me. The function cpuidle_update_retain_tick does not exist. Shouldn't it be somewhere in cpuidle.c? I used the function cpuidle_disable_device as a template for searching and comparing.

Because all of my baseline results are with kernel 5.17-rc3, that is what I am still using.

Error: ld: drivers/cpufreq/intel_pstate.o: in function `intel_pstate_update_perf_ctl': intel_pstate.c:(.text+0x2520): undefined reference to `cpuidle_update_retain_tick'

Same here, seems the cpuidle_update_retain_tick()'s implementation is missing.

That's a patch generation issue on my part, sorry.

However, it was a bit racy, so maybe it's good that it was not complete.

Below is a new version.

Thanks for the new version. I just gave it a try, and the occasional long delay of cpufreq auto-adjusting I have seen can not be reproduced after applying it.

OK, thanks!

I'll wait for feedback from Dough, though.

...

As my test is a rough one which can't reproduce what Doug has seen (including the power meter data), it's better to wait for his test result.

And one minor question for the code.

...

drivers/cpufreq/intel_pstate.c | 40 ++++++++++++++++++++++++++++--------- drivers/cpuidle/governor.c | 23 +++++++++++++++++++++ drivers/cpuidle/governors/ladder.c | 6 +++-- drivers/cpuidle/governors/menu.c | 2 + drivers/cpuidle/governors/teo.c | 3 ++ include/linux/cpuidle.h | 4 +++ 6 files changed, 67 insertions(+), 11 deletions(-)

[SNIP]

...
Index: linux-pm/drivers/cpufreq/intel_pstate.c

--- linux-pm.orig/drivers/cpufreq/intel_pstate.c +++ linux-pm/drivers/cpufreq/intel_pstate.c @@ -19,6 +19,7 @@ #include <linux/list.h> #include <linux/cpu.h> #include <linux/cpufreq.h> +#include <linux/cpuidle.h> #include <linux/sysfs.h> #include <linux/types.h> #include <linux/fs.h> @@ -1970,6 +1971,30 @@ static inline void intel_pstate_cppc_set } #endif /* CONFIG_ACPI_CPPC_LIB */

+static void intel_pstate_update_perf_ctl(struct cpudata *cpu) +{
int pstate = cpu->pstate.current_pstate;
/*
 * Avoid stopping the scheduler tick from cpuidle on CPUs in turbo
 * P-states to prevent them from getting back to the high frequency
 * right away after getting out of deep idle.
 */
cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
In our test, the workload will make CPU go to highest frequency before going to idle, but should we also consider that the case that the high but not highest cupfreq?

This covers the entire turbo range (max_pstate is the highest non-turbo one).

Doug Smythies

5:18 p.m.

On Tue, Mar 1, 2022 at 3:58 AM Rafael J. Wysocki rafael@kernel.org wrote:

...

On Tue, Mar 1, 2022 at 6:53 AM Feng Tang feng.tang@intel.com wrote:

...
On Mon, Feb 28, 2022 at 08:36:03PM +0100, Rafael J. Wysocki wrote:

...

...
...
However, it was a bit racy, so maybe it's good that it was not complete.

Below is a new version.

Thanks for the new version. I just gave it a try, and the occasional long delay of cpufreq auto-adjusting I have seen can not be reproduced after applying it.

OK, thanks!

I'll wait for feedback from Dough, though.

Hi Rafael,

Thank you for your version 2 patch. I screwed up an overnight test and will have to re-do it. However, I do have some results.

From reading the patch code, one worry was the potential to drive down the desired/required CPU frequency for the main periodic workflow, causing overruns, or inability of the task to complete its work before the next period. I have always had overrun information, but it has never been relevant before.

The other worry was if the threshold of turbo/not turbo frequency is enough.

I do not know how to test any final solution thoroughly, as so far I have simply found a good enough problematic example. We have so many years of experience with the convenient multi second NMI forcing lingering high pstate clean up. I'd keep it deciding within it if the TSC stuff needs to be executed or not.

Anyway...

Base Kernel 5.17-rc3. "stock" : unmodified. "revert" : with commit b50db7095fe reverted "rjw-2" : with this version2 patch added.

Test 1 (as before. There is no test 2, yet.): 347 Hertz work/sleep frequency on one CPU while others do virtually no load but enough to increase the requested pstate, but at around 0.02 hertz aggregate.

It is important to note the main load is approximately 38.6% @ 2.422 GHz, or 100% at 0.935 GHz. and almost exclusively uses idle state 2 (of 4 total idle states)

/sys/devices/system/cpu/cpu7/cpuidle/state0/name:POLL /sys/devices/system/cpu/cpu7/cpuidle/state1/name:C1_ACPI /sys/devices/system/cpu/cpu7/cpuidle/state2/name:C2_ACPI /sys/devices/system/cpu/cpu7/cpuidle/state3/name:C3_ACPI

Turbostat was used. ~10 samples at 300 seconds per. Processor package power (Watts):

Workflow was run for 1 hour each time or 1249201 loops.

revert: ave: 3.00 min: 2.89 max: 3.08 ave freq: 2.422 GHz. overruns: 1. max overrun time: 113 uSec.

stock: ave: 3.63 (+21%) min: 3.28 max: 3.99 ave freq: 2.791 GHz. overruns: 2. max overrun time: 677 uSec.

rjw-2: ave: 3.14 (+5%) min: 2.97 max: 3.28 ave freq: 2.635 GHz overruns: 1042. max overrun time: 9,769 uSec.

... Doug

Rafael J. Wysocki

5:34 p.m.

On Tue, Mar 1, 2022 at 6:18 PM Doug Smythies dsmythies@telus.net wrote:

...

On Tue, Mar 1, 2022 at 3:58 AM Rafael J. Wysocki rafael@kernel.org wrote:

...
On Tue, Mar 1, 2022 at 6:53 AM Feng Tang feng.tang@intel.com wrote:

...
On Mon, Feb 28, 2022 at 08:36:03PM +0100, Rafael J. Wysocki wrote:

...

...
...
...
However, it was a bit racy, so maybe it's good that it was not complete.

Below is a new version.

Thanks for the new version. I just gave it a try, and the occasional long delay of cpufreq auto-adjusting I have seen can not be reproduced after applying it.

OK, thanks!

I'll wait for feedback from Dough, though.

Hi Rafael,

Thank you for your version 2 patch. I screwed up an overnight test and will have to re-do it. However, I do have some results.

Thanks for testing it!

...

From reading the patch code, one worry was the potential to drive down the desired/required CPU frequency for the main periodic workflow, causing overruns, or inability of the task to complete its work before the next period.

It is not clear to me why you worried about that just from reading the patch? Can you explain, please?

...

I have always had overrun information, but it has never been relevant before.

The other worry was if the threshold of turbo/not turbo frequency is enough.

Agreed.

...

I do not know how to test any final solution thoroughly, as so far I have simply found a good enough problematic example. We have so many years of experience with the convenient multi second NMI forcing lingering high pstate clean up. I'd keep it deciding within it if the TSC stuff needs to be executed or not.

Anyway...

Base Kernel 5.17-rc3. "stock" : unmodified. "revert" : with commit b50db7095fe reverted "rjw-2" : with this version2 patch added.

Test 1 (as before. There is no test 2, yet.): 347 Hertz work/sleep frequency on one CPU while others do virtually no load but enough to increase the requested pstate, but at around 0.02 hertz aggregate.

It is important to note the main load is approximately 38.6% @ 2.422 GHz, or 100% at 0.935 GHz. and almost exclusively uses idle state 2 (of 4 total idle states)

/sys/devices/system/cpu/cpu7/cpuidle/state0/name:POLL /sys/devices/system/cpu/cpu7/cpuidle/state1/name:C1_ACPI /sys/devices/system/cpu/cpu7/cpuidle/state2/name:C2_ACPI /sys/devices/system/cpu/cpu7/cpuidle/state3/name:C3_ACPI

Turbostat was used. ~10 samples at 300 seconds per. Processor package power (Watts):

Workflow was run for 1 hour each time or 1249201 loops.

revert: ave: 3.00 min: 2.89 max: 3.08

I'm not sure what the above three numbers are.

...

ave freq: 2.422 GHz. overruns: 1. max overrun time: 113 uSec.

stock: ave: 3.63 (+21%) min: 3.28 max: 3.99 ave freq: 2.791 GHz. overruns: 2. max overrun time: 677 uSec.

rjw-2: ave: 3.14 (+5%) min: 2.97 max: 3.28 ave freq: 2.635 GHz

I guess the numbers above could be reduced still by using a P-state below the max non-turbo one as a limit.

...

overruns: 1042. max overrun time: 9,769 uSec.

This would probably get worse then, though.

ATM I'm not quite sure why this happens, but you seem to have some insight into it, so it would help if you shared it.

Thanks!

Doug Smythies

2 Mar 2 Mar

4:06 a.m.

On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki rafael@kernel.org wrote:

...

On Tue, Mar 1, 2022 at 6:18 PM Doug Smythies dsmythies@telus.net wrote:

...
On Tue, Mar 1, 2022 at 3:58 AM Rafael J. Wysocki rafael@kernel.org wrote:

...
On Tue, Mar 1, 2022 at 6:53 AM Feng Tang feng.tang@intel.com wrote:

...
On Mon, Feb 28, 2022 at 08:36:03PM +0100, Rafael J. Wysocki wrote:

...

...
...
...
However, it was a bit racy, so maybe it's good that it was not complete.

Below is a new version.

Thanks for the new version. I just gave it a try, and the occasional long delay of cpufreq auto-adjusting I have seen can not be reproduced after applying it.

OK, thanks!

I'll wait for feedback from Dough, though.

Hi Rafael,

Thank you for your version 2 patch. I screwed up an overnight test and will have to re-do it. However, I do have some results.

Thanks for testing it!

...
From reading the patch code, one worry was the potential to drive down the desired/required CPU frequency for the main periodic workflow, causing overruns, or inability of the task to complete its work before the next period.

It is not clear to me why you worried about that just from reading the patch? Can you explain, please?

It is already covered below. And a couple of further tests directly contradict my thinking.

...

...
I have always had overrun information, but it has never been relevant before.

The other worry was if the threshold of turbo/not turbo frequency is enough.

Agreed.

Just as an easy example and test I did this on top of this patch ("rjw-3"):

doug@s19:~/kernel/linux$ git diff diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index f878a4545eee..5cbdd7e479e8 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct cpudata *cpu) * P-states to prevent them from getting back to the high frequency * right away after getting out of deep idle. */ - cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate); + cpuidle_update_retain_tick(pstate > (cpu->pstate.max_pstate >> 1)); wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate)); }

...

...
I do not know how to test any final solution thoroughly, as so far I have simply found a good enough problematic example. We have so many years of experience with the convenient multi second NMI forcing lingering high pstate clean up. I'd keep it deciding within it if the TSC stuff needs to be executed or not.

Anyway...

Base Kernel 5.17-rc3. "stock" : unmodified. "revert" : with commit b50db7095fe reverted "rjw-2" : with this version2 patch added.

Test 1 (as before. There is no test 2, yet.): 347 Hertz work/sleep frequency on one CPU while others do virtually no load but enough to increase the requested pstate, but at around 0.02 hertz aggregate.

It is important to note the main load is approximately 38.6% @ 2.422 GHz, or 100% at 0.935 GHz. and almost exclusively uses idle state 2 (of 4 total idle states)

/sys/devices/system/cpu/cpu7/cpuidle/state0/name:POLL /sys/devices/system/cpu/cpu7/cpuidle/state1/name:C1_ACPI /sys/devices/system/cpu/cpu7/cpuidle/state2/name:C2_ACPI /sys/devices/system/cpu/cpu7/cpuidle/state3/name:C3_ACPI

Turbostat was used. ~10 samples at 300 seconds per. Processor package power (Watts):

Workflow was run for 1 hour each time or 1249201 loops.

revert: ave: 3.00 min: 2.89 max: 3.08

I'm not sure what the above three numbers are.

Processor package power, Watts.

...

...
ave freq: 2.422 GHz. overruns: 1. max overrun time: 113 uSec.

stock: ave: 3.63 (+21%) min: 3.28 max: 3.99 ave freq: 2.791 GHz. overruns: 2. max overrun time: 677 uSec.

rjw-2: ave: 3.14 (+5%) min: 2.97 max: 3.28 ave freq: 2.635 GHz

I guess the numbers above could be reduced still by using a P-state below the max non-turbo one as a limit.

Yes, and for a test I did "rjw-3".

...

...
overruns: 1042. max overrun time: 9,769 uSec.

This would probably get worse then, though.

Yes, that was my expectation, but not what happened.

rjw-3: ave: 3.09 watts min: 3.01 watts max: 31.7 watts ave freq: 2.42 GHz. overruns: 12. (I did not expect this.) Max overruns time: 621 uSec.

Note 1: IRQ's increased by 74%. i.e. it was going in and out of idle a lot more.

Note 2: We know that processor package power is highly temperature dependent. I forgot to let my coolant cool adequately after the kernel compile, and so had to throw out the first 4 power samples (20 minutes).

I retested both rjw-2 and rjw-3, but shorter tests and got 0 overruns in both cases.

...

ATM I'm not quite sure why this happens, but you seem to have some insight into it, so it would help if you shared it.

My insight seems questionable.

My thinking was that one can not decide if the pstate needs to go down or not based on such a localized look. The risk being that the higher periodic load might suffer overruns. Since my first test did exactly that, I violated my own "repeat all tests 3 times before reporting rule". Now, I am not sure what is going on. I will need more time to acquire traces and dig into it.

I also did a 1 hour intel_pstate_tracer test, with rjw-2, on an idle system and saw several long durations. This was expected as this patch set wouldn't change durations by more than a few jiffies. 755 long durations (>6.1 seconds), and 327.7 seconds longest.

... Doug

Rafael J. Wysocki

7 p.m.

On Wed, Mar 2, 2022 at 5:06 AM Doug Smythies dsmythies@telus.net wrote:

...

On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki rafael@kernel.org wrote:

...
On Tue, Mar 1, 2022 at 6:18 PM Doug Smythies dsmythies@telus.net wrote:

...
On Tue, Mar 1, 2022 at 3:58 AM Rafael J. Wysocki rafael@kernel.org wrote:

...
On Tue, Mar 1, 2022 at 6:53 AM Feng Tang feng.tang@intel.com wrote:

...
On Mon, Feb 28, 2022 at 08:36:03PM +0100, Rafael J. Wysocki wrote:

...

...
...
...
However, it was a bit racy, so maybe it's good that it was not complete.

Below is a new version.

Thanks for the new version. I just gave it a try, and the occasional long delay of cpufreq auto-adjusting I have seen can not be reproduced after applying it.

OK, thanks!

I'll wait for feedback from Dough, though.

Hi Rafael,

Thank you for your version 2 patch. I screwed up an overnight test and will have to re-do it. However, I do have some results.

Thanks for testing it!

...
From reading the patch code, one worry was the potential to drive down the desired/required CPU frequency for the main periodic workflow, causing overruns, or inability of the task to complete its work before the next period.

It is not clear to me why you worried about that just from reading the patch? Can you explain, please?

It is already covered below. And a couple of further tests directly contradict my thinking.

...
...
I have always had overrun information, but it has never been relevant before.

The other worry was if the threshold of turbo/not turbo frequency is enough.

Agreed.

Just as an easy example and test I did this on top of this patch ("rjw-3"):

doug@s19:~/kernel/linux$ git diff diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index f878a4545eee..5cbdd7e479e8 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct cpudata *cpu) * P-states to prevent them from getting back to the high frequency * right away after getting out of deep idle. */
  cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
  cpuidle_update_retain_tick(pstate > (cpu->pstate.max_pstate >> 1));

OK, but cpu->pstate.max_pstate / 2 may be almost cpu->pstate.min_pstate which would be better to use here IMO.

Or something like (cpu->pstate.max_pstate + cpu->pstate.min_pstate) / 2. Can you try this one in particular?

...

    wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
}

...
...
I do not know how to test any final solution thoroughly, as so far I have simply found a good enough problematic example. We have so many years of experience with the convenient multi second NMI forcing lingering high pstate clean up. I'd keep it deciding within it if the TSC stuff needs to be executed or not.

Anyway...

Base Kernel 5.17-rc3. "stock" : unmodified. "revert" : with commit b50db7095fe reverted "rjw-2" : with this version2 patch added.

Test 1 (as before. There is no test 2, yet.): 347 Hertz work/sleep frequency on one CPU while others do virtually no load but enough to increase the requested pstate, but at around 0.02 hertz aggregate.

It is important to note the main load is approximately 38.6% @ 2.422 GHz, or 100% at 0.935 GHz. and almost exclusively uses idle state 2 (of 4 total idle states)

/sys/devices/system/cpu/cpu7/cpuidle/state0/name:POLL /sys/devices/system/cpu/cpu7/cpuidle/state1/name:C1_ACPI /sys/devices/system/cpu/cpu7/cpuidle/state2/name:C2_ACPI /sys/devices/system/cpu/cpu7/cpuidle/state3/name:C3_ACPI

Turbostat was used. ~10 samples at 300 seconds per. Processor package power (Watts):

Workflow was run for 1 hour each time or 1249201 loops.

revert: ave: 3.00 min: 2.89 max: 3.08

I'm not sure what the above three numbers are.

Processor package power, Watts.

...

...
...
ave freq: 2.422 GHz. overruns: 1. max overrun time: 113 uSec.

stock: ave: 3.63 (+21%) min: 3.28 max: 3.99 ave freq: 2.791 GHz. overruns: 2. max overrun time: 677 uSec.

rjw-2: ave: 3.14 (+5%) min: 2.97 max: 3.28 ave freq: 2.635 GHz

I guess the numbers above could be reduced still by using a P-state below the max non-turbo one as a limit.

Yes, and for a test I did "rjw-3".

...
...
overruns: 1042. max overrun time: 9,769 uSec.

This would probably get worse then, though.

Yes, that was my expectation, but not what happened.

rjw-3: ave: 3.09 watts min: 3.01 watts max: 31.7 watts

Did you mean 3.17 here? It would be hard to get the average of 3.09 if the max was over 30 W.

...

ave freq: 2.42 GHz. overruns: 12. (I did not expect this.) Max overruns time: 621 uSec.

Note 1: IRQ's increased by 74%. i.e. it was going in and out of idle a lot more.

That's because the scheduler tick is allowed to run a lot more often in the given workload with the changed test above.

...

Note 2: We know that processor package power is highly temperature dependent. I forgot to let my coolant cool adequately after the kernel compile, and so had to throw out the first 4 power samples (20 minutes).

I retested both rjw-2 and rjw-3, but shorter tests and got 0 overruns in both cases.

...
ATM I'm not quite sure why this happens, but you seem to have some insight into it, so it would help if you shared it.

My insight seems questionable.

My thinking was that one can not decide if the pstate needs to go down or not based on such a localized look. The risk being that the higher periodic load might suffer overruns. Since my first test did exactly that, I violated my own "repeat all tests 3 times before reporting rule". Now, I am not sure what is going on. I will need more time to acquire traces and dig into it.

It looks like the workload's behavior depends on updating the CPU performance scaling governor sufficiently often.

Previously, that happened through the watchdog workflow that is gone now. The rjw-2/3 patch is attempting to make up for that by letting the tick run more often.

Admittedly, it is somewhat unclear to me why there are not so many overruns in the "stock" kernel test. Perhaps that's because the CPUs generally run fast enough in that case, so the P-state governor need not be updated so often for the CPU frequency to stay high and that's what determines the performance (and in turn that decides whether or not there are overruns and how often they occur). The other side of the coin is that the updates don't occur often enough for the power draw to be reasonable, though.

Presumably, when the P-state governor gets updated more often (via the scheduler tick), but not often enough, it effectively causes the frequency (and consequently the performance) to drop over relatively long time intervals (and hence the increased occurrence of overruns). If it gets updated even more often (like in rjw-3), though, it causes the CPU frequency to follow the utilization more precisely and so the CPUs don't run "too fast" or "too slowly" for too long.

...

I also did a 1 hour intel_pstate_tracer test, with rjw-2, on an idle system and saw several long durations. This was expected as this patch set wouldn't change durations by more than a few jiffies. 755 long durations (>6.1 seconds), and 327.7 seconds longest.

Feng Tang

3 Mar 3 Mar

5:27 a.m.

On Tue, Mar 01, 2022 at 08:06:24PM -0800, Doug Smythies wrote:

...

On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki rafael@kernel.org wrote:

...
I guess the numbers above could be reduced still by using a P-state below the max non-turbo one as a limit.

Yes, and for a test I did "rjw-3".

...
...
overruns: 1042. max overrun time: 9,769 uSec.

This would probably get worse then, though.

Yes, that was my expectation, but not what happened.

rjw-3: ave: 3.09 watts min: 3.01 watts max: 31.7 watts ave freq: 2.42 GHz. overruns: 12. (I did not expect this.) Max overruns time: 621 uSec.

Note 1: IRQ's increased by 74%. i.e. it was going in and out of idle a lot more.

Note 2: We know that processor package power is highly temperature dependent. I forgot to let my coolant cool adequately after the kernel compile, and so had to throw out the first 4 power samples (20 minutes).

I retested both rjw-2 and rjw-3, but shorter tests and got 0 overruns in both cases.

One thought is can we consider trying the previous debug patch of calling the util_update when entering idle (time limited).

In current code, the RT/CFS/Deadline class all have places to call cpufreq_update_util(), the patch will make sure it is called in all four classes, also it follows the principle of 'schedutil' of not introducing more system cost. And surely I could be missing some details here.

Following is a cleaner version of the patch, and the code could be moved down to the internal loop of

while (!need_resched()) {

}

Which will make it get called more frequently.

---

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index d17b0a5ce6ac..e12688036725 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -258,15 +258,23 @@ static void cpuidle_idle_call(void) * * Called with polling cleared. */ +DEFINE_PER_CPU(u64, last_util_update_time); /* in jiffies */ static void do_idle(void) { int cpu = smp_processor_id(); + u64 expire;

/* * Check if we need to update blocked load */ nohz_run_idle_balance(cpu);

+ expire = __this_cpu_read(last_util_update_time) + HZ * 3; + if (unlikely(time_is_before_jiffies((unsigned long)expire))) { + cpufreq_update_util(this_rq(), 0); + __this_cpu_write(last_util_update_time, get_jiffies_64()); + } + /* * If the arch has a polling bit, we maintain an invariant: *

Thanks, Feng

...

...
ATM I'm not quite sure why this happens, but you seem to have some insight into it, so it would help if you shared it.

My insight seems questionable.

My thinking was that one can not decide if the pstate needs to go down or not based on such a localized look. The risk being that the higher periodic load might suffer overruns. Since my first test did exactly that, I violated my own "repeat all tests 3 times before reporting rule". Now, I am not sure what is going on. I will need more time to acquire traces and dig into it.

I also did a 1 hour intel_pstate_tracer test, with rjw-2, on an idle system and saw several long durations. This was expected as this patch set wouldn't change durations by more than a few jiffies. 755 long durations (>6.1 seconds), and 327.7 seconds longest.

... Doug

Rafael J. Wysocki

12:02 p.m.

On Thu, Mar 3, 2022 at 6:27 AM Feng Tang feng.tang@intel.com wrote:

...

On Tue, Mar 01, 2022 at 08:06:24PM -0800, Doug Smythies wrote:

...
On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki rafael@kernel.org wrote:

...
I guess the numbers above could be reduced still by using a P-state below the max non-turbo one as a limit.

Yes, and for a test I did "rjw-3".

...
...
overruns: 1042. max overrun time: 9,769 uSec.

This would probably get worse then, though.

Yes, that was my expectation, but not what happened.

rjw-3: ave: 3.09 watts min: 3.01 watts max: 31.7 watts ave freq: 2.42 GHz. overruns: 12. (I did not expect this.) Max overruns time: 621 uSec.

Note 1: IRQ's increased by 74%. i.e. it was going in and out of idle a lot more.

Note 2: We know that processor package power is highly temperature dependent. I forgot to let my coolant cool adequately after the kernel compile, and so had to throw out the first 4 power samples (20 minutes).

I retested both rjw-2 and rjw-3, but shorter tests and got 0 overruns in both cases.

One thought is can we consider trying the previous debug patch of calling the util_update when entering idle (time limited).

In current code, the RT/CFS/Deadline class all have places to call cpufreq_update_util(), the patch will make sure it is called in all four classes, also it follows the principle of 'schedutil' of not introducing more system cost. And surely I could be missing some details here.

Following is a cleaner version of the patch, and the code could be moved down to the internal loop of
    while (!need_resched()) {

    }
Which will make it get called more frequently.

It will, but it's not necessary in all cases. It only is necessary if the tick is going to be stopped (because the tick will update the P-state governor anyway if it runs). However, at this point you don't know what's going to happen to the tick.

Moreover, updating the P-state governor before going idle doesn't really help, because the P-state programmed by it at this point may very well be stale after getting out of the idle state, so instead of doing a full update at this point, it should force a low P-state on the way from idle.

...

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index d17b0a5ce6ac..e12688036725 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -258,15 +258,23 @@ static void cpuidle_idle_call(void)

Called with polling cleared.

*/ +DEFINE_PER_CPU(u64, last_util_update_time); /* in jiffies */ static void do_idle(void) { int cpu = smp_processor_id();
  u64 expire;

  /*
   * Check if we need to update blocked load
   */
  nohz_run_idle_balance(cpu);
  expire = __this_cpu_read(last_util_update_time) + HZ * 3;
  if (unlikely(time_is_before_jiffies((unsigned long)expire))) {
          cpufreq_update_util(this_rq(), 0);

And quite frankly I'm not sure if running cpufreq_update_util() from here is safe.

...

          __this_cpu_write(last_util_update_time, get_jiffies_64());

```
  }
```

  /*
   * If the arch has a polling bit, we maintain an invariant:
   *

Doug Smythies

11 p.m.

On Wed, Mar 2, 2022 at 11:01 AM Rafael J. Wysocki rafael@kernel.org wrote:

...

On Wed, Mar 2, 2022 at 5:06 AM Doug Smythies dsmythies@telus.net wrote:

...
On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki rafael@kernel.org wrote:

...
On Tue, Mar 1, 2022 at 6:18 PM Doug Smythies dsmythies@telus.net wrote:

...
On Tue, Mar 1, 2022 at 3:58 AM Rafael J. Wysocki rafael@kernel.org wrote:

...
On Tue, Mar 1, 2022 at 6:53 AM Feng Tang feng.tang@intel.com wrote:

...
On Mon, Feb 28, 2022 at 08:36:03PM +0100, Rafael J. Wysocki wrote:

...

...
...
> > However, it was a bit racy, so maybe it's good that it was not complete. > > Below is a new version.

Thanks for the new version. I just gave it a try, and the occasional long delay of cpufreq auto-adjusting I have seen can not be reproduced after applying it.

OK, thanks!

I'll wait for feedback from Dough, though.

Hi Rafael,

Thank you for your version 2 patch. I screwed up an overnight test and will have to re-do it. However, I do have some results.

Thanks for testing it!

...
From reading the patch code, one worry was the potential to drive down the desired/required CPU frequency for the main periodic workflow, causing overruns, or inability of the task to complete its work before the next period.

It is not clear to me why you worried about that just from reading the patch? Can you explain, please?

It is already covered below. And a couple of further tests directly contradict my thinking.

...
...
I have always had overrun information, but it has never been relevant before.

The other worry was if the threshold of turbo/not turbo frequency is enough.

Agreed.

Just as an easy example and test I did this on top of this patch ("rjw-3"):

doug@s19:~/kernel/linux$ git diff diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index f878a4545eee..5cbdd7e479e8 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct cpudata *cpu) * P-states to prevent them from getting back to the high frequency * right away after getting out of deep idle. */
  cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
  cpuidle_update_retain_tick(pstate > (cpu->pstate.max_pstate >> 1));
OK, but cpu->pstate.max_pstate / 2 may be almost cpu->pstate.min_pstate which would be better to use here IMO.

For my processor, i5-10600K, it works out to 2.05 GHz. 4.1 GHz / 2, whereas min pstate would be 0.8 GHz

...

Or something like (cpu->pstate.max_pstate + cpu->pstate.min_pstate) / 2. Can you try this one in particular?

Which I agree would be a much better general case to use. For my processor that would be 2.45 GHz. (4.1 + 0.8) / 2.

My code was just for testing, and I intended to go further than we might want to for the final solution.

I'm holding off on trying your suggestion, for now.

...

...
    wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
}

...
...
I do not know how to test any final solution thoroughly, as so far I have simply found a good enough problematic example. We have so many years of experience with the convenient multi second NMI forcing lingering high pstate clean up. I'd keep it deciding within it if the TSC stuff needs to be executed or not.

Anyway...

Base Kernel 5.17-rc3. "stock" : unmodified. "revert" : with commit b50db7095fe reverted "rjw-2" : with this version2 patch added.

Test 1 (as before. There is no test 2, yet.): 347 Hertz work/sleep frequency on one CPU while others do virtually no load but enough to increase the requested pstate, but at around 0.02 hertz aggregate.

It is important to note the main load is approximately 38.6% @ 2.422 GHz, or 100% at 0.935 GHz. and almost exclusively uses idle state 2 (of 4 total idle states)

/sys/devices/system/cpu/cpu7/cpuidle/state0/name:POLL /sys/devices/system/cpu/cpu7/cpuidle/state1/name:C1_ACPI /sys/devices/system/cpu/cpu7/cpuidle/state2/name:C2_ACPI /sys/devices/system/cpu/cpu7/cpuidle/state3/name:C3_ACPI

Turbostat was used. ~10 samples at 300 seconds per. Processor package power (Watts):

Workflow was run for 1 hour each time or 1249201 loops.

revert: ave: 3.00 min: 2.89 max: 3.08

I'm not sure what the above three numbers are.

Processor package power, Watts.
OK

...
...
...
ave freq: 2.422 GHz. overruns: 1. max overrun time: 113 uSec.

stock: ave: 3.63 (+21%) min: 3.28 max: 3.99 ave freq: 2.791 GHz. overruns: 2. max overrun time: 677 uSec.

rjw-2: ave: 3.14 (+5%) min: 2.97 max: 3.28 ave freq: 2.635 GHz

I guess the numbers above could be reduced still by using a P-state below the max non-turbo one as a limit.

Yes, and for a test I did "rjw-3".

...
...
overruns: 1042. max overrun time: 9,769 uSec.

This would probably get worse then, though.

Yes, that was my expectation, but not what happened.

rjw-3: ave: 3.09 watts min: 3.01 watts max: 31.7 watts

Did you mean 3.17 here? It would be hard to get the average of 3.09 if the max was over 30 W.

Yes, 3.17 watts was what I meant to write. Sorry for any confusion.

...

...
ave freq: 2.42 GHz. overruns: 12. (I did not expect this.) Max overruns time: 621 uSec.

Note 1: IRQ's increased by 74%. i.e. it was going in and out of idle a lot more.

That's because the scheduler tick is allowed to run a lot more often in the given workload with the changed test above.

Agreed.

...

...
Note 2: We know that processor package power is highly temperature dependent. I forgot to let my coolant cool adequately after the kernel compile, and so had to throw out the first 4 power samples (20 minutes).

I retested both rjw-2 and rjw-3, but shorter tests and got 0 overruns in both cases.

...
ATM I'm not quite sure why this happens, but you seem to have some insight into it, so it would help if you shared it.

My insight seems questionable.

My thinking was that one can not decide if the pstate needs to go down or not based on such a localized look. The risk being that the higher periodic load might suffer overruns. Since my first test did exactly that, I violated my own "repeat all tests 3 times before reporting rule". Now, I am not sure what is going on. I will need more time to acquire traces and dig into it.

It looks like the workload's behavior depends on updating the CPU performance scaling governor sufficiently often.

Previously, that happened through the watchdog workflow that is gone now. The rjw-2/3 patch is attempting to make up for that by letting the tick run more often.

Admittedly, it is somewhat unclear to me why there are not so many overruns in the "stock" kernel test. Perhaps that's because the CPUs generally run fast enough in that case, so the P-state governor need not be updated so often for the CPU frequency to stay high and that's what determines the performance (and in turn that decides whether or not there are overruns and how often they occur). The other side of the coin is that the updates don't occur often enough for the power draw to be reasonable, though.

Agreed.

I am observing higher CPU frequencies than expected, and can not determine why in all cases. It is going to take several days before I can either reply with more detail or disprove what I think I am observing.

...

Presumably, when the P-state governor gets updated more often (via the scheduler tick), but not often enough, it effectively causes the frequency (and consequently the performance) to drop over relatively long time intervals (and hence the increased occurrence of overruns). If it gets updated even more often (like in rjw-3), though, it causes the CPU frequency to follow the utilization more precisely and so the CPUs don't run "too fast" or "too slowly" for too long.

Agreed.

...

...
I also did a 1 hour intel_pstate_tracer test, with rjw-2, on an idle system and saw several long durations. This was expected as this patch set wouldn't change durations by more than a few jiffies. 755 long durations (>6.1 seconds), and 327.7 seconds longest.

Feng Tang

4 Mar 4 Mar

5:13 a.m.

On Thu, Mar 03, 2022 at 01:02:01PM +0100, Rafael J. Wysocki wrote:

...

On Thu, Mar 3, 2022 at 6:27 AM Feng Tang feng.tang@intel.com wrote:

...
On Tue, Mar 01, 2022 at 08:06:24PM -0800, Doug Smythies wrote:

...
On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki rafael@kernel.org wrote:

...
I guess the numbers above could be reduced still by using a P-state below the max non-turbo one as a limit.

Yes, and for a test I did "rjw-3".

...
...
overruns: 1042. max overrun time: 9,769 uSec.

This would probably get worse then, though.

Yes, that was my expectation, but not what happened.

rjw-3: ave: 3.09 watts min: 3.01 watts max: 31.7 watts ave freq: 2.42 GHz. overruns: 12. (I did not expect this.) Max overruns time: 621 uSec.

Note 1: IRQ's increased by 74%. i.e. it was going in and out of idle a lot more.

Note 2: We know that processor package power is highly temperature dependent. I forgot to let my coolant cool adequately after the kernel compile, and so had to throw out the first 4 power samples (20 minutes).

I retested both rjw-2 and rjw-3, but shorter tests and got 0 overruns in both cases.

One thought is can we consider trying the previous debug patch of calling the util_update when entering idle (time limited).

In current code, the RT/CFS/Deadline class all have places to call cpufreq_update_util(), the patch will make sure it is called in all four classes, also it follows the principle of 'schedutil' of not introducing more system cost. And surely I could be missing some details here.

Following is a cleaner version of the patch, and the code could be moved down to the internal loop of
    while (!need_resched()) {

    }
Which will make it get called more frequently.
It will, but it's not necessary in all cases. It only is necessary if the tick is going to be stopped (because the tick will update the P-state governor anyway if it runs). However, at this point you don't know what's going to happen to the tick.

Moreover, updating the P-state governor before going idle doesn't really help,

From Doug's previous test, the power consumption and the delay both improved with the debug patch.

...

because the P-state programmed by it at this point may very well be stale after getting out of the idle state, so instead of doing a full update at this point, it should force a low P-state on the way from idle.

Makes sense.

Paul has asked about the timer interupts, and here is some more info, when there is no active load in system, the longest interval of 4 seconds between 2 timer interrupts comes from the kernel watchdog for hardware/software lockup detector, and every CPU will have this timer, which limits the maximum cpu idle duration to be 4 seconds.

...

...
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index d17b0a5ce6ac..e12688036725 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -258,15 +258,23 @@ static void cpuidle_idle_call(void)

Called with polling cleared.

*/ +DEFINE_PER_CPU(u64, last_util_update_time); /* in jiffies */ static void do_idle(void) { int cpu = smp_processor_id();
  u64 expire;

  /*
   * Check if we need to update blocked load
   */
  nohz_run_idle_balance(cpu);
  expire = __this_cpu_read(last_util_update_time) + HZ * 3;
  if (unlikely(time_is_before_jiffies((unsigned long)expire))) {
          cpufreq_update_util(this_rq(), 0);
And quite frankly I'm not sure if running cpufreq_update_util() from here is safe.

I had that concern too :). Do you mean this is called when the local irq is enabled, and could be interrupted causing some issue?

Thanks, Feng

...

          __this_cpu_write(last_util_update_time, get_jiffies_64());

```
  }
```

  /*
   * If the arch has a polling bit, we maintain an invariant:
   *

Doug Smythies

6:59 a.m.

On Thu, Mar 3, 2022 at 3:00 PM Doug Smythies dsmythies@telus.net wrote:

...

On Wed, Mar 2, 2022 at 11:01 AM Rafael J. Wysocki rafael@kernel.org wrote:

...
On Wed, Mar 2, 2022 at 5:06 AM Doug Smythies dsmythies@telus.net wrote:

...
On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki rafael@kernel.org wrote:

...
On Tue, Mar 1, 2022 at 6:18 PM Doug Smythies dsmythies@telus.net wrote:

...
On Tue, Mar 1, 2022 at 3:58 AM Rafael J. Wysocki rafael@kernel.org wrote:

...
On Tue, Mar 1, 2022 at 6:53 AM Feng Tang feng.tang@intel.com wrote: > On Mon, Feb 28, 2022 at 08:36:03PM +0100, Rafael J. Wysocki wrote:

...

...
> > > > However, it was a bit racy, so maybe it's good that it was not complete. > > > > Below is a new version. > > Thanks for the new version. I just gave it a try, and the occasional > long delay of cpufreq auto-adjusting I have seen can not be reproduced > after applying it.

OK, thanks!

I'll wait for feedback from Dough, though.

Hi Rafael,

Thank you for your version 2 patch. I screwed up an overnight test and will have to re-do it. However, I do have some results.

Thanks for testing it!

...
From reading the patch code, one worry was the potential to drive down the desired/required CPU frequency for the main periodic workflow, causing overruns, or inability of the task to complete its work before the next period.

It is not clear to me why you worried about that just from reading the patch? Can you explain, please?

It is already covered below. And a couple of further tests directly contradict my thinking.

...
...
I have always had overrun information, but it has never been relevant before.

The other worry was if the threshold of turbo/not turbo frequency is enough.

Agreed.

Just as an easy example and test I did this on top of this patch ("rjw-3"):

doug@s19:~/kernel/linux$ git diff diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index f878a4545eee..5cbdd7e479e8 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct cpudata *cpu) * P-states to prevent them from getting back to the high frequency * right away after getting out of deep idle. */
  cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
  cpuidle_update_retain_tick(pstate > (cpu->pstate.max_pstate >> 1));
OK, but cpu->pstate.max_pstate / 2 may be almost cpu->pstate.min_pstate which would be better to use here IMO.
For my processor, i5-10600K, it works out to 2.05 GHz. 4.1 GHz / 2, whereas min pstate would be 0.8 GHz

...
Or something like (cpu->pstate.max_pstate + cpu->pstate.min_pstate) / 2. Can you try this one in particular?

Which I agree would be a much better general case to use. For my processor that would be 2.45 GHz. (4.1 + 0.8) / 2.

My code was just for testing, and I intended to go further than we might want to for the final solution.

I'm holding off on trying your suggestion, for now.

...
...
    wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
}

...
...
I do not know how to test any final solution thoroughly, as so far I have simply found a good enough problematic example. We have so many years of experience with the convenient multi second NMI forcing lingering high pstate clean up. I'd keep it deciding within it if the TSC stuff needs to be executed or not.

Anyway...

Base Kernel 5.17-rc3. "stock" : unmodified. "revert" : with commit b50db7095fe reverted "rjw-2" : with this version2 patch added.

Test 1 (as before. There is no test 2, yet.): 347 Hertz work/sleep frequency on one CPU while others do virtually no load but enough to increase the requested pstate, but at around 0.02 hertz aggregate.

It is important to note the main load is approximately 38.6% @ 2.422 GHz, or 100% at 0.935 GHz. and almost exclusively uses idle state 2 (of 4 total idle states)

/sys/devices/system/cpu/cpu7/cpuidle/state0/name:POLL /sys/devices/system/cpu/cpu7/cpuidle/state1/name:C1_ACPI /sys/devices/system/cpu/cpu7/cpuidle/state2/name:C2_ACPI /sys/devices/system/cpu/cpu7/cpuidle/state3/name:C3_ACPI

Turbostat was used. ~10 samples at 300 seconds per. Processor package power (Watts):

Workflow was run for 1 hour each time or 1249201 loops.

revert: ave: 3.00 min: 2.89 max: 3.08

I'm not sure what the above three numbers are.

Processor package power, Watts.
OK

...
...
...
ave freq: 2.422 GHz. overruns: 1. max overrun time: 113 uSec.

stock: ave: 3.63 (+21%) min: 3.28 max: 3.99 ave freq: 2.791 GHz. overruns: 2. max overrun time: 677 uSec.

rjw-2: ave: 3.14 (+5%) min: 2.97 max: 3.28 ave freq: 2.635 GHz

I guess the numbers above could be reduced still by using a P-state below the max non-turbo one as a limit.

Yes, and for a test I did "rjw-3".

...
...
overruns: 1042. max overrun time: 9,769 uSec.

This would probably get worse then, though.

Yes, that was my expectation, but not what happened.

rjw-3: ave: 3.09 watts min: 3.01 watts max: 31.7 watts

Did you mean 3.17 here? It would be hard to get the average of 3.09 if the max was over 30 W.
Yes, 3.17 watts was what I meant to write. Sorry for any confusion.

...
...
ave freq: 2.42 GHz. overruns: 12. (I did not expect this.) Max overruns time: 621 uSec.

Note 1: IRQ's increased by 74%. i.e. it was going in and out of idle a lot more.

That's because the scheduler tick is allowed to run a lot more often in the given workload with the changed test above.

Agreed.

...
...
Note 2: We know that processor package power is highly temperature dependent. I forgot to let my coolant cool adequately after the kernel compile, and so had to throw out the first 4 power samples (20 minutes).

I retested both rjw-2 and rjw-3, but shorter tests and got 0 overruns in both cases.

...
ATM I'm not quite sure why this happens, but you seem to have some insight into it, so it would help if you shared it.

My insight seems questionable.

My thinking was that one can not decide if the pstate needs to go down or not based on such a localized look. The risk being that the higher periodic load might suffer overruns. Since my first test did exactly that, I violated my own "repeat all tests 3 times before reporting rule". Now, I am not sure what is going on. I will need more time to acquire traces and dig into it.

It looks like the workload's behavior depends on updating the CPU performance scaling governor sufficiently often.

Previously, that happened through the watchdog workflow that is gone now. The rjw-2/3 patch is attempting to make up for that by letting the tick run more often.

Admittedly, it is somewhat unclear to me why there are not so many overruns in the "stock" kernel test.

The CPU frequency is incorrectly held considerably higher than it should be. From intel_pstate_tracer data it looks as though a CPU that is idle is not giving up its vote into the PLL as to what the CPU frequency should be. However, the culprit isn't actually that CPU, but rather the uncore, which seems to lock to that same CPU's pstate. It doesn't unlock from that state until that CPU calls the intel_pstate driver, sometimes tens of seconds later. Then things will proceed as normal for a short time until that same CPU does the same thing and the cycle repeats.

If I lock the uncore at its minimum pstate, then everything is fine and I can not detect any difference between the "stock" and "reverted" kernels. Both give better power results than all previous tests. Processor package power is approximately 2.75 watts (it is late in my time zone, exact numbers at a later date). No overruns.

The shorter maximum durations from the "reverted" kernel merely cleaned up the mess faster than the "stock" kernel resulting in less, but still excessive, power consumption.

I only heard about uncore a couple of months ago from Srinivas [1]

[1] https://lore.kernel.org/linux-pm/1b2be990d5c31f62d9ce33aa2eb2530708d5607a.ca...

The CPU that the uncore appears to lock to is not the same for each run of the test, but I know which CPU it was for that run of the test just by looking at the load graphs from the intel_pstate_tracer output. Thereafter, manual searching of the data reveals the anomalies.

... Doug

...

Perhaps that's because the CPUs

...
generally run fast enough in that case, so the P-state governor need not be updated so often for the CPU frequency to stay high and that's what determines the performance (and in turn that decides whether or not there are overruns and how often they occur). The other side of the coin is that the updates don't occur often enough for the power draw to be reasonable, though.

Agreed.

I am observing higher CPU frequencies than expected, and can not determine why in all cases. It is going to take several days before I can either reply with more detail or disprove what I think I am observing.

...
Presumably, when the P-state governor gets updated more often (via the scheduler tick), but not often enough, it effectively causes the frequency (and consequently the performance) to drop over relatively long time intervals (and hence the increased occurrence of overruns). If it gets updated even more often (like in rjw-3), though, it causes the CPU frequency to follow the utilization more precisely and so the CPUs don't run "too fast" or "too slowly" for too long.

Agreed.

...
...
I also did a 1 hour intel_pstate_tracer test, with rjw-2, on an idle system and saw several long durations. This was expected as this patch set wouldn't change durations by more than a few jiffies. 755 long durations (>6.1 seconds), and 327.7 seconds longest.

Paul E. McKenney

4:23 p.m.

On Fri, Mar 04, 2022 at 01:13:44PM +0800, Feng Tang wrote:

...

On Thu, Mar 03, 2022 at 01:02:01PM +0100, Rafael J. Wysocki wrote:

...
On Thu, Mar 3, 2022 at 6:27 AM Feng Tang feng.tang@intel.com wrote:

...
On Tue, Mar 01, 2022 at 08:06:24PM -0800, Doug Smythies wrote:

...
On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki rafael@kernel.org wrote:

...
I guess the numbers above could be reduced still by using a P-state below the max non-turbo one as a limit.

Yes, and for a test I did "rjw-3".

...
...
overruns: 1042. max overrun time: 9,769 uSec.

This would probably get worse then, though.

Yes, that was my expectation, but not what happened.

rjw-3: ave: 3.09 watts min: 3.01 watts max: 31.7 watts ave freq: 2.42 GHz. overruns: 12. (I did not expect this.) Max overruns time: 621 uSec.

Note 1: IRQ's increased by 74%. i.e. it was going in and out of idle a lot more.

Note 2: We know that processor package power is highly temperature dependent. I forgot to let my coolant cool adequately after the kernel compile, and so had to throw out the first 4 power samples (20 minutes).

I retested both rjw-2 and rjw-3, but shorter tests and got 0 overruns in both cases.

One thought is can we consider trying the previous debug patch of calling the util_update when entering idle (time limited).

In current code, the RT/CFS/Deadline class all have places to call cpufreq_update_util(), the patch will make sure it is called in all four classes, also it follows the principle of 'schedutil' of not introducing more system cost. And surely I could be missing some details here.

Following is a cleaner version of the patch, and the code could be moved down to the internal loop of
    while (!need_resched()) {

    }
Which will make it get called more frequently.
It will, but it's not necessary in all cases. It only is necessary if the tick is going to be stopped (because the tick will update the P-state governor anyway if it runs). However, at this point you don't know what's going to happen to the tick.

Moreover, updating the P-state governor before going idle doesn't really help,
...
From Doug's previous test, the power consumption and the delay

both improved with the debug patch.

...
because the P-state programmed by it at this point may very well be stale after getting out of the idle state, so instead of doing a full update at this point, it should force a low P-state on the way from idle.

Makes sense.

Paul has asked about the timer interupts, and here is some more info, when there is no active load in system, the longest interval of 4 seconds between 2 timer interrupts comes from the kernel watchdog for hardware/software lockup detector, and every CPU will have this timer, which limits the maximum cpu idle duration to be 4 seconds.

And thank you for the info! One way to reduce this overhead is to boot with the watchdog_thresh kernel-boot parameter set to some value greater than 10. The downside is that HW/FW/NMI catastrophes will need to last for more than 10 seconds before the system complains. One approach is to run with a small watchdog_thresh during testing and a larger one in production.

Thanx, Paul

...

...
...
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index d17b0a5ce6ac..e12688036725 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -258,15 +258,23 @@ static void cpuidle_idle_call(void)

Called with polling cleared.

*/ +DEFINE_PER_CPU(u64, last_util_update_time); /* in jiffies */ static void do_idle(void) { int cpu = smp_processor_id();
  u64 expire;

  /*
   * Check if we need to update blocked load
   */
  nohz_run_idle_balance(cpu);
  expire = __this_cpu_read(last_util_update_time) + HZ * 3;
  if (unlikely(time_is_before_jiffies((unsigned long)expire))) {
          cpufreq_update_util(this_rq(), 0);
And quite frankly I'm not sure if running cpufreq_update_util() from here is safe.
I had that concern too :). Do you mean this is called when the local irq is enabled, and could be interrupted causing some issue?

Thanks, Feng

...
...
          __this_cpu_write(last_util_update_time, get_jiffies_64());
  }
  /*
   * If the arch has a polling bit, we maintain an invariant:
   *

Doug Smythies

16 Mar 16 Mar

3:54 p.m.

Readers: So that graphs and large attachments could be used, I have been on an off-list branch of this thread with Srinivas, and copied a couple of others. While now returning to this on-list thread, I'll only take up Rafael's proposed patch.

Hi Rafael,

So far all work has been done with: HWP disabled; intel_pstate; powersave. The reason was that it is, by far, the best way to obtain good trace data using the intel_pstate_tracer.py utility.

I always intended to try/test: HWP disabled; intel_cpufreq; schedutil. There is an issue with the proposed patch and schedutil.

If any CPU ever requests a pstate > the max non turbo pstate then it will stay at that request forever. Ultimately the idle power goes to about 5.7 watts (verses 1.4 watts expected). IRQs go very high, as the tick never turns off. Actually, one knows how many CPUs are stuck requesting a high pstate just by looking at IRQs.

Trace is useless because it virtually never gets called. So I have been reading the IA32_PERF_CTL MSR directly.

Example:

Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz 6 cores, 12 CPUs min pstate 8 max non-turbo pstate 41 max turbo pstate 48 The system is idle.

doug@s19:~$ sudo /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt --interval 10 Busy% Bzy_MHz IRQ PkgWatt 0.11 800 844 1.33 0.01 800 231 1.33 0.11 800 723 1.33 <<< Powersave governor 0.03 889 440 1.33 0.17 4418 21511 4.31 <<< Schedutil governor 0.12 4101 30153 4.48 <<< 3 CPUs are > pstate 41 0.22 4347 34226 4.75 0.17 4101 43554 4.78 0.29 4300 50565 4.94 0.21 4098 50297 4.76 <<< 5 CPUs are > pstate 41 0.29 4298 50532 4.84 0.20 4101 50126 4.63 0.20 4101 50149 4.62 0.29 4297 50623 4.76 0.20 4101 50203 4.72 0.29 4295 50642 4.78 0.20 4101 50223 4.68 0.29 4292 50597 4.88 0.20 4101 50208 4.73 0.29 4296 50519 4.84 0.20 4101 50167 4.80 0.20 4101 50242 4.76 0.29 4302 50625 4.94 0.20 4101 50233 4.73 0.29 4296 50613 4.78 0.20 4101 50231 4.70 0.29 4292 50802 4.93 1.46 4669 65610 8.36 0.41 4225 80701 5.48 0.33 4101 80219 5.36 <<< 8 CPUs are > ptstate 41 0.34 4098 80313 5.38 0.41 4228 80689 5.56 0.33 4101 80252 5.46

And the related MSR reads:

3 CPUs are > pstate 41: root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL 9.) 0x199: IA32_PERF_CTL : CPU 0-11 : 30 : 8 : 8 : 48 : 48 : 48 : 8 : 30 : 31 : 8 : 8 : 8 :

5 CPUs are > psate 41: root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL 9.) 0x199: IA32_PERF_CTL : CPU 0-11 : 44 : 30 : 31 : 48 : 48 : 48 : 8 : 8 : 8 : 8 : 48 : 8 :

8 CPUs are > pstate 41: root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL 9.) 0x199: IA32_PERF_CTL : CPU 0-11 : 45 : 48 : 48 : 48 : 48 : 48 : 8 : 30 : 8 : 8 : 48 : 42 :

This issue is independent of the original patch or the suggested modification:

...

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index f878a4545eee..94018ac0b59b 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct cpudata *cpu) * P-states to prevent them from getting back to the high frequency * right away after getting out of deep idle. */
  cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
  cpuidle_update_retain_tick(pstate > ((cpu->pstate.max_pstate +
cpu->pstate.min_pstate)/2)); wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate)); }

... Doug

Rafael J. Wysocki

17 Mar 17 Mar

12:30 p.m.

On Wed, Mar 16, 2022 at 4:55 PM Doug Smythies dsmythies@telus.net wrote:

...

Readers: So that graphs and large attachments could be used, I have been on an off-list branch of this thread with Srinivas, and copied a couple of others. While now returning to this on-list thread, I'll only take up Rafael's proposed patch.

Hi Rafael,

So far all work has been done with: HWP disabled; intel_pstate; powersave. The reason was that it is, by far, the best way to obtain good trace data using the intel_pstate_tracer.py utility.

I always intended to try/test: HWP disabled; intel_cpufreq; schedutil. There is an issue with the proposed patch and schedutil.

If any CPU ever requests a pstate > the max non turbo pstate then it will stay at that request forever. Ultimately the idle power goes to about 5.7 watts (verses 1.4 watts expected). IRQs go very high, as the tick never turns off. Actually, one knows how many CPUs are stuck requesting a high pstate just by looking at IRQs.

That may be because INTEL_CPUFREQ_TRANSITION_DELAY is too small.

Please try to increase /sys/devices/system/cpu/cpufreq/schedutil/rate_limit_us to 10000 and see what difference this makes.

...

Trace is useless because it virtually never gets called. So I have been reading the IA32_PERF_CTL MSR directly.

Example:

Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz 6 cores, 12 CPUs min pstate 8 max non-turbo pstate 41 max turbo pstate 48 The system is idle.

doug@s19:~$ sudo /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt --interval 10 Busy% Bzy_MHz IRQ PkgWatt 0.11 800 844 1.33 0.01 800 231 1.33 0.11 800 723 1.33 <<< Powersave governor 0.03 889 440 1.33 0.17 4418 21511 4.31 <<< Schedutil governor 0.12 4101 30153 4.48 <<< 3 CPUs are > pstate 41 0.22 4347 34226 4.75 0.17 4101 43554 4.78 0.29 4300 50565 4.94 0.21 4098 50297 4.76 <<< 5 CPUs are > pstate 41 0.29 4298 50532 4.84 0.20 4101 50126 4.63 0.20 4101 50149 4.62 0.29 4297 50623 4.76 0.20 4101 50203 4.72 0.29 4295 50642 4.78 0.20 4101 50223 4.68 0.29 4292 50597 4.88 0.20 4101 50208 4.73 0.29 4296 50519 4.84 0.20 4101 50167 4.80 0.20 4101 50242 4.76 0.29 4302 50625 4.94 0.20 4101 50233 4.73 0.29 4296 50613 4.78 0.20 4101 50231 4.70 0.29 4292 50802 4.93 1.46 4669 65610 8.36 0.41 4225 80701 5.48 0.33 4101 80219 5.36 <<< 8 CPUs are > ptstate 41 0.34 4098 80313 5.38 0.41 4228 80689 5.56 0.33 4101 80252 5.46

And the related MSR reads:

3 CPUs are > pstate 41: root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL 9.) 0x199: IA32_PERF_CTL : CPU 0-11 : 30 : 8 : 8 : 48 : 48 : 48 : 8 : 30 : 31 : 8 : 8 : 8 :

5 CPUs are > psate 41: root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL 9.) 0x199: IA32_PERF_CTL : CPU 0-11 : 44 : 30 : 31 : 48 : 48 : 48 : 8 : 8 : 8 : 8 : 48 : 8 :

8 CPUs are > pstate 41: root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL 9.) 0x199: IA32_PERF_CTL : CPU 0-11 : 45 : 48 : 48 : 48 : 48 : 48 : 8 : 30 : 8 : 8 : 48 : 42 :

This issue is independent of the original patch or the suggested modification:

...
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index f878a4545eee..94018ac0b59b 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct cpudata *cpu) * P-states to prevent them from getting back to the high frequency * right away after getting out of deep idle. */
  cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
  cpuidle_update_retain_tick(pstate > ((cpu->pstate.max_pstate +
cpu->pstate.min_pstate)/2)); wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate)); }
... Doug

Doug Smythies

1:58 p.m.

On Thu, Mar 17, 2022 at 5:30 AM Rafael J. Wysocki rafael@kernel.org wrote:

...

On Wed, Mar 16, 2022 at 4:55 PM Doug Smythies dsmythies@telus.net wrote:

...
Readers: So that graphs and large attachments could be used, I have been on an off-list branch of this thread with Srinivas, and copied a couple of others. While now returning to this on-list thread, I'll only take up Rafael's proposed patch.

Hi Rafael,

So far all work has been done with: HWP disabled; intel_pstate; powersave. The reason was that it is, by far, the best way to obtain good trace data using the intel_pstate_tracer.py utility.

I always intended to try/test: HWP disabled; intel_cpufreq; schedutil. There is an issue with the proposed patch and schedutil.

If any CPU ever requests a pstate > the max non turbo pstate then it will stay at that request forever. Ultimately the idle power goes to about 5.7 watts (verses 1.4 watts expected). IRQs go very high, as the tick never turns off. Actually, one knows how many CPUs are stuck requesting a high pstate just by looking at IRQs.

That may be because INTEL_CPUFREQ_TRANSITION_DELAY is too small.

Please try to increase /sys/devices/system/cpu/cpufreq/schedutil/rate_limit_us to 10000 and see what difference this makes.

Changing rate_limit_us to 10000, or even 20000, makes no difference.

see a slight clarification to yesterday's email in-line below.

...

...
Trace is useless because it virtually never gets called. So I have been reading the IA32_PERF_CTL MSR directly.

Example:

Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz 6 cores, 12 CPUs min pstate 8 max non-turbo pstate 41 max turbo pstate 48 The system is idle.

doug@s19:~$ sudo /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt --interval 10 Busy% Bzy_MHz IRQ PkgWatt 0.11 800 844 1.33 0.01 800 231 1.33 0.11 800 723 1.33 <<< Powersave governor 0.03 889 440 1.33 0.17 4418 21511 4.31 <<< Schedutil governor 0.12 4101 30153 4.48 <<< 3 CPUs are > pstate 41 0.22 4347 34226 4.75 0.17 4101 43554 4.78 0.29 4300 50565 4.94 0.21 4098 50297 4.76 <<< 5 CPUs are > pstate 41 0.29 4298 50532 4.84 0.20 4101 50126 4.63 0.20 4101 50149 4.62 0.29 4297 50623 4.76 0.20 4101 50203 4.72 0.29 4295 50642 4.78 0.20 4101 50223 4.68 0.29 4292 50597 4.88 0.20 4101 50208 4.73 0.29 4296 50519 4.84 0.20 4101 50167 4.80 0.20 4101 50242 4.76 0.29 4302 50625 4.94 0.20 4101 50233 4.73 0.29 4296 50613 4.78 0.20 4101 50231 4.70 0.29 4292 50802 4.93 1.46 4669 65610 8.36 0.41 4225 80701 5.48 0.33 4101 80219 5.36 <<< 8 CPUs are > ptstate 41 0.34 4098 80313 5.38 0.41 4228 80689 5.56 0.33 4101 80252 5.46

And the related MSR reads:

3 CPUs are > pstate 41: root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL 9.) 0x199: IA32_PERF_CTL : CPU 0-11 : 30 : 8 : 8 : 48 : 48 : 48 : 8 : 30 : 31 : 8 : 8 : 8 :

5 CPUs are > psate 41: root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL 9.) 0x199: IA32_PERF_CTL : CPU 0-11 : 44 : 30 : 31 : 48 : 48 : 48 : 8 : 8 : 8 : 8 : 48 : 8 :

8 CPUs are > pstate 41: root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL 9.) 0x199: IA32_PERF_CTL : CPU 0-11 : 45 : 48 : 48 : 48 : 48 : 48 : 8 : 30 : 8 : 8 : 48 : 42 :

This issue is independent of the original patch or the suggested modification:

Actually, the issue threshold is as defined by the greater than condition below.

...

...
...
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index f878a4545eee..94018ac0b59b 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct cpudata *cpu) * P-states to prevent them from getting back to the high frequency * right away after getting out of deep idle. */
  cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);

For the above kernel the threshold is pstate 42.

...

...
...
  cpuidle_update_retain_tick(pstate > ((cpu->pstate.max_pstate +
cpu->pstate.min_pstate)/2));

For the above kernel the threshold is pstate 25.

...

    wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));

}

... Doug

Doug Smythies

24 Mar 24 Mar

2:04 p.m.

Hi Rafael,

Do you have any suggestions for the proposed patch? I have tried to figure out what is wrong but haven't been able to.

... Doug

On Thu, Mar 17, 2022 at 6:58 AM Doug Smythies dsmythies@telus.net wrote:

...

On Thu, Mar 17, 2022 at 5:30 AM Rafael J. Wysocki rafael@kernel.org wrote:

...
On Wed, Mar 16, 2022 at 4:55 PM Doug Smythies dsmythies@telus.net wrote:

...
Readers: So that graphs and large attachments could be used, I have been on an off-list branch of this thread with Srinivas, and copied a couple of others. While now returning to this on-list thread, I'll only take up Rafael's proposed patch.

Hi Rafael,

So far all work has been done with: HWP disabled; intel_pstate; powersave. The reason was that it is, by far, the best way to obtain good trace data using the intel_pstate_tracer.py utility.

I always intended to try/test: HWP disabled; intel_cpufreq; schedutil. There is an issue with the proposed patch and schedutil.

If any CPU ever requests a pstate > the max non turbo pstate then it will stay at that request forever. Ultimately the idle power goes to about 5.7 watts (verses 1.4 watts expected). IRQs go very high, as the tick never turns off. Actually, one knows how many CPUs are stuck requesting a high pstate just by looking at IRQs.

That may be because INTEL_CPUFREQ_TRANSITION_DELAY is too small.

Please try to increase /sys/devices/system/cpu/cpufreq/schedutil/rate_limit_us to 10000 and see what difference this makes.

Changing rate_limit_us to 10000, or even 20000, makes no difference.

see a slight clarification to yesterday's email in-line below.

...
...
Trace is useless because it virtually never gets called. So I have been reading the IA32_PERF_CTL MSR directly.

Example:

Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz 6 cores, 12 CPUs min pstate 8 max non-turbo pstate 41 max turbo pstate 48 The system is idle.

doug@s19:~$ sudo /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt --interval 10 Busy% Bzy_MHz IRQ PkgWatt 0.11 800 844 1.33 0.01 800 231 1.33 0.11 800 723 1.33 <<< Powersave governor 0.03 889 440 1.33 0.17 4418 21511 4.31 <<< Schedutil governor 0.12 4101 30153 4.48 <<< 3 CPUs are > pstate 41 0.22 4347 34226 4.75 0.17 4101 43554 4.78 0.29 4300 50565 4.94 0.21 4098 50297 4.76 <<< 5 CPUs are > pstate 41 0.29 4298 50532 4.84 0.20 4101 50126 4.63 0.20 4101 50149 4.62 0.29 4297 50623 4.76 0.20 4101 50203 4.72 0.29 4295 50642 4.78 0.20 4101 50223 4.68 0.29 4292 50597 4.88 0.20 4101 50208 4.73 0.29 4296 50519 4.84 0.20 4101 50167 4.80 0.20 4101 50242 4.76 0.29 4302 50625 4.94 0.20 4101 50233 4.73 0.29 4296 50613 4.78 0.20 4101 50231 4.70 0.29 4292 50802 4.93 1.46 4669 65610 8.36 0.41 4225 80701 5.48 0.33 4101 80219 5.36 <<< 8 CPUs are > ptstate 41 0.34 4098 80313 5.38 0.41 4228 80689 5.56 0.33 4101 80252 5.46

And the related MSR reads:

3 CPUs are > pstate 41: root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL 9.) 0x199: IA32_PERF_CTL : CPU 0-11 : 30 : 8 : 8 : 48 : 48 : 48 : 8 : 30 : 31 : 8 : 8 : 8 :

5 CPUs are > psate 41: root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL 9.) 0x199: IA32_PERF_CTL : CPU 0-11 : 44 : 30 : 31 : 48 : 48 : 48 : 8 : 8 : 8 : 8 : 48 : 8 :

8 CPUs are > pstate 41: root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL 9.) 0x199: IA32_PERF_CTL : CPU 0-11 : 45 : 48 : 48 : 48 : 48 : 48 : 8 : 30 : 8 : 8 : 48 : 42 :

This issue is independent of the original patch or the suggested modification:

Actually, the issue threshold is as defined by the greater than condition below.

...
...
...
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index f878a4545eee..94018ac0b59b 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct cpudata *cpu) * P-states to prevent them from getting back to the high frequency * right away after getting out of deep idle. */
  cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
For the above kernel the threshold is pstate 42.

...
...
...
  cpuidle_update_retain_tick(pstate > ((cpu->pstate.max_pstate +
cpu->pstate.min_pstate)/2));
For the above kernel the threshold is pstate 25.

...
...
...
    wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
}
... Doug

Rafael J. Wysocki

6:17 p.m.

Hi Doug,

On Thu, Mar 24, 2022 at 3:04 PM Doug Smythies dsmythies@telus.net wrote:

...

Hi Rafael,

Do you have any suggestions for the proposed patch?

Not really.

It looks like the avoidance to stop the scheduler tick is sufficient to bump up the PELT signal for this workload in such a way that it doesn't fall below a certain level at all which in turn causes schedutil to ask for higher frequencies.

An alternative approach appears to be necessary, but I need some more time for that.

...

I have tried to figure out what is wrong but haven't been able to.

... Doug

On Thu, Mar 17, 2022 at 6:58 AM Doug Smythies dsmythies@telus.net wrote:

...
On Thu, Mar 17, 2022 at 5:30 AM Rafael J. Wysocki rafael@kernel.org wrote:

...
On Wed, Mar 16, 2022 at 4:55 PM Doug Smythies dsmythies@telus.net wrote:

...
Readers: So that graphs and large attachments could be used, I have been on an off-list branch of this thread with Srinivas, and copied a couple of others. While now returning to this on-list thread, I'll only take up Rafael's proposed patch.

Hi Rafael,

So far all work has been done with: HWP disabled; intel_pstate; powersave. The reason was that it is, by far, the best way to obtain good trace data using the intel_pstate_tracer.py utility.

I always intended to try/test: HWP disabled; intel_cpufreq; schedutil. There is an issue with the proposed patch and schedutil.

If any CPU ever requests a pstate > the max non turbo pstate then it will stay at that request forever. Ultimately the idle power goes to about 5.7 watts (verses 1.4 watts expected). IRQs go very high, as the tick never turns off. Actually, one knows how many CPUs are stuck requesting a high pstate just by looking at IRQs.

That may be because INTEL_CPUFREQ_TRANSITION_DELAY is too small.

Please try to increase /sys/devices/system/cpu/cpufreq/schedutil/rate_limit_us to 10000 and see what difference this makes.

Changing rate_limit_us to 10000, or even 20000, makes no difference.

see a slight clarification to yesterday's email in-line below.

...
...
Trace is useless because it virtually never gets called. So I have been reading the IA32_PERF_CTL MSR directly.

Example:

Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz 6 cores, 12 CPUs min pstate 8 max non-turbo pstate 41 max turbo pstate 48 The system is idle.

doug@s19:~$ sudo /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt --interval 10 Busy% Bzy_MHz IRQ PkgWatt 0.11 800 844 1.33 0.01 800 231 1.33 0.11 800 723 1.33 <<< Powersave governor 0.03 889 440 1.33 0.17 4418 21511 4.31 <<< Schedutil governor 0.12 4101 30153 4.48 <<< 3 CPUs are > pstate 41 0.22 4347 34226 4.75 0.17 4101 43554 4.78 0.29 4300 50565 4.94 0.21 4098 50297 4.76 <<< 5 CPUs are > pstate 41 0.29 4298 50532 4.84 0.20 4101 50126 4.63 0.20 4101 50149 4.62 0.29 4297 50623 4.76 0.20 4101 50203 4.72 0.29 4295 50642 4.78 0.20 4101 50223 4.68 0.29 4292 50597 4.88 0.20 4101 50208 4.73 0.29 4296 50519 4.84 0.20 4101 50167 4.80 0.20 4101 50242 4.76 0.29 4302 50625 4.94 0.20 4101 50233 4.73 0.29 4296 50613 4.78 0.20 4101 50231 4.70 0.29 4292 50802 4.93 1.46 4669 65610 8.36 0.41 4225 80701 5.48 0.33 4101 80219 5.36 <<< 8 CPUs are > ptstate 41 0.34 4098 80313 5.38 0.41 4228 80689 5.56 0.33 4101 80252 5.46

And the related MSR reads:

3 CPUs are > pstate 41: root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL 9.) 0x199: IA32_PERF_CTL : CPU 0-11 : 30 : 8 : 8 : 48 : 48 : 48 : 8 : 30 : 31 : 8 : 8 : 8 :

5 CPUs are > psate 41: root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL 9.) 0x199: IA32_PERF_CTL : CPU 0-11 : 44 : 30 : 31 : 48 : 48 : 48 : 8 : 8 : 8 : 8 : 48 : 8 :

8 CPUs are > pstate 41: root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL 9.) 0x199: IA32_PERF_CTL : CPU 0-11 : 45 : 48 : 48 : 48 : 48 : 48 : 8 : 30 : 8 : 8 : 48 : 42 :

This issue is independent of the original patch or the suggested modification:

Actually, the issue threshold is as defined by the greater than condition below.

...
...
...
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index f878a4545eee..94018ac0b59b 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct cpudata *cpu) * P-states to prevent them from getting back to the high frequency * right away after getting out of deep idle. */
  cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
For the above kernel the threshold is pstate 42.

...
...
...
  cpuidle_update_retain_tick(pstate > ((cpu->pstate.max_pstate +
cpu->pstate.min_pstate)/2));
For the above kernel the threshold is pstate 25.

...
...
...
    wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
}
... Doug

Doug Smythies

25 Mar 25 Mar

12:03 a.m.

On Thu, Mar 24, 2022 at 11:17 AM Rafael J. Wysocki rafael@kernel.org wrote:

...

Hi Doug,

On Thu, Mar 24, 2022 at 3:04 PM Doug Smythies dsmythies@telus.net wrote:

...
Hi Rafael,

Do you have any suggestions for the proposed patch?

Not really.

It looks like the avoidance to stop the scheduler tick is sufficient to bump up the PELT signal for this workload in such a way that it doesn't fall below a certain level at all which in turn causes schedutil to ask for higher frequencies.

An alternative approach appears to be necessary, but I need some more time for that.

Hi Rafael,

O.K. thanks for the reply. This can be sidelined for now if you prefer. As mentioned in one of the off-list emails:

I was always aware that we might be heading towards a solution tailored to my specific test workflow. It is one relatively simple thing to create an example workflow that exploits the issue, but quite another to claim that the proposed solution works for any workflow and hardware.

...

...
I have tried to figure out what is wrong but haven't been able to.

... Doug

On Thu, Mar 17, 2022 at 6:58 AM Doug Smythies dsmythies@telus.net wrote:

...
On Thu, Mar 17, 2022 at 5:30 AM Rafael J. Wysocki rafael@kernel.org wrote:

...
On Wed, Mar 16, 2022 at 4:55 PM Doug Smythies dsmythies@telus.net wrote:

...
Readers: So that graphs and large attachments could be used, I have been on an off-list branch of this thread with Srinivas, and copied a couple of others. While now returning to this on-list thread, I'll only take up Rafael's proposed patch.

Hi Rafael,

So far all work has been done with: HWP disabled; intel_pstate; powersave. The reason was that it is, by far, the best way to obtain good trace data using the intel_pstate_tracer.py utility.

I always intended to try/test: HWP disabled; intel_cpufreq; schedutil. There is an issue with the proposed patch and schedutil.

If any CPU ever requests a pstate > the max non turbo pstate then it will stay at that request forever. Ultimately the idle power goes to about 5.7 watts (verses 1.4 watts expected). IRQs go very high, as the tick never turns off. Actually, one knows how many CPUs are stuck requesting a high pstate just by looking at IRQs.

That may be because INTEL_CPUFREQ_TRANSITION_DELAY is too small.

Please try to increase /sys/devices/system/cpu/cpufreq/schedutil/rate_limit_us to 10000 and see what difference this makes.

Changing rate_limit_us to 10000, or even 20000, makes no difference.

see a slight clarification to yesterday's email in-line below.

...
...
Trace is useless because it virtually never gets called. So I have been reading the IA32_PERF_CTL MSR directly.

Example:

Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz 6 cores, 12 CPUs min pstate 8 max non-turbo pstate 41 max turbo pstate 48 The system is idle.

doug@s19:~$ sudo /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt --interval 10 Busy% Bzy_MHz IRQ PkgWatt 0.11 800 844 1.33 0.01 800 231 1.33 0.11 800 723 1.33 <<< Powersave governor 0.03 889 440 1.33 0.17 4418 21511 4.31 <<< Schedutil governor 0.12 4101 30153 4.48 <<< 3 CPUs are > pstate 41 0.22 4347 34226 4.75 0.17 4101 43554 4.78 0.29 4300 50565 4.94 0.21 4098 50297 4.76 <<< 5 CPUs are > pstate 41 0.29 4298 50532 4.84 0.20 4101 50126 4.63 0.20 4101 50149 4.62 0.29 4297 50623 4.76 0.20 4101 50203 4.72 0.29 4295 50642 4.78 0.20 4101 50223 4.68 0.29 4292 50597 4.88 0.20 4101 50208 4.73 0.29 4296 50519 4.84 0.20 4101 50167 4.80 0.20 4101 50242 4.76 0.29 4302 50625 4.94 0.20 4101 50233 4.73 0.29 4296 50613 4.78 0.20 4101 50231 4.70 0.29 4292 50802 4.93 1.46 4669 65610 8.36 0.41 4225 80701 5.48 0.33 4101 80219 5.36 <<< 8 CPUs are > ptstate 41 0.34 4098 80313 5.38 0.41 4228 80689 5.56 0.33 4101 80252 5.46

And the related MSR reads:

3 CPUs are > pstate 41: root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL 9.) 0x199: IA32_PERF_CTL : CPU 0-11 : 30 : 8 : 8 : 48 : 48 : 48 : 8 : 30 : 31 : 8 : 8 : 8 :

5 CPUs are > psate 41: root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL 9.) 0x199: IA32_PERF_CTL : CPU 0-11 : 44 : 30 : 31 : 48 : 48 : 48 : 8 : 8 : 8 : 8 : 48 : 8 :

8 CPUs are > pstate 41: root@s19:/home/doug# c/msr-decoder | grep IA32_PERF_CTL 9.) 0x199: IA32_PERF_CTL : CPU 0-11 : 45 : 48 : 48 : 48 : 48 : 48 : 8 : 30 : 8 : 8 : 48 : 42 :

This issue is independent of the original patch or the suggested modification:

Actually, the issue threshold is as defined by the greater than condition below.

...
...
...
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index f878a4545eee..94018ac0b59b 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct cpudata *cpu) * P-states to prevent them from getting back to the high frequency * right away after getting out of deep idle. */
  cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
For the above kernel the threshold is pstate 42.

...
...
...
  cpuidle_update_retain_tick(pstate > ((cpu->pstate.max_pstate +
cpu->pstate.min_pstate)/2));
For the above kernel the threshold is pstate 25.

...
...
...
    wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
}
... Doug

788

days inactive

833

days old

linux-stable-mirror@lists.linaro.org

46 comments

participants

tags (0)

participants (8)

Doug Smythies
Feng Tang
Paul E. McKenney
Rafael J. Wysocki
Rafael J. Wysocki
srinivas pandruvada
Thomas Gleixner
Zhang, Rui