Hello,
Apologies in advance for a lengthy cover letter. Hopefully it has all the required information so you dont need to read the ACPI spec. ;)
This patchset introduces the ideas behind CPPC (Collaborative Processor Performance Control) and implements support for controlling CPU performance using the existing PID (Proportional-Integral-Derivative) controller (from intel_pstate.c) and some CPPC semantics.
The patchwork is not a final proposal of the CPPC implementation. I've had to hack some sections due to lack of hardware, details of which are in the Testing section.
There are several bits of information which are needed in order to make CPPC work great on Linux based platforms and I'm hoping to start a wider discussion on how to address the missing bits. The following sections briefly introduce CPPC and later highlight the information which is missing.
More importantly, I'm also looking for ideas on how to support CPPC in the short term, given that we will soon be seeing products based on ARM64 and X86 which support CPPC.[1] Although we may not have all the information, we could make it work with existing governors in a way this patchset demonstrates. Hopefully, this approach is acceptable for mainline inclusion in the short term.
Finer details about the CPPC spec are available in the latest ACPI 5.1 specification.[2]
If these issues are being discussed on some other thread or elsewhere, or if someone is already working on it, please let me know. Also, please correct me if I have misunderstood anything.
What is CPPC: =============
CPPC is the new interface for CPU performance control between the OS and the platform defined in ACPI 5.0+. The interface is built on an abstract representation of CPU performance rather than raw frequency. Basic operation consists of:
* Platform enumerates supported performance range to OS
* OS requests desired performance level over some time window along with min and max instantaneous limits
* Platform is free to optimize power/performance within bounds provided by OS
* Platform provides telemetry back to OS on delivered performance
Communication with the OS is abstracted via another ACPI construct called Platform Communication Channel (PCC) which is essentially a generic shared memory channel with doorbell interrupts going back and forth. This abstraction allows the “platform” for CPPC to be a variety of different entities – driver, firmware, BMC, etc.
CPPC describes the following registers:
* HighestPerformance: (read from platform)
Indicates the highest level of performance the processor is theoretically capable of achieving, given ideal operating conditions.
* Nominal Performance: (read from platform)
Indicates the highest sustained performance level of the processor. This is the highest operating performance level the CPU is expected to deliver continuously.
* LowestNonlinearPerformance: (read from platform)
Indicates the lowest performance level of the processor with non- linear power savings.
* LowestPerformance: (read from platform)
Indicates the lowest performance level of the processor.
* GuaranteedPerformanceRegister: (read from platform)
Optional. If supported, contains register to read the current guaranteed performance from. This is current max sustained performance of the CPU taking into account all budgeting constraints. This can change at runtime and is notified to the OS via ACPI notification mechanisms.
* DesiredPerformanceRegister: (write to platform)
Register to write desired performance level from the OS.
* MinimumPerformanceRegister: (write to platform)
Optional. This is the min allowable performance as requested by the OS.
* MaximumPerformanceRegister: (write to platform)
Optional. This is the max allowable performance as requested by the OS.
* PerformanceReductionToleranceRegister (write to platform)
Optional. This is the deviation below the desired perf value as requested by the OS. If the Time window register(below) is supported, then this value is the min performance on average over the time window that the OS desires.
* TimeWindowRegister: (write to platform) Optional. The OS requests desired performance over this time window.
* CounterWraparoundTime: (read from platform) Optional. Min time before the performance counters wrap around.
* ReferencePerformanceCounterRegister: (read from platform)
A counter that increments proportionally to the reference performance of the processor.
* DeliveredPerformanceCounterRegister: (read from platform)
Delivered perf = reference perf * delta(delivered perf ctr)/delta(ref perf ctr)
* PerformanceLimitedRegister: (read from platform)
This is set by the platform in the event that it has to limit available performance due to thermal or budgeting constraints.
* CPPCEnableRegister: (read/write from platform)
Enable/disable CPPC
* AutonomousSelectionEnable:
Platform decides CPU performance level w/o OS assist.
* AutonomousActivityWindowRegister:
This influences the increase or decrease in cpu performance of the platforms autonomous selection policy.
* EnergyPerformancePreferenceRegister:
Provides a energy or perf bias hint to the platform when in autonomous mode.
* Reference Performance: (read from platform)
Indicates the rate at which the reference counter increments.
Whats missing in CPPC: =====================
Currently CPPC makes no mention of power. However, this could be added in future versions of the spec. e.g. although CPPC works off of a continuous range of CPU perf levels, we could discretize the scale such that we only extract points where the power level changes substantially between CPU perf levels and export this information to the scheduler.
Whats missing in the kernel: ============================
We may have some of this information in the scheduler, but I couldn't see a good way to extract it for CPPC yet.
(1) An intelligent way to provide a min/max bound and a desired value for CPU performance.
(2) A timing window for the platform to deliver requested performance within bounds. This could be a kind of sampling interval between consecutive reads of delivered cpu performance.
(3) Centralized decision making by any CPU in a freq domain for all its siblings.
The last point needs some elaboration:
I see that the CPUfreq layer allows defining "related CPUs" and that we can have the same policy for CPUs in the same freq domain and one governor per policy. However, from what I could tell, there are at least 2 baked in assumptions in this layer which break things at least for platforms like ARM (Please correct me if I'm wrong!)
(a) All CPUs run at the exact same max, min and cur freq.
(b) Any CPU always gets exactly the freq it asked for.
So, although the CPUFreq layer is capable of making somewhat centralized cpufreq decisions for CPUs under the same policy, it seems to be deciding things under the wrong/inapplicable assumptions. Moreover only one CPU is in charge of policy handling at a time and the policy handling is shifted to another CPU in the domain, only if the former CPU is hotplugged out.
Not having a proper centralized decision maker adversely affects power saving possibilities in platforms that can't distinguish when a CPU requests a specific freq and then goes to sleep. This potentially has the effect of keeping other CPUs in the domain running at a much higher frequency than required, while the initial requester is deep asleep.
So, for point (3), I'm not sure which path we should take among the following:
(I) Fix cpufreq layer and add CPPC support as a cpufreq_driver. (a) Change every call to get freq to make it read h/w registers and then snap value back to freq table. This way, cpufreq can keep its idea of freq current. However, this may end up waking CPUs to read counters, unless they are mem mapped. (b) Allow any CPU in the "related_cpus" mask to make policy decisions on behalf of siblings. So the policy maker switching is not tied to hotplug.
(II) Not touch CPUfreq and use the PID algorithm instead, but change the busyness calculation to accumulate busyness values from all CPUs in common domain. Requires implementation of domain awareness.
(III) Address these issues in the upcoming CPUfreq/CPUidle integration layer(?)
(IV) Handle it in the platform or lose out. I understand this has some potential for adding latency to cpu freq requests so it may not be possible for all platforms.
(V) ..?
For points (1) and (2), the long term solution IMHO is to work it out along with the scheduler CPUFreq/CPUidle integration. But its not clear to me what would be the best short term approach. I'd greatly appreciate any suggestions/comments. If anyone is already working on these issues, please CC me as well.
Test setup: ==========
For the sake of experiments, I used the Thinkpad x240 laptop, which advertises CPPC tables in its ACPI firmware. The PCC and CPPC drivers included in this patchset are able to parse the tables and get all the required addresses. However, it seems that this laptop doesn't implement PCC doorbell and the firmware side of CPPC. The PCC doorbell calls would just wait forever. Not sure whats going on there. So, I had to hack it and emulate what the platform would've done to some extent.
I extracted the PID algo from intel_pstate.c and modified it with CPPC function wrappers. It shouldn't be hard to replace PID with anything else we think is suitable. In the long term, I hope we can make CPPC calls directly from the scheduler.
There are two versions of the low level CPPC accessors. The one included in the patchset is how I'd imagine it would work with platforms that completely implement CPPC in firmware.
The other version is here [5]. This should help with DT or platforms with broken firmware, enablement purposes etc.
I ran a simple kernel compilation with intel_pstate.c and the CPPC modified version as the governors and saw no real difference in compile times. So no new overheads added. I verified that CPU freq requests were taken by reading out the PERF_STATUS register.
[1] - See the HWP section 14.4 http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32... [2] - http://www.uefi.org/sites/default/files/resources/ACPI_5_1release.pdf [3] - https://plus.google.com/+TheodoreTso/posts/2vEekAsG2QT [4] - https://plus.google.com/+ArjanvandeVen/posts/dLn9T4ehywL [5] - http://git.linaro.org/people/ashwin.chaugule/leg-kernel.git/blob/236d901d31f...
Ashwin Chaugule (3): ACPI: Add support for Platform Communication Channel CPPC: Add support for Collaborative Processor Performance Control CPPC: Add ACPI accessors to CPC registers
drivers/acpi/Kconfig | 10 + drivers/acpi/Makefile | 1 + drivers/acpi/pcc.c | 301 +++++++++++++++ drivers/cpufreq/Kconfig | 19 + drivers/cpufreq/Makefile | 2 + drivers/cpufreq/cppc.c | 874 ++++++++++++++++++++++++++++++++++++++++++++ drivers/cpufreq/cppc.h | 181 +++++++++ drivers/cpufreq/cppc_acpi.c | 80 ++++ 8 files changed, 1468 insertions(+) create mode 100644 drivers/acpi/pcc.c create mode 100644 drivers/cpufreq/cppc.c create mode 100644 drivers/cpufreq/cppc.h create mode 100644 drivers/cpufreq/cppc_acpi.c
The ACPI 5.0+ spec defines a generic mode of communication between the OS and a platform such as the BMC or external power controller. This medium (PCC) is typically used by CPPC (ACPI CPU Performance management), RAS (ACPI reliability protocol) and MPST (ACPI Memory power states).
This patch adds initial support for PCC to be usable by the aforementioned PCC clients.
Signed-off-by: Ashwin Chaugule ashwin.chaugule@linaro.org --- drivers/acpi/Kconfig | 10 +++ drivers/acpi/Makefile | 1 + drivers/acpi/pcc.c | 192 ++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 203 insertions(+) create mode 100644 drivers/acpi/pcc.c
diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig index a34a228..16d7c9a 100644 --- a/drivers/acpi/Kconfig +++ b/drivers/acpi/Kconfig @@ -364,6 +364,16 @@ config ACPI_REDUCED_HARDWARE_ONLY
If you are unsure what to do, do not enable this option.
+config ACPI_PCC + bool "ACPI Platform Communication Channel" + def_bool n + depends on ACPI + help + Enable this option if your platform supports PCC as defined in the + ACPI spec 5.0a+. PCC is a generic mechanism for the OS to communicate + with a platform such as a BMC. PCC is typically used by CPPC, RAS + and MPST. + source "drivers/acpi/apei/Kconfig"
config ACPI_EXTLOG diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile index ea55e01..35015fa 100644 --- a/drivers/acpi/Makefile +++ b/drivers/acpi/Makefile @@ -74,6 +74,7 @@ obj-$(CONFIG_ACPI_HED) += hed.o obj-$(CONFIG_ACPI_EC_DEBUGFS) += ec_sys.o obj-$(CONFIG_ACPI_CUSTOM_METHOD)+= custom_method.o obj-$(CONFIG_ACPI_BGRT) += bgrt.o +obj-$(CONFIG_ACPI_PCC) += pcc.o
# processor has its own "processor." module_param namespace processor-y := processor_driver.o processor_throttling.o diff --git a/drivers/acpi/pcc.c b/drivers/acpi/pcc.c new file mode 100644 index 0000000..105e11a --- /dev/null +++ b/drivers/acpi/pcc.c @@ -0,0 +1,192 @@ +/* + * Copyright (C) 2014 Linaro Ltd. + * Author: Ashwin Chaugule ashwin.chaugule@linaro.org + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#include <linux/acpi.h> +#include <linux/io.h> +#include <linux/uaccess.h> +#include <linux/init.h> +#include <linux/cpufreq.h> +#include <linux/delay.h> +#include <linux/ioctl.h> +#include <linux/vmalloc.h> +#include <linux/module.h> + +#include <acpi/actbl.h> + +#define MAX_PCC_SUBSPACES 256 +#define PCCS_SS_SIG_MAGIC 0x50434300 +#define PCC_CMD_COMPLETE 0x1 +#define PCC_VERSION "0.1" + +struct pcc_ss_desc { + struct acpi_pcct_subspace *pcc_ss_ptr; + raw_spinlock_t lock; +}; + +/* Array of pointers to Type 0 Generic Communication Subspace Structures */ +struct pcc_ss_desc pcc_ss_arr[MAX_PCC_SUBSPACES]; + +/* Total number of subspaces detected in PCCT. */ +static int total_ss; + +/* + * PCC clients call this function to get a base address of their + * Communication channel + */ +int get_pcc_comm_channel(u32 ss_idx, u64 __iomem *addr, int *len) +{ + struct acpi_pcct_subspace *pcct_subspace = pcc_ss_arr[ss_idx].pcc_ss_ptr; + + if (pcct_subspace) { + *addr = pcct_subspace->base_address; + *len = pcct_subspace->length; + } else + return -EINVAL; + + pr_debug("base addr: %llx\n", pcct_subspace->base_address); + + return 0; +} + +/* Send PCC cmd on behalf of this (subspace id) PCC client */ +u16 send_pcc_cmd(u8 cmd, u8 sci, u32 ss_idx, u64 __iomem *base_addr) +{ + struct acpi_pcct_subspace *pcct_subspace = pcc_ss_arr[ss_idx].pcc_ss_ptr; + struct acpi_pcct_shared_memory *generic_comm_base = + (struct acpi_pcct_shared_memory *)base_addr; + struct acpi_generic_address doorbell; + u64 doorbell_preserve; + u64 doorbell_val; + u64 doorbell_write; + + /* + * Min time in usec that OSPM is expected to wait + * before sending the next PCC cmd. + */ + u16 cmd_delay = pcct_subspace->min_turnaround_time; + + pr_debug("cmd: %d, ss_idx: %d, addr: %llx\n", cmd, ss_idx, (u64) base_addr); + if (!generic_comm_base) { + pr_err("No Generic Communication Channel provided.\n"); + return -EINVAL; + } + + raw_spin_lock(&pcc_ss_arr[ss_idx].lock); + + /* Get doorbell details for this subspace. */ + doorbell = pcct_subspace->doorbell_register; + doorbell_preserve = pcct_subspace->preserve_mask; + doorbell_write = pcct_subspace->write_mask; + + /* Write to the shared comm region. */ + iowrite16(cmd, &generic_comm_base->command); + + /* Write Subspace MAGIC value so platform can identify destination. */ + iowrite32((PCCS_SS_SIG_MAGIC | ss_idx), &generic_comm_base->signature); + + /* Flip CMD COMPLETE bit */ + iowrite16(0, &generic_comm_base->status); + + /* Sync notification from OSPM to Platform. */ + acpi_read(&doorbell_val, &doorbell); + acpi_write((doorbell_val & doorbell_preserve) | doorbell_write, + &doorbell); + + /* Wait for Platform to consume. */ + while (!(ioread16(&generic_comm_base->status) & PCC_CMD_COMPLETE)) + udelay(cmd_delay); + + raw_spin_unlock(&pcc_ss_arr[ss_idx].lock); + + return generic_comm_base->status; +} + +static int parse_pcc_subspace(struct acpi_subtable_header *header, + const unsigned long end) +{ + struct acpi_pcct_subspace *pcct_ss; + + if (total_ss <= MAX_PCC_SUBSPACES) { + pcct_ss = (struct acpi_pcct_subspace*) header; + + if (pcct_ss->header.type != ACPI_PCCT_TYPE_GENERIC_SUBSPACE) { + pr_err("Incorrect PCC Subspace type detected\n"); + return -EINVAL; + } + + pcc_ss_arr[total_ss].pcc_ss_ptr = pcct_ss; + pr_debug("(%s)PCCT base addr: %llx", __func__, pcct_ss->base_address); + raw_spin_lock_init(&pcc_ss_arr[total_ss].lock); + + total_ss++; + } else { + pr_err("No more space for PCC subspaces.\n"); + return -ENOSPC; + } + + return 0; +} + +static int __init pcc_probe(void) +{ + acpi_status status = AE_OK; + acpi_size pcct_tbl_header_size; + struct acpi_table_pcct *pcct_tbl; + + /* Search for PCCT */ + status = acpi_get_table_with_size(ACPI_SIG_PCCT, 0, + (struct acpi_table_header **)&pcct_tbl, + &pcct_tbl_header_size); + + if (ACPI_SUCCESS(status) && !pcct_tbl) { + pr_warn("PCCT header not found.\n"); + status = AE_NOT_FOUND; + goto out_err; + } + + status = acpi_table_parse_entries(ACPI_SIG_PCCT, + sizeof(struct acpi_table_pcct), ACPI_PCCT_TYPE_GENERIC_SUBSPACE, + parse_pcc_subspace, MAX_PCC_SUBSPACES); + + if (ACPI_SUCCESS(status)) + pr_err("Error parsing PCC subspaces from PCCT\n"); + + pr_info("Detected %d PCC Subspaces\n", total_ss); + +out_err: + return (ACPI_SUCCESS(status) ? 1 : 0); +} + +static int __init pcc_init(void) +{ + int ret; + + if (acpi_disabled) + return -ENODEV; + + /* Check if PCC support is available. */ + ret = pcc_probe(); + + if (ret) { + pr_debug("PCC probe failed.\n"); + return -EINVAL; + } + + return ret; +} +device_initcall(pcc_init); + +
+ Rafael [corrected email addr]
On 14 August 2014 15:57, Ashwin Chaugule ashwin.chaugule@linaro.org wrote:
The ACPI 5.0+ spec defines a generic mode of communication between the OS and a platform such as the BMC or external power controller. This medium (PCC) is typically used by CPPC (ACPI CPU Performance management), RAS (ACPI reliability protocol) and MPST (ACPI Memory power states).
This patch adds initial support for PCC to be usable by the aforementioned PCC clients.
Signed-off-by: Ashwin Chaugule ashwin.chaugule@linaro.org
drivers/acpi/Kconfig | 10 +++ drivers/acpi/Makefile | 1 + drivers/acpi/pcc.c | 192 ++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 203 insertions(+) create mode 100644 drivers/acpi/pcc.c
diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig index a34a228..16d7c9a 100644 --- a/drivers/acpi/Kconfig +++ b/drivers/acpi/Kconfig @@ -364,6 +364,16 @@ config ACPI_REDUCED_HARDWARE_ONLY
If you are unsure what to do, do not enable this option.
+config ACPI_PCC
bool "ACPI Platform Communication Channel"
def_bool n
depends on ACPI
help
Enable this option if your platform supports PCC as defined in the
ACPI spec 5.0a+. PCC is a generic mechanism for the OS to communicate
with a platform such as a BMC. PCC is typically used by CPPC, RAS
and MPST.
source "drivers/acpi/apei/Kconfig"
config ACPI_EXTLOG diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile index ea55e01..35015fa 100644 --- a/drivers/acpi/Makefile +++ b/drivers/acpi/Makefile @@ -74,6 +74,7 @@ obj-$(CONFIG_ACPI_HED) += hed.o obj-$(CONFIG_ACPI_EC_DEBUGFS) += ec_sys.o obj-$(CONFIG_ACPI_CUSTOM_METHOD)+= custom_method.o obj-$(CONFIG_ACPI_BGRT) += bgrt.o +obj-$(CONFIG_ACPI_PCC) += pcc.o
# processor has its own "processor." module_param namespace processor-y := processor_driver.o processor_throttling.o diff --git a/drivers/acpi/pcc.c b/drivers/acpi/pcc.c new file mode 100644 index 0000000..105e11a --- /dev/null +++ b/drivers/acpi/pcc.c @@ -0,0 +1,192 @@ +/*
Copyright (C) 2014 Linaro Ltd.
Author: Ashwin Chaugule <ashwin.chaugule@linaro.org>
- This program is free software; you can redistribute it and/or modify
- it under the terms of the GNU General Public License as published by
- the Free Software Foundation; either version 2 of the License, or
- (at your option) any later version.
- This program is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- GNU General Public License for more details.
- */
+#include <linux/acpi.h> +#include <linux/io.h> +#include <linux/uaccess.h> +#include <linux/init.h> +#include <linux/cpufreq.h> +#include <linux/delay.h> +#include <linux/ioctl.h> +#include <linux/vmalloc.h> +#include <linux/module.h>
+#include <acpi/actbl.h>
+#define MAX_PCC_SUBSPACES 256 +#define PCCS_SS_SIG_MAGIC 0x50434300 +#define PCC_CMD_COMPLETE 0x1 +#define PCC_VERSION "0.1"
+struct pcc_ss_desc {
struct acpi_pcct_subspace *pcc_ss_ptr;
raw_spinlock_t lock;
+};
+/* Array of pointers to Type 0 Generic Communication Subspace Structures */ +struct pcc_ss_desc pcc_ss_arr[MAX_PCC_SUBSPACES];
+/* Total number of subspaces detected in PCCT. */ +static int total_ss;
+/*
- PCC clients call this function to get a base address of their
- Communication channel
- */
+int get_pcc_comm_channel(u32 ss_idx, u64 __iomem *addr, int *len) +{
struct acpi_pcct_subspace *pcct_subspace = pcc_ss_arr[ss_idx].pcc_ss_ptr;
if (pcct_subspace) {
*addr = pcct_subspace->base_address;
*len = pcct_subspace->length;
} else
return -EINVAL;
pr_debug("base addr: %llx\n", pcct_subspace->base_address);
return 0;
+}
+/* Send PCC cmd on behalf of this (subspace id) PCC client */ +u16 send_pcc_cmd(u8 cmd, u8 sci, u32 ss_idx, u64 __iomem *base_addr) +{
struct acpi_pcct_subspace *pcct_subspace = pcc_ss_arr[ss_idx].pcc_ss_ptr;
struct acpi_pcct_shared_memory *generic_comm_base =
(struct acpi_pcct_shared_memory *)base_addr;
struct acpi_generic_address doorbell;
u64 doorbell_preserve;
u64 doorbell_val;
u64 doorbell_write;
/*
* Min time in usec that OSPM is expected to wait
* before sending the next PCC cmd.
*/
u16 cmd_delay = pcct_subspace->min_turnaround_time;
pr_debug("cmd: %d, ss_idx: %d, addr: %llx\n", cmd, ss_idx, (u64) base_addr);
if (!generic_comm_base) {
pr_err("No Generic Communication Channel provided.\n");
return -EINVAL;
}
raw_spin_lock(&pcc_ss_arr[ss_idx].lock);
/* Get doorbell details for this subspace. */
doorbell = pcct_subspace->doorbell_register;
doorbell_preserve = pcct_subspace->preserve_mask;
doorbell_write = pcct_subspace->write_mask;
/* Write to the shared comm region. */
iowrite16(cmd, &generic_comm_base->command);
/* Write Subspace MAGIC value so platform can identify destination. */
iowrite32((PCCS_SS_SIG_MAGIC | ss_idx), &generic_comm_base->signature);
/* Flip CMD COMPLETE bit */
iowrite16(0, &generic_comm_base->status);
/* Sync notification from OSPM to Platform. */
acpi_read(&doorbell_val, &doorbell);
acpi_write((doorbell_val & doorbell_preserve) | doorbell_write,
&doorbell);
/* Wait for Platform to consume. */
while (!(ioread16(&generic_comm_base->status) & PCC_CMD_COMPLETE))
udelay(cmd_delay);
raw_spin_unlock(&pcc_ss_arr[ss_idx].lock);
return generic_comm_base->status;
+}
+static int parse_pcc_subspace(struct acpi_subtable_header *header,
const unsigned long end)
+{
struct acpi_pcct_subspace *pcct_ss;
if (total_ss <= MAX_PCC_SUBSPACES) {
pcct_ss = (struct acpi_pcct_subspace*) header;
if (pcct_ss->header.type != ACPI_PCCT_TYPE_GENERIC_SUBSPACE) {
pr_err("Incorrect PCC Subspace type detected\n");
return -EINVAL;
}
pcc_ss_arr[total_ss].pcc_ss_ptr = pcct_ss;
pr_debug("(%s)PCCT base addr: %llx", __func__, pcct_ss->base_address);
raw_spin_lock_init(&pcc_ss_arr[total_ss].lock);
total_ss++;
} else {
pr_err("No more space for PCC subspaces.\n");
return -ENOSPC;
}
return 0;
+}
+static int __init pcc_probe(void) +{
acpi_status status = AE_OK;
acpi_size pcct_tbl_header_size;
struct acpi_table_pcct *pcct_tbl;
/* Search for PCCT */
status = acpi_get_table_with_size(ACPI_SIG_PCCT, 0,
(struct acpi_table_header **)&pcct_tbl,
&pcct_tbl_header_size);
if (ACPI_SUCCESS(status) && !pcct_tbl) {
pr_warn("PCCT header not found.\n");
status = AE_NOT_FOUND;
goto out_err;
}
status = acpi_table_parse_entries(ACPI_SIG_PCCT,
sizeof(struct acpi_table_pcct), ACPI_PCCT_TYPE_GENERIC_SUBSPACE,
parse_pcc_subspace, MAX_PCC_SUBSPACES);
if (ACPI_SUCCESS(status))
pr_err("Error parsing PCC subspaces from PCCT\n");
pr_info("Detected %d PCC Subspaces\n", total_ss);
+out_err:
return (ACPI_SUCCESS(status) ? 1 : 0);
+}
+static int __init pcc_init(void) +{
int ret;
if (acpi_disabled)
return -ENODEV;
/* Check if PCC support is available. */
ret = pcc_probe();
if (ret) {
pr_debug("PCC probe failed.\n");
return -EINVAL;
}
return ret;
+} +device_initcall(pcc_init);
-- 1.9.1
Add support for parsing the CPC tables as described in the ACPI 5.1+ CPPC specification. When successfully parsed along with low level register accessors, then enable the PID (proportional-intergral-derivative) controller based algorithm to manage CPU performance.
Signed-off-by: Ashwin Chaugule ashwin.chaugule@linaro.org --- drivers/acpi/pcc.c | 109 ++++++ drivers/cpufreq/Kconfig | 10 + drivers/cpufreq/Makefile | 1 + drivers/cpufreq/cppc.c | 874 +++++++++++++++++++++++++++++++++++++++++++++++ drivers/cpufreq/cppc.h | 181 ++++++++++ 5 files changed, 1175 insertions(+) create mode 100644 drivers/cpufreq/cppc.c create mode 100644 drivers/cpufreq/cppc.h
diff --git a/drivers/acpi/pcc.c b/drivers/acpi/pcc.c index 105e11a..7743f12 100644 --- a/drivers/acpi/pcc.c +++ b/drivers/acpi/pcc.c @@ -31,6 +31,12 @@ #define PCC_CMD_COMPLETE 0x1 #define PCC_VERSION "0.1"
+#define PCC_HACK 1 + +#ifdef PCC_HACK +static void *pcc_comm_addr; +#endif + struct pcc_ss_desc { struct acpi_pcct_subspace *pcc_ss_ptr; raw_spinlock_t lock; @@ -51,8 +57,13 @@ int get_pcc_comm_channel(u32 ss_idx, u64 __iomem *addr, int *len) struct acpi_pcct_subspace *pcct_subspace = pcc_ss_arr[ss_idx].pcc_ss_ptr;
if (pcct_subspace) { +#ifndef PCC_HACK *addr = pcct_subspace->base_address; *len = pcct_subspace->length; +#else + *addr = (u64 *)pcc_comm_addr; + *len = PAGE_SIZE; +#endif } else return -EINVAL;
@@ -61,6 +72,7 @@ int get_pcc_comm_channel(u32 ss_idx, u64 __iomem *addr, int *len) return 0; }
+#ifndef PCC_HACK /* Send PCC cmd on behalf of this (subspace id) PCC client */ u16 send_pcc_cmd(u8 cmd, u8 sci, u32 ss_idx, u64 __iomem *base_addr) { @@ -114,6 +126,93 @@ u16 send_pcc_cmd(u8 cmd, u8 sci, u32 ss_idx, u64 __iomem *base_addr) return generic_comm_base->status; }
+#else + +#include <asm/msr.h> + +/* These offsets are from the SSDT9.asl table on the Thinkpad X240 */ + +/* These are offsets per CPU from which its CPC table begins. */ +int cpu_base[] = {0, 0x64, 0xC8, 0x12C, 0x190, 0x1F4, 0x258, 0x2BC}; + +/* These are offsets of the registers in each CPC table. */ +#define HIGHEST_PERF_OFFSET 0x0 +#define LOWEST_PERF_OFFSET 0xc +#define DESIRED_PERF_OFFSET 0x14 + +static int core_get_min(void) +{ + u64 val; + rdmsrl(MSR_PLATFORM_INFO, val); + return (val >> 40) & 0xff; +} + +static int core_get_max(void) +{ + u64 val; + rdmsrl(MSR_PLATFORM_INFO, val); + return (val >> 8) & 0xff; +} + +static int core_get_turbo(void) +{ + u64 value; + int nont, ret; + + rdmsrl(MSR_NHM_TURBO_RATIO_LIMIT, value); + nont = core_get_max(); + ret = ((value) & 255); + if (ret <= nont) + ret = nont; + return ret; +} + +u16 send_pcc_cmd(u8 cmd, u8 sci, u32 ss_idx, u64 __iomem *base_addr) +{ + unsigned int cpu; + u64 desired_val; + + raw_spin_lock(&pcc_ss_arr[ss_idx].lock); + /*XXX: Instead of waiting for platform to consume the cmd, + * just do what the platform would've done. + */ + switch (cmd) { + case 0: //PCC_CMD_READ + + /* XXX: Normally the Platform would need to update all the other CPPC registers as well. + * But for this experiment, since we're not really using all of them, we'll only update + * what we use. + */ + for_each_possible_cpu(cpu) { + *(char*)(pcc_comm_addr + cpu_base[cpu] + HIGHEST_PERF_OFFSET) = core_get_turbo(); + *(char*)(pcc_comm_addr + cpu_base[cpu] + LOWEST_PERF_OFFSET) = core_get_min(); + } + break; + case 1: //PCC_CMD_WRITE + + /* XXX: All this hackery is very X86 Thinkpad X240 specific. + * Normally, the cpc_write64() would have all the info on + * how, where and what to write. + */ + for_each_possible_cpu(cpu) { + desired_val = *(u64*)(pcc_comm_addr + cpu_base[cpu] + DESIRED_PERF_OFFSET); + + if (desired_val) { + wrmsrl_on_cpu(cpu, MSR_IA32_PERF_CTL, desired_val << 8); + *(u64*)(pcc_comm_addr + cpu_base[cpu] + DESIRED_PERF_OFFSET) = 0; + } + } + break; + default: + pr_err("Unknown PCC cmd from the OS\n"); + return 0; + } + + raw_spin_unlock(&pcc_ss_arr[ss_idx].lock); + return 1; +} +#endif + static int parse_pcc_subspace(struct acpi_subtable_header *header, const unsigned long end) { @@ -185,6 +284,16 @@ static int __init pcc_init(void) return -EINVAL; }
+#ifdef PCC_HACK + pcc_comm_addr = kzalloc(PAGE_SIZE, GFP_KERNEL); + + if (!pcc_comm_addr) { + pr_err("Could not allocate mem for pcc hack\n"); + return -ENOMEM; + } + +#endif + return ret; } device_initcall(pcc_init); diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index ffe350f..d8e8335 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -196,6 +196,16 @@ config GENERIC_CPUFREQ_CPU0
If in doubt, say N.
+config CPPC_CPUFREQ + bool "CPPC CPUFreq driver" + depends on ACPI && ACPI_PCC + default n + help + CPPC is Collaborative Processor Performance Control. It allows the OS + to request CPU performance with an abstract metric and lets the platform + (e.g. BMC) interpret and optimize it for power and performance in a + platform specific manner. + menu "x86 CPU frequency scaling drivers" depends on X86 source "drivers/cpufreq/Kconfig.x86" diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile index db6d9a2..b392c8c 100644 --- a/drivers/cpufreq/Makefile +++ b/drivers/cpufreq/Makefile @@ -14,6 +14,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE) += cpufreq_conservative.o obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o
obj-$(CONFIG_GENERIC_CPUFREQ_CPU0) += cpufreq-cpu0.o +obj-$(CONFIG_CPPC_CPUFREQ) += cppc.o
################################################################################## # x86 drivers. diff --git a/drivers/cpufreq/cppc.c b/drivers/cpufreq/cppc.c new file mode 100644 index 0000000..6917ce0 --- /dev/null +++ b/drivers/cpufreq/cppc.c @@ -0,0 +1,874 @@ +/* + * Copyright (C) 2014 Linaro Ltd. + * Author: Ashwin Chaugule ashwin.chaugule@linaro.org + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * PID algo bits are from intel_pstate.c and modified to use CPPC + * accessors. + * + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/kernel_stat.h> +#include <linux/module.h> +#include <linux/hrtimer.h> +#include <linux/tick.h> +#include <linux/slab.h> +#include <linux/sched.h> +#include <linux/list.h> +#include <linux/cpu.h> +#include <linux/cpufreq.h> +#include <linux/sysfs.h> +#include <linux/types.h> +#include <linux/fs.h> +#include <linux/debugfs.h> +#include <linux/acpi.h> +#include <linux/errno.h> + +#include <acpi/processor.h> +#include <acpi/actypes.h> + +#include <trace/events/power.h> + +#include <asm/div64.h> +#include <asm/msr.h> + +#include "cppc.h" + +#define FRAC_BITS 8 +#define int_tofp(X) ((int64_t)(X) << FRAC_BITS) +#define fp_toint(X) ((X) >> FRAC_BITS) + +#define CPPC_EN 1 +#define PCC_CMD_COMPLETE 1 + +/* There is one CPC descriptor per CPU */ +static DEFINE_PER_CPU(struct cpc_desc *, cpc_desc_ptr); + +/* PCC client specifics for the CPPC structure */ +/* Returned by the PCCT Subspace structure */ +static u64 pcc_comm_base_addr; + +/* ioremap the pcc_comm_base_addr*/ +static void __iomem *comm_base_addr; + +/* The PCC subspace used by the CPC table */ +static s8 pcc_subspace_idx = -1; + +extern int get_pcc_comm_channel(u32 ss_idx, u64* addr, int *len); +extern u16 send_pcc_cmd(u8 cmd, u8 sci, u32 ss_idx, u64 * __iomem base_addr); + +/* + * The low level platform specific accessors + * to the registers defined in the CPC table + */ +struct cpc_funcs *cppc_func_ops; + +static struct cpudata **all_cpu_data; +static struct pstate_adjust_policy pid_params; + +/* PCC Commands used by CPPC */ +enum cppc_ppc_cmds { + PCC_CMD_READ, + PCC_CMD_WRITE, + RESERVED, +}; + +static struct perf_limits limits = { + .max_perf_pct = 100, + .max_perf = int_tofp(1), + .min_perf_pct = 0, + .min_perf = 0, + .max_policy_pct = 100, + .max_sysfs_pct = 100, +}; + +u64 cpc_read64(struct cpc_register_resource *reg, void __iomem *base_addr) +{ + u64 err = 0; + u64 val; + + switch (reg->space_id) { + case ACPI_ADR_SPACE_PLATFORM_COMM: + err = readq((void *) (reg->address + *(u64 *)base_addr)); + break; + case ACPI_ADR_SPACE_FIXED_HARDWARE: + rdmsrl(reg->address, val); + return val; + break; + default: + pr_err("unknown space_id detected in cpc reg: %d\n", reg->space_id); + break; + } + + return err; +} + +int cpc_write64(u64 val, struct cpc_register_resource *reg, void __iomem *base_addr) +{ + unsigned int err = 0; + + switch (reg->space_id) { + case ACPI_ADR_SPACE_PLATFORM_COMM: + writeq(val, (void *)(reg->address + *(u64 *)base_addr)); + break; + case ACPI_ADR_SPACE_FIXED_HARDWARE: + wrmsrl(reg->address, val); + break; + default: + pr_err("unknown space_id detected in cpc reg: %d\n", reg->space_id); + break; + } + + return err; +} + +static inline int32_t mul_fp(int32_t x, int32_t y) +{ + return ((int64_t)x * (int64_t)y) >> FRAC_BITS; +} + +static inline int32_t div_fp(int32_t x, int32_t y) +{ + return div_s64((int64_t)x << FRAC_BITS, (int64_t)y); +} + +static inline void pid_reset(struct _pid *pid, int setpoint, int busy, + int deadband, int integral) { + pid->setpoint = setpoint; + pid->deadband = deadband; + pid->integral = int_tofp(integral); + pid->last_err = int_tofp(setpoint) - int_tofp(busy); +} + +static inline void pid_p_gain_set(struct _pid *pid, int percent) +{ + pid->p_gain = div_fp(int_tofp(percent), int_tofp(100)); +} + +static inline void pid_i_gain_set(struct _pid *pid, int percent) +{ + pid->i_gain = div_fp(int_tofp(percent), int_tofp(100)); +} + +static inline void pid_d_gain_set(struct _pid *pid, int percent) +{ + pid->d_gain = div_fp(int_tofp(percent), int_tofp(100)); +} + +static signed int pid_calc(struct _pid *pid, int32_t busy) +{ + signed int result; + int32_t pterm, dterm, fp_error; + int32_t integral_limit; + + fp_error = int_tofp(pid->setpoint) - busy; + + if (abs(fp_error) <= int_tofp(pid->deadband)) + return 0; + + pterm = mul_fp(pid->p_gain, fp_error); + + pid->integral += fp_error; + + /* limit the integral term */ + integral_limit = int_tofp(30); + if (pid->integral > integral_limit) + pid->integral = integral_limit; + if (pid->integral < -integral_limit) + pid->integral = -integral_limit; + + dterm = mul_fp(pid->d_gain, fp_error - pid->last_err); + pid->last_err = fp_error; + + result = pterm + mul_fp(pid->integral, pid->i_gain) + dterm; + result = result + (1 << (FRAC_BITS-1)); + return (signed int)fp_toint(result); +} + +static inline void pstate_busy_pid_reset(struct cpudata *cpu) +{ + pid_p_gain_set(&cpu->pid, pid_params.p_gain_pct); + pid_d_gain_set(&cpu->pid, pid_params.d_gain_pct); + pid_i_gain_set(&cpu->pid, pid_params.i_gain_pct); + + pid_reset(&cpu->pid, + pid_params.setpoint, + 100, + pid_params.deadband, + 0); +} + +static inline void pstate_reset_all_pid(void) +{ + unsigned int cpu; + for_each_online_cpu(cpu) { + if (all_cpu_data[cpu]) + pstate_busy_pid_reset(all_cpu_data[cpu]); + } +} + +/************************** debugfs begin ************************/ +static int pid_param_set(void *data, u64 val) +{ + *(u32 *)data = val; + pstate_reset_all_pid(); + return 0; +} + +static int pid_param_get(void *data, u64 *val) +{ + *val = *(u32 *)data; + return 0; +} +DEFINE_SIMPLE_ATTRIBUTE(fops_pid_param, pid_param_get, + pid_param_set, "%llu\n"); + +struct pid_param { + char *name; + void *value; +}; + +static struct pid_param pid_files[] = { + {"sample_rate_ms", &pid_params.sample_rate_ms}, + {"d_gain_pct", &pid_params.d_gain_pct}, + {"i_gain_pct", &pid_params.i_gain_pct}, + {"deadband", &pid_params.deadband}, + {"setpoint", &pid_params.setpoint}, + {"p_gain_pct", &pid_params.p_gain_pct}, + {NULL, NULL} +}; + +static struct dentry *debugfs_parent; +static void cppc_pstate_debug_expose_params(void) +{ + int i = 0; + + debugfs_parent = debugfs_create_dir("pstate_snb", NULL); + if (IS_ERR_OR_NULL(debugfs_parent)) + return; + while (pid_files[i].name) { + debugfs_create_file(pid_files[i].name, 0660, + debugfs_parent, pid_files[i].value, + &fops_pid_param); + i++; + } +} + +/************************** debugfs end ************************/ + +/************************** sysfs begin ************************/ +#define show_one(file_name, object) \ + static ssize_t show_##file_name \ + (struct kobject *kobj, struct attribute *attr, char *buf) \ + { \ + return sprintf(buf, "%u\n", limits.object); \ + } + +static ssize_t store_max_perf_pct(struct kobject *a, struct attribute *b, + const char *buf, size_t count) +{ + unsigned int input; + int ret; + ret = sscanf(buf, "%u", &input); + if (ret != 1) + return -EINVAL; + + limits.max_sysfs_pct = clamp_t(int, input, 0 , 100); + limits.max_perf_pct = min(limits.max_policy_pct, limits.max_sysfs_pct); + limits.max_perf = div_fp(int_tofp(limits.max_perf_pct), int_tofp(100)); + return count; +} + +static ssize_t store_min_perf_pct(struct kobject *a, struct attribute *b, + const char *buf, size_t count) +{ + unsigned int input; + int ret; + ret = sscanf(buf, "%u", &input); + if (ret != 1) + return -EINVAL; + limits.min_perf_pct = clamp_t(int, input, 0 , 100); + limits.min_perf = div_fp(int_tofp(limits.min_perf_pct), int_tofp(100)); + + return count; +} + +show_one(max_perf_pct, max_perf_pct); +show_one(min_perf_pct, min_perf_pct); + +define_one_global_rw(max_perf_pct); +define_one_global_rw(min_perf_pct); + +static struct attribute *cppc_pstate_attributes[] = { + &max_perf_pct.attr, + &min_perf_pct.attr, + NULL +}; + +static struct attribute_group cppc_pstate_attr_group = { + .attrs = cppc_pstate_attributes, +}; +static struct kobject *cppc_pstate_kobject; + +static void cppc_pstate_sysfs_expose_params(void) +{ + int rc; + + cppc_pstate_kobject = kobject_create_and_add("cppc_pstate", + &cpu_subsys.dev_root->kobj); + BUG_ON(!cppc_pstate_kobject); + rc = sysfs_create_group(cppc_pstate_kobject, + &cppc_pstate_attr_group); + BUG_ON(rc); +} + +/************************** sysfs end ************************/ + +static inline void pstate_calc_busy(struct cpudata *cpu) +{ + struct sample *sample = &cpu->sample; + int64_t core_pct; + int32_t rem; + + core_pct = int_tofp(sample->delivered) * int_tofp(100); + core_pct = div_u64_rem(core_pct, int_tofp(sample->reference), &rem); + + if ((rem << 1) >= int_tofp(sample->reference)) + core_pct += 1; + + sample->freq = fp_toint( + mul_fp(int_tofp(cpu->pstate.max_pstate * 1000), core_pct)); + + sample->core_pct_busy = (int32_t)core_pct; +} + +static inline void pstate_sample(struct cpudata *cpu) +{ + u64 delivered, reference; + unsigned int status; + /* + * If this platform has a PCCT, then + * send a command to the platform to update + * all PCC registers. + */ + if (comm_base_addr) { + pr_debug("Sending PCC READ to update COMM space\n"); + status = send_pcc_cmd(PCC_CMD_READ, 0, pcc_subspace_idx, + comm_base_addr); + + if (!(status & PCC_CMD_COMPLETE)) { + pr_err("Err updating PCC comm space\n"); + return; + } + } + + reference = cppc_func_ops->get_ref_perf_ctr(cpu); + delivered = cppc_func_ops->get_delivered_ctr(cpu); + + delivered = delivered >> FRAC_BITS; + reference = reference >> FRAC_BITS; + + cpu->last_sample_time = cpu->sample.time; + cpu->sample.time = ktime_get(); + cpu->sample.delivered = delivered; + cpu->sample.reference = reference; + cpu->sample.delivered -= cpu->prev_delivered; + cpu->sample.reference -= cpu->prev_reference; + + pstate_calc_busy(cpu); + + cpu->prev_delivered = delivered; + cpu->prev_reference = reference; +} + +static inline int32_t pstate_get_scaled_busy(struct cpudata *cpu) +{ + int32_t core_busy, max_pstate, current_pstate, sample_ratio; + u32 duration_us; + u32 sample_time; + + core_busy = cpu->sample.core_pct_busy; + max_pstate = int_tofp(cpu->pstate.max_pstate); + current_pstate = int_tofp(cpu->pstate.current_pstate); + core_busy = mul_fp(core_busy, div_fp(max_pstate, current_pstate)); + + sample_time = (pid_params.sample_rate_ms * USEC_PER_MSEC); + duration_us = (u32) ktime_us_delta(cpu->sample.time, + cpu->last_sample_time); + if (duration_us > sample_time * 3) { + sample_ratio = div_fp(int_tofp(sample_time), + int_tofp(duration_us)); + core_busy = mul_fp(core_busy, sample_ratio); + } + + return core_busy; +} + +static inline void pstate_set_sample_time(struct cpudata *cpu) +{ + int sample_time, delay; + + sample_time = pid_params.sample_rate_ms; + delay = msecs_to_jiffies(sample_time); + mod_timer_pinned(&cpu->timer, jiffies + delay); +} + +static void pstate_get_min_max(struct cpudata *cpu, int *min, int *max) +{ + int max_perf = cpu->pstate.max_pstate; + int max_perf_adj; + int min_perf; + + max_perf_adj = fp_toint(mul_fp(int_tofp(max_perf), limits.max_perf)); + *max = clamp_t(int, max_perf_adj, + cpu->pstate.min_pstate, cpu->pstate.max_pstate); + + min_perf = fp_toint(mul_fp(int_tofp(max_perf), limits.min_perf)); + *min = clamp_t(int, min_perf, + cpu->pstate.min_pstate, max_perf); +} + +static void set_pstate(struct cpudata *cpu, int pstate) +{ + int max_perf, min_perf; + unsigned int status; + + pstate_get_min_max(cpu, &min_perf, &max_perf); + + pstate = clamp_t(int, pstate, min_perf, max_perf); + + if (pstate == cpu->pstate.current_pstate) + return; + + trace_cpu_frequency(pstate * 100000, cpu->cpu); + + cpu->pstate.current_pstate = pstate; + + cppc_func_ops->set_desired_perf(cpu, pstate); + + /* + * Send a Write command to tell the platform that + * there is new data in the PCC registers. + */ + if (comm_base_addr) { + pr_debug("Sending PCC WRITE to update COMM space\n"); + status = send_pcc_cmd(PCC_CMD_WRITE, 0, pcc_subspace_idx, + comm_base_addr); + + if (!(status & PCC_CMD_COMPLETE)) { + pr_err("Err updating PCC comm space\n"); + return; + } + } +} + +static inline void pstate_pstate_increase(struct cpudata *cpu, int steps) +{ + int target; + target = cpu->pstate.current_pstate + steps; + + set_pstate(cpu, target); +} + +static inline void pstate_pstate_decrease(struct cpudata *cpu, int steps) +{ + int target; + target = cpu->pstate.current_pstate - steps; + set_pstate(cpu, target); +} + +static inline void pstate_adjust_busy_pstate(struct cpudata *cpu) +{ + int32_t busy_scaled; + struct _pid *pid; + signed int ctl = 0; + int steps; + + pid = &cpu->pid; + busy_scaled = pstate_get_scaled_busy(cpu); + + ctl = pid_calc(pid, busy_scaled); + + steps = abs(ctl); + + if (ctl < 0) + pstate_pstate_increase(cpu, steps); + else + pstate_pstate_decrease(cpu, steps); +} + +static void pstate_timer_func(unsigned long __data) +{ + struct cpudata *cpu = (struct cpudata *) __data; + struct sample *sample; + + pstate_sample(cpu); + + sample = &cpu->sample; + + pstate_adjust_busy_pstate(cpu); + + trace_pstate_sample(fp_toint(sample->core_pct_busy), + fp_toint(pstate_get_scaled_busy(cpu)), + cpu->pstate.current_pstate, + sample->reference, + sample->delivered, + sample->freq); + + pstate_set_sample_time(cpu); +} + +static int cppc_cpufreq_init(struct cpufreq_policy *policy) +{ + struct cpudata *cpu; + unsigned int cpunum = policy->cpu; + unsigned int status; + struct cpc_desc *current_cpu_cpc = per_cpu(cpc_desc_ptr, cpunum); + + all_cpu_data[cpunum] = kzalloc(sizeof(struct cpudata), GFP_KERNEL); + if (!all_cpu_data[cpunum]) + return -ENOMEM; + + cpu = all_cpu_data[cpunum]; + + cpu->cpu = cpunum; + + if (!cppc_func_ops) { + pr_err("CPPC is not supported on this platform\n"); + return -ENOTSUPP; + } + + if (!current_cpu_cpc) { + pr_err("Undefined CPC descriptor for CPU:%d\n", cpunum); + return -ENODEV; + } + + /* + * If this platform has a PCCT, then + * send a command to the platform to update + * all PCC registers. + */ + if (comm_base_addr) { + pr_debug("Sending PCC READ to update COMM space\n"); + status = send_pcc_cmd(PCC_CMD_READ, 0, pcc_subspace_idx, + comm_base_addr); + + if (!(status & PCC_CMD_COMPLETE)) { + pr_err("Err updating PCC comm space\n"); + return -EIO; + } + } + + cpu->cpc_desc = current_cpu_cpc; + cpu->pcc_comm_address = comm_base_addr; + cpu->pstate.min_pstate = cppc_func_ops->get_lowest_perf(cpu); + cpu->pstate.max_pstate = cppc_func_ops->get_highest_perf(cpu); + /* PCC reads/writes are made to offsets from this base address.*/ + + set_pstate(cpu, cpu->pstate.min_pstate); + + init_timer_deferrable(&cpu->timer); + cpu->timer.function = pstate_timer_func; + cpu->timer.data = + (unsigned long)cpu; + cpu->timer.expires = jiffies + HZ/100; + pstate_busy_pid_reset(cpu); + pstate_sample(cpu); + + add_timer_on(&cpu->timer, cpunum); + + pr_info("CPPC PID pstate controlling: cpu %d\n", cpunum); + + if (limits.min_perf_pct == 100 && limits.max_perf_pct == 100) + policy->policy = CPUFREQ_POLICY_PERFORMANCE; + else + policy->policy = CPUFREQ_POLICY_POWERSAVE; + + policy->min = cpu->pstate.min_pstate * 100000; + policy->max = cpu->pstate.max_pstate * 100000; + + /* cpuinfo and default policy values */ + policy->cpuinfo.min_freq = cpu->pstate.min_pstate * 100000; + policy->cpuinfo.max_freq = cpu->pstate.max_pstate * 100000; + policy->cpuinfo.transition_latency = CPUFREQ_ETERNAL; + cpumask_set_cpu(policy->cpu, policy->cpus); + + return 0; +} + +static void cppc_stop_cpu(struct cpufreq_policy *policy) +{ + int cpu_num = policy->cpu; + struct cpudata *cpu = all_cpu_data[cpu_num]; + + pr_info("CPPC PID controller CPU %d exiting\n", cpu_num); + + del_timer_sync(&all_cpu_data[cpu_num]->timer); + set_pstate(cpu, cpu->pstate.min_pstate); + kfree(all_cpu_data[cpu_num]); + all_cpu_data[cpu_num] = NULL; + kfree(cpu->cpc_desc); +} + +static int cppc_verify_policy(struct cpufreq_policy *policy) +{ + cpufreq_verify_within_cpu_limits(policy); + + if ((policy->policy != CPUFREQ_POLICY_POWERSAVE) && + (policy->policy != CPUFREQ_POLICY_PERFORMANCE)) + return -EINVAL; + + return 0; +} + +static int cppc_set_policy(struct cpufreq_policy *policy) +{ + struct cpudata *cpu; + + cpu = all_cpu_data[policy->cpu]; + + if (!policy->cpuinfo.max_freq) + return -ENODEV; + + if (policy->policy == CPUFREQ_POLICY_PERFORMANCE) { + limits.min_perf_pct = 100; + limits.min_perf = int_tofp(1); + limits.max_perf_pct = 100; + limits.max_perf = int_tofp(1); + return 0; + } + limits.min_perf_pct = (policy->min * 100) / policy->cpuinfo.max_freq; + limits.min_perf_pct = clamp_t(int, limits.min_perf_pct, 0 , 100); + limits.min_perf = div_fp(int_tofp(limits.min_perf_pct), int_tofp(100)); + + limits.max_policy_pct = policy->max * 100 / policy->cpuinfo.max_freq; + limits.max_policy_pct = clamp_t(int, limits.max_policy_pct, 0 , 100); + limits.max_perf_pct = min(limits.max_policy_pct, limits.max_sysfs_pct); + limits.max_perf = div_fp(int_tofp(limits.max_perf_pct), int_tofp(100)); + + return 0; +} + +static unsigned int cppc_get(unsigned int cpu_num) +{ + struct sample *sample; + struct cpudata *cpu; + + cpu = all_cpu_data[cpu_num]; + if (!cpu) + return 0; + sample = &cpu->sample; + return sample->freq; +} + +static struct cpufreq_driver cppc_cpufreq = { + .flags = CPUFREQ_CONST_LOOPS, + .verify = cppc_verify_policy, + .setpolicy = cppc_set_policy, + .get = cppc_get, + .init = cppc_cpufreq_init, + .stop_cpu = cppc_stop_cpu, + .name = "cppc_cpufreq", +}; + +static int cppc_processor_probe(void) +{ + struct acpi_buffer output = {ACPI_ALLOCATE_BUFFER, NULL}; + union acpi_object *out_obj, *cpc_obj; + struct cpc_desc *current_cpu_cpc; + struct cpc_register_resource *gas_t; + char proc_name[11]; + unsigned int num_ent, ret = 0, i, cpu, len; + acpi_handle handle; + acpi_status status; + + /*Parse the ACPI _CPC table for each CPU. */ + for_each_online_cpu(cpu) { + sprintf(proc_name, "\_PR.CPU%d", cpu); + + status = acpi_get_handle(NULL, proc_name, &handle); + if (ACPI_FAILURE(status)) { + ret = -ENODEV; + goto out_free; + } + + if (!acpi_has_method(handle, "_CPC")) { + ret = -ENODEV; + goto out_free; + } + + status = acpi_evaluate_object(handle, "_CPC", NULL, &output); + if (ACPI_FAILURE(status)) { + ret = -ENODEV; + goto out_free; + } + + out_obj = (union acpi_object *) output.pointer; + if (out_obj->type != ACPI_TYPE_PACKAGE) { + ret = -ENODEV; + goto out_free; + } + + current_cpu_cpc = kzalloc(sizeof(struct cpc_desc), GFP_KERNEL); + if (!current_cpu_cpc) { + pr_err("Could not allocate per cpu CPC descriptors\n"); + return -ENOMEM; + } + num_ent = out_obj->package.count; + current_cpu_cpc->num_entries = num_ent; + + pr_debug("num_ent in CPC table:%d\n", num_ent); + + /* Iterate through each entry in _CPC */ + for (i = 2; i < num_ent; i++) { + cpc_obj = &out_obj->package.elements[i]; + + if (cpc_obj->type != ACPI_TYPE_BUFFER) { + pr_err("Malformed PCC entry in CPC table\n"); + ret = -EINVAL; + goto out_free; + } + + gas_t = (struct cpc_register_resource *) cpc_obj->buffer.pointer; + + if (gas_t->space_id == ACPI_ADR_SPACE_PLATFORM_COMM) { + if (pcc_subspace_idx < 0) + pcc_subspace_idx = gas_t->access_width; + } + + current_cpu_cpc->cpc_regs[i-2] = (struct cpc_register_resource) { + .space_id = gas_t->space_id, + .length = gas_t->length, + .bit_width = gas_t->bit_width, + .bit_offset = gas_t->bit_offset, + .address = gas_t->address, + .access_width = gas_t->access_width, + }; + } + per_cpu(cpc_desc_ptr, cpu) = current_cpu_cpc; + } + + pr_debug("Completed parsing , now onto PCC init\n"); + + if (pcc_subspace_idx >= 0) { + ret = get_pcc_comm_channel(pcc_subspace_idx, &pcc_comm_base_addr, &len); + if (ret) { + pr_err("No PCC Communication Channel found\n"); + ret = -ENODEV; + goto out_free; + } + + //XXX: PCC HACK: The PCC hack in drivers/acpi/pcc.c just + //returns a kmallocd address, so no point in ioremapping + //it here. Instead we'll just use it directly. + //Normally, we'd ioremap the address specified in the PCCT + //header for this PCC subspace. + + comm_base_addr = &pcc_comm_base_addr; + + // comm_base_addr = ioremap_nocache(pcc_comm_base_addr, len); + + // if (!comm_base_addr) { + // pr_err("ioremapping pcc comm space failed\n"); + // ret = -ENOMEM; + // goto out_free; + // } + pr_debug("PCC ioremapd space:%p, PCCT addr: %lld\n", comm_base_addr, pcc_comm_base_addr); + + } else { + pr_err("No PCC subspace detected in any CPC structure!\n"); + ret = -EINVAL; + goto out_free; + } + + /* Everything looks okay */ + pr_info("Successfully parsed all CPC structs\n"); + pr_debug("Enable CPPC_EN\n"); + /*XXX: Send write cmd to enable CPPC */ + + kfree(output.pointer); + return 0; + +out_free: + for_each_online_cpu(cpu) { + current_cpu_cpc = per_cpu(cpc_desc_ptr, cpu); + if (current_cpu_cpc) + kfree(current_cpu_cpc); + } + + kfree(output.pointer); + return -ENODEV; +} + +static void copy_pid_params(struct pstate_adjust_policy *policy) +{ + pid_params.sample_rate_ms = policy->sample_rate_ms; + pid_params.p_gain_pct = policy->p_gain_pct; + pid_params.i_gain_pct = policy->i_gain_pct; + pid_params.d_gain_pct = policy->d_gain_pct; + pid_params.deadband = policy->deadband; + pid_params.setpoint = policy->setpoint; +} + +static int __init cppc_init(void) +{ + int ret = 0; + unsigned int cpu; + + /* + * Platform specific low level accessors should be + * initialized by now if CPPC is supported. + */ + if (!cppc_func_ops) { + pr_err("No CPPC low level accessors found\n"); + return -ENODEV; + } + + if(acpi_disabled || cppc_processor_probe()) { + pr_err("Err initializing CPC structures or ACPI is disabled\n"); + return -ENODEV; + } + + copy_pid_params(&cppc_func_ops->pid_policy); + + pr_info("CPPC PID driver initializing.\n"); + + all_cpu_data = vzalloc(sizeof(void *) * num_possible_cpus()); + if (!all_cpu_data) + return -ENOMEM; + + /* Now register with CPUfreq */ + ret = cpufreq_register_driver(&cppc_cpufreq); + if (ret) + goto out; + + cppc_pstate_debug_expose_params(); + cppc_pstate_sysfs_expose_params(); + + return ret; + +out: + get_online_cpus(); + for_each_online_cpu(cpu) { + if (all_cpu_data[cpu]) { + del_timer_sync(&all_cpu_data[cpu]->timer); + kfree(all_cpu_data[cpu]); + } + } + + put_online_cpus(); + vfree(all_cpu_data); + return -ENODEV; +} +device_initcall(cppc_init); diff --git a/drivers/cpufreq/cppc.h b/drivers/cpufreq/cppc.h new file mode 100644 index 0000000..3adbd3d --- /dev/null +++ b/drivers/cpufreq/cppc.h @@ -0,0 +1,181 @@ +/* + * Copyright (C) 2014 Linaro Ltd. + * Author: Ashwin Chaugule ashwin.chaugule@linaro.org + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * PID algo bits are from intel_pstate.c and modified to use CPPC + * accessors. + * + */ + +#ifndef _CPPC_H +#define _CPPC_H + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/ktime.h> +#include <linux/hrtimer.h> +/* + * The max number of Register entries + * in the CPC table + */ +#define MAX_CPC_REG_ENT 19 + +/* These are indexes into the per-cpu cpc_regs[]. Order is important. */ +enum cppc_pcc_regs { + HIGHEST_PERF, /* Highest Performance */ + NOMINAL_PERF, /* Nominal Performance */ + LOW_NON_LINEAR_PERF, /* Lowest Nonlinear Performance */ + LOWEST_PERF, /* Lowest Performance */ + GUARANTEED_PERF, /* Guaranteed Performance Register */ + DESIRED_PERF, /* Desired Performance Register */ + MIN_PERF, /* Minimum Performance Register */ + MAX_PERF, /* Maximum Performance Register */ + PERF_REDUC_TOLERANCE, /* Performance Reduction Tolerance Register */ + TIME_WINDOW, /* Time Window Register */ + CTR_WRAP_TIME, /* Counter Wraparound Time */ + REFERENCE_CTR, /* Reference Counter Register */ + DELIVERED_CTR, /* Delivered Counter Register */ + PERF_LIMITED, /* Performance Limited Register */ + ENABLE, /* Enable Register */ + AUTO_SEL_ENABLE, /* Autonomous Selection Enable */ + AUTO_ACT_WINDOW, /* Autonomous Activity Window */ + ENERGY_PERF, /* Energy Performance Preference Register */ + REFERENCE_PERF, /* Reference Performance */ +}; + +/* Each register in the CPC table has the following format */ +struct cpc_register_resource { + u8 descriptor; + u16 length; + u8 space_id; + u8 bit_width; + u8 bit_offset; + u8 access_width; + u64 __iomem address; +} __attribute__ ((packed)); + +struct cpc_desc { + unsigned int num_entries; + unsigned int version; + struct cpc_register_resource cpc_regs[MAX_CPC_REG_ENT]; +}; + +struct _pid { + int setpoint; + int32_t integral; + int32_t p_gain; + int32_t i_gain; + int32_t d_gain; + int deadband; + int32_t last_err; +}; + +struct sample { + int32_t core_pct_busy; + u64 delivered; + u64 reference; + int freq; + ktime_t time; +}; + +struct pstate_data { + int current_pstate; + int min_pstate; + int max_pstate; +}; + +struct cpudata { + int cpu; + + struct timer_list timer; + + struct pstate_data pstate; + struct _pid pid; + + ktime_t last_sample_time; + u64 prev_delivered; + u64 prev_reference; + struct sample sample; + struct cpc_desc *cpc_desc; + void __iomem *pcc_comm_address; +}; + +struct perf_limits { + int max_perf_pct; + int min_perf_pct; + int32_t max_perf; + int32_t min_perf; + int max_policy_pct; + int max_sysfs_pct; +}; + +struct pstate_adjust_policy { + int sample_rate_ms; + int deadband; + int setpoint; + int p_gain_pct; + int d_gain_pct; + int i_gain_pct; +}; + +struct cpc_funcs { + struct pstate_adjust_policy pid_policy; + + u32 (*get_highest_perf)(struct cpudata *); + u32 (*get_nominal_perf)(struct cpudata *); + u64 (*get_ref_perf_ctr)(struct cpudata *); + u32 (*get_lowest_nonlinear_perf)(struct cpudata *); + u32 (*get_lowest_perf)(struct cpudata *); + u32 (*get_guaranteed_perf)(struct cpudata *); + + u32 (*get_desired_perf)(struct cpudata *); + void (*set_desired_perf)(struct cpudata *, u32 val); + + u64 (*get_delivered_ctr)(struct cpudata *); + + /* Optional */ + u32 (*get_max_perf)(struct cpudata *); + void (*set_max_perf)(struct cpudata *, u32 val); + + u32 (*get_min_perf)(struct cpudata *); + void (*set_min_perf)(struct cpudata *, u32 val); + + u32 (*get_perf_reduc)(struct cpudata *); + void (*set_perf_reduc)(struct cpudata *, u32 val); + + u32 (*get_time_window)(struct cpudata *); + void (*set_time_window)(struct cpudata *, u32 msecs); + + u64 (*get_ctr_wraparound)(struct cpudata *); + void (*set_ctr_wraparound)(struct cpudata *, u32 secs); + + u8 (*get_perf_limit)(struct cpudata *); + void (*set_perf_limit)(struct cpudata *); + + void (*set_cppc_enable)(struct cpudata *); + + u8 (*get_auto_sel_en)(struct cpudata *); + void (*set_auto_sel_en)(struct cpudata *); + + void (*set_auto_activity)(struct cpudata *, u32 val); + + void (*set_energy_pref)(struct cpudata *, u32 val); + + u32 (*get_ref_perf_rate)(struct cpudata *); +}; + +extern struct cpc_funcs *cppc_func_ops; +extern u64 cpc_read64(struct cpc_register_resource *reg); +extern int cpc_write64(u64 val, struct cpc_register_resource *reg); + +#endif /* _CPPC_H */
+ Rafael [corrected email addr]
On 14 August 2014 15:57, Ashwin Chaugule ashwin.chaugule@linaro.org wrote:
Add support for parsing the CPC tables as described in the ACPI 5.1+ CPPC specification. When successfully parsed along with low level register accessors, then enable the PID (proportional-intergral-derivative) controller based algorithm to manage CPU performance.
Signed-off-by: Ashwin Chaugule ashwin.chaugule@linaro.org
drivers/acpi/pcc.c | 109 ++++++ drivers/cpufreq/Kconfig | 10 + drivers/cpufreq/Makefile | 1 + drivers/cpufreq/cppc.c | 874 +++++++++++++++++++++++++++++++++++++++++++++++ drivers/cpufreq/cppc.h | 181 ++++++++++ 5 files changed, 1175 insertions(+) create mode 100644 drivers/cpufreq/cppc.c create mode 100644 drivers/cpufreq/cppc.h
diff --git a/drivers/acpi/pcc.c b/drivers/acpi/pcc.c index 105e11a..7743f12 100644 --- a/drivers/acpi/pcc.c +++ b/drivers/acpi/pcc.c @@ -31,6 +31,12 @@ #define PCC_CMD_COMPLETE 0x1 #define PCC_VERSION "0.1"
+#define PCC_HACK 1
+#ifdef PCC_HACK +static void *pcc_comm_addr; +#endif
struct pcc_ss_desc { struct acpi_pcct_subspace *pcc_ss_ptr; raw_spinlock_t lock; @@ -51,8 +57,13 @@ int get_pcc_comm_channel(u32 ss_idx, u64 __iomem *addr, int *len) struct acpi_pcct_subspace *pcct_subspace = pcc_ss_arr[ss_idx].pcc_ss_ptr;
if (pcct_subspace) {
+#ifndef PCC_HACK *addr = pcct_subspace->base_address; *len = pcct_subspace->length; +#else
*addr = (u64 *)pcc_comm_addr;
*len = PAGE_SIZE;
+#endif } else return -EINVAL;
@@ -61,6 +72,7 @@ int get_pcc_comm_channel(u32 ss_idx, u64 __iomem *addr, int *len) return 0; }
+#ifndef PCC_HACK /* Send PCC cmd on behalf of this (subspace id) PCC client */ u16 send_pcc_cmd(u8 cmd, u8 sci, u32 ss_idx, u64 __iomem *base_addr) { @@ -114,6 +126,93 @@ u16 send_pcc_cmd(u8 cmd, u8 sci, u32 ss_idx, u64 __iomem *base_addr) return generic_comm_base->status; }
+#else
+#include <asm/msr.h>
+/* These offsets are from the SSDT9.asl table on the Thinkpad X240 */
+/* These are offsets per CPU from which its CPC table begins. */ +int cpu_base[] = {0, 0x64, 0xC8, 0x12C, 0x190, 0x1F4, 0x258, 0x2BC};
+/* These are offsets of the registers in each CPC table. */ +#define HIGHEST_PERF_OFFSET 0x0 +#define LOWEST_PERF_OFFSET 0xc +#define DESIRED_PERF_OFFSET 0x14
+static int core_get_min(void) +{
u64 val;
rdmsrl(MSR_PLATFORM_INFO, val);
return (val >> 40) & 0xff;
+}
+static int core_get_max(void) +{
u64 val;
rdmsrl(MSR_PLATFORM_INFO, val);
return (val >> 8) & 0xff;
+}
+static int core_get_turbo(void) +{
u64 value;
int nont, ret;
rdmsrl(MSR_NHM_TURBO_RATIO_LIMIT, value);
nont = core_get_max();
ret = ((value) & 255);
if (ret <= nont)
ret = nont;
return ret;
+}
+u16 send_pcc_cmd(u8 cmd, u8 sci, u32 ss_idx, u64 __iomem *base_addr) +{
unsigned int cpu;
u64 desired_val;
raw_spin_lock(&pcc_ss_arr[ss_idx].lock);
/*XXX: Instead of waiting for platform to consume the cmd,
* just do what the platform would've done.
*/
switch (cmd) {
case 0: //PCC_CMD_READ
/* XXX: Normally the Platform would need to update all the other CPPC registers as well.
* But for this experiment, since we're not really using all of them, we'll only update
* what we use.
*/
for_each_possible_cpu(cpu) {
*(char*)(pcc_comm_addr + cpu_base[cpu] + HIGHEST_PERF_OFFSET) = core_get_turbo();
*(char*)(pcc_comm_addr + cpu_base[cpu] + LOWEST_PERF_OFFSET) = core_get_min();
}
break;
case 1: //PCC_CMD_WRITE
/* XXX: All this hackery is very X86 Thinkpad X240 specific.
* Normally, the cpc_write64() would have all the info on
* how, where and what to write.
*/
for_each_possible_cpu(cpu) {
desired_val = *(u64*)(pcc_comm_addr + cpu_base[cpu] + DESIRED_PERF_OFFSET);
if (desired_val) {
wrmsrl_on_cpu(cpu, MSR_IA32_PERF_CTL, desired_val << 8);
*(u64*)(pcc_comm_addr + cpu_base[cpu] + DESIRED_PERF_OFFSET) = 0;
}
}
break;
default:
pr_err("Unknown PCC cmd from the OS\n");
return 0;
}
raw_spin_unlock(&pcc_ss_arr[ss_idx].lock);
return 1;
+} +#endif
static int parse_pcc_subspace(struct acpi_subtable_header *header, const unsigned long end) { @@ -185,6 +284,16 @@ static int __init pcc_init(void) return -EINVAL; }
+#ifdef PCC_HACK
pcc_comm_addr = kzalloc(PAGE_SIZE, GFP_KERNEL);
if (!pcc_comm_addr) {
pr_err("Could not allocate mem for pcc hack\n");
return -ENOMEM;
}
+#endif
return ret;
} device_initcall(pcc_init); diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index ffe350f..d8e8335 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -196,6 +196,16 @@ config GENERIC_CPUFREQ_CPU0
If in doubt, say N.
+config CPPC_CPUFREQ
bool "CPPC CPUFreq driver"
depends on ACPI && ACPI_PCC
default n
help
CPPC is Collaborative Processor Performance Control. It allows the OS
to request CPU performance with an abstract metric and lets the platform
(e.g. BMC) interpret and optimize it for power and performance in a
platform specific manner.
menu "x86 CPU frequency scaling drivers" depends on X86 source "drivers/cpufreq/Kconfig.x86" diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile index db6d9a2..b392c8c 100644 --- a/drivers/cpufreq/Makefile +++ b/drivers/cpufreq/Makefile @@ -14,6 +14,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE) += cpufreq_conservative.o obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o
obj-$(CONFIG_GENERIC_CPUFREQ_CPU0) += cpufreq-cpu0.o +obj-$(CONFIG_CPPC_CPUFREQ) += cppc.o
################################################################################## # x86 drivers. diff --git a/drivers/cpufreq/cppc.c b/drivers/cpufreq/cppc.c new file mode 100644 index 0000000..6917ce0 --- /dev/null +++ b/drivers/cpufreq/cppc.c @@ -0,0 +1,874 @@ +/*
Copyright (C) 2014 Linaro Ltd.
Author: Ashwin Chaugule <ashwin.chaugule@linaro.org>
- This program is free software; you can redistribute it and/or modify
- it under the terms of the GNU General Public License as published by
- the Free Software Foundation; either version 2 of the License, or
- (at your option) any later version.
- This program is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- GNU General Public License for more details.
- PID algo bits are from intel_pstate.c and modified to use CPPC
- accessors.
- */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/kernel_stat.h> +#include <linux/module.h> +#include <linux/hrtimer.h> +#include <linux/tick.h> +#include <linux/slab.h> +#include <linux/sched.h> +#include <linux/list.h> +#include <linux/cpu.h> +#include <linux/cpufreq.h> +#include <linux/sysfs.h> +#include <linux/types.h> +#include <linux/fs.h> +#include <linux/debugfs.h> +#include <linux/acpi.h> +#include <linux/errno.h>
+#include <acpi/processor.h> +#include <acpi/actypes.h>
+#include <trace/events/power.h>
+#include <asm/div64.h> +#include <asm/msr.h>
+#include "cppc.h"
+#define FRAC_BITS 8 +#define int_tofp(X) ((int64_t)(X) << FRAC_BITS) +#define fp_toint(X) ((X) >> FRAC_BITS)
+#define CPPC_EN 1 +#define PCC_CMD_COMPLETE 1
+/* There is one CPC descriptor per CPU */ +static DEFINE_PER_CPU(struct cpc_desc *, cpc_desc_ptr);
+/* PCC client specifics for the CPPC structure */ +/* Returned by the PCCT Subspace structure */ +static u64 pcc_comm_base_addr;
+/* ioremap the pcc_comm_base_addr*/ +static void __iomem *comm_base_addr;
+/* The PCC subspace used by the CPC table */ +static s8 pcc_subspace_idx = -1;
+extern int get_pcc_comm_channel(u32 ss_idx, u64* addr, int *len); +extern u16 send_pcc_cmd(u8 cmd, u8 sci, u32 ss_idx, u64 * __iomem base_addr);
+/*
- The low level platform specific accessors
- to the registers defined in the CPC table
- */
+struct cpc_funcs *cppc_func_ops;
+static struct cpudata **all_cpu_data; +static struct pstate_adjust_policy pid_params;
+/* PCC Commands used by CPPC */ +enum cppc_ppc_cmds {
PCC_CMD_READ,
PCC_CMD_WRITE,
RESERVED,
+};
+static struct perf_limits limits = {
.max_perf_pct = 100,
.max_perf = int_tofp(1),
.min_perf_pct = 0,
.min_perf = 0,
.max_policy_pct = 100,
.max_sysfs_pct = 100,
+};
+u64 cpc_read64(struct cpc_register_resource *reg, void __iomem *base_addr) +{
u64 err = 0;
u64 val;
switch (reg->space_id) {
case ACPI_ADR_SPACE_PLATFORM_COMM:
err = readq((void *) (reg->address + *(u64 *)base_addr));
break;
case ACPI_ADR_SPACE_FIXED_HARDWARE:
rdmsrl(reg->address, val);
return val;
break;
default:
pr_err("unknown space_id detected in cpc reg: %d\n", reg->space_id);
break;
}
return err;
+}
+int cpc_write64(u64 val, struct cpc_register_resource *reg, void __iomem *base_addr) +{
unsigned int err = 0;
switch (reg->space_id) {
case ACPI_ADR_SPACE_PLATFORM_COMM:
writeq(val, (void *)(reg->address + *(u64 *)base_addr));
break;
case ACPI_ADR_SPACE_FIXED_HARDWARE:
wrmsrl(reg->address, val);
break;
default:
pr_err("unknown space_id detected in cpc reg: %d\n", reg->space_id);
break;
}
return err;
+}
+static inline int32_t mul_fp(int32_t x, int32_t y) +{
return ((int64_t)x * (int64_t)y) >> FRAC_BITS;
+}
+static inline int32_t div_fp(int32_t x, int32_t y) +{
return div_s64((int64_t)x << FRAC_BITS, (int64_t)y);
+}
+static inline void pid_reset(struct _pid *pid, int setpoint, int busy,
int deadband, int integral) {
pid->setpoint = setpoint;
pid->deadband = deadband;
pid->integral = int_tofp(integral);
pid->last_err = int_tofp(setpoint) - int_tofp(busy);
+}
+static inline void pid_p_gain_set(struct _pid *pid, int percent) +{
pid->p_gain = div_fp(int_tofp(percent), int_tofp(100));
+}
+static inline void pid_i_gain_set(struct _pid *pid, int percent) +{
pid->i_gain = div_fp(int_tofp(percent), int_tofp(100));
+}
+static inline void pid_d_gain_set(struct _pid *pid, int percent) +{
pid->d_gain = div_fp(int_tofp(percent), int_tofp(100));
+}
+static signed int pid_calc(struct _pid *pid, int32_t busy) +{
signed int result;
int32_t pterm, dterm, fp_error;
int32_t integral_limit;
fp_error = int_tofp(pid->setpoint) - busy;
if (abs(fp_error) <= int_tofp(pid->deadband))
return 0;
pterm = mul_fp(pid->p_gain, fp_error);
pid->integral += fp_error;
/* limit the integral term */
integral_limit = int_tofp(30);
if (pid->integral > integral_limit)
pid->integral = integral_limit;
if (pid->integral < -integral_limit)
pid->integral = -integral_limit;
dterm = mul_fp(pid->d_gain, fp_error - pid->last_err);
pid->last_err = fp_error;
result = pterm + mul_fp(pid->integral, pid->i_gain) + dterm;
result = result + (1 << (FRAC_BITS-1));
return (signed int)fp_toint(result);
+}
+static inline void pstate_busy_pid_reset(struct cpudata *cpu) +{
pid_p_gain_set(&cpu->pid, pid_params.p_gain_pct);
pid_d_gain_set(&cpu->pid, pid_params.d_gain_pct);
pid_i_gain_set(&cpu->pid, pid_params.i_gain_pct);
pid_reset(&cpu->pid,
pid_params.setpoint,
100,
pid_params.deadband,
0);
+}
+static inline void pstate_reset_all_pid(void) +{
unsigned int cpu;
for_each_online_cpu(cpu) {
if (all_cpu_data[cpu])
pstate_busy_pid_reset(all_cpu_data[cpu]);
}
+}
+/************************** debugfs begin ************************/ +static int pid_param_set(void *data, u64 val) +{
*(u32 *)data = val;
pstate_reset_all_pid();
return 0;
+}
+static int pid_param_get(void *data, u64 *val) +{
*val = *(u32 *)data;
return 0;
+} +DEFINE_SIMPLE_ATTRIBUTE(fops_pid_param, pid_param_get,
pid_param_set, "%llu\n");
+struct pid_param {
char *name;
void *value;
+};
+static struct pid_param pid_files[] = {
{"sample_rate_ms", &pid_params.sample_rate_ms},
{"d_gain_pct", &pid_params.d_gain_pct},
{"i_gain_pct", &pid_params.i_gain_pct},
{"deadband", &pid_params.deadband},
{"setpoint", &pid_params.setpoint},
{"p_gain_pct", &pid_params.p_gain_pct},
{NULL, NULL}
+};
+static struct dentry *debugfs_parent; +static void cppc_pstate_debug_expose_params(void) +{
int i = 0;
debugfs_parent = debugfs_create_dir("pstate_snb", NULL);
if (IS_ERR_OR_NULL(debugfs_parent))
return;
while (pid_files[i].name) {
debugfs_create_file(pid_files[i].name, 0660,
debugfs_parent, pid_files[i].value,
&fops_pid_param);
i++;
}
+}
+/************************** debugfs end ************************/
+/************************** sysfs begin ************************/ +#define show_one(file_name, object) \
static ssize_t show_##file_name \
(struct kobject *kobj, struct attribute *attr, char *buf) \
{ \
return sprintf(buf, "%u\n", limits.object); \
}
+static ssize_t store_max_perf_pct(struct kobject *a, struct attribute *b,
const char *buf, size_t count)
+{
unsigned int input;
int ret;
ret = sscanf(buf, "%u", &input);
if (ret != 1)
return -EINVAL;
limits.max_sysfs_pct = clamp_t(int, input, 0 , 100);
limits.max_perf_pct = min(limits.max_policy_pct, limits.max_sysfs_pct);
limits.max_perf = div_fp(int_tofp(limits.max_perf_pct), int_tofp(100));
return count;
+}
+static ssize_t store_min_perf_pct(struct kobject *a, struct attribute *b,
const char *buf, size_t count)
+{
unsigned int input;
int ret;
ret = sscanf(buf, "%u", &input);
if (ret != 1)
return -EINVAL;
limits.min_perf_pct = clamp_t(int, input, 0 , 100);
limits.min_perf = div_fp(int_tofp(limits.min_perf_pct), int_tofp(100));
return count;
+}
+show_one(max_perf_pct, max_perf_pct); +show_one(min_perf_pct, min_perf_pct);
+define_one_global_rw(max_perf_pct); +define_one_global_rw(min_perf_pct);
+static struct attribute *cppc_pstate_attributes[] = {
&max_perf_pct.attr,
&min_perf_pct.attr,
NULL
+};
+static struct attribute_group cppc_pstate_attr_group = {
.attrs = cppc_pstate_attributes,
+}; +static struct kobject *cppc_pstate_kobject;
+static void cppc_pstate_sysfs_expose_params(void) +{
int rc;
cppc_pstate_kobject = kobject_create_and_add("cppc_pstate",
&cpu_subsys.dev_root->kobj);
BUG_ON(!cppc_pstate_kobject);
rc = sysfs_create_group(cppc_pstate_kobject,
&cppc_pstate_attr_group);
BUG_ON(rc);
+}
+/************************** sysfs end ************************/
+static inline void pstate_calc_busy(struct cpudata *cpu) +{
struct sample *sample = &cpu->sample;
int64_t core_pct;
int32_t rem;
core_pct = int_tofp(sample->delivered) * int_tofp(100);
core_pct = div_u64_rem(core_pct, int_tofp(sample->reference), &rem);
if ((rem << 1) >= int_tofp(sample->reference))
core_pct += 1;
sample->freq = fp_toint(
mul_fp(int_tofp(cpu->pstate.max_pstate * 1000), core_pct));
sample->core_pct_busy = (int32_t)core_pct;
+}
+static inline void pstate_sample(struct cpudata *cpu) +{
u64 delivered, reference;
unsigned int status;
/*
* If this platform has a PCCT, then
* send a command to the platform to update
* all PCC registers.
*/
if (comm_base_addr) {
pr_debug("Sending PCC READ to update COMM space\n");
status = send_pcc_cmd(PCC_CMD_READ, 0, pcc_subspace_idx,
comm_base_addr);
if (!(status & PCC_CMD_COMPLETE)) {
pr_err("Err updating PCC comm space\n");
return;
}
}
reference = cppc_func_ops->get_ref_perf_ctr(cpu);
delivered = cppc_func_ops->get_delivered_ctr(cpu);
delivered = delivered >> FRAC_BITS;
reference = reference >> FRAC_BITS;
cpu->last_sample_time = cpu->sample.time;
cpu->sample.time = ktime_get();
cpu->sample.delivered = delivered;
cpu->sample.reference = reference;
cpu->sample.delivered -= cpu->prev_delivered;
cpu->sample.reference -= cpu->prev_reference;
pstate_calc_busy(cpu);
cpu->prev_delivered = delivered;
cpu->prev_reference = reference;
+}
+static inline int32_t pstate_get_scaled_busy(struct cpudata *cpu) +{
int32_t core_busy, max_pstate, current_pstate, sample_ratio;
u32 duration_us;
u32 sample_time;
core_busy = cpu->sample.core_pct_busy;
max_pstate = int_tofp(cpu->pstate.max_pstate);
current_pstate = int_tofp(cpu->pstate.current_pstate);
core_busy = mul_fp(core_busy, div_fp(max_pstate, current_pstate));
sample_time = (pid_params.sample_rate_ms * USEC_PER_MSEC);
duration_us = (u32) ktime_us_delta(cpu->sample.time,
cpu->last_sample_time);
if (duration_us > sample_time * 3) {
sample_ratio = div_fp(int_tofp(sample_time),
int_tofp(duration_us));
core_busy = mul_fp(core_busy, sample_ratio);
}
return core_busy;
+}
+static inline void pstate_set_sample_time(struct cpudata *cpu) +{
int sample_time, delay;
sample_time = pid_params.sample_rate_ms;
delay = msecs_to_jiffies(sample_time);
mod_timer_pinned(&cpu->timer, jiffies + delay);
+}
+static void pstate_get_min_max(struct cpudata *cpu, int *min, int *max) +{
int max_perf = cpu->pstate.max_pstate;
int max_perf_adj;
int min_perf;
max_perf_adj = fp_toint(mul_fp(int_tofp(max_perf), limits.max_perf));
*max = clamp_t(int, max_perf_adj,
cpu->pstate.min_pstate, cpu->pstate.max_pstate);
min_perf = fp_toint(mul_fp(int_tofp(max_perf), limits.min_perf));
*min = clamp_t(int, min_perf,
cpu->pstate.min_pstate, max_perf);
+}
+static void set_pstate(struct cpudata *cpu, int pstate) +{
int max_perf, min_perf;
unsigned int status;
pstate_get_min_max(cpu, &min_perf, &max_perf);
pstate = clamp_t(int, pstate, min_perf, max_perf);
if (pstate == cpu->pstate.current_pstate)
return;
trace_cpu_frequency(pstate * 100000, cpu->cpu);
cpu->pstate.current_pstate = pstate;
cppc_func_ops->set_desired_perf(cpu, pstate);
/*
* Send a Write command to tell the platform that
* there is new data in the PCC registers.
*/
if (comm_base_addr) {
pr_debug("Sending PCC WRITE to update COMM space\n");
status = send_pcc_cmd(PCC_CMD_WRITE, 0, pcc_subspace_idx,
comm_base_addr);
if (!(status & PCC_CMD_COMPLETE)) {
pr_err("Err updating PCC comm space\n");
return;
}
}
+}
+static inline void pstate_pstate_increase(struct cpudata *cpu, int steps) +{
int target;
target = cpu->pstate.current_pstate + steps;
set_pstate(cpu, target);
+}
+static inline void pstate_pstate_decrease(struct cpudata *cpu, int steps) +{
int target;
target = cpu->pstate.current_pstate - steps;
set_pstate(cpu, target);
+}
+static inline void pstate_adjust_busy_pstate(struct cpudata *cpu) +{
int32_t busy_scaled;
struct _pid *pid;
signed int ctl = 0;
int steps;
pid = &cpu->pid;
busy_scaled = pstate_get_scaled_busy(cpu);
ctl = pid_calc(pid, busy_scaled);
steps = abs(ctl);
if (ctl < 0)
pstate_pstate_increase(cpu, steps);
else
pstate_pstate_decrease(cpu, steps);
+}
+static void pstate_timer_func(unsigned long __data) +{
struct cpudata *cpu = (struct cpudata *) __data;
struct sample *sample;
pstate_sample(cpu);
sample = &cpu->sample;
pstate_adjust_busy_pstate(cpu);
trace_pstate_sample(fp_toint(sample->core_pct_busy),
fp_toint(pstate_get_scaled_busy(cpu)),
cpu->pstate.current_pstate,
sample->reference,
sample->delivered,
sample->freq);
pstate_set_sample_time(cpu);
+}
+static int cppc_cpufreq_init(struct cpufreq_policy *policy) +{
struct cpudata *cpu;
unsigned int cpunum = policy->cpu;
unsigned int status;
struct cpc_desc *current_cpu_cpc = per_cpu(cpc_desc_ptr, cpunum);
all_cpu_data[cpunum] = kzalloc(sizeof(struct cpudata), GFP_KERNEL);
if (!all_cpu_data[cpunum])
return -ENOMEM;
cpu = all_cpu_data[cpunum];
cpu->cpu = cpunum;
if (!cppc_func_ops) {
pr_err("CPPC is not supported on this platform\n");
return -ENOTSUPP;
}
if (!current_cpu_cpc) {
pr_err("Undefined CPC descriptor for CPU:%d\n", cpunum);
return -ENODEV;
}
/*
* If this platform has a PCCT, then
* send a command to the platform to update
* all PCC registers.
*/
if (comm_base_addr) {
pr_debug("Sending PCC READ to update COMM space\n");
status = send_pcc_cmd(PCC_CMD_READ, 0, pcc_subspace_idx,
comm_base_addr);
if (!(status & PCC_CMD_COMPLETE)) {
pr_err("Err updating PCC comm space\n");
return -EIO;
}
}
cpu->cpc_desc = current_cpu_cpc;
cpu->pcc_comm_address = comm_base_addr;
cpu->pstate.min_pstate = cppc_func_ops->get_lowest_perf(cpu);
cpu->pstate.max_pstate = cppc_func_ops->get_highest_perf(cpu);
/* PCC reads/writes are made to offsets from this base address.*/
set_pstate(cpu, cpu->pstate.min_pstate);
init_timer_deferrable(&cpu->timer);
cpu->timer.function = pstate_timer_func;
cpu->timer.data =
(unsigned long)cpu;
cpu->timer.expires = jiffies + HZ/100;
pstate_busy_pid_reset(cpu);
pstate_sample(cpu);
add_timer_on(&cpu->timer, cpunum);
pr_info("CPPC PID pstate controlling: cpu %d\n", cpunum);
if (limits.min_perf_pct == 100 && limits.max_perf_pct == 100)
policy->policy = CPUFREQ_POLICY_PERFORMANCE;
else
policy->policy = CPUFREQ_POLICY_POWERSAVE;
policy->min = cpu->pstate.min_pstate * 100000;
policy->max = cpu->pstate.max_pstate * 100000;
/* cpuinfo and default policy values */
policy->cpuinfo.min_freq = cpu->pstate.min_pstate * 100000;
policy->cpuinfo.max_freq = cpu->pstate.max_pstate * 100000;
policy->cpuinfo.transition_latency = CPUFREQ_ETERNAL;
cpumask_set_cpu(policy->cpu, policy->cpus);
return 0;
+}
+static void cppc_stop_cpu(struct cpufreq_policy *policy) +{
int cpu_num = policy->cpu;
struct cpudata *cpu = all_cpu_data[cpu_num];
pr_info("CPPC PID controller CPU %d exiting\n", cpu_num);
del_timer_sync(&all_cpu_data[cpu_num]->timer);
set_pstate(cpu, cpu->pstate.min_pstate);
kfree(all_cpu_data[cpu_num]);
all_cpu_data[cpu_num] = NULL;
kfree(cpu->cpc_desc);
+}
+static int cppc_verify_policy(struct cpufreq_policy *policy) +{
cpufreq_verify_within_cpu_limits(policy);
if ((policy->policy != CPUFREQ_POLICY_POWERSAVE) &&
(policy->policy != CPUFREQ_POLICY_PERFORMANCE))
return -EINVAL;
return 0;
+}
+static int cppc_set_policy(struct cpufreq_policy *policy) +{
struct cpudata *cpu;
cpu = all_cpu_data[policy->cpu];
if (!policy->cpuinfo.max_freq)
return -ENODEV;
if (policy->policy == CPUFREQ_POLICY_PERFORMANCE) {
limits.min_perf_pct = 100;
limits.min_perf = int_tofp(1);
limits.max_perf_pct = 100;
limits.max_perf = int_tofp(1);
return 0;
}
limits.min_perf_pct = (policy->min * 100) / policy->cpuinfo.max_freq;
limits.min_perf_pct = clamp_t(int, limits.min_perf_pct, 0 , 100);
limits.min_perf = div_fp(int_tofp(limits.min_perf_pct), int_tofp(100));
limits.max_policy_pct = policy->max * 100 / policy->cpuinfo.max_freq;
limits.max_policy_pct = clamp_t(int, limits.max_policy_pct, 0 , 100);
limits.max_perf_pct = min(limits.max_policy_pct, limits.max_sysfs_pct);
limits.max_perf = div_fp(int_tofp(limits.max_perf_pct), int_tofp(100));
return 0;
+}
+static unsigned int cppc_get(unsigned int cpu_num) +{
struct sample *sample;
struct cpudata *cpu;
cpu = all_cpu_data[cpu_num];
if (!cpu)
return 0;
sample = &cpu->sample;
return sample->freq;
+}
+static struct cpufreq_driver cppc_cpufreq = {
.flags = CPUFREQ_CONST_LOOPS,
.verify = cppc_verify_policy,
.setpolicy = cppc_set_policy,
.get = cppc_get,
.init = cppc_cpufreq_init,
.stop_cpu = cppc_stop_cpu,
.name = "cppc_cpufreq",
+};
+static int cppc_processor_probe(void) +{
struct acpi_buffer output = {ACPI_ALLOCATE_BUFFER, NULL};
union acpi_object *out_obj, *cpc_obj;
struct cpc_desc *current_cpu_cpc;
struct cpc_register_resource *gas_t;
char proc_name[11];
unsigned int num_ent, ret = 0, i, cpu, len;
acpi_handle handle;
acpi_status status;
/*Parse the ACPI _CPC table for each CPU. */
for_each_online_cpu(cpu) {
sprintf(proc_name, "\\_PR.CPU%d", cpu);
status = acpi_get_handle(NULL, proc_name, &handle);
if (ACPI_FAILURE(status)) {
ret = -ENODEV;
goto out_free;
}
if (!acpi_has_method(handle, "_CPC")) {
ret = -ENODEV;
goto out_free;
}
status = acpi_evaluate_object(handle, "_CPC", NULL, &output);
if (ACPI_FAILURE(status)) {
ret = -ENODEV;
goto out_free;
}
out_obj = (union acpi_object *) output.pointer;
if (out_obj->type != ACPI_TYPE_PACKAGE) {
ret = -ENODEV;
goto out_free;
}
current_cpu_cpc = kzalloc(sizeof(struct cpc_desc), GFP_KERNEL);
if (!current_cpu_cpc) {
pr_err("Could not allocate per cpu CPC descriptors\n");
return -ENOMEM;
}
num_ent = out_obj->package.count;
current_cpu_cpc->num_entries = num_ent;
pr_debug("num_ent in CPC table:%d\n", num_ent);
/* Iterate through each entry in _CPC */
for (i = 2; i < num_ent; i++) {
cpc_obj = &out_obj->package.elements[i];
if (cpc_obj->type != ACPI_TYPE_BUFFER) {
pr_err("Malformed PCC entry in CPC table\n");
ret = -EINVAL;
goto out_free;
}
gas_t = (struct cpc_register_resource *) cpc_obj->buffer.pointer;
if (gas_t->space_id == ACPI_ADR_SPACE_PLATFORM_COMM) {
if (pcc_subspace_idx < 0)
pcc_subspace_idx = gas_t->access_width;
}
current_cpu_cpc->cpc_regs[i-2] = (struct cpc_register_resource) {
.space_id = gas_t->space_id,
.length = gas_t->length,
.bit_width = gas_t->bit_width,
.bit_offset = gas_t->bit_offset,
.address = gas_t->address,
.access_width = gas_t->access_width,
};
}
per_cpu(cpc_desc_ptr, cpu) = current_cpu_cpc;
}
pr_debug("Completed parsing , now onto PCC init\n");
if (pcc_subspace_idx >= 0) {
ret = get_pcc_comm_channel(pcc_subspace_idx, &pcc_comm_base_addr, &len);
if (ret) {
pr_err("No PCC Communication Channel found\n");
ret = -ENODEV;
goto out_free;
}
//XXX: PCC HACK: The PCC hack in drivers/acpi/pcc.c just
//returns a kmallocd address, so no point in ioremapping
//it here. Instead we'll just use it directly.
//Normally, we'd ioremap the address specified in the PCCT
//header for this PCC subspace.
comm_base_addr = &pcc_comm_base_addr;
// comm_base_addr = ioremap_nocache(pcc_comm_base_addr, len);
// if (!comm_base_addr) {
// pr_err("ioremapping pcc comm space failed\n");
// ret = -ENOMEM;
// goto out_free;
// }
pr_debug("PCC ioremapd space:%p, PCCT addr: %lld\n", comm_base_addr, pcc_comm_base_addr);
} else {
pr_err("No PCC subspace detected in any CPC structure!\n");
ret = -EINVAL;
goto out_free;
}
/* Everything looks okay */
pr_info("Successfully parsed all CPC structs\n");
pr_debug("Enable CPPC_EN\n");
/*XXX: Send write cmd to enable CPPC */
kfree(output.pointer);
return 0;
+out_free:
for_each_online_cpu(cpu) {
current_cpu_cpc = per_cpu(cpc_desc_ptr, cpu);
if (current_cpu_cpc)
kfree(current_cpu_cpc);
}
kfree(output.pointer);
return -ENODEV;
+}
+static void copy_pid_params(struct pstate_adjust_policy *policy) +{
pid_params.sample_rate_ms = policy->sample_rate_ms;
pid_params.p_gain_pct = policy->p_gain_pct;
pid_params.i_gain_pct = policy->i_gain_pct;
pid_params.d_gain_pct = policy->d_gain_pct;
pid_params.deadband = policy->deadband;
pid_params.setpoint = policy->setpoint;
+}
+static int __init cppc_init(void) +{
int ret = 0;
unsigned int cpu;
/*
* Platform specific low level accessors should be
* initialized by now if CPPC is supported.
*/
if (!cppc_func_ops) {
pr_err("No CPPC low level accessors found\n");
return -ENODEV;
}
if(acpi_disabled || cppc_processor_probe()) {
pr_err("Err initializing CPC structures or ACPI is disabled\n");
return -ENODEV;
}
copy_pid_params(&cppc_func_ops->pid_policy);
pr_info("CPPC PID driver initializing.\n");
all_cpu_data = vzalloc(sizeof(void *) * num_possible_cpus());
if (!all_cpu_data)
return -ENOMEM;
/* Now register with CPUfreq */
ret = cpufreq_register_driver(&cppc_cpufreq);
if (ret)
goto out;
cppc_pstate_debug_expose_params();
cppc_pstate_sysfs_expose_params();
return ret;
+out:
get_online_cpus();
for_each_online_cpu(cpu) {
if (all_cpu_data[cpu]) {
del_timer_sync(&all_cpu_data[cpu]->timer);
kfree(all_cpu_data[cpu]);
}
}
put_online_cpus();
vfree(all_cpu_data);
return -ENODEV;
+} +device_initcall(cppc_init); diff --git a/drivers/cpufreq/cppc.h b/drivers/cpufreq/cppc.h new file mode 100644 index 0000000..3adbd3d --- /dev/null +++ b/drivers/cpufreq/cppc.h @@ -0,0 +1,181 @@ +/*
Copyright (C) 2014 Linaro Ltd.
Author: Ashwin Chaugule <ashwin.chaugule@linaro.org>
- This program is free software; you can redistribute it and/or modify
- it under the terms of the GNU General Public License as published by
- the Free Software Foundation; either version 2 of the License, or
- (at your option) any later version.
- This program is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- GNU General Public License for more details.
- PID algo bits are from intel_pstate.c and modified to use CPPC
- accessors.
- */
+#ifndef _CPPC_H +#define _CPPC_H
+#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/ktime.h> +#include <linux/hrtimer.h> +/*
- The max number of Register entries
- in the CPC table
- */
+#define MAX_CPC_REG_ENT 19
+/* These are indexes into the per-cpu cpc_regs[]. Order is important. */ +enum cppc_pcc_regs {
HIGHEST_PERF, /* Highest Performance */
NOMINAL_PERF, /* Nominal Performance */
LOW_NON_LINEAR_PERF, /* Lowest Nonlinear Performance */
LOWEST_PERF, /* Lowest Performance */
GUARANTEED_PERF, /* Guaranteed Performance Register */
DESIRED_PERF, /* Desired Performance Register */
MIN_PERF, /* Minimum Performance Register */
MAX_PERF, /* Maximum Performance Register */
PERF_REDUC_TOLERANCE, /* Performance Reduction Tolerance Register */
TIME_WINDOW, /* Time Window Register */
CTR_WRAP_TIME, /* Counter Wraparound Time */
REFERENCE_CTR, /* Reference Counter Register */
DELIVERED_CTR, /* Delivered Counter Register */
PERF_LIMITED, /* Performance Limited Register */
ENABLE, /* Enable Register */
AUTO_SEL_ENABLE, /* Autonomous Selection Enable */
AUTO_ACT_WINDOW, /* Autonomous Activity Window */
ENERGY_PERF, /* Energy Performance Preference Register */
REFERENCE_PERF, /* Reference Performance */
+};
+/* Each register in the CPC table has the following format */ +struct cpc_register_resource {
u8 descriptor;
u16 length;
u8 space_id;
u8 bit_width;
u8 bit_offset;
u8 access_width;
u64 __iomem address;
+} __attribute__ ((packed));
+struct cpc_desc {
unsigned int num_entries;
unsigned int version;
struct cpc_register_resource cpc_regs[MAX_CPC_REG_ENT];
+};
+struct _pid {
int setpoint;
int32_t integral;
int32_t p_gain;
int32_t i_gain;
int32_t d_gain;
int deadband;
int32_t last_err;
+};
+struct sample {
int32_t core_pct_busy;
u64 delivered;
u64 reference;
int freq;
ktime_t time;
+};
+struct pstate_data {
int current_pstate;
int min_pstate;
int max_pstate;
+};
+struct cpudata {
int cpu;
struct timer_list timer;
struct pstate_data pstate;
struct _pid pid;
ktime_t last_sample_time;
u64 prev_delivered;
u64 prev_reference;
struct sample sample;
struct cpc_desc *cpc_desc;
void __iomem *pcc_comm_address;
+};
+struct perf_limits {
int max_perf_pct;
int min_perf_pct;
int32_t max_perf;
int32_t min_perf;
int max_policy_pct;
int max_sysfs_pct;
+};
+struct pstate_adjust_policy {
int sample_rate_ms;
int deadband;
int setpoint;
int p_gain_pct;
int d_gain_pct;
int i_gain_pct;
+};
+struct cpc_funcs {
struct pstate_adjust_policy pid_policy;
u32 (*get_highest_perf)(struct cpudata *);
u32 (*get_nominal_perf)(struct cpudata *);
u64 (*get_ref_perf_ctr)(struct cpudata *);
u32 (*get_lowest_nonlinear_perf)(struct cpudata *);
u32 (*get_lowest_perf)(struct cpudata *);
u32 (*get_guaranteed_perf)(struct cpudata *);
u32 (*get_desired_perf)(struct cpudata *);
void (*set_desired_perf)(struct cpudata *, u32 val);
u64 (*get_delivered_ctr)(struct cpudata *);
/* Optional */
u32 (*get_max_perf)(struct cpudata *);
void (*set_max_perf)(struct cpudata *, u32 val);
u32 (*get_min_perf)(struct cpudata *);
void (*set_min_perf)(struct cpudata *, u32 val);
u32 (*get_perf_reduc)(struct cpudata *);
void (*set_perf_reduc)(struct cpudata *, u32 val);
u32 (*get_time_window)(struct cpudata *);
void (*set_time_window)(struct cpudata *, u32 msecs);
u64 (*get_ctr_wraparound)(struct cpudata *);
void (*set_ctr_wraparound)(struct cpudata *, u32 secs);
u8 (*get_perf_limit)(struct cpudata *);
void (*set_perf_limit)(struct cpudata *);
void (*set_cppc_enable)(struct cpudata *);
u8 (*get_auto_sel_en)(struct cpudata *);
void (*set_auto_sel_en)(struct cpudata *);
void (*set_auto_activity)(struct cpudata *, u32 val);
void (*set_energy_pref)(struct cpudata *, u32 val);
u32 (*get_ref_perf_rate)(struct cpudata *);
+};
+extern struct cpc_funcs *cppc_func_ops; +extern u64 cpc_read64(struct cpc_register_resource *reg); +extern int cpc_write64(u64 val, struct cpc_register_resource *reg);
+#endif /* _CPPC_H */
1.9.1
If the firmware supports CPPC natively in the firmware then we can use the ACPI defined semantics to access the CPPC specific registers.
Signed-off-by: Ashwin Chaugule ashwin.chaugule@linaro.org --- drivers/cpufreq/Kconfig | 9 +++++ drivers/cpufreq/Makefile | 1 + drivers/cpufreq/cppc.h | 4 +-- drivers/cpufreq/cppc_acpi.c | 80 +++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 92 insertions(+), 2 deletions(-) create mode 100644 drivers/cpufreq/cppc_acpi.c
diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index d8e8335..c5f3c0b 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -206,6 +206,15 @@ config CPPC_CPUFREQ (e.g. BMC) interpret and optimize it for power and performance in a platform specific manner.
+config CPPC_ACPI + bool "CPPC with ACPI accessors" + depends on ACPI && ACPI_PCC + default n + help + This driver implements the low level accessors to the registers as described + in the ACPI 5.1 spec. Select this driver if you know your platform supports CPPC + and PCC in the firmware. + menu "x86 CPU frequency scaling drivers" depends on X86 source "drivers/cpufreq/Kconfig.x86" diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile index b392c8c..d49a999 100644 --- a/drivers/cpufreq/Makefile +++ b/drivers/cpufreq/Makefile @@ -15,6 +15,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o
obj-$(CONFIG_GENERIC_CPUFREQ_CPU0) += cpufreq-cpu0.o obj-$(CONFIG_CPPC_CPUFREQ) += cppc.o +obj-$(CONFIG_CPPC_ACPI) += cppc_acpi.o
################################################################################## # x86 drivers. diff --git a/drivers/cpufreq/cppc.h b/drivers/cpufreq/cppc.h index 3adbd3d..a119c3b 100644 --- a/drivers/cpufreq/cppc.h +++ b/drivers/cpufreq/cppc.h @@ -175,7 +175,7 @@ struct cpc_funcs { };
extern struct cpc_funcs *cppc_func_ops; -extern u64 cpc_read64(struct cpc_register_resource *reg); -extern int cpc_write64(u64 val, struct cpc_register_resource *reg); +extern u64 cpc_read64(struct cpc_register_resource *reg, void __iomem *base_addr); +extern int cpc_write64(u64 val, struct cpc_register_resource *reg, void __iomem *base_addr);
#endif /* _CPPC_H */ diff --git a/drivers/cpufreq/cppc_acpi.c b/drivers/cpufreq/cppc_acpi.c new file mode 100644 index 0000000..1835fe7 --- /dev/null +++ b/drivers/cpufreq/cppc_acpi.c @@ -0,0 +1,80 @@ +/* + * Copyright (C) 2014 Linaro Ltd. + * Author: Ashwin Chaugule ashwin.chaugule@linaro.org + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#include <linux/acpi.h> + +#include "cppc.h" + +static u32 acpi_get_highest_perf(struct cpudata *cpu) +{ + struct cpc_register_resource *high_perf = &cpu->cpc_desc->cpc_regs[HIGHEST_PERF]; + + return cpc_read64(high_perf, cpu->pcc_comm_address); +} + +static u64 acpi_get_ref_perf_ctr(struct cpudata *cpu) +{ + struct cpc_register_resource *ref_perf = &cpu->cpc_desc->cpc_regs[REFERENCE_CTR]; + return cpc_read64(ref_perf, cpu->pcc_comm_address); +} + +static u32 acpi_get_lowest_perf(struct cpudata *cpu) +{ + struct cpc_register_resource *low_perf = &cpu->cpc_desc->cpc_regs[LOWEST_PERF]; + return cpc_read64(low_perf, cpu->pcc_comm_address); +} + +static void acpi_set_desired_perf(struct cpudata *cpu, u32 val) +{ + struct cpc_register_resource *desired_perf = &cpu->cpc_desc->cpc_regs[DESIRED_PERF]; + cpc_write64(val, desired_perf, cpu->pcc_comm_address); +} + +static u64 acpi_get_delivered_ctr(struct cpudata *cpu) +{ + struct cpc_register_resource *delivered_ctr = &cpu->cpc_desc->cpc_regs[DELIVERED_CTR]; + return cpc_read64(delivered_ctr, cpu->pcc_comm_address); +} + +struct cpc_funcs acpi_cppc_func_ops = { + .pid_policy = { + .sample_rate_ms = 10, + .deadband = 0, + .setpoint = 97, + .p_gain_pct = 20, + .d_gain_pct = 0, + .i_gain_pct = 0, + }, + .get_highest_perf = acpi_get_highest_perf, + .get_ref_perf_ctr = acpi_get_ref_perf_ctr, + .get_lowest_perf = acpi_get_lowest_perf, + .set_desired_perf = acpi_set_desired_perf, + .get_delivered_ctr = acpi_get_delivered_ctr, +}; + +static int __init acpi_cppc_init(void) +{ + if (acpi_disabled) + return 0; + + cppc_func_ops = &acpi_cppc_func_ops; + + pr_info("Registered ACPI CPPC function ops\n"); + + return 0; +} +early_initcall(acpi_cppc_init); +
+ Rafael [corrected email addr]
On 14 August 2014 15:57, Ashwin Chaugule ashwin.chaugule@linaro.org wrote:
If the firmware supports CPPC natively in the firmware then we can use the ACPI defined semantics to access the CPPC specific registers.
Signed-off-by: Ashwin Chaugule ashwin.chaugule@linaro.org
drivers/cpufreq/Kconfig | 9 +++++ drivers/cpufreq/Makefile | 1 + drivers/cpufreq/cppc.h | 4 +-- drivers/cpufreq/cppc_acpi.c | 80 +++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 92 insertions(+), 2 deletions(-) create mode 100644 drivers/cpufreq/cppc_acpi.c
diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index d8e8335..c5f3c0b 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -206,6 +206,15 @@ config CPPC_CPUFREQ (e.g. BMC) interpret and optimize it for power and performance in a platform specific manner.
+config CPPC_ACPI
bool "CPPC with ACPI accessors"
depends on ACPI && ACPI_PCC
default n
help
This driver implements the low level accessors to the registers as described
in the ACPI 5.1 spec. Select this driver if you know your platform supports CPPC
and PCC in the firmware.
menu "x86 CPU frequency scaling drivers" depends on X86 source "drivers/cpufreq/Kconfig.x86" diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile index b392c8c..d49a999 100644 --- a/drivers/cpufreq/Makefile +++ b/drivers/cpufreq/Makefile @@ -15,6 +15,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_COMMON) += cpufreq_governor.o
obj-$(CONFIG_GENERIC_CPUFREQ_CPU0) += cpufreq-cpu0.o obj-$(CONFIG_CPPC_CPUFREQ) += cppc.o +obj-$(CONFIG_CPPC_ACPI) += cppc_acpi.o
################################################################################## # x86 drivers. diff --git a/drivers/cpufreq/cppc.h b/drivers/cpufreq/cppc.h index 3adbd3d..a119c3b 100644 --- a/drivers/cpufreq/cppc.h +++ b/drivers/cpufreq/cppc.h @@ -175,7 +175,7 @@ struct cpc_funcs { };
extern struct cpc_funcs *cppc_func_ops; -extern u64 cpc_read64(struct cpc_register_resource *reg); -extern int cpc_write64(u64 val, struct cpc_register_resource *reg); +extern u64 cpc_read64(struct cpc_register_resource *reg, void __iomem *base_addr); +extern int cpc_write64(u64 val, struct cpc_register_resource *reg, void __iomem *base_addr);
#endif /* _CPPC_H */ diff --git a/drivers/cpufreq/cppc_acpi.c b/drivers/cpufreq/cppc_acpi.c new file mode 100644 index 0000000..1835fe7 --- /dev/null +++ b/drivers/cpufreq/cppc_acpi.c @@ -0,0 +1,80 @@ +/*
Copyright (C) 2014 Linaro Ltd.
Author: Ashwin Chaugule <ashwin.chaugule@linaro.org>
- This program is free software; you can redistribute it and/or modify
- it under the terms of the GNU General Public License as published by
- the Free Software Foundation; either version 2 of the License, or
- (at your option) any later version.
- This program is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- GNU General Public License for more details.
- */
+#include <linux/acpi.h>
+#include "cppc.h"
+static u32 acpi_get_highest_perf(struct cpudata *cpu) +{
struct cpc_register_resource *high_perf = &cpu->cpc_desc->cpc_regs[HIGHEST_PERF];
return cpc_read64(high_perf, cpu->pcc_comm_address);
+}
+static u64 acpi_get_ref_perf_ctr(struct cpudata *cpu) +{
struct cpc_register_resource *ref_perf = &cpu->cpc_desc->cpc_regs[REFERENCE_CTR];
return cpc_read64(ref_perf, cpu->pcc_comm_address);
+}
+static u32 acpi_get_lowest_perf(struct cpudata *cpu) +{
struct cpc_register_resource *low_perf = &cpu->cpc_desc->cpc_regs[LOWEST_PERF];
return cpc_read64(low_perf, cpu->pcc_comm_address);
+}
+static void acpi_set_desired_perf(struct cpudata *cpu, u32 val) +{
struct cpc_register_resource *desired_perf = &cpu->cpc_desc->cpc_regs[DESIRED_PERF];
cpc_write64(val, desired_perf, cpu->pcc_comm_address);
+}
+static u64 acpi_get_delivered_ctr(struct cpudata *cpu) +{
struct cpc_register_resource *delivered_ctr = &cpu->cpc_desc->cpc_regs[DELIVERED_CTR];
return cpc_read64(delivered_ctr, cpu->pcc_comm_address);
+}
+struct cpc_funcs acpi_cppc_func_ops = {
.pid_policy = {
.sample_rate_ms = 10,
.deadband = 0,
.setpoint = 97,
.p_gain_pct = 20,
.d_gain_pct = 0,
.i_gain_pct = 0,
},
.get_highest_perf = acpi_get_highest_perf,
.get_ref_perf_ctr = acpi_get_ref_perf_ctr,
.get_lowest_perf = acpi_get_lowest_perf,
.set_desired_perf = acpi_set_desired_perf,
.get_delivered_ctr = acpi_get_delivered_ctr,
+};
+static int __init acpi_cppc_init(void) +{
if (acpi_disabled)
return 0;
cppc_func_ops = &acpi_cppc_func_ops;
pr_info("Registered ACPI CPPC function ops\n");
return 0;
+} +early_initcall(acpi_cppc_init);
-- 1.9.1
+ Rafael [corrected email addr]
On 14 August 2014 15:57, Ashwin Chaugule ashwin.chaugule@linaro.org wrote:
Hello,
Apologies in advance for a lengthy cover letter. Hopefully it has all the required information so you dont need to read the ACPI spec. ;)
This patchset introduces the ideas behind CPPC (Collaborative Processor Performance Control) and implements support for controlling CPU performance using the existing PID (Proportional-Integral-Derivative) controller (from intel_pstate.c) and some CPPC semantics.
The patchwork is not a final proposal of the CPPC implementation. I've had to hack some sections due to lack of hardware, details of which are in the Testing section.
There are several bits of information which are needed in order to make CPPC work great on Linux based platforms and I'm hoping to start a wider discussion on how to address the missing bits. The following sections briefly introduce CPPC and later highlight the information which is missing.
More importantly, I'm also looking for ideas on how to support CPPC in the short term, given that we will soon be seeing products based on ARM64 and X86 which support CPPC.[1] Although we may not have all the information, we could make it work with existing governors in a way this patchset demonstrates. Hopefully, this approach is acceptable for mainline inclusion in the short term.
Finer details about the CPPC spec are available in the latest ACPI 5.1 specification.[2]
If these issues are being discussed on some other thread or elsewhere, or if someone is already working on it, please let me know. Also, please correct me if I have misunderstood anything.
What is CPPC:
CPPC is the new interface for CPU performance control between the OS and the platform defined in ACPI 5.0+. The interface is built on an abstract representation of CPU performance rather than raw frequency. Basic operation consists of:
Platform enumerates supported performance range to OS
OS requests desired performance level over some time window along
with min and max instantaneous limits
Platform is free to optimize power/performance within bounds provided by OS
Platform provides telemetry back to OS on delivered performance
Communication with the OS is abstracted via another ACPI construct called Platform Communication Channel (PCC) which is essentially a generic shared memory channel with doorbell interrupts going back and forth. This abstraction allows the “platform” for CPPC to be a variety of different entities – driver, firmware, BMC, etc.
CPPC describes the following registers:
- HighestPerformance: (read from platform)
Indicates the highest level of performance the processor is theoretically capable of achieving, given ideal operating conditions.
- Nominal Performance: (read from platform)
Indicates the highest sustained performance level of the processor. This is the highest operating performance level the CPU is expected to deliver continuously.
- LowestNonlinearPerformance: (read from platform)
Indicates the lowest performance level of the processor with non- linear power savings.
- LowestPerformance: (read from platform)
Indicates the lowest performance level of the processor.
- GuaranteedPerformanceRegister: (read from platform)
Optional. If supported, contains register to read the current guaranteed performance from. This is current max sustained performance of the CPU taking into account all budgeting constraints. This can change at runtime and is notified to the OS via ACPI notification mechanisms.
- DesiredPerformanceRegister: (write to platform)
Register to write desired performance level from the OS.
- MinimumPerformanceRegister: (write to platform)
Optional. This is the min allowable performance as requested by the OS.
- MaximumPerformanceRegister: (write to platform)
Optional. This is the max allowable performance as requested by the OS.
- PerformanceReductionToleranceRegister (write to platform)
Optional. This is the deviation below the desired perf value as requested by the OS. If the Time window register(below) is supported, then this value is the min performance on average over the time window that the OS desires.
- TimeWindowRegister: (write to platform)
Optional. The OS requests desired performance over this time window.
- CounterWraparoundTime: (read from platform)
Optional. Min time before the performance counters wrap around.
- ReferencePerformanceCounterRegister: (read from platform)
A counter that increments proportionally to the reference performance of the processor.
- DeliveredPerformanceCounterRegister: (read from platform)
Delivered perf = reference perf * delta(delivered perf ctr)/delta(ref perf ctr)
- PerformanceLimitedRegister: (read from platform)
This is set by the platform in the event that it has to limit available performance due to thermal or budgeting constraints.
- CPPCEnableRegister: (read/write from platform)
Enable/disable CPPC
- AutonomousSelectionEnable:
Platform decides CPU performance level w/o OS assist.
- AutonomousActivityWindowRegister:
This influences the increase or decrease in cpu performance of the platforms autonomous selection policy.
- EnergyPerformancePreferenceRegister:
Provides a energy or perf bias hint to the platform when in autonomous mode.
- Reference Performance: (read from platform)
Indicates the rate at which the reference counter increments.
Whats missing in CPPC:
Currently CPPC makes no mention of power. However, this could be added in future versions of the spec. e.g. although CPPC works off of a continuous range of CPU perf levels, we could discretize the scale such that we only extract points where the power level changes substantially between CPU perf levels and export this information to the scheduler.
Whats missing in the kernel:
We may have some of this information in the scheduler, but I couldn't see a good way to extract it for CPPC yet.
(1) An intelligent way to provide a min/max bound and a desired value for CPU performance.
(2) A timing window for the platform to deliver requested performance within bounds. This could be a kind of sampling interval between consecutive reads of delivered cpu performance.
(3) Centralized decision making by any CPU in a freq domain for all its siblings.
The last point needs some elaboration:
I see that the CPUfreq layer allows defining "related CPUs" and that we can have the same policy for CPUs in the same freq domain and one governor per policy. However, from what I could tell, there are at least 2 baked in assumptions in this layer which break things at least for platforms like ARM (Please correct me if I'm wrong!)
(a) All CPUs run at the exact same max, min and cur freq.
(b) Any CPU always gets exactly the freq it asked for.
So, although the CPUFreq layer is capable of making somewhat centralized cpufreq decisions for CPUs under the same policy, it seems to be deciding things under the wrong/inapplicable assumptions. Moreover only one CPU is in charge of policy handling at a time and the policy handling is shifted to another CPU in the domain, only if the former CPU is hotplugged out.
Not having a proper centralized decision maker adversely affects power saving possibilities in platforms that can't distinguish when a CPU requests a specific freq and then goes to sleep. This potentially has the effect of keeping other CPUs in the domain running at a much higher frequency than required, while the initial requester is deep asleep.
So, for point (3), I'm not sure which path we should take among the following:
(I) Fix cpufreq layer and add CPPC support as a cpufreq_driver. (a) Change every call to get freq to make it read h/w registers and then snap value back to freq table. This way, cpufreq can keep its idea of freq current. However, this may end up waking CPUs to read counters, unless they are mem mapped. (b) Allow any CPU in the "related_cpus" mask to make policy decisions on behalf of siblings. So the policy maker switching is not tied to hotplug.
(II) Not touch CPUfreq and use the PID algorithm instead, but change the busyness calculation to accumulate busyness values from all CPUs in common domain. Requires implementation of domain awareness.
(III) Address these issues in the upcoming CPUfreq/CPUidle integration layer(?)
(IV) Handle it in the platform or lose out. I understand this has some potential for adding latency to cpu freq requests so it may not be possible for all platforms.
(V) ..?
For points (1) and (2), the long term solution IMHO is to work it out along with the scheduler CPUFreq/CPUidle integration. But its not clear to me what would be the best short term approach. I'd greatly appreciate any suggestions/comments. If anyone is already working on these issues, please CC me as well.
Test setup:
For the sake of experiments, I used the Thinkpad x240 laptop, which advertises CPPC tables in its ACPI firmware. The PCC and CPPC drivers included in this patchset are able to parse the tables and get all the required addresses. However, it seems that this laptop doesn't implement PCC doorbell and the firmware side of CPPC. The PCC doorbell calls would just wait forever. Not sure whats going on there. So, I had to hack it and emulate what the platform would've done to some extent.
I extracted the PID algo from intel_pstate.c and modified it with CPPC function wrappers. It shouldn't be hard to replace PID with anything else we think is suitable. In the long term, I hope we can make CPPC calls directly from the scheduler.
There are two versions of the low level CPPC accessors. The one included in the patchset is how I'd imagine it would work with platforms that completely implement CPPC in firmware.
The other version is here [5]. This should help with DT or platforms with broken firmware, enablement purposes etc.
I ran a simple kernel compilation with intel_pstate.c and the CPPC modified version as the governors and saw no real difference in compile times. So no new overheads added. I verified that CPU freq requests were taken by reading out the PERF_STATUS register.
[1] - See the HWP section 14.4 http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32... [2] - http://www.uefi.org/sites/default/files/resources/ACPI_5_1release.pdf [3] - https://plus.google.com/+TheodoreTso/posts/2vEekAsG2QT [4] - https://plus.google.com/+ArjanvandeVen/posts/dLn9T4ehywL [5] - http://git.linaro.org/people/ashwin.chaugule/leg-kernel.git/blob/236d901d31f...
Ashwin Chaugule (3): ACPI: Add support for Platform Communication Channel CPPC: Add support for Collaborative Processor Performance Control CPPC: Add ACPI accessors to CPC registers
drivers/acpi/Kconfig | 10 + drivers/acpi/Makefile | 1 + drivers/acpi/pcc.c | 301 +++++++++++++++ drivers/cpufreq/Kconfig | 19 + drivers/cpufreq/Makefile | 2 + drivers/cpufreq/cppc.c | 874 ++++++++++++++++++++++++++++++++++++++++++++ drivers/cpufreq/cppc.h | 181 +++++++++ drivers/cpufreq/cppc_acpi.c | 80 ++++ 8 files changed, 1468 insertions(+) create mode 100644 drivers/acpi/pcc.c create mode 100644 drivers/cpufreq/cppc.c create mode 100644 drivers/cpufreq/cppc.h create mode 100644 drivers/cpufreq/cppc_acpi.c
-- 1.9.1
On Thu, Aug 14, 2014 at 03:57:07PM -0400, Ashwin Chaugule wrote:
What is CPPC:
CPPC is the new interface for CPU performance control between the OS and the platform defined in ACPI 5.0+. The interface is built on an abstract representation of CPU performance rather than raw frequency. Basic operation consists of:
Why do we want this? Typically we've ignored ACPI and gone straight to MSR access, intel_pstate and intel_idle were created especially to avoid ACPI, so why return to it.
Also, the whole interface sounds like trainwreck (one would not expect anything else from ACPI).
So _why_?
Hi Peter,
On 14 August 2014 16:51, Peter Zijlstra peterz@infradead.org wrote:
On Thu, Aug 14, 2014 at 03:57:07PM -0400, Ashwin Chaugule wrote:
What is CPPC:
CPPC is the new interface for CPU performance control between the OS and the platform defined in ACPI 5.0+. The interface is built on an abstract representation of CPU performance rather than raw frequency. Basic operation consists of:
Why do we want this? Typically we've ignored ACPI and gone straight to MSR access, intel_pstate and intel_idle were created especially to avoid ACPI, so why return to it.
Also, the whole interface sounds like trainwreck (one would not expect anything else from ACPI).
So _why_?
The overall idea is that tying the notion of CPU performance to CPU frequency is no longer true these days.[1]. So, using some direction from an OS , the platforms want to be able to decide how to adjust CPU performance by using knowledge that may be very platform specific. e.g. through the use of performance counters, thermal budgets and other system specific constraints. So, CPPC describes a way for the OS to request performance within certain bounds and then letting the platform optimize it within those constraints. Expressing CPU performance in an abstract way, should also help keep things uniform across various architecture implementations.
I dont see CPPC as necessarily an ACPI specific thing. If the platform can provide the information which CPPC lays out via MSRs or other system IO, then the higher algorithms should still work. The CPPC table itself is nothing but register descriptions. It describes how and where to access the registers. The registers can be anything, from system I/O addresses, MSRs, CP15, or Mailbox type addresses(PCC). CPPC doesn't even you tell what to do with that information. So its really just a descriptor.
If you see the example in [2], the aperf and mperf reads directly go to MSR addresses as parsed from the tables. If the platform does not have ACPI, but knows how to provide the same information, then it can directly read its MSRs or other sys regs. e.g. as in the case of core_get_{min,max}_pstate(), which are used to get highest and lowest performance values.
But I think the problem is that we dont have an algorithm that can make use of the information that CPPC supported platforms can provide, nor can we provide CPPC related information back to the platform.
Cheers, Ashwin
[1]- https://plus.google.com/+ArjanvandeVen/posts/dLn9T4ehywL [2] - http://git.linaro.org/people/ashwin.chaugule/leg-kernel.git/blob/236d901d31f...
On Thu, Aug 14, 2014 at 05:56:10PM -0400, Ashwin Chaugule wrote:
Hi Peter,
On 14 August 2014 16:51, Peter Zijlstra peterz@infradead.org wrote:
On Thu, Aug 14, 2014 at 03:57:07PM -0400, Ashwin Chaugule wrote:
What is CPPC:
CPPC is the new interface for CPU performance control between the OS and the platform defined in ACPI 5.0+. The interface is built on an abstract representation of CPU performance rather than raw frequency. Basic operation consists of:
Why do we want this? Typically we've ignored ACPI and gone straight to MSR access, intel_pstate and intel_idle were created especially to avoid ACPI, so why return to it.
Also, the whole interface sounds like trainwreck (one would not expect anything else from ACPI).
So _why_?
The overall idea is that tying the notion of CPU performance to CPU frequency is no longer true these days.[1]. So, using some direction from an OS , the platforms want to be able to decide how to adjust CPU performance by using knowledge that may be very platform specific. e.g. through the use of performance counters, thermal budgets and other system specific constraints. So, CPPC describes a way for the OS to request performance within certain bounds and then letting the platform optimize it within those constraints. Expressing CPU performance in an abstract way, should also help keep things uniform across various architecture implementations.
[1]- https://plus.google.com/+ArjanvandeVen/posts/dLn9T4ehywL [2] - http://git.linaro.org/people/ashwin.chaugule/leg-kernel.git/blob/236d901d31f...
Yeah, I'm so not clicking in that; if you want to make an argument make it here.
In any case; that's all nice and shiny that the 'hardware' works like that. But have these people considered how we're supposed to use it?
How should we know what to do with a new task? Do we stack it on a busy CPU, do we wake an idle cpu and how are we going to tell which is the 'best' option.
How are we going to do DVFS like accounting if we don't know wtf the hardware can or will do.
And how can you design these interfaces and hardware without at least partially knowing the answer to these questions.
Hi Peter,
On 15 August 2014 02:19, Peter Zijlstra peterz@infradead.org wrote:
On Thu, Aug 14, 2014 at 05:56:10PM -0400, Ashwin Chaugule wrote:
On 14 August 2014 16:51, Peter Zijlstra peterz@infradead.org wrote:
On Thu, Aug 14, 2014 at 03:57:07PM -0400, Ashwin Chaugule wrote:
What is CPPC:
CPPC is the new interface for CPU performance control between the OS and the platform defined in ACPI 5.0+. The interface is built on an abstract representation of CPU performance rather than raw frequency. Basic operation consists of:
Why do we want this? Typically we've ignored ACPI and gone straight to MSR access, intel_pstate and intel_idle were created especially to avoid ACPI, so why return to it.
Also, the whole interface sounds like trainwreck (one would not expect anything else from ACPI).
So _why_?
The overall idea is that tying the notion of CPU performance to CPU frequency is no longer true these days.[1]. So, using some direction from an OS , the platforms want to be able to decide how to adjust CPU performance by using knowledge that may be very platform specific. e.g. through the use of performance counters, thermal budgets and other system specific constraints. So, CPPC describes a way for the OS to request performance within certain bounds and then letting the platform optimize it within those constraints. Expressing CPU performance in an abstract way, should also help keep things uniform across various architecture implementations.
[1]- https://plus.google.com/+ArjanvandeVen/posts/dLn9T4ehywL [2] - http://git.linaro.org/people/ashwin.chaugule/leg-kernel.git/blob/236d901d31f...
Yeah, I'm so not clicking in that; if you want to make an argument make it here.
In any case; that's all nice and shiny that the 'hardware' works like that. But have these people considered how we're supposed to use it?
How should we know what to do with a new task? Do we stack it on a busy CPU, do we wake an idle cpu and how are we going to tell which is the 'best' option.
How are we going to do DVFS like accounting if we don't know wtf the hardware can or will do.
And how can you design these interfaces and hardware without at least partially knowing the answer to these questions.
Although, the CPPC descriptor table and the spec dont describe the algorithm, it still gives a good enough idea of how the platform would react. I'll try to summarize it briefly. I have a few more register specific details in the cover letter if needed.
e.g.:
(1) The OS can read from the platform what each CPU is capable of at the moment. Highest, Lowest CPU performance bounds which are essentially the thresholds at which this CPU can deliver. The platform can even tell us a "guaranteed performance value" at that moment. This is the level the CPU is expected to deliver taking into account all the possible constraints. (e.g. thermal, power budgets etc.). If the "guaranteed" value changes due to some reason, the platform raises a notification, so the OS can reevaluate.
(2) When an OS requests a specific performance value, it supplies a Max, Min and Desired value. The platform is expected to deliver CPU performance within this range. The Delivered performance register should reflect what the platform decided.
(3) If the OS knows that it needs to step up or lower the CPU performance value for a specific period of time, then it sets the Time Window and Performance Reduction Tolerance register in addition to Max, Min, and Desired. This will force the platform to deliver CPU performance which on average over the Time Window equals the value in Performance Reduction.
So, its not as though the OS is left completely blind. The platform maintains updated information about CPUs performance capabilities and relies on hints from the OS to make decisions and it also feeds back what it decides.
If the OS only looks at Highest, Lowest, Delivered registers and only writes to Desired, then we're not really any different than how we do things today in the CPUFreq layer. Or even in the case of intel_pstate, if you map Desired to PERF_CTL and get value of Delivered by using aperf/mperf ratios (as my experimental driver does), then we can still maintain the existing system performance. It seems like if an OS can make use of the additional information then it should be net win for overall power savings and performance enhancement. Also, using the CPPC descriptors, we should be able to have one driver across X86 and ARM64. (possibly others too.)
So I'm still learning about the scheduler and dont have enough knowledge yet. Hence this discussion with you guys. Hopefully with the above flow, you can see that:
(a) we can plug the cppc driver to the existing infrastructure and not change anything really. (except the freq domain awareness issues I mentioned earlier) (short term)
(b) we come up with ways to provide the bounds around a Desired value using the information from the platform. (long term)
I briefly looked at the x86 HWP (Hardware Performance States) in the s/w manual again. Its essentially an implementation of CPPC. It seems like X86 has implemented most if not all these registers as MSRs. I'm really interested in knowing if anyone there is/has been working on using them and what they found.
Cheers, Ashwin
On 8/15/2014 6:08 AM, Ashwin Chaugule wrote:
(b) we come up with ways to provide the bounds around a Desired value using the information from the platform. (long term)
I briefly looked at the x86 HWP (Hardware Performance States) in the s/w manual again. Its essentially an implementation of CPPC. It seems like X86 has implemented most if not all these registers as MSRs. I'm really interested in knowing if anyone there is/has been working on using them and what they found.
we've found that so far that there are two reasonable options 1) Let the OS device (old style) 2) Let the hardware decide (new style)
2) is there in practice today in the turbo range (which is increasingly the whole thing) and the hardware can make decisions about power budgetting on a timescale the OS can never even dream of, so once you give control the the hardware (with CPPC or native) it's normally better to just get out of the way as OS.
On 8/15/2014 6:42 AM, Arjan van de Ven wrote:
On 8/15/2014 6:08 AM, Ashwin Chaugule wrote:
(b) we come up with ways to provide the bounds around a Desired value using the information from the platform. (long term)
I briefly looked at the x86 HWP (Hardware Performance States) in the s/w manual again. Its essentially an implementation of CPPC. It seems like X86 has implemented most if not all these registers as MSRs. I'm really interested in knowing if anyone there is/has been working on using them and what they found.
we've found that so far that there are two reasonable options
Let the OS device (old style)
Let the hardware decide (new style)
is there in practice today in the turbo range (which is increasingly the whole thing)
and the hardware can make decisions about power budgetting on a timescale the OS can never even dream of, so once you give control the the hardware (with CPPC or native) it's normally better to just get out of the way as OS.
I should clarify this; with increasing number of cores, you end up with much more dynamic maximums (e.g. turbo in Intel speak) and the hardware already controls this (yes the OS gives hints, but that's mostly a lot of OS work for little value)
in a single or even dual core situation you may have a different ballgame with a small turbo range, but even in phones octocores seem to be the norm nowadays ;-) (just finished reading the Samsung announcement)
Hi Arjan,
On 15 August 2014 09:53, Arjan van de Ven arjan@linux.intel.com wrote:
On 8/15/2014 6:42 AM, Arjan van de Ven wrote:
On 8/15/2014 6:08 AM, Ashwin Chaugule wrote:
(b) we come up with ways to provide the bounds around a Desired value using the information from the platform. (long term)
I briefly looked at the x86 HWP (Hardware Performance States) in the s/w manual again. Its essentially an implementation of CPPC. It seems like X86 has implemented most if not all these registers as MSRs. I'm really interested in knowing if anyone there is/has been working on using them and what they found.
we've found that so far that there are two reasonable options
Let the OS device (old style)
Let the hardware decide (new style)
is there in practice today in the turbo range (which is increasingly
the whole thing) and the hardware can make decisions about power budgetting on a timescale the OS can never even dream of, so once you give control the the hardware (with CPPC or native) it's normally better to just get out of the way as OS.
Interesting. This sounds like X86 plans to use the Autonomous bits that got added to the CPPC spec. (v5.1)? I agree that the platform can make decisions on a much finer timescale. But even in the non-Autonomous mode, by providing the bounds around a Desired Value, the OS can get out of the way knowing that the platform would deliver something in the range it requested. If the OS can provide bounds, it seems to me that the platform can make more optimum decisions, rather than trying to guess whats running (or not).
I should clarify this; with increasing number of cores, you end up with much more dynamic maximums (e.g. turbo in Intel speak) and the hardware already controls this (yes the OS gives hints, but that's mostly a lot of OS work for little value)
Going completely Autonomous (i.e. letting the platform decide whats best w/o OS hints) may be possible only for some platforms. There will be several ones which dont have that kind of logic built into the hardware. In such cases getting OS hints would be the only option.
in a single or even dual core situation you may have a different ballgame with a small turbo range, but even in phones octocores seem to be the norm nowadays ;-) (just finished reading the Samsung announcement)
Yea, its getting crazy out there.
Cheers, Ashwin
On 8/15/2014 7:24 AM, Ashwin Chaugule wrote:
we've found that so far that there are two reasonable options
Let the OS device (old style)
Let the hardware decide (new style)
is there in practice today in the turbo range (which is increasingly
the whole thing) and the hardware can make decisions about power budgetting on a timescale the OS can never even dream of, so once you give control the the hardware (with CPPC or native) it's normally better to just get out of the way as OS.
Interesting. This sounds like X86 plans to use the Autonomous bits that got added to the CPPC spec. (v5.1)?
if and when x86/Intel implement that, we will certainly evaluate it to see how it behaves... but based on todays use of the hw control of the actual p-state... I would expect that evaluation to pass.
note that on todays multi-core x96 systems, in practice you operate mostly in the turbo range (I am ignoring mostly-idle workloads since there the p-state isn't nearly as relevant anyway); all it takes for one of the cores to request a turbo-range state, and the whole chip operates in turbo mode.. and in turbo mode the hardware already picks the frequency/voltage.
with the current (and more so, past) Linux behavior, even at moderate loads you end up there; the more cores you have the more true that becomes.
I agree that the platform can make decisions on a much finer timescale. But even in the non-Autonomous mode, by providing the bounds around a Desired Value, the OS can get out of the way knowing that the platform would deliver something in the range it requested. If the OS can provide bounds, it seems to me that the platform can make more optimum decisions, rather than trying to guess whats running (or not).
I highly question that the OS can provide intelligent bounds.
When are you going to request an upper bound that is lower than maximum? (don't say thermals, there are other mechanisms for controlling thermals that work much better than direct P state control). Are you still going to do that even if sometimes lower frequencies end up costing more battery? (race-to-halt and all that)
I can see cases where you bump the minimum for QoS reasons, but even there I would dare to say that at best the OS will be doing wild-ass guesses.
Hello,
On 15 August 2014 11:47, Arjan van de Ven arjan@linux.intel.com wrote:
On 8/15/2014 7:24 AM, Ashwin Chaugule wrote:
we've found that so far that there are two reasonable options
Let the OS device (old style)
Let the hardware decide (new style)
is there in practice today in the turbo range (which is increasingly
the whole thing) and the hardware can make decisions about power budgetting on a timescale the OS can never even dream of, so once you give control the the hardware (with CPPC or native) it's normally better to just get out of the way as OS.
Interesting. This sounds like X86 plans to use the Autonomous bits that got added to the CPPC spec. (v5.1)?
if and when x86/Intel implement that, we will certainly evaluate it to see how it behaves... but based on todays use of the hw control of the actual p-state... I would expect that evaluation to pass.
note that on todays multi-core x96 systems, in practice you operate mostly in the turbo range (I am ignoring mostly-idle workloads since there the p-state isn't nearly as relevant anyway); all it takes for one of the cores to request a turbo-range state, and the whole chip operates in turbo mode.. and in turbo mode the hardware already picks the frequency/voltage.
x96 - Wonder what that has! ;)
So, this I think brings back my point of Freq domain awareness (or lack of) in todays governors. On X86, it seems as though, the h/w can take care of "Freq voting rights" among CPUs and it knows to ignore a request after the requestor goes to sleep. That way the other CPUs in the domain dont unnecessarily operate under a higher freq/voltage and their vote can become current. Also on X86, all CPUs are assumed to have the same min, max operating points?
This may not be true on ARM (or others). So if the h/w isnt capable of automatically updating freq/voltage for a domain, then the OS needs to provide that. And I think we can achieve that through the knowledge of system topology and having a centralized CPU policy governor for each domain. If each CPU in the domain is capable of making decisions on behalf of everyone in that domain, then we can at least get past the problem of "stale CPU freq votes". (replace freq with performance in CPPC terms).
e.g. to make my point clear, assume there are 3 cpus in the system.
C0, C1 are in one domain and C2 is in another.
If C0 asks for 3Ghz and C1 asks for 1Ghz, the h/w delivers 3Ghz. But now C0 goes to sleep. With todays governors, we dont reevaluate and so, C1 continues to get 3Ghz even though it doesnt need it. Maybe X86 can figure out that C0 is asleep and so it should now deliver 1Ghz, but ARM does not have that AFAIK. So we need the governor to reevaluate between C0 and C1 (preferably through aperf/mperf like ratios, rather than the broken p-state assumptions) and send a new request to ask for 1Ghz.
with the current (and more so, past) Linux behavior, even at moderate loads you end up there; the more cores you have the more true that becomes.
I agree that the platform can make decisions on a much finer timescale. But even in the non-Autonomous mode, by providing the bounds around a Desired Value, the OS can get out of the way knowing that the platform would deliver something in the range it requested. If the OS can provide bounds, it seems to me that the platform can make more optimum decisions, rather than trying to guess whats running (or not).
I highly question that the OS can provide intelligent bounds.
Agreed. This is a challenging problem. Hence the wider discussion.
When are you going to request an upper bound that is lower than maximum? (don't say thermals, there are other mechanisms for controlling thermals that work much better than direct P state control). Are you still going to do that even if sometimes lower frequencies end up costing more battery? (race-to-halt and all that)
Maybe the answer is that in the short term, we always request for MAX in the (Max, Min, Desired) tuple. Although I suspect some platforms will still use P state controls for thermal mitigation.
I can see cases where you bump the minimum for QoS reasons, but even there I would dare to say that at best the OS will be doing wild-ass guesses.
Right. I see Min being used for QoS too.
Cheers, Ashwin
The Linux team at Intel did not implement ACPI CPPC support because we see no benefit to it over the native hardware interface on x86.
cheers, Len Brown Intel Open Source Technology Center
Hi Len,
On 15 August 2014 18:11, Len Brown lenb@kernel.org wrote:
The Linux team at Intel did not implement ACPI CPPC support because we see no benefit to it over the native hardware interface on x86.
Thanks for sharing Intels observations. I had looked at the SDM [1] and found all the CPPC MSRs. I suppose X86 will not use a majority of it.
Cheers, Ashwin
[1] - HWP section 14.4 http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32...
On Fri, Aug 15, 2014 at 06:42:51AM -0700, Arjan van de Ven wrote:
On 8/15/2014 6:08 AM, Ashwin Chaugule wrote:
(b) we come up with ways to provide the bounds around a Desired value using the information from the platform. (long term)
I briefly looked at the x86 HWP (Hardware Performance States) in the s/w manual again. Its essentially an implementation of CPPC. It seems like X86 has implemented most if not all these registers as MSRs. I'm really interested in knowing if anyone there is/has been working on using them and what they found.
we've found that so far that there are two reasonable options
Let the OS device (old style)
Let the hardware decide (new style)
is there in practice today in the turbo range (which is increasingly the whole thing)
and the hardware can make decisions about power budgetting on a timescale the OS can never even dream of, so once you give control the the hardware (with CPPC or native) it's normally better to just get out of the way as OS.
OK, so we should just forget about 'power aware scheduling' for Intel?
On Fri, Aug 15, 2014 at 09:08:50AM -0400, Ashwin Chaugule wrote:
If the OS only looks at Highest, Lowest, Delivered registers and only writes to Desired, then we're not really any different than how we do things today in the CPUFreq layer.
The thing is; we're already struggling to make 'sense' of x86 as it stands today. And it looks like this CPPC stuff makes the behaviour even less certain.
Or even in the case of intel_pstate, if you map Desired to PERF_CTL and get value of Delivered by using aperf/mperf ratios (as my experimental driver does), then we can still maintain the existing system performance. It seems like if an OS can make use of the additional information then it should be net win for overall power savings and performance enhancement. Also, using the CPPC descriptors, we should be able to have one driver across X86 and ARM64. (possibly others too.)
Yikes, so aaargh64 will go do creative power management too?
And worse; it will go do ACPI? Welcome to the world of guaranteed BIOS fail :-(
Hi Peter,
On 15 August 2014 10:07, Peter Zijlstra peterz@infradead.org wrote:
On Fri, Aug 15, 2014 at 09:08:50AM -0400, Ashwin Chaugule wrote:
If the OS only looks at Highest, Lowest, Delivered registers and only writes to Desired, then we're not really any different than how we do things today in the CPUFreq layer.
The thing is; we're already struggling to make 'sense' of x86 as it stands today. And it looks like this CPPC stuff makes the behaviour even less certain.
I think its still better than the "p-state" thing we have going today, where the algorithms are making their decisions based on the incorrect assumption that the CPU got what it requested for. (among other things listed earlier.) CPPC at least gives you a guarantee that the delivered performance will be within a range you requested. It can even force the platform to deliver a specific performance value if you choose over a specific time window.
Or even in the case of intel_pstate, if you map Desired to PERF_CTL and get value of Delivered by using aperf/mperf ratios (as my experimental driver does), then we can still maintain the existing system performance. It seems like if an OS can make use of the additional information then it should be net win for overall power savings and performance enhancement. Also, using the CPPC descriptors, we should be able to have one driver across X86 and ARM64. (possibly others too.)
Yikes, so aaargh64 will go do creative power management too?
Nice use or aargh. ;) Strangely hadn't seen that before.
And worse; it will go do ACPI? Welcome to the world of guaranteed BIOS fail :-(
ACPI or not, I think leads to a rather different kind of debate. :) If all ARM implementations could include the CP15 equivalents of the CPPC MSRs that Intel has, then we wouldn't need this CPPC table. But that'll remain a pipe dream. So I'd prefer to think of CPPC as a simple wrapper of registers descriptions which allows implementations to choose how and where to get their CPPC information. However they choose to implement the registers, I think theres a lot of potential power savings and performance optimization on the table which can be acquired through CPPC. Given the platforms some amount of freedom to optimize things in a platform specific way helps them differentiate themselves (a key thing with ARM esp.) and keeping the idea of CPU performance abstract rather than tied to Frequency should help to keep things unified in the OS, so we avoid the driver bloat that ARM historically has had.
Cheers, Ashwin
On Fri, Aug 15, 2014 at 10:37:32AM -0400, Ashwin Chaugule wrote:
Hi Peter,
On 15 August 2014 10:07, Peter Zijlstra peterz@infradead.org wrote:
On Fri, Aug 15, 2014 at 09:08:50AM -0400, Ashwin Chaugule wrote:
If the OS only looks at Highest, Lowest, Delivered registers and only writes to Desired, then we're not really any different than how we do things today in the CPUFreq layer.
The thing is; we're already struggling to make 'sense' of x86 as it stands today. And it looks like this CPPC stuff makes the behaviour even less certain.
I think its still better than the "p-state" thing we have going today, where the algorithms are making their decisions based on the incorrect assumption that the CPU got what it requested for. (among other things listed earlier.) CPPC at least gives you a guarantee that the delivered performance will be within a range you requested. It can even force the platform to deliver a specific performance value if you choose over a specific time window.
Maybe; the guarantee and interrupt on change might be useful indeed. But which ever way we need aperf/mperf ratios somewhere.
Hello,
On 15 August 2014 10:41, Peter Zijlstra peterz@infradead.org wrote:
On Fri, Aug 15, 2014 at 10:37:32AM -0400, Ashwin Chaugule wrote:
On 15 August 2014 10:07, Peter Zijlstra peterz@infradead.org wrote:
On Fri, Aug 15, 2014 at 09:08:50AM -0400, Ashwin Chaugule wrote:
If the OS only looks at Highest, Lowest, Delivered registers and only writes to Desired, then we're not really any different than how we do things today in the CPUFreq layer.
The thing is; we're already struggling to make 'sense' of x86 as it stands today. And it looks like this CPPC stuff makes the behaviour even less certain.
I think its still better than the "p-state" thing we have going today, where the algorithms are making their decisions based on the incorrect assumption that the CPU got what it requested for. (among other things listed earlier.) CPPC at least gives you a guarantee that the delivered performance will be within a range you requested. It can even force the platform to deliver a specific performance value if you choose over a specific time window.
Maybe; the guarantee and interrupt on change might be useful indeed. But which ever way we need aperf/mperf ratios somewhere.
Right. Regarding aperf/mperf; some ARMs will re-use existing performance counters, while others may have dedicated registers to count the equivalent. So I'm suggesting the use of the CPPC descriptors to unify the various implementations. In CPPC terms, aperf maps to the Delivered ctr and mperf maps to the Nominal ctr. Until we have rest of the plumbing in the scheduler to make use of the other CPPC specified regs, we can at least start off with having a common lower level driver as suggested in this patchset and reusing the PID controller as the "governor". Thats the only one that actually makes use of aperf/mperf today AFAIK. The driver can then be modified to include the Freq domain awareness as well, so that it works well on non-X86 platforms. Unless anyone strongly thinks that we should fix the shortcomings in the CPUfreq layer(broken pstate and freq domain assumptions) rather than enhancing PID? Thoughts?
Cheers, Ashwin
I verified that CPU freq requests were taken by reading out the PERF_STATUS register.
Don't use the x86 PERF_STATUS register -- it will not tell you what you want to know.
If you want to see the actual frequency, you need to watch how many cycles elapse per a known time interval, which is how intel_pstate and the turbostat utility do it.
cheers, Len Brown Intel Open Source Technology Center