[PATCHSET V1 0/2] cpuidle, bpf: Introduce BPF-based extensible cpuidle policy via struct_ops

List overview All Threads
Download

newer

older

[PATCH v2] KVM: loongarch:...

[PATCH net-next] selftests: net:...

Lin Yikai

29 Aug 2025 29 Aug '25

10:11 a.m.

Summary ---------- Hi, everyone, This patch set introduces an extensible cpuidle governor framework using BPF struct_ops, enabling dynamic implementation of idle-state selection policies via BPF programs.

Motivation ---------- As is well-known, CPUs support multiple idle states (e.g., C0, C1, C2, ...), where deeper states reduce power consumption, but results in longer wakeup latency, potentially affecting performance. Existing generic cpuidle governors operate effectively in common scenarios but exhibit suboptimal behavior in specific Android phone's use cases.

Our testing reveals that during low-utilization scenarios (e.g., screen-off background tasks like music playback with CPU utilization <10%), the C0 state occupies ~50% of idle time, causing significant energy inefficiency. Reducing C0 to ≤20% could yield ≥5% power savings on mobile phones.

To address this, we expect: 1.Dynamic governor switching to power-saved policies for low cpu utilization scenarios (e.g., screen-off mode) 2.Dynamic switching to alternate governors for high-performance scenarios (e.g., gaming)

OverView ---------- The BPF cpuidle ext governor registers at postcore_initcall() but remains disabled by default due to its low priority "rating" with value "1". Activation requires adjust higer "rating" than other governors within BPF.

Core Components: 1.**struct cpuidle_gov_ext_ops** – BPF-overridable operations: - ops.enable()/ops.disable(): enable or disable callback - ops.select(): cpu Idle-state selection logic - ops.set_stop_tick(): Scheduler tick management after state selection - ops.reflect(): feedback info about previous idle state. - ops.init()/ops.deinit(): Initialization or cleanup.

2.**Critical kfuncs for kernel state access**: - bpf_cpuidle_ext_gov_update_rating(): Activate ext governor by raising rating must be called from "ops.init()" - bpf_cpuidle_ext_gov_latency_req(): get idle-state latency constraints - bpf_tick_nohz_get_sleep_length(): get CPU sleep duration in tickless mode

Future work ---------- 1. Scenario detection: Identifying low-utilization states (e.g., screen-off + background music) 2. Policy optimization: Optimizing state-selection algorithms for specific scenarios

Lin Yikai (2): Subject: [PATCH v1 1/2] cpuidle: Implement BPF extensible cpuidle class Subject: [PATCH v1 2/2] selftests/bpf: Add selftests

drivers/cpuidle/Kconfig | 12 + drivers/cpuidle/governors/Makefile | 1 + drivers/cpuidle/governors/ext.c | 537 ++++++++++++++++++ .../bpf/prog_tests/test_cpuidle_gov_ext.c | 28 + .../selftests/bpf/progs/cpuidle_gov_ext.c | 208 +++++++ 5 files changed, 786 insertions(+) create mode 100644 drivers/cpuidle/governors/ext.c create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cpuidle_gov_ext.c create mode 100644 tools/testing/selftests/bpf/progs/cpuidle_gov_ext.c

-- 2.43.0

Show replies by date

Lin Yikai

29 Aug 29 Aug

10:11 a.m.

New subject: [PATCH v1 1/2] cpuidle: Implement BPF extensible cpuidle class

The BPF cpuidle ext governor registers at postcore_initcall() but remains disabled by default due to its low priority "rating" (1). Activation requires adjust higer "rating" than other governors within BPF.

Signed-off-by: Lin Yikai yikai.lin@vivo.com --- drivers/cpuidle/Kconfig | 12 + drivers/cpuidle/governors/Makefile | 1 + drivers/cpuidle/governors/ext.c | 537 +++++++++++++++++++++++++++++ 3 files changed, 550 insertions(+) create mode 100644 drivers/cpuidle/governors/ext.c

diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig index cac5997dca50..4f2eac531b0b 100644 --- a/drivers/cpuidle/Kconfig +++ b/drivers/cpuidle/Kconfig @@ -44,6 +44,18 @@ config CPU_IDLE_GOV_HALTPOLL

Some virtualized workloads benefit from using it.

+config CPU_IDLE_GOV_EXT + bool "bpf cpuidle ext governor" + depends on BPF_SYSCALL && BPF_JIT && DEBUG_INFO_BTF + default y + help + This governor implements a simple cpuidle ext governor, + which can be customized by a BPF program without modifying + kernel code. + + Some scenarios benefit where CPUidle policy needs + to be customized based on user-space requirements. + config DT_IDLE_STATES bool

diff --git a/drivers/cpuidle/governors/Makefile b/drivers/cpuidle/governors/Makefile index 63abb5393a4d..cd5eaf9f275f 100644 --- a/drivers/cpuidle/governors/Makefile +++ b/drivers/cpuidle/governors/Makefile @@ -7,3 +7,4 @@ obj-$(CONFIG_CPU_IDLE_GOV_LADDER) += ladder.o obj-$(CONFIG_CPU_IDLE_GOV_MENU) += menu.o obj-$(CONFIG_CPU_IDLE_GOV_TEO) += teo.o obj-$(CONFIG_CPU_IDLE_GOV_HALTPOLL) += haltpoll.o +obj-$(CONFIG_CPU_IDLE_GOV_EXT) += ext.o diff --git a/drivers/cpuidle/governors/ext.c b/drivers/cpuidle/governors/ext.c new file mode 100644 index 000000000000..9968ae482899 --- /dev/null +++ b/drivers/cpuidle/governors/ext.c @@ -0,0 +1,537 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * ext.c - the cpuidle ext governor used by BPF + * + * Copyright (C) Yikai Lin yikai.lin@vivo.com + */ + +#include <linux/module.h> +#include <linux/init.h> +#include <linux/kernel.h> +#include <linux/cpuidle.h> +#include <linux/percpu.h> +#include <linux/ktime.h> +#include <linux/cpumask.h> +#include <linux/slab.h> +#include <linux/sysfs.h> +#include <linux/tick.h> + +#define EXT_GOV_NAME "ext" + +/******************************************************************************** + * Helpers that can be called from the BPF cpuidle gov. + */ +#include <linux/btf_ids.h> +#include <linux/btf.h> + +#include "../cpuidle.h" + +static struct cpuidle_governor *cpuidle_last_governor; + +/** + * restore_cpuidle_last_governor - restore last governor after bpf ext gov exiting. + */ +static void restore_cpuidle_last_governor(void) +{ + bool enabled = false; + + if (cpuidle_curr_governor) + enabled = !strncasecmp(cpuidle_curr_governor->name, EXT_GOV_NAME, CPUIDLE_NAME_LEN); + + mutex_lock(&cpuidle_lock); + if (enabled && cpuidle_last_governor) + if (cpuidle_switch_governor(cpuidle_last_governor)) + cpuidle_last_governor = NULL; + mutex_unlock(&cpuidle_lock); +} + +__bpf_kfunc_start_defs(); + +/** + * bpf_cpuidle_ext_gov_update_rating - update rating of bpf cpuidle ext governor. + * @rating: target rating + * + * The BPF cpuidle ext governor is registered by default + * but remains inactive due to its default @rating being set to 1 + * which is significantly lower than that of other governors. + * + * To activate it, adjust @rating to a higher value within the BPF program. + * + * This function should be called from ops.init(). + */ +__bpf_kfunc int bpf_cpuidle_ext_gov_update_rating(unsigned int rating) +{ + int ret = -EINVAL; + struct cpuidle_governor *ext_gov; + + ext_gov = cpuidle_find_governor(EXT_GOV_NAME); + if (!ext_gov) { + ret = -EEXIST; + goto clean_up; + } + mutex_lock(&cpuidle_lock); + if (!cpuidle_curr_governor || cpuidle_curr_governor->rating < rating) { + cpuidle_last_governor = cpuidle_curr_governor; + ret = cpuidle_switch_governor(ext_gov); + } + mutex_unlock(&cpuidle_lock); + +clean_up: + return ret; +} + +/** + * bpf_cpuidle_ext_gov_latency_req - get target cpu's latency constraint + * @cpu: Target CPU + * + * The BPF program may require this info. + */ +__bpf_kfunc s64 bpf_cpuidle_ext_gov_latency_req(unsigned int cpu) +{ + return cpuidle_governor_latency_req(cpu); +} + +/** + * bpf_tick_nohz_get_sleep_length - return the expected length of the current sleep + * + * The BPF program may require this info. + */ +__bpf_kfunc s64 bpf_tick_nohz_get_sleep_length(void) +{ + ktime_t delta_tick; + + return (s64)tick_nohz_get_sleep_length(&delta_tick); +} + +__bpf_kfunc_end_defs(); + +BTF_KFUNCS_START(cpuidle_ext_gov_kfuncs) +BTF_ID_FLAGS(func, bpf_cpuidle_ext_gov_update_rating) +BTF_ID_FLAGS(func, bpf_cpuidle_ext_gov_latency_req) +BTF_ID_FLAGS(func, bpf_tick_nohz_get_sleep_length) +BTF_KFUNCS_END(cpuidle_ext_gov_kfuncs) + +static const struct btf_kfunc_id_set cpuidle_ext_gov_kfuncs_set = { + .owner = THIS_MODULE, + .set = &cpuidle_ext_gov_kfuncs, +}; + +static int cpuidle_gov_kfuncs_init(void) +{ + return register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &cpuidle_ext_gov_kfuncs_set); +} + +/******************************************************************************** + * bpf_struct_ops plumbing. + */ +#include <linux/bpf_verifier.h> +#include <linux/bpf.h> + +#define CPUIDLE_GOV_EXT_NAME_LEN 128 +enum ops_enable_state { + OPS_ENABLED, + OPS_DISABLED, +}; + +static const struct btf_type *cpuidle_device_type; +static u32 cpuidle_device_type_id; +static struct cpuidle_gov_ext_ops *ops; + +static DEFINE_MUTEX(ops_mutex); +DEFINE_STATIC_KEY_FALSE(ops_enabled_key); +static atomic_t ops_enable_state_var = ATOMIC_INIT(OPS_DISABLED); + +struct cpuidle_gov_ext_ops { + /** + * enable - cpuidle ext governor enable + * @drv: cpuidle driver containing state data. + * @dev: target cpu + */ + int (*enable)(struct cpuidle_driver *drv, struct cpuidle_device *dev); + + /** + * disable - cpuidle ext governor disable + * @drv: cpuidle driver containing state data. + * @dev: target cpu + */ + void (*disable)(struct cpuidle_driver *drv, struct cpuidle_device *dev); + + /* + * select - select the next cpu idle state to enter + * @drv: cpuidle driver containing state data. + * @dev: target cpu + */ + int (*select)(struct cpuidle_driver *drv, struct cpuidle_device *dev); + + /* + * set_stop_tick - whether or not to stop the scheduler tick + * automatically called after selecting cpuidle state + */ + bool (*set_stop_tick)(void); + + /* + * reflect - Give the governor an opportunity to reflect on the outcome + * @dev: target cpu + * @index: last idle state which target cpu has entered + */ + void (*reflect)(struct cpuidle_device *dev, int index); + + /** + * init - Initialize the BPF cpuidle governor + */ + int (*init)(void); + + /** + * exit - Clean up after the BPF cpuidle governor + */ + void (*exit)(void); + + /** + * name - BPF cpuidle governor name + */ + char name[CPUIDLE_GOV_EXT_NAME_LEN]; +}; + +static enum ops_enable_state get_ops_enable_state(void) +{ + return atomic_read(&ops_enable_state_var); +} + +static enum ops_enable_state +set_ops_enable_state(enum ops_enable_state to) +{ + return atomic_xchg(&ops_enable_state_var, to); +} + +static int enable_stub(struct cpuidle_driver *drv, struct cpuidle_device *dev) { return 0; } +static void disable_stub(struct cpuidle_driver *drv, struct cpuidle_device *dev) {} +static int select_stub(struct cpuidle_driver *drv, struct cpuidle_device *dev) { return 0; } +static bool set_stop_tick_stub(void) {return false; } +static void reflect_stub(struct cpuidle_device *dev, int index) {} +static int init_stub(void) { return 0; } +static void exit_stub(void) {} + +static struct cpuidle_gov_ext_ops __bpf_ops_cpuidle_gov_ext_ops = { + .enable = enable_stub, + .disable = disable_stub, + .select = select_stub, + .set_stop_tick = set_stop_tick_stub, + .reflect = reflect_stub, + .init = init_stub, + .exit = exit_stub, +}; + +static int ext_btf_struct_access(struct bpf_verifier_log *log, + const struct bpf_reg_state *reg, int off, + int size) +{ + const struct btf_type *t; + + t = btf_type_by_id(reg->btf, reg->btf_id); + if (t == cpuidle_device_type) { + for (int i = 0; i < CPUIDLE_STATE_MAX; i++) { + size_t base_offset = offsetof(struct cpuidle_device, states_usage[i]); + + if (off >= base_offset + offsetof(struct cpuidle_state_usage, disable) && + off + size <= base_offset + offsetofend(struct cpuidle_state_usage, disable)) { + return SCALAR_VALUE; + } + } + } + + return -EACCES; +} + +static const struct bpf_verifier_ops ops_verifier = { + .get_func_proto = bpf_base_func_proto, + .is_valid_access = btf_ctx_access, + .btf_struct_access = ext_btf_struct_access, +}; + +static void ops_disable(void) +{ + restore_cpuidle_last_governor(); + WARN_ON_ONCE(set_ops_enable_state(OPS_DISABLED) != OPS_ENABLED); + static_branch_disable(&ops_enabled_key); + if (ops->exit) + ops->exit(); + memset(&ops, 0, sizeof(ops)); +} + +static void ops_unreg(void *kdata, struct bpf_link *link) +{ + mutex_lock(&ops_mutex); + ops_disable(); + mutex_unlock(&ops_mutex); +} + +static int ops_reg(void *kdata, struct bpf_link *link) +{ + mutex_lock(&ops_mutex); + if (get_ops_enable_state() != OPS_DISABLED) { + mutex_unlock(&ops_mutex); + return -EEXIST; + } + /* + * Set ops, call ops.init(), and set enable state flag + */ + ops = (struct cpuidle_gov_ext_ops *)kdata; + if (ops->init && ops->init()) { + ops_disable(); + mutex_unlock(&ops_mutex); + return -EINVAL; + } + WARN_ON_ONCE(set_ops_enable_state(OPS_ENABLED) != OPS_DISABLED); + static_branch_enable(&ops_enabled_key); + + mutex_unlock(&ops_mutex); + return 0; +} + +static int ops_check_member(const struct btf_type *t, + const struct btf_member *member, + const struct bpf_prog *prog) +{ + u32 moff = __btf_member_bit_offset(t, member) / 8; + + switch (moff) { + case offsetof(struct cpuidle_gov_ext_ops, enable): + case offsetof(struct cpuidle_gov_ext_ops, disable): + case offsetof(struct cpuidle_gov_ext_ops, select): + case offsetof(struct cpuidle_gov_ext_ops, set_stop_tick): + case offsetof(struct cpuidle_gov_ext_ops, reflect): + case offsetof(struct cpuidle_gov_ext_ops, init): + case offsetof(struct cpuidle_gov_ext_ops, exit): + break; + default: + return -EINVAL; + } + return 0; +} + +static int ops_init_member(const struct btf_type *t, + const struct btf_member *member, + void *kdata, const void *udata) +{ + const struct cpuidle_gov_ext_ops *uops = udata; + struct cpuidle_gov_ext_ops *ops = kdata; + u32 moff = __btf_member_bit_offset(t, member) / 8; + int ret; + + switch (moff) { + case offsetof(struct cpuidle_gov_ext_ops, name): + ret = bpf_obj_name_cpy(ops->name, uops->name, + sizeof(ops->name)); + if (ret < 0) + return ret; + if (ret == 0) + return -EINVAL; + return 1; + } + return 0; +} + +static int ops_init(struct btf *btf) +{ + s32 type_id; + + type_id = btf_find_by_name_kind(btf, "cpuidle_device", BTF_KIND_STRUCT); + if (type_id < 0) + return -EINVAL; + cpuidle_device_type = btf_type_by_id(btf, type_id); + cpuidle_device_type_id = type_id; + + return 0; +} + +static int ops_update(void *kdata, void *old_kdata, struct bpf_link *link) +{ + /* + * Not support updating the actively-loaded BPF cpuidle governor + */ + return -EOPNOTSUPP; +} + +static int ops_validate(void *kdata) +{ + return 0; +} + +static struct bpf_struct_ops bpf_cpuidle_gov_ext_ops = { + .verifier_ops = &ops_verifier, + .reg = ops_reg, + .unreg = ops_unreg, + .check_member = ops_check_member, + .init_member = ops_init_member, + .init = ops_init, + .update = ops_update, + .validate = ops_validate, + .name = "cpuidle_gov_ext_ops", + .owner = THIS_MODULE, + .cfi_stubs = &__bpf_ops_cpuidle_gov_ext_ops +}; + +/******************************************************************************** + * default cpuidle ext governor implementations + */ +#define ALPHA_SCALE 100 +#define FIT_FACTOR 90 + +struct cpuidle_gov_ext { + int cpu; + int last_idx; + u64 last_duration; + u64 next_pred; +}; + +DEFINE_PER_CPU(struct cpuidle_gov_ext, cpuidle_gov_ext_data); + +static void update_predict_duration(struct cpuidle_gov_ext *data, + struct cpuidle_driver *drv, struct cpuidle_device *dev) +{ + int idx; + struct cpuidle_state *target; + + if (!data || !drv || !dev) + return; + idx = data->last_idx; + data->last_duration = dev->last_residency_ns; + if (idx > 0) { + target = &drv->states[idx]; + if (data->last_duration > target->exit_latency) + data->last_duration -= target->exit_latency; + } + data->next_pred = data->last_duration; +} + +static void ext_reflect_dfl(struct cpuidle_device *dev, int index) +{ + struct cpuidle_gov_ext *data = this_cpu_ptr(&cpuidle_gov_ext_data); + + if (!data) + return; + data->last_idx = index; +} + +static int ext_select_dfl(struct cpuidle_driver *drv, struct cpuidle_device *dev, + bool *stop_tick) +{ + int i, selected; + struct cpuidle_gov_ext *data; + ktime_t delta_tick; + s64 delta = tick_nohz_get_sleep_length(&delta_tick); + s64 latency_req = cpuidle_governor_latency_req(dev->cpu); + + data = this_cpu_ptr(&cpuidle_gov_ext_data); + if (!data) + return 0; + + /* + * We aim to achieve function redefinition through BPF ops.select(), + * so we do not use complex algorithm here. + */ + update_predict_duration(data, drv, dev); + for (i = drv->state_count - 1; i > 0; i--) { + struct cpuidle_state *s = &drv->states[i]; + struct cpuidle_state_usage *su = &dev->states_usage[i]; + + if (su->disable) + continue; + + if (latency_req < s->exit_latency_ns) + continue; + + if (delta < s->target_residency_ns) + continue; + + if (data->next_pred / FIT_FACTOR * ALPHA_SCALE < s->target_residency_ns) + continue; + break; + } + selected = i; + return selected; +} + +static int ext_enable_dfl(struct cpuidle_driver *drv, struct cpuidle_device *dev) +{ + struct cpuidle_gov_ext *data = &per_cpu(cpuidle_gov_ext_data, dev->cpu); + + memset(data, 0, sizeof(struct cpuidle_gov_ext)); + data->cpu = dev->cpu; + return 0; +} + +static void ext_disable_dfl(struct cpuidle_driver *drv, struct cpuidle_device *dev) { } + +/******************************************************************************** + * Register and init cpuidle governor + */ +static int ext_enable(struct cpuidle_driver *drv, struct cpuidle_device *dev) +{ + if (static_branch_likely(&ops_enabled_key)) + return ops->enable(drv, dev); + return ext_enable_dfl(drv, dev); +} + +static void ext_disable(struct cpuidle_driver *drv, struct cpuidle_device *dev) +{ + if (static_branch_likely(&ops_enabled_key)) + return ops->disable(drv, dev); + return ext_disable_dfl(drv, dev); +} + +static int ext_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, + bool *stop_tick) +{ + int state = 0; + + if (static_branch_likely(&ops_enabled_key)) { + state = ops->select(drv, dev); + *stop_tick = ops->set_stop_tick(); + } else { + state = ext_select_dfl(drv, dev, stop_tick); + } + return state; +} + +static void ext_reflect(struct cpuidle_device *dev, int index) +{ + if (static_branch_likely(&ops_enabled_key)) + ops->reflect(dev, index); + ext_reflect_dfl(dev, index); +} + +static struct cpuidle_governor ext_governor = { + .name = EXT_GOV_NAME, + .rating = 1, + .enable = ext_enable, + .disable = ext_disable, + .select = ext_select, + .reflect = ext_reflect, +}; + +static int __init init_ext(void) +{ + int ret; + + ret = cpuidle_register_governor(&ext_governor); + if (ret) + return ret; + + ret = register_bpf_struct_ops(&bpf_cpuidle_gov_ext_ops, cpuidle_gov_ext_ops); + if (ret) { + pr_err("bpf_cpuidle_gov_ext_ops register fail: %d\n", ret); + return ret; + } + + ret = cpuidle_gov_kfuncs_init(); + if (ret) { + pr_err("bpf cpuidle_gov_kfuncs_init register fail: %d\n", ret); + return ret; + } + + return ret; +} + +postcore_initcall(init_ext); +MODULE_LICENSE("GPL");

-- 2.43.0

Lin Yikai

10:11 a.m.

New subject: [PATCH v1 2/2] selftests/bpf: Add selftests

Add test to verify cpuidle governor ext's load, attach, and kfuncs.

This patch also provides a simple demonstration of `cpuidle_gov_ext_ops` usage: - In `ops.init()`, we set the "rating" value to 60 - significantly exceeding other governors' ratings - to activate `cpuidle_gov_ext`. - For specific scenarios (e.g., screen-off music playback on mobile devices), we can enable "expect_deeper" to transition to deeper idle states.

This implementation serves as a foundation, not a final solution. We can explore further exploration of cpuidle strategies optimized for various usage scenarios.

Signed-off-by: Lin Yikai yikai.lin@vivo.com --- .../bpf/prog_tests/test_cpuidle_gov_ext.c | 28 +++ .../selftests/bpf/progs/cpuidle_gov_ext.c | 208 ++++++++++++++++++ 2 files changed, 236 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cpuidle_gov_ext.c create mode 100644 tools/testing/selftests/bpf/progs/cpuidle_gov_ext.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_cpuidle_gov_ext.c b/tools/testing/selftests/bpf/prog_tests/test_cpuidle_gov_ext.c new file mode 100644 index 000000000000..8b35771ada44 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/test_cpuidle_gov_ext.c @@ -0,0 +1,28 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * test_cpuidle_gov_ext.c - test cpuidle governor ext's load, attach and kfuncs + * + * Copyright (C) Yikai Lin yikai.lin@vivo.com + */ + +#include <test_progs.h> +#include "cpuidle_gov_ext.skel.h" + +void test_test_cpuidle_gov_ext(void) +{ + struct cpuidle_gov_ext *skel; + int err; + + skel = cpuidle_gov_ext__open_and_load(); + if (!ASSERT_OK_PTR(skel, "cpuidle_gov_ext__open_and_load")) + return; + + skel->bss->expect_deeper = 1; + err = cpuidle_gov_ext__attach(skel); + if (!ASSERT_OK(err, "cpuidle_gov_ext__attach")) + goto cleanup; + +cleanup: + cpuidle_gov_ext__destroy(skel); +} + diff --git a/tools/testing/selftests/bpf/progs/cpuidle_gov_ext.c b/tools/testing/selftests/bpf/progs/cpuidle_gov_ext.c new file mode 100644 index 000000000000..62d5a9bc8cb3 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/cpuidle_gov_ext.c @@ -0,0 +1,208 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * cpuidle_gov_ext.c - test to use cpuidle governor ext by bpf + * + * Copyright (C) Yikai Lin yikai.lin@vivo.com + */ + +#include "vmlinux.h" + +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> +#include <bpf/bpf_core_read.h> + +char LICENSE[] SEC("license") = "GPL"; + +#ifndef ARRAY_SIZE +#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])) +#endif +#ifndef max +#define max(a, b) ((a) > (b) ? (a) : (b)) +#endif +#ifndef min +#define min(a, b) ((a) < (b) ? (a) : (b)) +#endif + +#define ALPHA 10 +#define ALPHA_SCALE 100 +#define FIT_FACTOR 90 + +/* + * For some low-power scenarios, + * such as the screen off scenario of mobile devices + * (which will be determined by the user-space BPF program), + * we aim to choose a deeper state + * At this point, we will somewhat disregard the impact on CPU performance. + */ +int expect_deeper = 0; + +int bpf_cpuidle_ext_gov_update_rating(unsigned int rating) __ksym __weak; +s64 bpf_cpuidle_ext_gov_latency_req(unsigned int cpu) __ksym __weak; +s64 bpf_tick_nohz_get_sleep_length(void) __ksym __weak; + +struct cpuidle_gov_data { + int cpu; + int last_idx; + u64 last_pred; + u64 last_duration; + u64 next_pred; +}; + +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __uint(max_entries, 1); + __type(key, u32); + __type(value, struct cpuidle_gov_data); +} cpuidle_gov_data_map SEC(".maps"); + +static u64 calculate_ewma(u64 last, u64 new, u32 alpha, u32 alpha_scale) +{ + return (alpha * new + (alpha_scale - alpha) * last) / alpha_scale; +} + +static void update_predict_duration(struct cpuidle_gov_data *data, + struct cpuidle_driver *drv, struct cpuidle_device *dev) +{ + int idx; + struct cpuidle_state target; + + if (!data || !drv || !dev) + return; + idx = data->last_idx; + data->last_duration = dev->last_residency_ns; + if (idx > 0) { + bpf_core_read(&target, sizeof(target), &drv->states[idx]); + if (data->last_duration > target.exit_latency) + data->last_duration -= target.exit_latency; + } + data->last_pred = data->next_pred; + data->next_pred = calculate_ewma(data->next_pred, + data->last_duration, ALPHA, ALPHA_SCALE); +} + +/* Enable the cpuidle governor */ +SEC("struct_ops.s/enable") +int BPF_PROG(bpf_cpuidle_enable, struct cpuidle_driver *drv, struct cpuidle_device *dev) +{ + u32 key = 0; + struct cpuidle_gov_data *data; + + bpf_printk("cpuidle_gov_ext: enabled"); + data = bpf_map_lookup_percpu_elem(&cpuidle_gov_data_map, &key, dev->cpu); + if (!data) + return 0; + + __builtin_memset(data, 0, sizeof(struct cpuidle_gov_data)); + data->cpu = dev->cpu; + return 0; +} + +/* Disable the cpuidle governor */ +SEC("struct_ops.s/disable") +void BPF_PROG(bpf_cpuidle_disable, struct cpuidle_driver *drv, struct cpuidle_device *dev) +{ + bpf_printk("cpuidle_gov_ext: disabled"); +} + +/* Select the next idle state */ +SEC("struct_ops.s/select") +int BPF_PROG(bpf_cpuidle_select, struct cpuidle_driver *drv, struct cpuidle_device *dev) +{ + u32 key = 0; + s64 delta, latency_req, residency_ns; + int i, selected; + unsigned long long disable = 0; + struct cpuidle_gov_data *data; + struct cpuidle_state cs; + + data = bpf_map_lookup_percpu_elem(&cpuidle_gov_data_map, &key, dev->cpu); + if (!data) { + bpf_printk("cpuidle_gov_ext: [%s] cpuidle_gov_data_map is NULL\n", __func__); + return 0; + } + latency_req = bpf_cpuidle_ext_gov_latency_req(dev->cpu); + delta = bpf_tick_nohz_get_sleep_length(); + + update_predict_duration(data, drv, dev); + + for (i = ARRAY_SIZE(drv->states)-1; i > 0; i--) { + if (i > drv->state_count-1) + continue; + bpf_core_read(&cs, sizeof(cs), &drv->states[i]); + bpf_core_read(&disable, sizeof(disable), &dev->states_usage[i]); + + if (disable) + continue; + + if (latency_req < cs.exit_latency_ns) + continue; + + if (delta < cs.target_residency_ns) + continue; + + if (data->next_pred / FIT_FACTOR * ALPHA_SCALE < cs.target_residency_ns) + continue; + + break; + } + residency_ns = drv->states[i].target_residency_ns; + if (expect_deeper && + i < drv->state_count - 1 && + data->last_pred >= residency_ns && + data->next_pred < residency_ns && + data->next_pred / FIT_FACTOR * ALPHA_SCALE >= residency_ns && + data->next_pred / FIT_FACTOR * ALPHA_SCALE >= data->last_duration && + delta > residency_ns) { + i++; + } + + selected = i; + return selected; +} + +//enable or disable scheduling tick after selecting cpuidle state +SEC("struct_ops.s/set_stop_tick") +bool BPF_PROG(bpf_cpuidle_set_stop_tick) +{ + return false; +} + +/* Reflect function called after entering an idle state */ +SEC("struct_ops.s/reflect") +void BPF_PROG(bpf_cpuidle_reflect, struct cpuidle_device *dev, int index) +{ + u32 key = 0; + struct cpuidle_gov_data *data; + + data = bpf_map_lookup_percpu_elem(&cpuidle_gov_data_map, &key, dev->cpu); + if (!data) { + bpf_printk("cpuidle_gov_ext: [%s] cpuidle_gov_data_map is NULL\n", __func__); + return; + } + data->last_idx = index; +} + +/* Initialize the BPF cpuidle governor */ +SEC("struct_ops.s/init") +int BPF_PROG(bpf_cpuidle_init) +{ + int ret = bpf_cpuidle_ext_gov_update_rating(60); + return ret; +} + +/* Cleanup after the BPF cpuidle governor */ +SEC("struct_ops.s/exit") +void BPF_PROG(bpf_cpuidle_exit) { } + +/* Struct_ops linkage for cpuidle governor */ +SEC(".struct_ops.link") +struct cpuidle_gov_ext_ops ops = { + .enable = (void *)bpf_cpuidle_enable, + .disable = (void *)bpf_cpuidle_disable, + .select = (void *)bpf_cpuidle_select, + .set_stop_tick = (void *)bpf_cpuidle_set_stop_tick, + .reflect = (void *)bpf_cpuidle_reflect, + .init = (void *)bpf_cpuidle_init, + .exit = (void *)bpf_cpuidle_exit, + .name = "BPF_cpuidle_gov" +};

-- 2.43.0

Christian Loehle

10:50 a.m.

New subject: [PATCH v1 2/2] selftests/bpf: Add selftests

On 8/29/25 11:11, Lin Yikai wrote:

...

Add test to verify cpuidle governor ext's load, attach, and kfuncs.

This patch also provides a simple demonstration of `cpuidle_gov_ext_ops` usage:

In `ops.init()`, we set the "rating" value to 60 - significantly exceeding other governors' ratings - to activate `cpuidle_gov_ext`.

For specific scenarios (e.g., screen-off music playback on mobile devices), we can enable "expect_deeper" to transition to deeper idle states.

This implementation serves as a foundation, not a final solution. We can explore further exploration of cpuidle strategies optimized for various usage scenarios.

This is the example cpuidle BPF governor I guess, just some very first comments.

patch title should mention "cpuidle governor" somehow.

...

Signed-off-by: Lin Yikai yikai.lin@vivo.com

.../bpf/prog_tests/test_cpuidle_gov_ext.c | 28 +++ .../selftests/bpf/progs/cpuidle_gov_ext.c | 208 ++++++++++++++++++ 2 files changed, 236 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cpuidle_gov_ext.c create mode 100644 tools/testing/selftests/bpf/progs/cpuidle_gov_ext.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_cpuidle_gov_ext.c b/tools/testing/selftests/bpf/prog_tests/test_cpuidle_gov_ext.c new file mode 100644 index 000000000000..8b35771ada44 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/test_cpuidle_gov_ext.c @@ -0,0 +1,28 @@ +// SPDX-License-Identifier: GPL-2.0 +/*

test_cpuidle_gov_ext.c - test cpuidle governor ext's load, attach and kfuncs

Copyright (C) Yikai Lin yikai.lin@vivo.com

*/

+#include <test_progs.h> +#include "cpuidle_gov_ext.skel.h"

+void test_test_cpuidle_gov_ext(void) +{
struct cpuidle_gov_ext *skel;

int err;

skel = cpuidle_gov_ext__open_and_load();

if (!ASSERT_OK_PTR(skel, "cpuidle_gov_ext__open_and_load"))
return;
skel->bss->expect_deeper = 1;

err = cpuidle_gov_ext__attach(skel);

if (!ASSERT_OK(err, "cpuidle_gov_ext__attach"))
goto cleanup;
+cleanup:

cpuidle_gov_ext__destroy(skel);

+}

diff --git a/tools/testing/selftests/bpf/progs/cpuidle_gov_ext.c b/tools/testing/selftests/bpf/progs/cpuidle_gov_ext.c new file mode 100644 index 000000000000..62d5a9bc8cb3 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/cpuidle_gov_ext.c @@ -0,0 +1,208 @@ +// SPDX-License-Identifier: GPL-2.0 +/*

cpuidle_gov_ext.c - test to use cpuidle governor ext by bpf

Copyright (C) Yikai Lin yikai.lin@vivo.com

*/

+#include "vmlinux.h"

+#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> +#include <bpf/bpf_core_read.h>

+char LICENSE[] SEC("license") = "GPL";

+#ifndef ARRAY_SIZE +#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])) +#endif +#ifndef max +#define max(a, b) ((a) > (b) ? (a) : (b)) +#endif +#ifndef min +#define min(a, b) ((a) < (b) ? (a) : (b)) +#endif

+#define ALPHA 10 +#define ALPHA_SCALE 100 +#define FIT_FACTOR 90

+/*

For some low-power scenarios,

such as the screen off scenario of mobile devices

(which will be determined by the user-space BPF program),

we aim to choose a deeper state

At this point, we will somewhat disregard the impact on CPU performance.

*/

+int expect_deeper = 0;

+int bpf_cpuidle_ext_gov_update_rating(unsigned int rating) __ksym __weak; +s64 bpf_cpuidle_ext_gov_latency_req(unsigned int cpu) __ksym __weak; +s64 bpf_tick_nohz_get_sleep_length(void) __ksym __weak;

These would need proper headers, every BPF cpuidle governor would need them, no?

...

+struct cpuidle_gov_data {

int cpu;

int last_idx;

u64 last_pred;

u64 last_duration;

u64 next_pred;

+};

+struct {

__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);

__uint(max_entries, 1);

__type(key, u32);

__type(value, struct cpuidle_gov_data);

+} cpuidle_gov_data_map SEC(".maps");

+static u64 calculate_ewma(u64 last, u64 new, u32 alpha, u32 alpha_scale) +{

return (alpha * new + (alpha_scale - alpha) * last) / alpha_scale;

+}

+static void update_predict_duration(struct cpuidle_gov_data *data,
	struct cpuidle_driver *drv, struct cpuidle_device *dev)
+{
int idx;

struct cpuidle_state target;

if (!data || !drv || !dev)
return;
idx = data->last_idx;

data->last_duration = dev->last_residency_ns;

if (idx > 0) {
bpf_core_read(&target, sizeof(target), &drv->states[idx]);
if (data->last_duration > target.exit_latency)
	data->last_duration -= target.exit_latency;
}

data->last_pred = data->next_pred;

data->next_pred = calculate_ewma(data->next_pred,
data->last_duration, ALPHA, ALPHA_SCALE);
+}

+/* Enable the cpuidle governor */ +SEC("struct_ops.s/enable") +int BPF_PROG(bpf_cpuidle_enable, struct cpuidle_driver *drv, struct cpuidle_device *dev) +{
u32 key = 0;

struct cpuidle_gov_data *data;

bpf_printk("cpuidle_gov_ext: enabled");

data = bpf_map_lookup_percpu_elem(&cpuidle_gov_data_map, &key, dev->cpu);

if (!data)
return 0;
__builtin_memset(data, 0, sizeof(struct cpuidle_gov_data));

data->cpu = dev->cpu;

return 0;
+}

+/* Disable the cpuidle governor */ +SEC("struct_ops.s/disable") +void BPF_PROG(bpf_cpuidle_disable, struct cpuidle_driver *drv, struct cpuidle_device *dev) +{

bpf_printk("cpuidle_gov_ext: disabled");

+}

+/* Select the next idle state */ +SEC("struct_ops.s/select") +int BPF_PROG(bpf_cpuidle_select, struct cpuidle_driver *drv, struct cpuidle_device *dev) +{
u32 key = 0;

s64 delta, latency_req, residency_ns;

int i, selected;

unsigned long long disable = 0;

struct cpuidle_gov_data *data;

struct cpuidle_state cs;

data = bpf_map_lookup_percpu_elem(&cpuidle_gov_data_map, &key, dev->cpu);

if (!data) {
bpf_printk("cpuidle_gov_ext: [%s] cpuidle_gov_data_map is NULL\n", __func__);
return 0;
}

latency_req = bpf_cpuidle_ext_gov_latency_req(dev->cpu);

delta = bpf_tick_nohz_get_sleep_length();

update_predict_duration(data, drv, dev);

for (i = ARRAY_SIZE(drv->states)-1; i > 0; i--) {
if (i > drv->state_count-1)
	continue;

When is this true?

...

bpf_core_read(&cs, sizeof(cs), &drv->states[i]);

bpf_core_read(&disable, sizeof(disable), &dev->states_usage[i]);

Is there no nicer way of doing this than reading it every select() call?

...

```
if (disable)
```
```
	continue;
```
```
if (latency_req < cs.exit_latency_ns)
```
```
	continue;
```
```
if (delta < cs.target_residency_ns)
```
```
	continue;
```

if (data->next_pred / FIT_FACTOR * ALPHA_SCALE < cs.target_residency_ns)

```
	continue;
```
```
break;
```
}
residency_ns = drv->states[i].target_residency_ns;
if (expect_deeper &&
```
i < drv->state_count - 1 &&
```
```
data->last_pred >= residency_ns &&
```
```
data->next_pred < residency_ns &&
```

data->next_pred / FIT_FACTOR * ALPHA_SCALE >= residency_ns &&

data->next_pred / FIT_FACTOR * ALPHA_SCALE >= data->last_duration &&

```
delta > residency_ns) {
```
```
i++;
```

i++ might be disabled though.

...

}

selected = i;

Just return i?

...

return selected;

+}

+//enable or disable scheduling tick after selecting cpuidle state +SEC("struct_ops.s/set_stop_tick") +bool BPF_PROG(bpf_cpuidle_set_stop_tick) +{

return false;

+}

+/* Reflect function called after entering an idle state */ +SEC("struct_ops.s/reflect") +void BPF_PROG(bpf_cpuidle_reflect, struct cpuidle_device *dev, int index) +{
u32 key = 0;

struct cpuidle_gov_data *data;

data = bpf_map_lookup_percpu_elem(&cpuidle_gov_data_map, &key, dev->cpu);

if (!data) {
bpf_printk("cpuidle_gov_ext: [%s] cpuidle_gov_data_map is NULL\n", __func__);
return;
}

data->last_idx = index;
+}

+/* Initialize the BPF cpuidle governor */ +SEC("struct_ops.s/init") +int BPF_PROG(bpf_cpuidle_init) +{

int ret = bpf_cpuidle_ext_gov_update_rating(60);

return ret;

unnecessary ret.

...

+}

+/* Cleanup after the BPF cpuidle governor */ +SEC("struct_ops.s/exit") +void BPF_PROG(bpf_cpuidle_exit) { }

+/* Struct_ops linkage for cpuidle governor */ +SEC(".struct_ops.link") +struct cpuidle_gov_ext_ops ops = {

.enable = (void *)bpf_cpuidle_enable,

.disable = (void *)bpf_cpuidle_disable,

.select = (void *)bpf_cpuidle_select,

.set_stop_tick = (void *)bpf_cpuidle_set_stop_tick,

.reflect = (void *)bpf_cpuidle_reflect,

.init = (void *)bpf_cpuidle_init,

.exit = (void *)bpf_cpuidle_exit,

.name = "BPF_cpuidle_gov"

+};

Song Liu

8 p.m.

Hi,

Thanks for the patchset.

Some logistics:

1. Please prefix future patches properly with "bpf" or "bpf-next", for example, [PATCH v2 bpf-next 1/2].

2. Please be specific with the patch title, i.e. "selftests/bpf: Add selftests" should be something like "selftests/bpf: Add selftests for cpu-idle ext".

On Fri, Aug 29, 2025 at 3:11 AM Lin Yikai yikai.lin@vivo.com wrote:

...

Summary

Hi, everyone, This patch set introduces an extensible cpuidle governor framework using BPF struct_ops, enabling dynamic implementation of idle-state selection policies via BPF programs.

Motivation

As is well-known, CPUs support multiple idle states (e.g., C0, C1, C2, ...), where deeper states reduce power consumption, but results in longer wakeup latency, potentially affecting performance. Existing generic cpuidle governors operate effectively in common scenarios but exhibit suboptimal behavior in specific Android phone's use cases.

Our testing reveals that during low-utilization scenarios (e.g., screen-off background tasks like music playback with CPU utilization <10%), the C0 state occupies ~50% of idle time, causing significant energy inefficiency. Reducing C0 to ≤20% could yield ≥5% power savings on mobile phones.

To address this, we expect: 1.Dynamic governor switching to power-saved policies for low cpu utilization scenarios (e.g., screen-off mode) 2.Dynamic switching to alternate governors for high-performance scenarios (e.g., gaming)

OverView

The BPF cpuidle ext governor registers at postcore_initcall() but remains disabled by default due to its low priority "rating" with value "1". Activation requires adjust higer "rating" than other governors within BPF.

Core Components: 1.**struct cpuidle_gov_ext_ops** – BPF-overridable operations:

ops.enable()/ops.disable(): enable or disable callback

ops.select(): cpu Idle-state selection logic

ops.set_stop_tick(): Scheduler tick management after state selection

ops.reflect(): feedback info about previous idle state.

ops.init()/ops.deinit(): Initialization or cleanup.

2.**Critical kfuncs for kernel state access**:

bpf_cpuidle_ext_gov_update_rating(): Activate ext governor by raising rating must be called from "ops.init()"

bpf_cpuidle_ext_gov_latency_req(): get idle-state latency constraints

bpf_tick_nohz_get_sleep_length(): get CPU sleep duration in tickless mode

Future work

Scenario detection: Identifying low-utilization states (e.g., screen-off + background music)

Policy optimization: Optimizing state-selection algorithms for specific scenarios

I am not an expert on cpuidle, so pardon me if the following are rookie questions. But I guess some more detail will help other folks too.

1. It is not clear to me why a BPF based solution is needed here. Can we achieve similar benefits with a knob and some userspace daemon?

2. Is it possible to extend sched_ext to cover cpuidle logic?

Thanks, Song

yikai.lin

1 Sep 1 Sep

6:52 a.m.

On 8/30/2025 4:00 AM, Song Liu wrote:

...

Hi,

Thanks for the patchset.

Some logistics:

Please prefix future patches properly with "bpf" or "bpf-next", for example,

[PATCH v2 bpf-next 1/2].

Please be specific with the patch title, i.e. "selftests/bpf: Add selftests"

should be something like "selftests/bpf: Add selftests for cpu-idle ext".

Yes, I'll update for them.

...

On Fri, Aug 29, 2025 at 3:11 AM Lin Yikai yikai.lin@vivo.com wrote:

...
Summary

Hi, everyone, This patch set introduces an extensible cpuidle governor framework using BPF struct_ops, enabling dynamic implementation of idle-state selection policies via BPF programs.

Motivation

As is well-known, CPUs support multiple idle states (e.g., C0, C1, C2, ...), where deeper states reduce power consumption, but results in longer wakeup latency, potentially affecting performance. Existing generic cpuidle governors operate effectively in common scenarios but exhibit suboptimal behavior in specific Android phone's use cases.

Our testing reveals that during low-utilization scenarios (e.g., screen-off background tasks like music playback with CPU utilization <10%), the C0 state occupies ~50% of idle time, causing significant energy inefficiency. Reducing C0 to ≤20% could yield ≥5% power savings on mobile phones.

To address this, we expect: 1.Dynamic governor switching to power-saved policies for low cpu utilization scenarios (e.g., screen-off mode) 2.Dynamic switching to alternate governors for high-performance scenarios (e.g., gaming)

OverView

The BPF cpuidle ext governor registers at postcore_initcall() but remains disabled by default due to its low priority "rating" with value "1". Activation requires adjust higer "rating" than other governors within BPF.

Core Components: 1.**struct cpuidle_gov_ext_ops** – BPF-overridable operations:

ops.enable()/ops.disable(): enable or disable callback

ops.select(): cpu Idle-state selection logic

ops.set_stop_tick(): Scheduler tick management after state selection

ops.reflect(): feedback info about previous idle state.

ops.init()/ops.deinit(): Initialization or cleanup.

2.**Critical kfuncs for kernel state access**:

bpf_cpuidle_ext_gov_update_rating(): Activate ext governor by raising rating must be called from "ops.init()"

bpf_cpuidle_ext_gov_latency_req(): get idle-state latency constraints

bpf_tick_nohz_get_sleep_length(): get CPU sleep duration in tickless mode

Future work

Scenario detection: Identifying low-utilization states (e.g., screen-off + background music)

Policy optimization: Optimizing state-selection algorithms for specific scenarios

I am not an expert on cpuidle, so pardon me if the following are rookie questions. But I guess some more detail will help other folks too.

Thanks very much for your comments. The cpuidle framework is as follows.(And I'll add it into the next V2 version.) ---------------------------------------------------------- Scheduler Core ---------------------------------------------------------- | v ---------------------------------------------------------- | FAIR Class | EXT Class | IDLE Class | ---------------------------------------------------------- | | | | | | | v | | | ------------------------ | | | enter_cpu_idle() | | | ------------------------ | | | | | | | v | | | ------------------------------ | | | | CPUIDLE Governor | | | | ------------------------------ | | | | | | | | | v v v | | |----------------------------------- | | | default | | other | | BPF ext | | | | Governor | | Governor | | Governor | <===Here is the feature we add. | | |----------------------------------- | | | | | | | | | v v v | | |------------------------------------- | | | select idle state | | |-------------------------------------> 1. It is not clear to me why a BPF based solution is needed here. Can

...

we achieve similar benefits with a knob and some userspace daemon? Each time the system switches to the idle class, it requires a governor policy to select the correct idle state.

Currently, we can only switch governor policies through sysfs nodes, as shown below: / # ls /sys/devices/system/cpu/cpuidle/ available_governors current_driver current_governor current_governor_ro / # cat /sys/devices/system/cpu/cpuidle/available_governors menu teo qcom-cpu-lpm / # cat /sys/devices/system/cpu/cpuidle/current_governor qcom-cpu-lpm <===Here we can echo governor name to this node to switch it. However, it is not possible to change the implementation of this policy through user interfaces.

...

Is it possible to extend sched_ext to cover cpuidle logic?

The cpuidle governor decides which idle state to enter each time it switches to the idle class. The sched_ext is used to determine the scheduling order of tasks, whereas cpuidle is invoked after switching to an idle state when no tasks are present. They are not closely related, so it's not feasible to implement kfuncs or other extensions via sched_ext.> Thanks,

...

Song

Thanks for your comments.

yikai.lin

6:59 a.m.

On 8/30/2025 4:00 AM, Song Liu wrote:

...

Hi,

Thanks for the patchset.

Some logistics:

Please prefix future patches properly with "bpf" or "bpf-next", for example,

[PATCH v2 bpf-next 1/2].

Please be specific with the patch title, i.e. "selftests/bpf: Add selftests"

should be something like "selftests/bpf: Add selftests for cpu-idle ext".

On Fri, Aug 29, 2025 at 3:11 AM Lin Yikai yikai.lin@vivo.com wrote:

...
Summary

Hi, everyone, This patch set introduces an extensible cpuidle governor framework using BPF struct_ops, enabling dynamic implementation of idle-state selection policies via BPF programs.

Motivation

As is well-known, CPUs support multiple idle states (e.g., C0, C1, C2, ...), where deeper states reduce power consumption, but results in longer wakeup latency, potentially affecting performance. Existing generic cpuidle governors operate effectively in common scenarios but exhibit suboptimal behavior in specific Android phone's use cases.

Our testing reveals that during low-utilization scenarios (e.g., screen-off background tasks like music playback with CPU utilization <10%), the C0 state occupies ~50% of idle time, causing significant energy inefficiency. Reducing C0 to ≤20% could yield ≥5% power savings on mobile phones.

To address this, we expect: 1.Dynamic governor switching to power-saved policies for low cpu utilization scenarios (e.g., screen-off mode) 2.Dynamic switching to alternate governors for high-performance scenarios (e.g., gaming)

OverView

The BPF cpuidle ext governor registers at postcore_initcall() but remains disabled by default due to its low priority "rating" with value "1". Activation requires adjust higer "rating" than other governors within BPF.

Core Components: 1.**struct cpuidle_gov_ext_ops** – BPF-overridable operations:

ops.enable()/ops.disable(): enable or disable callback

ops.select(): cpu Idle-state selection logic

ops.set_stop_tick(): Scheduler tick management after state selection

ops.reflect(): feedback info about previous idle state.

ops.init()/ops.deinit(): Initialization or cleanup.

2.**Critical kfuncs for kernel state access**:

bpf_cpuidle_ext_gov_update_rating(): Activate ext governor by raising rating must be called from "ops.init()"

bpf_cpuidle_ext_gov_latency_req(): get idle-state latency constraints

bpf_tick_nohz_get_sleep_length(): get CPU sleep duration in tickless mode

Future work

Scenario detection: Identifying low-utilization states (e.g., screen-off + background music)

Policy optimization: Optimizing state-selection algorithms for specific scenarios

I am not an expert on cpuidle, so pardon me if the following are rookie questions. But I guess some more detail will help other folks too.

Thanks very much for your comments. The cpuidle governor framework is as follows, and I will include it in the next V2 version. ---------------------------------------------------------- Scheduler Core ---------------------------------------------------------- | v ---------------------------------------------------------- | FAIR Class | EXT Class | IDLE Class | ---------------------------------------------------------- | | | | | | | v | | | ------------------------ | | | enter_cpu_idle() | | | ------------------------ | | | | | | | v | | | ------------------------------ | | | | CPUIDLE Governor | | | | ------------------------------ | | | | | | | | | v v v | | |----------------------------------- | | | default | | other | | BPF ext | | | | Governor | | Governor | | Governor | | | |----------------------------------- | | | | | | | | | v v v | | |------------------------------------- | | | select idle state | | |-------------------------------------> 1. It is not clear to me why a BPF based solution is needed here. Can

...

we achieve similar benefits with a knob and some userspace daemon?

Each time the system switches to the idle class, it requires a governor policy to select the correct idle state. Currently, we can only switch governor policies through sysfs nodes, as shown below: / # ls /sys/devices/system/cpu/cpuidle/ available_governors current_driver current_governor current_governor_ro / # cat /sys/devices/system/cpu/cpuidle/available_governors menu teo qcom-cpu-lpm 《===Here we can switch governor policy by echo this node. However, it is not possible to change the implementation of this policy through user interfaces.

...

Is it possible to extend sched_ext to cover cpuidle logic?

The sched_ext is used to determine the scheduling order of tasks,

whereas cpuidle is invoked after switching to idle class when no tasks are present. They are not closely related, so it's not feasible to implement kfuncs or other extensions via sched_ext.> Thanks,

...

Song

Thanks, yikai

130

days inactive

133

days old

linux-kselftest-mirror@lists.linaro.org

6 comments

participants

tags (0)

participants (4)

Christian Loehle
Lin Yikai
Song Liu
yikai.lin