From: Joshua Hahn joshua.hahn6@gmail.com
v2 -> v3: Signed-off-by & renamed subject for clarity. v1 -> v2: Edited commit messages for clarity.
Niced CPU usage is a metric reported in host-level /prot/stat, but is not reported in cgroup-level statistics in cpu.stat. However, when a host contains multiple tasks across different workloads, it becomes difficult to gauge how much of the task is being spent on niced processes based on /proc/stat alone, since host-level metrics do not provide this cgroup-level granularity.
Exposing this metric will allow users to accurately probe the niced CPU metric for each workload, and make more informed decisions when directing higher priority tasks.
Joshua Hahn (2): Tracking cgroup-level niced CPU time Selftests for niced CPU statistics
include/linux/cgroup-defs.h | 1 + kernel/cgroup/rstat.c | 16 ++++- tools/testing/selftests/cgroup/test_cpu.c | 72 +++++++++++++++++++++++ 3 files changed, 86 insertions(+), 3 deletions(-)
From: Joshua Hahn joshua.hahn6@gmail.com
Cgroup-level CPU statistics currently include time spent on user/system processes, but do not include niced CPU time (despite already being tracked). This patch exposes niced CPU time to the userspace, allowing users to get a better understanding of their hardware limits and can facilitate more informed workload distribution.
A new field 'ntime' is added to struct cgroup_base_stat as opposed to struct task_cputime to minimize footprint.
Signed-off-by: Joshua Hahn joshua.hahnjy@gmail.com --- include/linux/cgroup-defs.h | 1 + kernel/cgroup/rstat.c | 16 +++++++++++++--- 2 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index ae04035b6cbe..a2fcb3db6c52 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -315,6 +315,7 @@ struct cgroup_base_stat { #ifdef CONFIG_SCHED_CORE u64 forceidle_sum; #endif + u64 ntime; };
/* diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index a06b45272411..a77ba9a83bab 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -444,6 +444,7 @@ static void cgroup_base_stat_add(struct cgroup_base_stat *dst_bstat, #ifdef CONFIG_SCHED_CORE dst_bstat->forceidle_sum += src_bstat->forceidle_sum; #endif + dst_bstat->ntime += src_bstat->ntime; }
static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat, @@ -455,6 +456,7 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat, #ifdef CONFIG_SCHED_CORE dst_bstat->forceidle_sum -= src_bstat->forceidle_sum; #endif + dst_bstat->ntime -= src_bstat->ntime; }
static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu) @@ -535,7 +537,10 @@ void __cgroup_account_cputime_field(struct cgroup *cgrp,
switch (index) { case CPUTIME_USER: + rstatc->bstat.cputime.utime += delta_exec; + break; case CPUTIME_NICE: + rstatc->bstat.ntime += delta_exec; rstatc->bstat.cputime.utime += delta_exec; break; case CPUTIME_SYSTEM: @@ -591,6 +596,7 @@ static void root_cgroup_cputime(struct cgroup_base_stat *bstat) #ifdef CONFIG_SCHED_CORE bstat->forceidle_sum += cpustat[CPUTIME_FORCEIDLE]; #endif + bstat->ntime += cpustat[CPUTIME_NICE]; } }
@@ -608,13 +614,14 @@ static void cgroup_force_idle_show(struct seq_file *seq, struct cgroup_base_stat void cgroup_base_stat_cputime_show(struct seq_file *seq) { struct cgroup *cgrp = seq_css(seq)->cgroup; - u64 usage, utime, stime; + u64 usage, utime, stime, ntime;
if (cgroup_parent(cgrp)) { cgroup_rstat_flush_hold(cgrp); usage = cgrp->bstat.cputime.sum_exec_runtime; cputime_adjust(&cgrp->bstat.cputime, &cgrp->prev_cputime, &utime, &stime); + ntime = cgrp->bstat.ntime; cgroup_rstat_flush_release(cgrp); } else { /* cgrp->bstat of root is not actually used, reuse it */ @@ -622,16 +629,19 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq) usage = cgrp->bstat.cputime.sum_exec_runtime; utime = cgrp->bstat.cputime.utime; stime = cgrp->bstat.cputime.stime; + ntime = cgrp->bstat.ntime; }
do_div(usage, NSEC_PER_USEC); do_div(utime, NSEC_PER_USEC); do_div(stime, NSEC_PER_USEC); + do_div(ntime, NSEC_PER_USEC);
seq_printf(seq, "usage_usec %llu\n" "user_usec %llu\n" - "system_usec %llu\n", - usage, utime, stime); + "system_usec %llu\n" + "nice_usec %llu\n", + usage, utime, stime, ntime);
cgroup_force_idle_show(seq, &cgrp->bstat); }
On Mon, Sep 23, 2024 at 07:20:05AM GMT, Joshua Hahn joshua.hahnjy@gmail.com wrote:
static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu) @@ -535,7 +537,10 @@ void __cgroup_account_cputime_field(struct cgroup *cgrp, switch (index) { case CPUTIME_USER:
rstatc->bstat.cputime.utime += delta_exec;
case CPUTIME_NICE:break;
rstatc->bstat.cputime.utime += delta_exec; break;rstatc->bstat.ntime += delta_exec;
Nit: slightly better diffstat is possible with fallthrough:
rstatc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
switch (index) { - case CPUTIME_USER: case CPUTIME_NICE: + rstatc->bstat.ntime += delta_exec; + fallthrough; + case CPUTIME_USER: rstatc->bstat.cputime.utime += delta_exec; break; case CPUTIME_SYSTEM:
@@ -622,16 +629,19 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
...
seq_printf(seq, "usage_usec %llu\n" "user_usec %llu\n"
"system_usec %llu\n",
usage, utime, stime);
"system_usec %llu\n"
"nice_usec %llu\n",
usage, utime, stime, ntime);
This seems to be different whitespace alignment than user_usec above.
(Implementation looks good, I only have some remarks to the concept, reply to cover letter.)
Michal
From: Joshua Hahn joshua.hahn6@gmail.com
Creates a cgroup with a single nice CPU hog process running. fork() is called to generate the nice process because un-nicing is not possible (see man nice(3)). If fork() was not used to generate the CPU hog, we would run the rest of the cgroup selftest suite as a nice process.
Signed-off-by: Joshua Hahn joshua.hahnjy@gmail.com --- tools/testing/selftests/cgroup/test_cpu.c | 72 +++++++++++++++++++++++ 1 file changed, 72 insertions(+)
diff --git a/tools/testing/selftests/cgroup/test_cpu.c b/tools/testing/selftests/cgroup/test_cpu.c index dad2ed82f3ef..cd5550391f49 100644 --- a/tools/testing/selftests/cgroup/test_cpu.c +++ b/tools/testing/selftests/cgroup/test_cpu.c @@ -8,6 +8,7 @@ #include <pthread.h> #include <stdio.h> #include <time.h> +#include <unistd.h>
#include "../kselftest.h" #include "cgroup_util.h" @@ -229,6 +230,76 @@ static int test_cpucg_stats(const char *root) return ret; }
+/* + * Creates a nice process that consumes CPU and checks that the elapsed + * usertime in the cgroup is close to the expected time. + */ +static int test_cpucg_nice(const char *root) +{ + int ret = KSFT_FAIL; + int status; + long user_usec, nice_usec; + long usage_seconds = 2; + long expected_nice_usec = usage_seconds * USEC_PER_SEC; + char *cpucg; + pid_t pid; + + cpucg = cg_name(root, "cpucg_test"); + if (!cpucg) + goto cleanup; + + if (cg_create(cpucg)) + goto cleanup; + + user_usec = cg_read_key_long(cpucg, "cpu.stat", "user_usec"); + nice_usec = cg_read_key_long(cpucg, "cpu.stat", "nice_usec"); + if (user_usec != 0 || nice_usec != 0) + goto cleanup; + + /* + * We fork here to create a new process that can be niced without + * polluting the nice value of other selftests + */ + pid = fork(); + if (pid < 0) { + goto cleanup; + } else if (pid == 0) { + struct cpu_hog_func_param param = { + .nprocs = 1, + .ts = { + .tv_sec = usage_seconds, + .tv_nsec = 0, + }, + .clock_type = CPU_HOG_CLOCK_PROCESS, + }; + + /* Try to keep niced CPU usage as constrained to hog_cpu as possible */ + nice(1); + cg_run(cpucg, hog_cpus_timed, (void *)¶m); + exit(0); + } else { + waitpid(pid, &status, 0); + if (!WIFEXITED(status)) + goto cleanup; + + user_usec = cg_read_key_long(cpucg, "cpu.stat", "user_usec"); + nice_usec = cg_read_key_long(cpucg, "cpu.stat", "nice_usec"); + if (nice_usec > user_usec || user_usec <= 0) + goto cleanup; + + if (!values_close(nice_usec, expected_nice_usec, 1)) + goto cleanup; + + ret = KSFT_PASS; + } + +cleanup: + cg_destroy(cpucg); + free(cpucg); + + return ret; +} + static int run_cpucg_weight_test( const char *root, @@ -686,6 +757,7 @@ struct cpucg_test { } tests[] = { T(test_cpucg_subtree_control), T(test_cpucg_stats), + T(test_cpucg_nice), T(test_cpucg_weight_overprovisioned), T(test_cpucg_weight_underprovisioned), T(test_cpucg_nested_weight_overprovisioned),
On Mon, Sep 23, 2024 at 07:20:06AM GMT, Joshua Hahn joshua.hahnjy@gmail.com wrote:
+/*
- Creates a nice process that consumes CPU and checks that the elapsed
- usertime in the cgroup is close to the expected time.
- */
+static int test_cpucg_nice(const char *root) +{
- int ret = KSFT_FAIL;
- int status;
- long user_usec, nice_usec;
- long usage_seconds = 2;
- long expected_nice_usec = usage_seconds * USEC_PER_SEC;
- char *cpucg;
- pid_t pid;
- cpucg = cg_name(root, "cpucg_test");
- if (!cpucg)
goto cleanup;
- if (cg_create(cpucg))
goto cleanup;
- user_usec = cg_read_key_long(cpucg, "cpu.stat", "user_usec");
- nice_usec = cg_read_key_long(cpucg, "cpu.stat", "nice_usec");
- if (user_usec != 0 || nice_usec != 0)
goto cleanup;
Can you please distinguish a check between non-zero nice_usec and non-existent nice_usec (KSFT_FAIL vs KSFT_SKIP)? So that the selftest is usable on older kernels too.
- /*
* We fork here to create a new process that can be niced without
* polluting the nice value of other selftests
*/
- pid = fork();
- if (pid < 0) {
goto cleanup;
- } else if (pid == 0) {
struct cpu_hog_func_param param = {
.nprocs = 1,
.ts = {
.tv_sec = usage_seconds,
.tv_nsec = 0,
},
.clock_type = CPU_HOG_CLOCK_PROCESS,
};
/* Try to keep niced CPU usage as constrained to hog_cpu as possible */
nice(1);
cg_run(cpucg, hog_cpus_timed, (void *)¶m);
Notice that cg_run() does fork itself internally. So you can call hog_cpus_timed(cpucg, (void *)¶m) directly, no need for the fork with cg_run(). (Alternatively substitute fork in this test with the fork in cg_run() but with extension of cpu_hog_func_params with the nice value.)
Thanks, Michal
On Thu, Sep 26, 2024 at 2:10 PM Michal Koutný mkoutny@suse.com wrote:
On Mon, Sep 23, 2024 at 07:20:06AM GMT, Joshua Hahn joshua.hahnjy@gmail.com wrote:
+/*
- Creates a nice process that consumes CPU and checks that the elapsed
- usertime in the cgroup is close to the expected time.
- */
user_usec = cg_read_key_long(cpucg, "cpu.stat", "user_usec");
nice_usec = cg_read_key_long(cpucg, "cpu.stat", "nice_usec");
if (user_usec != 0 || nice_usec != 0)
goto cleanup;
Can you please distinguish a check between non-zero nice_usec and non-existent nice_usec (KSFT_FAIL vs KSFT_SKIP)? So that the selftest is usable on older kernels too.
Yes, this sounds good to me -- I will include it in a v4, which I am hoping to send out soon.
/*
* We fork here to create a new process that can be niced without
* polluting the nice value of other selftests
*/
pid = fork();
if (pid < 0) {
goto cleanup;
} else if (pid == 0) {
struct cpu_hog_func_param param = {
.nprocs = 1,
.ts = {
.tv_sec = usage_seconds,
.tv_nsec = 0,
},
.clock_type = CPU_HOG_CLOCK_PROCESS,
};
/* Try to keep niced CPU usage as constrained to hog_cpu as possible */
nice(1);
cg_run(cpucg, hog_cpus_timed, (void *)¶m);
Notice that cg_run() does fork itself internally. So you can call hog_cpus_timed(cpucg, (void *)¶m) directly, no need for the fork with cg_run(). (Alternatively substitute fork in this test with the fork in cg_run() but with extension of cpu_hog_func_params with the nice value.)
Thanks, Michal
Thank you for your feedback, Michal. The reason I used a fork in the testing is so that I could isolate the niced portion of the test to only the CPU hog. If I were to nice(1) --> cg_hog() in a single process without forking, this would mean that the cleanup portion of the test would also be run as a niced process, contributing to the stat and potentially dirtying the value (which is tested for accuracy via `values_close`).
The other thing that I considered when writing this was that while it is possible to make a process nicer, it is impossible to make a process less nice. This would mean that the comparison & cleanup portions would also be run nicely if I do not call fork().
What do you think? Do you think that this increase in granularity / accuracy is worth the increase in code complexity? I do agree that it would be much easier to read if there was no fork.
Alternatively, I can add a new parameter to cpu_hog_func_param that takes in a nice value. For this however, I am afraid of changing the function signature of existing utility functions, since it would mean breaking support for older functions or others currently working on this.
Thank you for your detailed feedback again -- I will also change up the diffstat and indentation issues you brought up from the first part of the patch.
Joshua
On Mon, Sep 30, 2024 at 02:07:22PM GMT, Joshua Hahn joshua.hahnjy@gmail.com wrote:
The reason I used a fork in the testing is so that I could isolate the niced portion of the test to only the CPU hog. If I were to nice(1) --> cg_hog() in a single process without forking, this would mean that the cleanup portion of the test would also be run as a niced process,
The cleanup runs in a parent process and nice is called after fork in a child in those considered cases (at least that's what I meant).
contributing to the stat and potentially dirtying the value (which is tested for accuracy via `values_close`).
Yes, a test that randomly fails (false negative) is a nuisance. One fork is needed, the second doesn't divide different priority tasks.
What do you think?
My motivation comes from debugging cgroup selftests when strace is quite useful and your implementation adds the unnecessary fork which makes the strace (slightly) less readable.
Do you think that this increase in granularity / accuracy is worth the increase in code complexity? I do agree that it would be much easier to read if there was no fork.
I think both changes (no cg_run or cpu_hog_func_param extension) could be reasonably small changes (existing usages of cpu_hog_func_param extension would default to zero nice, so the actual change would only be in hog_cpus_timed()).
Alternatively, I can add a new parameter to cpu_hog_func_param that takes in a nice value. For this however, I am afraid of changing the function signature of existing utility functions, since it would mean breaking support for older functions or others currently working on this.
The function is internal to the cgroup selftests and others can rebase, so it doesn't have to stick to a particular signature.
HTH, Michal
My motivation comes from debugging cgroup selftests when strace is quite useful and your implementation adds the unnecessary fork which makes the strace (slightly) less readable.
This makes sense, thank you for the context. I hadn't considered debugging considerations much, but I can imagine that it becomes harder to read once the code & strace becomes clogged up.
Do you think that this increase in granularity / accuracy is worth the increase in code complexity? I do agree that it would be much easier to read if there was no fork.
I think both changes (no cg_run or cpu_hog_func_param extension) could be reasonably small changes (existing usages of cpu_hog_func_param extension would default to zero nice, so the actual change would only be in hog_cpus_timed()).
I think I will stick with the no cg_run option. Initially, I had wanted to use it to maintain the same style with the other selftests in test_cpu.c, but I think it creates more unnecessary unreadability.
Thank you again, Joshua
On Mon, Sep 23, 2024 at 07:20:04AM GMT, Joshua Hahn joshua.hahnjy@gmail.com wrote:
From: Joshua Hahn joshua.hahn6@gmail.com
v2 -> v3: Signed-off-by & renamed subject for clarity. v1 -> v2: Edited commit messages for clarity.
Thanks for the version changelog, appreciated!
...
Exposing this metric will allow users to accurately probe the niced CPU metric for each workload, and make more informed decisions when directing higher priority tasks.
Possibly an example of how this value (combined with some other?) is used for decisions could shed some light on this and justify adding this attribute.
Thanks, Michal
(I'll respond here to Tejun's message from v2 thread.)
On Tue, Sep 10, 2024 at 11:01:07AM GMT, Tejun Heo tj@kernel.org wrote:
I think it's as useful as system-wide nice metric is.
Exactly -- and I don't understand how that system-wide value (without any cgroups) is useful. If I don't know how many there are niced and non-niced tasks and what their runnable patterns are, the aggregated nice time can have ambiguous interpretations.
I think there are benefits to mirroring system wide metrics, at least ones as widely spread as nice.
I agree with benefits of mirroring of some system wide metrics when they are useful <del>but not all of them because it's difficult/impossible to take them away once they're exposed</del>. Actually, readers _should_ handle missing keys gracefuly, so this may be just fine.
(Is this nice time widely spread? (I remember the field from `top`, still not sure how to use it.) Are other proc_stat(5) fields different?
I see how this can be the global analog on leaf cgroups but interpretting middle cgroups with children of different cpu.weights?)
Hello, Michal.
On Thu, Sep 26, 2024 at 08:10:35PM +0200, Michal Koutný wrote: ...
On Tue, Sep 10, 2024 at 11:01:07AM GMT, Tejun Heo tj@kernel.org wrote:
I think it's as useful as system-wide nice metric is.
Exactly -- and I don't understand how that system-wide value (without any cgroups) is useful. If I don't know how many there are niced and non-niced tasks and what their runnable patterns are, the aggregated nice time can have ambiguous interpretations.
I think there are benefits to mirroring system wide metrics, at least ones as widely spread as nice.
I agree with benefits of mirroring of some system wide metrics when they are useful <del>but not all of them because it's difficult/impossible to take them away once they're exposed</del>. Actually, readers _should_ handle missing keys gracefuly, so this may be just fine.
(Is this nice time widely spread? (I remember the field from `top`, still not sure how to use it.) Are other proc_stat(5) fields different?
A personal anecdote: I usually run compile jobs with nice and look at the nice utilization to see what the system is doing. I think it'd be simliar for most folks. Because the number has always been there and ubiqutous across many monitoring tools, people end up using it for something. It's not a great metric but a long-standing and widely available one, so it ends up with usages.
BTW, there are numbers which are actively silly - e.g. iowait, especially due to how it gets aggregated across multiple CPUs. That, we want to actively drop especially as the pressure metrics is the better substitute. I don't think nice is in that category. It's not the best metric there is but not useless or misleading.
I see how this can be the global analog on leaf cgroups but interpretting middle cgroups with children of different cpu.weights?)
I think aggregating per-thread numbers is the right thing to do. It's just sum of CPU cycles spent by threads which got niced.
Thanks.
linux-kselftest-mirror@lists.linaro.org