Changes in v1: - added selftests - comments rewording
RFC in https://lore.kernel.org/r/20220623124944.2753-1-mkoutny@suse.com
Michal Koutný (3): cpuset: Allow setscheduler regardless of manipulated task selftests: cgroup: Minor code reorganizations selftests: cgroup: Add cpuset migrations testcase
MAINTAINERS | 2 + kernel/cgroup/cpuset.c | 7 + tools/testing/selftests/cgroup/.gitignore | 1 + tools/testing/selftests/cgroup/Makefile | 2 + tools/testing/selftests/cgroup/cgroup_util.c | 2 + tools/testing/selftests/cgroup/cgroup_util.h | 2 + tools/testing/selftests/cgroup/test_core.c | 2 +- tools/testing/selftests/cgroup/test_cpuset.c | 272 ++++++++++++++++++ .../selftests/cgroup/test_cpuset_prs.sh | 2 +- 9 files changed, 290 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/cgroup/test_cpuset.c
base-commit: 6995e2de6891c724bfeb2db33d7b87775f913ad1
When we migrate a task between two cgroups, one of the checks is a verification whether we can modify task's scheduler settings (cap_task_setscheduler()).
An implicit migration occurs also when enabling a controller on the unified hierarchy (think of parent to child migration). The aforementioned check may be problematic if the caller of the migration (enabling a controller) has no permissions over migrated tasks. For instance, a user's cgroup that ends up running a process of a different user. Although cgroup permissions are configured favorably, the enablement fails due to the foreign process [1].
Change the behavior by relaxing the permissions check on the unified hierarchy (or in v2 mode). This is in accordance with unified hierarchy attachment behavior when permissions of the source to target cgroups are decisive whereas the migrated task is opaque (as opposed to more restrictive check in __cgroup1_procs_write()).
[1] https://github.com/systemd/systemd/issues/18293#issuecomment-831205649
Signed-off-by: Michal Koutný mkoutny@suse.com --- kernel/cgroup/cpuset.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index e4ca2dd2b764..3b5f87a9a150 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2495,6 +2495,13 @@ static int cpuset_can_attach(struct cgroup_taskset *tset) ret = task_can_attach(task, cs->effective_cpus); if (ret) goto out_unlock; + + /* + * Skip rights over task check in v2, migration permission derives + * from hierarchy ownership in cgroup_procs_write_permission()). + */ + if (is_in_v2_mode()) + continue; ret = security_task_setscheduler(task); if (ret) goto out_unlock;
On 6/29/23 05:11, Michal Koutný wrote:
When we migrate a task between two cgroups, one of the checks is a verification whether we can modify task's scheduler settings (cap_task_setscheduler()).
An implicit migration occurs also when enabling a controller on the unified hierarchy (think of parent to child migration). The aforementioned check may be problematic if the caller of the migration (enabling a controller) has no permissions over migrated tasks. For instance, a user's cgroup that ends up running a process of a different user. Although cgroup permissions are configured favorably, the enablement fails due to the foreign process [1].
Change the behavior by relaxing the permissions check on the unified hierarchy (or in v2 mode). This is in accordance with unified hierarchy attachment behavior when permissions of the source to target cgroups are decisive whereas the migrated task is opaque (as opposed to more restrictive check in __cgroup1_procs_write()).
The is_in_v2_mode() check is for supporting the v2 mode in cgroup v1. However, there is no controller enabling in v1. So I think you should just use cgroup_subsys_on_dfl(cpuset_cgrp_subsys) as the v2 check if your focus is just to prevent problem when enabling cpuset controller.
[1] https://github.com/systemd/systemd/issues/18293#issuecomment-831205649
Signed-off-by: Michal Koutný mkoutny@suse.com
kernel/cgroup/cpuset.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index e4ca2dd2b764..3b5f87a9a150 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2495,6 +2495,13 @@ static int cpuset_can_attach(struct cgroup_taskset *tset) ret = task_can_attach(task, cs->effective_cpus); if (ret) goto out_unlock;
/*
* Skip rights over task check in v2, migration permission derives
* from hierarchy ownership in cgroup_procs_write_permission()).
*/
if (is_in_v2_mode())
ret = security_task_setscheduler(task); if (ret) goto out_unlock;continue;
This change will likely conflict with the latest cpuset change on tracking # of dl tasks in a cpuset. You will have to, at least, move the dl task check before the security_task_setscheduler() check.
Another fact about cpuset controller enabling is that both cpus_allowed and mems_allowed are empty at that point. You may also add these checks as a preconditions for disabling the security_task_setscheduler check.
Cheers, Longman
On Thu, Jun 29, 2023 at 08:11:33AM -0400, Waiman Long longman@redhat.com wrote:
So I think you should just use cgroup_subsys_on_dfl(cpuset_cgrp_subsys) as the v2 check if your focus is just to prevent problem when enabling cpuset controller.
I thought the bare cgroup_subsys_on_dfl(cpuset_cgrp_subsys) is not used in cpuset.c but I was wrong -- yes, I'll change this.
This change will likely conflict with the latest cpuset change on tracking # of dl tasks in a cpuset. You will have to, at least, move the dl task check before the security_task_setscheduler() check.
Another fact about cpuset controller enabling is that both cpus_allowed and mems_allowed are empty at that point. You may also add these checks as a preconditions for disabling the security_task_setscheduler check.
Ah, I will rebase on fresh mainline (or do you mean another reference?).
Thanks for the hints, Michal
On 6/29/23 08:26, Michal Koutný wrote:
On Thu, Jun 29, 2023 at 08:11:33AM -0400, Waiman Long longman@redhat.com wrote:
So I think you should just use cgroup_subsys_on_dfl(cpuset_cgrp_subsys) as the v2 check if your focus is just to prevent problem when enabling cpuset controller.
I thought the bare cgroup_subsys_on_dfl(cpuset_cgrp_subsys) is not used in cpuset.c but I was wrong -- yes, I'll change this.
This change will likely conflict with the latest cpuset change on tracking # of dl tasks in a cpuset. You will have to, at least, move the dl task check before the security_task_setscheduler() check.
Another fact about cpuset controller enabling is that both cpus_allowed and mems_allowed are empty at that point. You may also add these checks as a preconditions for disabling the security_task_setscheduler check.
Ah, I will rebase on fresh mainline (or do you mean another reference?).
Yes, those changes have just been merged into the mainline.
Cheers, Longman
On Thu, Jun 29, 2023 at 08:11:33AM -0400, Waiman Long longman@redhat.com wrote:
Another fact about cpuset controller enabling is that both cpus_allowed and mems_allowed are empty at that point. You may also add these checks as a preconditions for disabling the security_task_setscheduler check.
I considered relying on that, however, there is more generic case when migrating between two sibling that should be allowed in v2 too. See the added test_cpuset_perms_object(). (Admittedly, it doesn't stress the case when the two siblings had different CPUs but it could.)
Anyway, let's move on to v2 (where I addressed remaining comments).
Thanks, Michal
No functional change intended, these small changes are merged into one commit and they serve as a preparation for an upcoming new testcase.
Signed-off-by: Michal Koutný mkoutny@suse.com --- MAINTAINERS | 1 + tools/testing/selftests/cgroup/cgroup_util.c | 2 ++ tools/testing/selftests/cgroup/cgroup_util.h | 2 ++ tools/testing/selftests/cgroup/test_core.c | 2 +- tools/testing/selftests/cgroup/test_cpuset_prs.sh | 2 +- 5 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/MAINTAINERS b/MAINTAINERS index 35e19594640d..50e7056b4383 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5259,6 +5259,7 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git F: Documentation/admin-guide/cgroup-v1/cpusets.rst F: include/linux/cpuset.h F: kernel/cgroup/cpuset.c +F: tools/testing/selftests/cgroup/test_cpuset_prs.sh
CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG) M: Johannes Weiner hannes@cmpxchg.org diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c index e8bbbdb77e0d..0340d4ca8f51 100644 --- a/tools/testing/selftests/cgroup/cgroup_util.c +++ b/tools/testing/selftests/cgroup/cgroup_util.c @@ -286,6 +286,8 @@ int cg_destroy(const char *cgroup) { int ret;
+ if (!cgroup) + return 0; retry: ret = rmdir(cgroup); if (ret && errno == EBUSY) { diff --git a/tools/testing/selftests/cgroup/cgroup_util.h b/tools/testing/selftests/cgroup/cgroup_util.h index c92df4e5d395..1df7f202214a 100644 --- a/tools/testing/selftests/cgroup/cgroup_util.h +++ b/tools/testing/selftests/cgroup/cgroup_util.h @@ -11,6 +11,8 @@ #define USEC_PER_SEC 1000000L #define NSEC_PER_SEC 1000000000L
+#define TEST_UID 65534 /* usually nobody, any !root is fine */ + /* * Checks if two given values differ by less than err% of their sum. */ diff --git a/tools/testing/selftests/cgroup/test_core.c b/tools/testing/selftests/cgroup/test_core.c index 600123503063..80aa6b2373b9 100644 --- a/tools/testing/selftests/cgroup/test_core.c +++ b/tools/testing/selftests/cgroup/test_core.c @@ -683,7 +683,7 @@ static int test_cgcore_thread_migration(const char *root) */ static int test_cgcore_lesser_euid_open(const char *root) { - const uid_t test_euid = 65534; /* usually nobody, any !root is fine */ + const uid_t test_euid = TEST_UID; int ret = KSFT_FAIL; char *cg_test_a = NULL, *cg_test_b = NULL; char *cg_test_a_procs = NULL, *cg_test_b_procs = NULL; diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh index 2b5215cc599f..4afb132e4e4f 100755 --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh @@ -10,7 +10,7 @@ skip_test() { echo "$1" echo "Test SKIPPED" - exit 0 + exit 4 # ksft_skip }
[[ $(id -u) -eq 0 ]] || skip_test "Test must be run as root!"
Add a separate testfile to verify treating permissions when tasks are migrated on cgroup v2 hierarchy between cpuset cgroups.
In accordance with v2 design, migration should be allowed based on delegation boundaries (i.e. cgroup.procs permissions) and does not depend on the migrated object (i.e. unprivileged process can migrate another process (even privileged) as long as it remains in the original dedicated scope).
Signed-off-by: Michal Koutný mkoutny@suse.com --- MAINTAINERS | 1 + tools/testing/selftests/cgroup/.gitignore | 1 + tools/testing/selftests/cgroup/Makefile | 2 + tools/testing/selftests/cgroup/test_cpuset.c | 272 +++++++++++++++++++ 4 files changed, 276 insertions(+) create mode 100644 tools/testing/selftests/cgroup/test_cpuset.c
diff --git a/MAINTAINERS b/MAINTAINERS index 50e7056b4383..51a31aab077f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5259,6 +5259,7 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git F: Documentation/admin-guide/cgroup-v1/cpusets.rst F: include/linux/cpuset.h F: kernel/cgroup/cpuset.c +F: tools/testing/selftests/cgroup/test_cpuset.c F: tools/testing/selftests/cgroup/test_cpuset_prs.sh
CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG) diff --git a/tools/testing/selftests/cgroup/.gitignore b/tools/testing/selftests/cgroup/.gitignore index c4a57e69f749..8443a8d46a1c 100644 --- a/tools/testing/selftests/cgroup/.gitignore +++ b/tools/testing/selftests/cgroup/.gitignore @@ -5,4 +5,5 @@ test_freezer test_kmem test_kill test_cpu +test_cpuset wait_inotify diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile index 3d263747d2ad..dee0f013c7f4 100644 --- a/tools/testing/selftests/cgroup/Makefile +++ b/tools/testing/selftests/cgroup/Makefile @@ -12,6 +12,7 @@ TEST_GEN_PROGS += test_core TEST_GEN_PROGS += test_freezer TEST_GEN_PROGS += test_kill TEST_GEN_PROGS += test_cpu +TEST_GEN_PROGS += test_cpuset
LOCAL_HDRS += $(selfdir)/clone3/clone3_selftests.h $(selfdir)/pidfd/pidfd.h
@@ -23,3 +24,4 @@ $(OUTPUT)/test_core: cgroup_util.c $(OUTPUT)/test_freezer: cgroup_util.c $(OUTPUT)/test_kill: cgroup_util.c $(OUTPUT)/test_cpu: cgroup_util.c +$(OUTPUT)/test_cpuset: cgroup_util.c diff --git a/tools/testing/selftests/cgroup/test_cpuset.c b/tools/testing/selftests/cgroup/test_cpuset.c new file mode 100644 index 000000000000..976ec6f014d8 --- /dev/null +++ b/tools/testing/selftests/cgroup/test_cpuset.c @@ -0,0 +1,272 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <linux/limits.h> +#include <signal.h> + +#include "../kselftest.h" +#include "cgroup_util.h" + +static int idle_process_fn(const char *cgroup, void *arg) +{ + (void)pause(); + return 0; +} + +static int do_migration_fn(const char *cgroup, void *arg) +{ + int object_pid = (int)(size_t)arg; + + if (setuid(TEST_UID)) + return EXIT_FAILURE; + + // XXX checking /proc/$pid/cgroup would be quicker than wait + if (cg_enter(cgroup, object_pid) || + cg_wait_for_proc_count(cgroup, 1)) + return EXIT_FAILURE; + + return EXIT_SUCCESS; +} + +static int do_controller_fn(const char *cgroup, void *arg) +{ + const char *child = cgroup; + const char *parent = arg; + + if (setuid(TEST_UID)) + return EXIT_FAILURE; + + if (!cg_read_strstr(child, "cgroup.controllers", "cpuset")) + return EXIT_FAILURE; + + if (cg_write(parent, "cgroup.subtree_control", "+cpuset")) + return EXIT_FAILURE; + + if (cg_read_strstr(child, "cgroup.controllers", "cpuset")) + return EXIT_FAILURE; + + if (cg_write(parent, "cgroup.subtree_control", "-cpuset")) + return EXIT_FAILURE; + + if (!cg_read_strstr(child, "cgroup.controllers", "cpuset")) + return EXIT_FAILURE; + + return EXIT_SUCCESS; +} + +/* + * Migrate a process between two sibling cgroups. + * The success should only depend on the parent cgroup permissions and not the + * migrated process itself (cpuset controller is in place because it uses + * security_task_setscheduler() in cgroup v1). + */ +static int test_cpuset_perms_object(const char *root, bool allow) +{ + char *parent = NULL, *child_src = NULL, *child_dst = NULL; + char *parent_procs = NULL, *child_src_procs = NULL, *child_dst_procs = NULL; + const uid_t test_euid = TEST_UID; + int object_pid = 0; + int ret = KSFT_FAIL; + + parent = cg_name(root, "cpuset_test_0"); + if (!parent) + goto cleanup; + parent_procs = cg_name(parent, "cgroup.procs"); + if (!parent_procs) + goto cleanup; + if (cg_create(parent)) + goto cleanup; + + child_src = cg_name(parent, "cpuset_test_1"); + if (!child_src) + goto cleanup; + child_src_procs = cg_name(child_src, "cgroup.procs"); + if (!child_src_procs) + goto cleanup; + if (cg_create(child_src)) + goto cleanup; + + child_dst = cg_name(parent, "cpuset_test_2"); + if (!child_dst) + goto cleanup; + child_dst_procs = cg_name(child_dst, "cgroup.procs"); + if (!child_dst_procs) + goto cleanup; + if (cg_create(child_dst)) + goto cleanup; + + if (cg_write(parent, "cgroup.subtree_control", "+cpuset")) + goto cleanup; + + if (cg_read_strstr(child_src, "cgroup.controllers", "cpuset") || + cg_read_strstr(child_dst, "cgroup.controllers", "cpuset")) + goto cleanup; + + /* Enable permissions along src->dst tree path */ + if (chown(child_src_procs, test_euid, -1) || + chown(child_dst_procs, test_euid, -1)) + goto cleanup; + + if (allow && chown(parent_procs, test_euid, -1)) + goto cleanup; + + /* Fork a privileged child as a test object */ + object_pid = cg_run_nowait(child_src, idle_process_fn, NULL); + if (object_pid < 0) + goto cleanup; + + /* Carry out migration in a child process that can drop all privileges + * (including capabilities), the main process must remain privileged for + * cleanup. + * Child process's cgroup is irrelevant but we place it into child_dst + * as hacky way to pass information about migration target to the child. + */ + if (allow ^ (cg_run(child_dst, do_migration_fn, (void *)(size_t)object_pid) == EXIT_SUCCESS)) + goto cleanup; + + ret = KSFT_PASS; + +cleanup: + if (object_pid > 0) { + (void)kill(object_pid, SIGTERM); + (void)clone_reap(object_pid, WEXITED); + } + + cg_destroy(child_dst); + free(child_dst_procs); + free(child_dst); + + cg_destroy(child_src); + free(child_src_procs); + free(child_src); + + cg_destroy(parent); + free(parent_procs); + free(parent); + + return ret; +} + +static int test_cpuset_perms_object_allow(const char *root) +{ + return test_cpuset_perms_object(root, true); +} + +static int test_cpuset_perms_object_deny(const char *root) +{ + return test_cpuset_perms_object(root, false); +} + +/* + * Migrate a process between parent and child implicitely + * Implicit migration happens when a controller is enabled/disabled. + * + */ +static int test_cpuset_perms_subtree(const char *root) +{ + char *parent = NULL, *child = NULL; + char *parent_procs = NULL, *parent_subctl = NULL, *child_procs = NULL; + const uid_t test_euid = TEST_UID; + int object_pid = 0; + int ret = KSFT_FAIL; + + parent = cg_name(root, "cpuset_test_0"); + if (!parent) + goto cleanup; + parent_procs = cg_name(parent, "cgroup.procs"); + if (!parent_procs) + goto cleanup; + parent_subctl = cg_name(parent, "cgroup.subtree_control"); + if (!parent_subctl) + goto cleanup; + if (cg_create(parent)) + goto cleanup; + + child = cg_name(parent, "cpuset_test_1"); + if (!child) + goto cleanup; + child_procs = cg_name(child, "cgroup.procs"); + if (!child_procs) + goto cleanup; + if (cg_create(child)) + goto cleanup; + + /* Enable permissions as in a delegated subtree */ + if (chown(parent_procs, test_euid, -1) || + chown(parent_subctl, test_euid, -1) || + chown(child_procs, test_euid, -1)) + goto cleanup; + + /* Put a privileged child in the subtree and modify controller state + * from an unprivileged process, the main process remains privileged + * for cleanup. + * The unprivileged child runs in subtree too to avoid parent and + * internal-node constraing violation. + */ + object_pid = cg_run_nowait(child, idle_process_fn, NULL); + if (object_pid < 0) + goto cleanup; + + if (cg_run(child, do_controller_fn, parent) != EXIT_SUCCESS) + goto cleanup; + + ret = KSFT_PASS; + +cleanup: + if (object_pid > 0) { + (void)kill(object_pid, SIGTERM); + (void)clone_reap(object_pid, WEXITED); + } + + cg_destroy(child); + free(child_procs); + free(child); + + cg_destroy(parent); + free(parent_subctl); + free(parent_procs); + free(parent); + + return ret; +} + + +#define T(x) { x, #x } +struct cpuset_test { + int (*fn)(const char *root); + const char *name; +} tests[] = { + T(test_cpuset_perms_object_allow), + T(test_cpuset_perms_object_deny), + T(test_cpuset_perms_subtree), +}; +#undef T + +int main(int argc, char *argv[]) +{ + char root[PATH_MAX]; + int i, ret = EXIT_SUCCESS; + + if (cg_find_unified_root(root, sizeof(root))) + ksft_exit_skip("cgroup v2 isn't mounted\n"); + + if (cg_read_strstr(root, "cgroup.subtree_control", "cpuset")) + if (cg_write(root, "cgroup.subtree_control", "+cpuset")) + ksft_exit_skip("Failed to set cpuset controller\n"); + + for (i = 0; i < ARRAY_SIZE(tests); i++) { + switch (tests[i].fn(root)) { + case KSFT_PASS: + ksft_test_result_pass("%s\n", tests[i].name); + break; + case KSFT_SKIP: + ksft_test_result_skip("%s\n", tests[i].name); + break; + default: + ret = EXIT_FAILURE; + ksft_test_result_fail("%s\n", tests[i].name); + break; + } + } + + return ret; +}
linux-kselftest-mirror@lists.linaro.org