April 2019 - Linux-kselftest-mirror

[PATCH v4 0/6] lib/string: Add strscpy_pad() function

by Tobin C. Harding

Hi Shua, Here is the set with cleanup as suggested by Kees on v3. Configured, built, and tested all modules loaded by tools/testing/selftests/lib/*.sh >From previous cover letters ... While doing the testing for strscpy_pad() it was noticed that there is duplication in how test modules are being fed to kselftest and also in the test modules themselves. This set makes an attempt at adding a framework to kselftest for writing kernel test modules. It also adds a script for use in creating script test runners for kselftest. My macro-foo is not great, all criticism and suggestions very much appreciated. The design is based on test modules lib/test_printf.c, lib/test_bitmap.c, lib/test_xarray.c. Changes since last version: - Remove dependency on Bash (thanks Kees) - Use oneliner to implement kselftest test runners (thanks Kees) - Squash patch that adds kselftest script creator script with patch that uses it. - Fix typos (thanks Randy) - Add Kees' Acked-by tags to all patches thanks, Tobin. Tobin C. Harding (6): lib/test_printf: Add empty module_exit function kselftest: Add test runner creation script kselftest: Add test module framework header lib: Use new kselftest header lib/string: Add strscpy_pad() function lib: Add test module for strscpy_pad Documentation/dev-tools/kselftest.rst | 94 +++++++++++- include/linux/string.h | 4 + lib/Kconfig.debug | 3 + lib/Makefile | 1 + lib/string.c | 47 +++++- lib/test_bitmap.c | 20 +-- lib/test_printf.c | 17 +-- lib/test_strscpy.c | 150 +++++++++++++++++++ tools/testing/selftests/kselftest_module.h | 48 ++++++ tools/testing/selftests/kselftest_module.sh | 84 +++++++++++ tools/testing/selftests/lib/Makefile | 2 +- tools/testing/selftests/lib/bitmap.sh | 18 +-- tools/testing/selftests/lib/config | 1 + tools/testing/selftests/lib/prime_numbers.sh | 17 +-- tools/testing/selftests/lib/printf.sh | 19 +-- tools/testing/selftests/lib/strscpy.sh | 3 + 16 files changed, 440 insertions(+), 88 deletions(-) create mode 100644 lib/test_strscpy.c create mode 100644 tools/testing/selftests/kselftest_module.h create mode 100755 tools/testing/selftests/kselftest_module.sh create mode 100755 tools/testing/selftests/lib/strscpy.sh -- 2.21.0

6 years, 1 month

5
11
0 0

[RFC v3 00/19] kunit: introduce KUnit, the Linux kernel unit testing framework

by Brendan Higgins

This patch set proposes KUnit, a lightweight unit testing and mocking framework for the Linux kernel. Unlike Autotest and kselftest, KUnit is a true unit testing framework; it does not require installing the kernel on a test machine or in a VM and does not require tests to be written in userspace running on a host kernel. Additionally, KUnit is fast: From invocation to completion KUnit can run several dozen tests in under a second. Currently, the entire KUnit test suite for KUnit runs in under a second from the initial invocation (build time excluded). KUnit is heavily inspired by JUnit, Python's unittest.mock, and Googletest/Googlemock for C++. KUnit provides facilities for defining unit test cases, grouping related test cases into test suites, providing common infrastructure for running tests, mocking, spying, and much more. ## What's so special about unit testing? A unit test is supposed to test a single unit of code in isolation, hence the name. There should be no dependencies outside the control of the test; this means no external dependencies, which makes tests orders of magnitudes faster. Likewise, since there are no external dependencies, there are no hoops to jump through to run the tests. Additionally, this makes unit tests deterministic: a failing unit test always indicates a problem. Finally, because unit tests necessarily have finer granularity, they are able to test all code paths easily solving the classic problem of difficulty in exercising error handling code. ## Is KUnit trying to replace other testing frameworks for the kernel? No. Most existing tests for the Linux kernel are end-to-end tests, which have their place. A well tested system has lots of unit tests, a reasonable number of integration tests, and some end-to-end tests. KUnit is just trying to address the unit test space which is currently not being addressed. ## More information on KUnit There is a bunch of documentation near the end of this patch set that describes how to use KUnit and best practices for writing unit tests. For convenience I am hosting the compiled docs here: https://google.github.io/kunit-docs/third_party/kernel/docs/ Additionally for convenience, I have applied these patches to a branch: https://kunit.googlesource.com/linux/+/kunit/rfc/4.19/v3 The repo may be cloned with: git clone https://kunit.googlesource.com/linux This patchset is on the kunit/rfc/4.19/v3 branch. ## Changes Since Last Version - Changed namespace prefix from `test_*` to `kunit_*` as requested by Shuah. - Started converting/cleaning up the device tree unittest to use KUnit. - Started adding KUnit expectations with custom messages. -- 2.20.0.rc0.387.gc7a69e6b6c-goog

6 years, 3 months

11
105
0 0

[PATCH v10 6/9] kselftests: cgroup: add freezer controller self-tests

by Roman Gushchin

This patch implements 9 tests for the freezer controller for cgroup v2: 1) a simple test, which aims to freeze and unfreeze a cgroup with 100 processes 2) a more complicated tree test, which creates a hierarchy of cgroups, puts some processes in some cgroups, and tries to freeze and unfreeze different parts of the subtree 3) a forkbomb test: the test aims to freeze a forkbomb running in a cgroup, kill all tasks in the cgroup and remove the cgroup without the unfreezing. 4) rmdir test: the test creates two nested cgroups, freezes the parent one, checks that the child can be successfully removed, and a new child can be created 5) migration tests: the test checks migration of a task between frozen cgroups: from a frozen to a running, from a running to a frozen, and from a frozen to a frozen. 6) ptrace test: the test checks that it's possible to attach to a process in a frozen cgroup, get some information and detach, and the cgroup will remain frozen. 7) stopped test: the test checks that it's possible to freeze a cgroup with a stopped task 8) ptraced test: the test checks that it's possible to freeze a cgroup with a ptraced task 9) vfork test: the test checks that it's possible to freeze a cgroup with a parent process waiting for the child process in vfork() Expected output: $ ./test_freezer ok 1 test_cgfreezer_simple ok 2 test_cgfreezer_tree ok 3 test_cgfreezer_forkbomb ok 4 test_cgrreezer_rmdir ok 5 test_cgfreezer_migrate ok 6 test_cgfreezer_ptrace ok 7 test_cgfreezer_stopped ok 8 test_cgfreezer_ptraced ok 9 test_cgfreezer_vfork Signed-off-by: Roman Gushchin <guro(a)fb.com> Cc: Shuah Khan <shuah(a)kernel.org> Cc: Tejun Heo <tj(a)kernel.org> Cc: kernel-team(a)fb.com Cc: linux-kselftest(a)vger.kernel.org --- tools/testing/selftests/cgroup/.gitignore | 1 + tools/testing/selftests/cgroup/Makefile | 2 + tools/testing/selftests/cgroup/cgroup_util.c | 54 +- tools/testing/selftests/cgroup/cgroup_util.h | 5 + tools/testing/selftests/cgroup/test_freezer.c | 851 ++++++++++++++++++ 5 files changed, 912 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/cgroup/test_freezer.c diff --git a/tools/testing/selftests/cgroup/.gitignore b/tools/testing/selftests/cgroup/.gitignore index adacda50a4b2..7f9835624793 100644 --- a/tools/testing/selftests/cgroup/.gitignore +++ b/tools/testing/selftests/cgroup/.gitignore @@ -1,2 +1,3 @@ test_memcontrol test_core +test_freezer diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile index 23fbaa4a9630..8d369b6a2069 100644 --- a/tools/testing/selftests/cgroup/Makefile +++ b/tools/testing/selftests/cgroup/Makefile @@ -5,8 +5,10 @@ all: TEST_GEN_PROGS = test_memcontrol TEST_GEN_PROGS += test_core +TEST_GEN_PROGS += test_freezer include ../lib.mk $(OUTPUT)/test_memcontrol: cgroup_util.c $(OUTPUT)/test_core: cgroup_util.c +$(OUTPUT)/test_freezer: cgroup_util.c diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c index eba06f94433b..4c223266299a 100644 --- a/tools/testing/selftests/cgroup/cgroup_util.c +++ b/tools/testing/selftests/cgroup/cgroup_util.c @@ -74,6 +74,16 @@ char *cg_name_indexed(const char *root, const char *name, int index) return ret; } +char *cg_control(const char *cgroup, const char *control) +{ + size_t len = strlen(cgroup) + strlen(control) + 2; + char *ret = malloc(len); + + snprintf(ret, len, "%s/%s", cgroup, control); + + return ret; +} + int cg_read(const char *cgroup, const char *control, char *buf, size_t len) { char path[PATH_MAX]; @@ -196,7 +206,32 @@ int cg_create(const char *cgroup) return mkdir(cgroup, 0644); } -static int cg_killall(const char *cgroup) +int cg_wait_for_proc_count(const char *cgroup, int count) +{ + char buf[10 * PAGE_SIZE] = {0}; + int attempts; + char *ptr; + + for (attempts = 10; attempts >= 0; attempts--) { + int nr = 0; + + if (cg_read(cgroup, "cgroup.procs", buf, sizeof(buf))) + break; + + for (ptr = buf; *ptr; ptr++) + if (*ptr == '\n') + nr++; + + if (nr >= count) + return 0; + + usleep(100000); + } + + return -1; +} + +int cg_killall(const char *cgroup) { char buf[PAGE_SIZE]; char *ptr = buf; @@ -238,6 +273,14 @@ int cg_destroy(const char *cgroup) return ret; } +int cg_enter(const char *cgroup, int pid) +{ + char pidbuf[64]; + + snprintf(pidbuf, sizeof(pidbuf), "%d", pid); + return cg_write(cgroup, "cgroup.procs", pidbuf); +} + int cg_enter_current(const char *cgroup) { char pidbuf[64]; @@ -367,3 +410,12 @@ int set_oom_adj_score(int pid, int score) close(fd); return 0; } + +char proc_read_text(int pid, const char *item, char *buf, size_t size) +{ + char path[PATH_MAX]; + + snprintf(path, sizeof(path), "/proc/%d/%s", pid, item); + + return read_text(path, buf, size); +} diff --git a/tools/testing/selftests/cgroup/cgroup_util.h b/tools/testing/selftests/cgroup/cgroup_util.h index 9ac8b7958f83..c72f28046bfa 100644 --- a/tools/testing/selftests/cgroup/cgroup_util.h +++ b/tools/testing/selftests/cgroup/cgroup_util.h @@ -18,6 +18,7 @@ static inline int values_close(long a, long b, int err) extern int cg_find_unified_root(char *root, size_t len); extern char *cg_name(const char *root, const char *name); extern char *cg_name_indexed(const char *root, const char *name, int index); +extern char *cg_control(const char *cgroup, const char *control); extern int cg_create(const char *cgroup); extern int cg_destroy(const char *cgroup); extern int cg_read(const char *cgroup, const char *control, @@ -32,6 +33,7 @@ extern int cg_write(const char *cgroup, const char *control, char *buf); extern int cg_run(const char *cgroup, int (*fn)(const char *cgroup, void *arg), void *arg); +extern int cg_enter(const char *cgroup, int pid); extern int cg_enter_current(const char *cgroup); extern int cg_run_nowait(const char *cgroup, int (*fn)(const char *cgroup, void *arg), @@ -41,3 +43,6 @@ extern int alloc_pagecache(int fd, size_t size); extern int alloc_anon(const char *cgroup, void *arg); extern int is_swap_enabled(void); extern int set_oom_adj_score(int pid, int score); +extern int cg_wait_for_proc_count(const char *cgroup, int count); +extern int cg_killall(const char *cgroup); +extern char proc_read_text(int pid, const char *item, char *buf, size_t size); diff --git a/tools/testing/selftests/cgroup/test_freezer.c b/tools/testing/selftests/cgroup/test_freezer.c new file mode 100644 index 000000000000..2bfddb6d6d3b --- /dev/null +++ b/tools/testing/selftests/cgroup/test_freezer.c @@ -0,0 +1,851 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#include <stdbool.h> +#include <linux/limits.h> +#include <sys/ptrace.h> +#include <sys/types.h> +#include <sys/mman.h> +#include <unistd.h> +#include <stdio.h> +#include <errno.h> +#include <poll.h> +#include <stdlib.h> +#include <sys/inotify.h> +#include <string.h> +#include <sys/types.h> +#include <sys/wait.h> + +#include "../kselftest.h" +#include "cgroup_util.h" + +#define DEBUG +#ifdef DEBUG +#define debug(args...) fprintf(stderr, args) +#else +#define debug(args...) +#endif + +/* + * Check if the cgroup is frozen by looking at the cgroup.events::frozen value. + */ +static int cg_check_frozen(const char *cgroup, bool frozen) +{ + if (frozen) { + if (cg_read_strstr(cgroup, "cgroup.events", "frozen 1") != 0) { + debug("Cgroup %s isn't frozen\n", cgroup); + return -1; + } + } else { + /* + * Check the cgroup.events::frozen value. + */ + if (cg_read_strstr(cgroup, "cgroup.events", "frozen 0") != 0) { + debug("Cgroup %s is frozen\n", cgroup); + return -1; + } + } + + return 0; +} + +/* + * Freeze the given cgroup. + */ +static int cg_freeze_nowait(const char *cgroup, bool freeze) +{ + return cg_write(cgroup, "cgroup.freeze", freeze ? "1" : "0"); +} + +/* + * Prepare for waiting on cgroup.events file. + */ +static int cg_prepare_for_wait(const char *cgroup) +{ + int fd, ret = -1; + + fd = inotify_init1(0); + if (fd == -1) { + debug("Error: inotify_init1() failed\n"); + return fd; + } + + ret = inotify_add_watch(fd, cg_control(cgroup, "cgroup.events"), + IN_MODIFY); + if (ret == -1) { + debug("Error: inotify_add_watch() failed\n"); + close(fd); + } + + return fd; +} + +/* + * Wait for an event. If there are no events for 10 seconds, + * treat this an error. + */ +static int cg_wait_for(int fd) +{ + int ret = -1; + struct pollfd fds = { + .fd = fd, + .events = POLLIN, + }; + + while (true) { + ret = poll(&fds, 1, 10000); + + if (ret == -1) { + if (errno == EINTR) + continue; + debug("Error: poll() failed\n"); + break; + } + + if (ret > 0 && fds.revents & POLLIN) { + ret = 0; + break; + } + } + + return ret; +} + +/* + * Attach a task to the given cgroup and wait for a cgroup frozen event. + * All transient events (e.g. populated) are ignored. + */ +static int cg_enter_and_wait_for_frozen(const char *cgroup, int pid, + bool frozen) +{ + int fd, ret = -1; + int attempts; + + fd = cg_prepare_for_wait(cgroup); + if (fd < 0) + return fd; + + ret = cg_enter(cgroup, pid); + if (ret) + goto out; + + for (attempts = 0; attempts < 10; attempts++) { + ret = cg_wait_for(fd); + if (ret) + break; + + ret = cg_check_frozen(cgroup, frozen); + if (ret) + continue; + } + +out: + close(fd); + return ret; +} + +/* + * Freeze the given cgroup and wait for the inotify signal. + * If there are no events in 10 seconds, treat this as an error. + * Then check that the cgroup is in the desired state. + */ +static int cg_freeze_wait(const char *cgroup, bool freeze) +{ + int fd, ret = -1; + + fd = cg_prepare_for_wait(cgroup); + if (fd < 0) + return fd; + + ret = cg_freeze_nowait(cgroup, freeze); + if (ret) { + debug("Error: cg_freeze_nowait() failed\n"); + goto out; + } + + ret = cg_wait_for(fd); + if (ret) + goto out; + + ret = cg_check_frozen(cgroup, freeze); +out: + close(fd); + return ret; +} + +/* + * A simple process running in a sleep loop until being + * re-parented. + */ +static int child_fn(const char *cgroup, void *arg) +{ + int ppid = getppid(); + + while (getppid() == ppid) + usleep(1000); + + return getppid() == ppid; +} + +/* + * A simple test for the cgroup freezer: populated the cgroup with 100 + * running processes and freeze it. Then unfreeze it. Then it kills all + * processes and destroys the cgroup. + */ +static int test_cgfreezer_simple(const char *root) +{ + int ret = KSFT_FAIL; + char *cgroup = NULL; + int i; + + cgroup = cg_name(root, "cg_test_simple"); + if (!cgroup) + goto cleanup; + + if (cg_create(cgroup)) + goto cleanup; + + for (i = 0; i < 100; i++) + cg_run_nowait(cgroup, child_fn, NULL); + + if (cg_wait_for_proc_count(cgroup, 100)) + goto cleanup; + + if (cg_check_frozen(cgroup, false)) + goto cleanup; + + if (cg_freeze_wait(cgroup, true)) + goto cleanup; + + if (cg_freeze_wait(cgroup, false)) + goto cleanup; + + ret = KSFT_PASS; + +cleanup: + if (cgroup) + cg_destroy(cgroup); + free(cgroup); + return ret; +} + +/* + * The test creates the following hierarchy: + * A + * / / \ \ + * B E I K + * /\ | + * C D F + * | + * G + * | + * H + * + * with a process in C, H and 3 processes in K. + * Then it tries to freeze and unfreeze the whole tree. + */ +static int test_cgfreezer_tree(const char *root) +{ + char *cgroup[10] = {0}; + int ret = KSFT_FAIL; + int i; + + cgroup[0] = cg_name(root, "cg_test_tree_A"); + if (!cgroup[0]) + goto cleanup; + + cgroup[1] = cg_name(cgroup[0], "B"); + if (!cgroup[1]) + goto cleanup; + + cgroup[2] = cg_name(cgroup[1], "C"); + if (!cgroup[2]) + goto cleanup; + + cgroup[3] = cg_name(cgroup[1], "D"); + if (!cgroup[3]) + goto cleanup; + + cgroup[4] = cg_name(cgroup[0], "E"); + if (!cgroup[4]) + goto cleanup; + + cgroup[5] = cg_name(cgroup[4], "F"); + if (!cgroup[5]) + goto cleanup; + + cgroup[6] = cg_name(cgroup[5], "G"); + if (!cgroup[6]) + goto cleanup; + + cgroup[7] = cg_name(cgroup[6], "H"); + if (!cgroup[7]) + goto cleanup; + + cgroup[8] = cg_name(cgroup[0], "I"); + if (!cgroup[8]) + goto cleanup; + + cgroup[9] = cg_name(cgroup[0], "K"); + if (!cgroup[9]) + goto cleanup; + + for (i = 0; i < 10; i++) + if (cg_create(cgroup[i])) + goto cleanup; + + cg_run_nowait(cgroup[2], child_fn, NULL); + cg_run_nowait(cgroup[7], child_fn, NULL); + cg_run_nowait(cgroup[9], child_fn, NULL); + cg_run_nowait(cgroup[9], child_fn, NULL); + cg_run_nowait(cgroup[9], child_fn, NULL); + + /* + * Wait until all child processes will enter + * corresponding cgroups. + */ + + if (cg_wait_for_proc_count(cgroup[2], 1) || + cg_wait_for_proc_count(cgroup[7], 1) || + cg_wait_for_proc_count(cgroup[9], 3)) + goto cleanup; + + /* + * Freeze B. + */ + if (cg_freeze_wait(cgroup[1], true)) + goto cleanup; + + /* + * Freeze F. + */ + if (cg_freeze_wait(cgroup[5], true)) + goto cleanup; + + /* + * Freeze G. + */ + if (cg_freeze_wait(cgroup[6], true)) + goto cleanup; + + /* + * Check that A and E are not frozen. + */ + if (cg_check_frozen(cgroup[0], false)) + goto cleanup; + + if (cg_check_frozen(cgroup[4], false)) + goto cleanup; + + /* + * Freeze A. Check that A, B and E are frozen. + */ + if (cg_freeze_wait(cgroup[0], true)) + goto cleanup; + + if (cg_check_frozen(cgroup[1], true)) + goto cleanup; + + if (cg_check_frozen(cgroup[4], true)) + goto cleanup; + + /* + * Unfreeze B, F and G + */ + if (cg_freeze_nowait(cgroup[1], false)) + goto cleanup; + + if (cg_freeze_nowait(cgroup[5], false)) + goto cleanup; + + if (cg_freeze_nowait(cgroup[6], false)) + goto cleanup; + + /* + * Check that C and H are still frozen. + */ + if (cg_check_frozen(cgroup[2], true)) + goto cleanup; + + if (cg_check_frozen(cgroup[7], true)) + goto cleanup; + + /* + * Unfreeze A. Check that A, C and K are not frozen. + */ + if (cg_freeze_wait(cgroup[0], false)) + goto cleanup; + + if (cg_check_frozen(cgroup[2], false)) + goto cleanup; + + if (cg_check_frozen(cgroup[9], false)) + goto cleanup; + + ret = KSFT_PASS; + +cleanup: + for (i = 9; i >= 0 && cgroup[i]; i--) { + cg_destroy(cgroup[i]); + free(cgroup[i]); + } + + return ret; +} + +/* + * A fork bomb emulator. + */ +static int forkbomb_fn(const char *cgroup, void *arg) +{ + int ppid; + + fork(); + fork(); + + ppid = getppid(); + + while (getppid() == ppid) + usleep(1000); + + return getppid() == ppid; +} + +/* + * The test runs a fork bomb in a cgroup and tries to freeze it. + * Then it kills all processes and checks that cgroup isn't populated + * anymore. + */ +static int test_cgfreezer_forkbomb(const char *root) +{ + int ret = KSFT_FAIL; + char *cgroup = NULL; + + cgroup = cg_name(root, "cg_forkbomb_test"); + if (!cgroup) + goto cleanup; + + if (cg_create(cgroup)) + goto cleanup; + + cg_run_nowait(cgroup, forkbomb_fn, NULL); + + usleep(100000); + + if (cg_freeze_wait(cgroup, true)) + goto cleanup; + + if (cg_killall(cgroup)) + goto cleanup; + + if (cg_wait_for_proc_count(cgroup, 0)) + goto cleanup; + + ret = KSFT_PASS; + +cleanup: + if (cgroup) + cg_destroy(cgroup); + free(cgroup); + return ret; +} + +/* + * The test creates two nested cgroups, freezes the parent + * and removes the child. Then it checks that the parent cgroup + * remains frozen and it's possible to create a new child + * without unfreezing. The new child is frozen too. + */ +static int test_cgfreezer_rmdir(const char *root) +{ + int ret = KSFT_FAIL; + char *parent, *child = NULL; + + parent = cg_name(root, "cg_test_rmdir_A"); + if (!parent) + goto cleanup; + + child = cg_name(parent, "cg_test_rmdir_B"); + if (!child) + goto cleanup; + + if (cg_create(parent)) + goto cleanup; + + if (cg_create(child)) + goto cleanup; + + if (cg_freeze_wait(parent, true)) + goto cleanup; + + if (cg_destroy(child)) + goto cleanup; + + if (cg_check_frozen(parent, true)) + goto cleanup; + + if (cg_create(child)) + goto cleanup; + + if (cg_check_frozen(child, true)) + goto cleanup; + + ret = KSFT_PASS; + +cleanup: + if (child) + cg_destroy(child); + free(child); + if (parent) + cg_destroy(parent); + free(parent); + return ret; +} + +/* + * The test creates two cgroups: A and B, runs a process in A + * and performs several migrations: + * 1) A (running) -> B (frozen) + * 2) B (frozen) -> A (running) + * 3) A (frozen) -> B (frozen) + * + * On each step it checks the actual state of both cgroups. + */ +static int test_cgfreezer_migrate(const char *root) +{ + int ret = KSFT_FAIL; + char *cgroup[2] = {0}; + int pid; + + cgroup[0] = cg_name(root, "cg_test_migrate_A"); + if (!cgroup[0]) + goto cleanup; + + cgroup[1] = cg_name(root, "cg_test_migrate_B"); + if (!cgroup[1]) + goto cleanup; + + if (cg_create(cgroup[0])) + goto cleanup; + + if (cg_create(cgroup[1])) + goto cleanup; + + pid = cg_run_nowait(cgroup[0], child_fn, NULL); + if (pid < 0) + goto cleanup; + + if (cg_wait_for_proc_count(cgroup[0], 1)) + goto cleanup; + + /* + * Migrate from A (running) to B (frozen) + */ + if (cg_freeze_wait(cgroup[1], true)) + goto cleanup; + + if (cg_enter_and_wait_for_frozen(cgroup[1], pid, true)) + goto cleanup; + + if (cg_check_frozen(cgroup[0], false)) + goto cleanup; + + /* + * Migrate from B (frozen) to A (running) + */ + if (cg_enter_and_wait_for_frozen(cgroup[0], pid, false)) + goto cleanup; + + if (cg_check_frozen(cgroup[1], true)) + goto cleanup; + + /* + * Migrate from A (frozen) to B (frozen) + */ + if (cg_freeze_wait(cgroup[0], true)) + goto cleanup; + + if (cg_enter_and_wait_for_frozen(cgroup[1], pid, true)) + goto cleanup; + + if (cg_check_frozen(cgroup[0], true)) + goto cleanup; + + ret = KSFT_PASS; + +cleanup: + if (cgroup[0]) + cg_destroy(cgroup[0]); + free(cgroup[0]); + if (cgroup[1]) + cg_destroy(cgroup[1]); + free(cgroup[1]); + return ret; +} + +/* + * The test checks that ptrace works with a tracing process in a frozen cgroup. + */ +static int test_cgfreezer_ptrace(const char *root) +{ + int ret = KSFT_FAIL; + char *cgroup = NULL; + siginfo_t siginfo; + int pid; + + cgroup = cg_name(root, "cg_test_ptrace"); + if (!cgroup) + goto cleanup; + + if (cg_create(cgroup)) + goto cleanup; + + pid = cg_run_nowait(cgroup, child_fn, NULL); + if (pid < 0) + goto cleanup; + + if (cg_wait_for_proc_count(cgroup, 1)) + goto cleanup; + + if (cg_freeze_wait(cgroup, true)) + goto cleanup; + + if (ptrace(PTRACE_SEIZE, pid, NULL, NULL)) + goto cleanup; + + if (ptrace(PTRACE_INTERRUPT, pid, NULL, NULL)) + goto cleanup; + + waitpid(pid, NULL, 0); + + /* + * Cgroup has to remain frozen, however the test task + * is in traced state. + */ + if (cg_check_frozen(cgroup, true)) + goto cleanup; + + if (ptrace(PTRACE_GETSIGINFO, pid, NULL, &siginfo)) + goto cleanup; + + if (ptrace(PTRACE_DETACH, pid, NULL, NULL)) + goto cleanup; + + if (cg_check_frozen(cgroup, true)) + goto cleanup; + + ret = KSFT_PASS; + +cleanup: + if (cgroup) + cg_destroy(cgroup); + free(cgroup); + return ret; +} + +/* + * Check if the process is stopped. + */ +static int proc_check_stopped(int pid) +{ + char buf[PAGE_SIZE]; + int len; + + len = proc_read_text(pid, "stat", buf, sizeof(buf)); + if (len == -1) { + debug("Can't get %d stat\n", pid); + return -1; + } + + if (strstr(buf, "(test_freezer) T ") == NULL) { + debug("Process %d in the unexpected state: %s\n", pid, buf); + return -1; + } + + return 0; +} + +/* + * Test that it's possible to freeze a cgroup with a stopped process. + */ +static int test_cgfreezer_stopped(const char *root) +{ + int pid, ret = KSFT_FAIL; + char *cgroup = NULL; + + cgroup = cg_name(root, "cg_test_stopped"); + if (!cgroup) + goto cleanup; + + if (cg_create(cgroup)) + goto cleanup; + + pid = cg_run_nowait(cgroup, child_fn, NULL); + + if (cg_wait_for_proc_count(cgroup, 1)) + goto cleanup; + + if (kill(pid, SIGSTOP)) + goto cleanup; + + if (cg_check_frozen(cgroup, false)) + goto cleanup; + + if (cg_freeze_wait(cgroup, true)) + goto cleanup; + + if (cg_freeze_wait(cgroup, false)) + goto cleanup; + + if (proc_check_stopped(pid)) + goto cleanup; + + ret = KSFT_PASS; + +cleanup: + if (cgroup) + cg_destroy(cgroup); + free(cgroup); + return ret; +} + +/* + * Test that it's possible to freeze a cgroup with a ptraced process. + */ +static int test_cgfreezer_ptraced(const char *root) +{ + int pid, ret = KSFT_FAIL; + char *cgroup = NULL; + siginfo_t siginfo; + + cgroup = cg_name(root, "cg_test_ptraced"); + if (!cgroup) + goto cleanup; + + if (cg_create(cgroup)) + goto cleanup; + + pid = cg_run_nowait(cgroup, child_fn, NULL); + + if (cg_wait_for_proc_count(cgroup, 1)) + goto cleanup; + + if (ptrace(PTRACE_SEIZE, pid, NULL, NULL)) + goto cleanup; + + if (ptrace(PTRACE_INTERRUPT, pid, NULL, NULL)) + goto cleanup; + + waitpid(pid, NULL, 0); + + if (cg_check_frozen(cgroup, false)) + goto cleanup; + + if (cg_freeze_wait(cgroup, true)) + goto cleanup; + + /* + * cg_check_frozen(cgroup, true) will fail here, + * because the task in in the TRACEd state. + */ + if (cg_freeze_wait(cgroup, false)) + goto cleanup; + + if (ptrace(PTRACE_GETSIGINFO, pid, NULL, &siginfo)) + goto cleanup; + + if (ptrace(PTRACE_DETACH, pid, NULL, NULL)) + goto cleanup; + + ret = KSFT_PASS; + +cleanup: + if (cgroup) + cg_destroy(cgroup); + free(cgroup); + return ret; +} + +static int vfork_fn(const char *cgroup, void *arg) +{ + int pid = vfork(); + + if (pid == 0) + while (true) + sleep(1); + + return pid; +} + +/* + * Test that it's possible to freeze a cgroup with a process, + * which called vfork() and is waiting for a child. + */ +static int test_cgfreezer_vfork(const char *root) +{ + int ret = KSFT_FAIL; + char *cgroup = NULL; + + cgroup = cg_name(root, "cg_test_vfork"); + if (!cgroup) + goto cleanup; + + if (cg_create(cgroup)) + goto cleanup; + + cg_run_nowait(cgroup, vfork_fn, NULL); + + if (cg_wait_for_proc_count(cgroup, 2)) + goto cleanup; + + if (cg_freeze_wait(cgroup, true)) + goto cleanup; + + ret = KSFT_PASS; + +cleanup: + if (cgroup) + cg_destroy(cgroup); + free(cgroup); + return ret; +} + +#define T(x) { x, #x } +struct cgfreezer_test { + int (*fn)(const char *root); + const char *name; +} tests[] = { + T(test_cgfreezer_simple), + T(test_cgfreezer_tree), + T(test_cgfreezer_forkbomb), + T(test_cgfreezer_rmdir), + T(test_cgfreezer_migrate), + T(test_cgfreezer_ptrace), + T(test_cgfreezer_stopped), + T(test_cgfreezer_ptraced), + T(test_cgfreezer_vfork), +}; +#undef T + +int main(int argc, char *argv[]) +{ + char root[PATH_MAX]; + int i, ret = EXIT_SUCCESS; + + if (cg_find_unified_root(root, sizeof(root))) + ksft_exit_skip("cgroup v2 isn't mounted\n"); + for (i = 0; i < ARRAY_SIZE(tests); i++) { + switch (tests[i].fn(root)) { + case KSFT_PASS: + ksft_test_result_pass("%s\n", tests[i].name); + break; + case KSFT_SKIP: + ksft_test_result_skip("%s\n", tests[i].name); + break; + default: + ret = EXIT_FAILURE; + ksft_test_result_fail("%s\n", tests[i].name); + break; + } + } + + return ret; +} -- 2.20.1

6 years, 5 months

3
4
0 0

[PATCH for 5.2 08/12] rseq/selftests: arm: use udf instruction for RSEQ_SIG

by Mathieu Desnoyers

Use udf as the guard instruction for the restartable sequence abort handler. Previously, the chosen signature was not a valid instruction, based on the assumption that it could always sit in a literal pool. However, there are compilation environments in which literal pools are not availble, for instance execute-only code. Therefore, we need to choose a signature value that is also a valid instruction. Handle compiling with -mbig-endian on ARMv6+, which generates binaries with mixed code vs data endianness (little endian code, big endian data). Else mismatch between code endianness for the generated signatures and data endianness for the RSEQ_SIG parameter passed to the rseq registration will trigger application segmentation faults when the kernel try to abort rseq critical sections. Prior to ARMv6, -mbig-endian generates big-endian code and data, so endianness should not be reversed in that case. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com> CC: Peter Zijlstra <peterz(a)infradead.org> CC: Thomas Gleixner <tglx(a)linutronix.de> CC: Joel Fernandes <joelaf(a)google.com> CC: Catalin Marinas <catalin.marinas(a)arm.com> CC: Dave Watson <davejwatson(a)fb.com> CC: Will Deacon <will.deacon(a)arm.com> CC: Shuah Khan <shuah(a)kernel.org> CC: Andi Kleen <andi(a)firstfloor.org> CC: linux-kselftest(a)vger.kernel.org CC: "H . Peter Anvin" <hpa(a)zytor.com> CC: Chris Lameter <cl(a)linux.com> CC: Russell King <linux(a)arm.linux.org.uk> CC: Michael Kerrisk <mtk.manpages(a)gmail.com> CC: "Paul E . McKenney" <paulmck(a)linux.vnet.ibm.com> CC: Paul Turner <pjt(a)google.com> CC: Boqun Feng <boqun.feng(a)gmail.com> CC: Josh Triplett <josh(a)joshtriplett.org> CC: Steven Rostedt <rostedt(a)goodmis.org> CC: Ben Maurer <bmaurer(a)fb.com> CC: linux-api(a)vger.kernel.org CC: Andy Lutomirski <luto(a)amacapital.net> CC: Andrew Morton <akpm(a)linux-foundation.org> CC: Linus Torvalds <torvalds(a)linux-foundation.org> --- tools/testing/selftests/rseq/rseq-arm.h | 52 +++++++++++++++++++++++++++++++-- 1 file changed, 50 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/rseq/rseq-arm.h b/tools/testing/selftests/rseq/rseq-arm.h index 5f262c54364f..e8ccfc37d685 100644 --- a/tools/testing/selftests/rseq/rseq-arm.h +++ b/tools/testing/selftests/rseq/rseq-arm.h @@ -5,7 +5,54 @@ * (C) Copyright 2016-2018 - Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com> */ -#define RSEQ_SIG 0x53053053 +/* + * RSEQ_SIG uses the udf A32 instruction with an uncommon immediate operand + * value 0x5de3. This traps if user-space reaches this instruction by mistake, + * and the uncommon operand ensures the kernel does not move the instruction + * pointer to attacker-controlled code on rseq abort. + * + * The instruction pattern in the A32 instruction set is: + * + * e7f5def3 udf #24035 ; 0x5de3 + * + * This translates to the following instruction pattern in the T16 instruction + * set: + * + * little endian: + * def3 udf #243 ; 0xf3 + * e7f5 b.n <7f5> + * + * pre-ARMv6 big endian code: + * e7f5 b.n <7f5> + * def3 udf #243 ; 0xf3 + * + * ARMv6+ -mbig-endian generates mixed endianness code vs data: little-endian + * code and big-endian data. Ensure the RSEQ_SIG data signature matches code + * endianness. Prior to ARMv6, -mbig-endian generates big-endian code and data + * (which match), so there is no need to reverse the endianness of the data + * representation of the signature. However, the choice between BE32 and BE8 + * is done by the linker, so we cannot know whether code and data endianness + * will be mixed before the linker is invoked. + */ + +#define RSEQ_SIG_CODE 0xe7f5def3 + +#ifndef __ASSEMBLER__ + +#define RSEQ_SIG_DATA \ + ({ \ + int sig; \ + asm volatile ( "b 2f\n\t" \ + "1: .inst " __rseq_str(RSEQ_SIG_CODE) "\n\t" \ + "2:\n\t" \ + "ldr %[sig], 1b\n\t" \ + : [sig] "=r" (sig)); \ + sig; \ + }) + +#define RSEQ_SIG RSEQ_SIG_DATA + +#endif #define rseq_smp_mb() __asm__ __volatile__ ("dmb" ::: "memory", "cc") #define rseq_smp_rmb() __asm__ __volatile__ ("dmb" ::: "memory", "cc") @@ -78,7 +125,8 @@ do { \ __rseq_str(table_label) ":\n\t" \ ".word " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \ ".word " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) ", 0x0\n\t" \ - ".word " __rseq_str(RSEQ_SIG) "\n\t" \ + ".arm\n\t" \ + ".inst " __rseq_str(RSEQ_SIG_CODE) "\n\t" \ __rseq_str(label) ":\n\t" \ teardown \ "b %l[" __rseq_str(abort_label) "]\n\t" -- 2.11.0

6 years, 6 months

1
3
0 0

[PATCH selftests 0/2] Add checkbashisms meta-testcase

by Masami Hiramatsu

Hi, Here are patches for making sure the ftracetest testcases are checkbashisms clean. This actually needs a patch from Juerg, "selftests/ftrace: Make the coloring POSIX compliant" to complete the work. http://lkml.kernel.org/r/20190220161333.28109-1-juergh@canonical.com (Note that this is still under development) So as Juerg pointed, recently ftracetest becomes not POSIX compliant, and such kind of issues happened repeatedly. To avoid those anymore, I decided to introduce a testcase which runs checkbasisms on ftracetest and its testcases. I think this can help us to find out whether it was written in a way out of POSIX. Thank you, --- Masami Hiramatsu (2): selftests/ftrace: Make a script checkbashisms clean selftests/ftrace: Add checkbashisms meta-testcase tools/testing/selftests/ftrace/ftracetest | 1 + .../ftrace/test.d/kprobe/kprobe_ftrace.tc | 2 +- .../selftests/ftrace/test.d/selftest/bashisms.tc | 21 ++++++++++++++++++++ 3 files changed, 23 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/ftrace/test.d/selftest/bashisms.tc -- Masami Hiramatsu (Linaro) <mhiramat(a)kernel.org>

6 years, 7 months

3
5
0 0

[PATCH v2 0/5] Fix vDSO clock_getres()

by Vincenzo Frascino

clock_getres in the vDSO library has to preserve the same behaviour of posix_get_hrtimer_res(). In particular, posix_get_hrtimer_res() does: sec = 0; ns = hrtimer_resolution; and hrtimer_resolution depends on the enablement of the high resolution timers that can happen either at compile or at run time. A possible fix is to change the vdso implementation of clock_getres, keeping a copy of hrtimer_resolution in vdso data and using that directly [1]. This patchset implements the proposed fix for arm64, powerpc, s390, nds32 and adds a test to verify that the syscall and the vdso library implementation of clock_getres return the same values. Even if these patches are unified by the same topic, there is no dependency between them, hence they can be merged singularly by each arch maintainer. [1] https://marc.info/?l=linux-arm-kernel&m=155110381930196&w=2 Changes: -------- v2: - Rebased on 5.1-rc5. - Addressed review comments. Cc: Catalin Marinas <catalin.marinas(a)arm.com> Cc: Will Deacon <will.deacon(a)arm.com> Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org> Cc: Paul Mackerras <paulus(a)samba.org> Cc: Michael Ellerman <mpe(a)ellerman.id.au> Cc: Martin Schwidefsky <schwidefsky(a)de.ibm.com> Cc: Heiko Carstens <heiko.carstens(a)de.ibm.com> Cc: Greentime Hu <green.hu(a)gmail.com> Cc: Vincent Chen <deanbo422(a)gmail.com> Cc: Shuah Khan <shuah(a)kernel.org> Cc: Thomas Gleixner <tglx(a)linutronix.de> Cc: Arnd Bergmann <arnd(a)arndb.de> Signed-off-by: Vincenzo Frascino <vincenzo.frascino(a)arm.com> Vincenzo Frascino (5): arm64: Fix vDSO clock_getres() powerpc: Fix vDSO clock_getres() s390: Fix vDSO clock_getres() nds32: Fix vDSO clock_getres() kselftest: Extend vDSO selftest to clock_getres arch/arm64/include/asm/vdso_datapage.h | 1 + arch/arm64/kernel/asm-offsets.c | 2 +- arch/arm64/kernel/vdso.c | 2 + arch/arm64/kernel/vdso/gettimeofday.S | 22 ++-- arch/nds32/include/asm/vdso_datapage.h | 1 + arch/nds32/kernel/vdso.c | 1 + arch/nds32/kernel/vdso/gettimeofday.c | 4 +- arch/powerpc/include/asm/vdso_datapage.h | 2 + arch/powerpc/kernel/asm-offsets.c | 2 +- arch/powerpc/kernel/time.c | 1 + arch/powerpc/kernel/vdso32/gettimeofday.S | 7 +- arch/powerpc/kernel/vdso64/gettimeofday.S | 7 +- arch/s390/include/asm/vdso.h | 1 + arch/s390/kernel/asm-offsets.c | 2 +- arch/s390/kernel/time.c | 1 + arch/s390/kernel/vdso32/clock_getres.S | 12 +- arch/s390/kernel/vdso64/clock_getres.S | 10 +- tools/testing/selftests/vDSO/Makefile | 2 + .../selftests/vDSO/vdso_clock_getres.c | 108 ++++++++++++++++++ 19 files changed, 159 insertions(+), 29 deletions(-) create mode 100644 tools/testing/selftests/vDSO/vdso_clock_getres.c -- 2.21.0

6 years, 7 months

5
13
0 0

[PATCH v3] selftests/x86: Support Atom for syscall_arg_fault test

by Tong Bo

Atom-based CPUs trigger stack fault when invoke 32-bit SYSENTER instruction with invalid register values. So we also need SIGBUS handling in this case. Following is assembly when the fault exception happens. (gdb) disassemble $eip Dump of assembler code for function __kernel_vsyscall: 0xf7fd8fe0 <+0>: push %ecx 0xf7fd8fe1 <+1>: push %edx 0xf7fd8fe2 <+2>: push %ebp 0xf7fd8fe3 <+3>: mov %esp,%ebp 0xf7fd8fe5 <+5>: sysenter 0xf7fd8fe7 <+7>: int $0x80 => 0xf7fd8fe9 <+9>: pop %ebp 0xf7fd8fea <+10>: pop %edx 0xf7fd8feb <+11>: pop %ecx 0xf7fd8fec <+12>: ret End of assembler dump. According to Intel SDM, this could also be a Stack Segment Fault(#SS, 12), except a normal Page Fault(#PF, 14). Especially, in section 6.9 of Vol.3A, both stack and page faults are within the 10th(lowest priority) class, and as it said, "exceptions within each class are implementation-dependent and may vary from processor to processor". It's expected for processors like Intel Atom to trigger stack fault(SIGBUS), while we get page fault(SIGSEGV) from common Core processors. Signed-off-by: Tong Bo <bo.tong(a)intel.com> Acked-by: Andy Lutomirski <luto(a)kernel.org> --- tools/testing/selftests/x86/syscall_arg_fault.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/x86/syscall_arg_fault.c b/tools/testing/selftests/x86/syscall_arg_fault.c index 7db4fc9..d2548401 100644 --- a/tools/testing/selftests/x86/syscall_arg_fault.c +++ b/tools/testing/selftests/x86/syscall_arg_fault.c @@ -43,7 +43,7 @@ static sigjmp_buf jmpbuf; static volatile sig_atomic_t n_errs; -static void sigsegv(int sig, siginfo_t *info, void *ctx_void) +static void sigsegv_or_sigbus(int sig, siginfo_t *info, void *ctx_void) { ucontext_t *ctx = (ucontext_t*)ctx_void; @@ -73,7 +73,13 @@ int main() if (sigaltstack(&stack, NULL) != 0) err(1, "sigaltstack"); - sethandler(SIGSEGV, sigsegv, SA_ONSTACK); + sethandler(SIGSEGV, sigsegv_or_sigbus, SA_ONSTACK); + /* + * The actual exception can vary. On Atom CPUs, we get #SS + * instead of #PF when the vDSO fails to access the stack when + * ESP is too close to 2^32, and #SS causes SIGBUS. + */ + sethandler(SIGBUS, sigsegv_or_sigbus, SA_ONSTACK); sethandler(SIGILL, sigill, SA_ONSTACK); /* -- 2.7.4

6 years, 7 months

3
3
0 0

[PATCH v2 1/2] Add polling support to pidfd

by Joel Fernandes (Google)

Android low memory killer (LMK) needs to know when a process dies once it is sent the kill signal. It does so by checking for the existence of /proc/pid which is both racy and slow. For example, if a PID is reused between when LMK sends a kill signal and checks for existence of the PID, since the wrong PID is now possibly checked for existence. This patch adds polling support to pidfd. Using the polling support, LMK will be able to get notified when a process exists in race-free and fast way, and allows the LMK to do other things (such as by polling on other fds) while awaiting the process being killed to die. For notification to polling processes, we follow the same existing mechanism in the kernel used when the parent of the task group is to be notified of a child's death (do_notify_parent). This is precisely when the tasks waiting on a poll of pidfd are also awakened in this patch. We have decided to include the waitqueue in struct pid for the following reasons: 1. The wait queue has to survive for the lifetime of the poll. Including it in task_struct would not be option in this case because the task can be reaped and destroyed before the poll returns. 2. By including the struct pid for the waitqueue means that during de_thread(), the new thread group leader automatically gets the new waitqueue/pid even though its task_struct is different. Appropriate test cases are added in the second patch to provide coverage of all the cases the patch is handling. Andy had a similar patch [1] in the past which was a good reference however this patch tries to handle different situations properly related to thread group existence, and how/where it notifies. And also solves other bugs (waitqueue lifetime). Daniel had a similar patch [2] recently which this patch supercedes. [1] https://lore.kernel.org/patchwork/patch/345098/ [2] https://lore.kernel.org/lkml/20181029175322.189042-1-dancol@google.com/ Cc: Andy Lutomirski <luto(a)amacapital.net> Cc: Steven Rostedt <rostedt(a)goodmis.org> Cc: Daniel Colascione <dancol(a)google.com> Cc: Christian Brauner <christian(a)brauner.io> Cc: Jann Horn <jannh(a)google.com> Cc: Tim Murray <timmurray(a)google.com> Cc: Jonathan Kowalski <bl0pbl33p(a)gmail.com> Cc: Linus Torvalds <torvalds(a)linux-foundation.org> Cc: Al Viro <viro(a)zeniv.linux.org.uk> Cc: Kees Cook <keescook(a)chromium.org> Cc: David Howells <dhowells(a)redhat.com> Cc: Oleg Nesterov <oleg(a)redhat.com> Cc: kernel-team(a)android.com (Oleg improved the code by showing how to avoid tasklist_lock) Suggested-by: Oleg Nesterov <oleg(a)redhat.com> Co-developed-by: Daniel Colascione <dancol(a)google.com> Signed-off-by: Daniel Colascione <dancol(a)google.com> Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org> --- v1 -> v2: * Restructure poll code to avoid tasklist_lock (Oleg) * use task_pid instead of get_pid_task in notify_pidfd (Oleg) * Added comments to code, commit message nits (Christian) * Test case nits/improvements (Christian) RFC -> v1: * Based on CLONE_PIDFD patches: https://lwn.net/Articles/786244/ * Updated selftests. * Renamed poll wake function to do_notify_pidfd. * Removed depending on EXIT flags * Removed POLLERR flag since semantics are controversial and we don't have usecases for it right now (later we can add if there's a need for it). include/linux/pid.h | 3 +++ kernel/fork.c | 29 +++++++++++++++++++++++++++++ kernel/pid.c | 2 ++ kernel/signal.c | 11 +++++++++++ 4 files changed, 45 insertions(+) diff --git a/include/linux/pid.h b/include/linux/pid.h index 3c8ef5a199ca..1484db6ca8d1 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -3,6 +3,7 @@ #define _LINUX_PID_H #include <linux/rculist.h> +#include <linux/wait.h> enum pid_type { @@ -60,6 +61,8 @@ struct pid unsigned int level; /* lists of tasks that use this pid */ struct hlist_head tasks[PIDTYPE_MAX]; + /* wait queue for pidfd notifications */ + wait_queue_head_t wait_pidfd; struct rcu_head rcu; struct upid numbers[1]; }; diff --git a/kernel/fork.c b/kernel/fork.c index 5525837ed80e..721f8c9d2921 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1685,8 +1685,37 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f) } #endif +/* + * Poll support for process exit notification. + */ +static unsigned int pidfd_poll(struct file *file, struct poll_table_struct *pts) +{ + struct task_struct *task; + struct pid *pid = file->private_data; + int poll_flags = 0; + + poll_wait(file, &pid->wait_pidfd, pts); + + rcu_read_lock(); + task = pid_task(pid, PIDTYPE_PID); + WARN_ON_ONCE(task && !thread_group_leader(task)); + + /* + * Inform pollers only when the whole thread group exits, if thread + * group leader exits before all other threads in the group, then + * poll(2) should block, similar to the wait(2) family. + */ + if (!task || (task->exit_state && thread_group_empty(task))) + poll_flags = POLLIN | POLLRDNORM; + rcu_read_unlock(); + + return poll_flags; +} + + const struct file_operations pidfd_fops = { .release = pidfd_release, + .poll = pidfd_poll, #ifdef CONFIG_PROC_FS .show_fdinfo = pidfd_show_fdinfo, #endif diff --git a/kernel/pid.c b/kernel/pid.c index 20881598bdfa..5c90c239242f 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -214,6 +214,8 @@ struct pid *alloc_pid(struct pid_namespace *ns) for (type = 0; type < PIDTYPE_MAX; ++type) INIT_HLIST_HEAD(&pid->tasks[type]); + init_waitqueue_head(&pid->wait_pidfd); + upid = pid->numbers + ns->level; spin_lock_irq(&pidmap_lock); if (!(ns->pid_allocated & PIDNS_ADDING)) diff --git a/kernel/signal.c b/kernel/signal.c index 1581140f2d99..a17fff073c3d 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1800,6 +1800,14 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type) return ret; } +static void do_notify_pidfd(struct task_struct *task) +{ + struct pid *pid; + + pid = task_pid(task); + wake_up_all(&pid->wait_pidfd); +} + /* * Let a parent know about the death of a child. * For a stopped/continued status change, use do_notify_parent_cldstop instead. @@ -1823,6 +1831,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig) BUG_ON(!tsk->ptrace && (tsk->group_leader != tsk || !thread_group_empty(tsk))); + /* Wake up all pidfd waiters */ + do_notify_pidfd(tsk); + if (sig != SIGCHLD) { /* * This is only possible if parent == real_parent. -- 2.21.0.593.g511ec345e18-goog

6 years, 7 months

4
10
0 0

[PATCH linux-next v10 0/7] ptrace: add PTRACE_GET_SYSCALL_INFO request

by Dmitry V. Levin

[Andrew, could you take this patchset into your tree, please?] PTRACE_GET_SYSCALL_INFO is a generic ptrace API that lets ptracer obtain details of the syscall the tracee is blocked in. There are two reasons for a special syscall-related ptrace request. Firstly, with the current ptrace API there are cases when ptracer cannot retrieve necessary information about syscalls. Some examples include: * The notorious int-0x80-from-64-bit-task issue. See [1] for details. In short, if a 64-bit task performs a syscall through int 0x80, its tracer has no reliable means to find out that the syscall was, in fact, a compat syscall, and misidentifies it. * Syscall-enter-stop and syscall-exit-stop look the same for the tracer. Common practice is to keep track of the sequence of ptrace-stops in order not to mix the two syscall-stops up. But it is not as simple as it looks; for example, strace had a (just recently fixed) long-standing bug where attaching strace to a tracee that is performing the execve system call led to the tracer identifying the following syscall-exit-stop as syscall-enter-stop, which messed up all the state tracking. * Since the introduction of commit 84d77d3f06e7e8dea057d10e8ec77ad71f721be3 ("ptrace: Don't allow accessing an undumpable mm"), both PTRACE_PEEKDATA and process_vm_readv become unavailable when the process dumpable flag is cleared. On such architectures as ia64 this results in all syscall arguments being unavailable for the tracer. Secondly, ptracers also have to support a lot of arch-specific code for obtaining information about the tracee. For some architectures, this requires a ptrace(PTRACE_PEEKUSER, ...) invocation for every syscall argument and return value. PTRACE_GET_SYSCALL_INFO returns the following structure: struct ptrace_syscall_info { __u8 op; /* PTRACE_SYSCALL_INFO_* */ __u32 arch __attribute__((__aligned__(sizeof(__u32)))); __u64 instruction_pointer; __u64 stack_pointer; union { struct { __u64 nr; __u64 args[6]; } entry; struct { __s64 rval; __u8 is_error; } exit; struct { __u64 nr; __u64 args[6]; __u32 ret_data; } seccomp; }; }; The structure was chosen according to [2], except for the following changes: * seccomp substructure was added as a superset of entry substructure; * the type of nr field was changed from int to __u64 because syscall numbers are, as a practical matter, 64 bits; * stack_pointer field was added along with instruction_pointer field since it is readily available and can save the tracer from extra PTRACE_GETREGS/PTRACE_GETREGSET calls; * arch is always initialized to aid with tracing system calls * such as execve(); * instruction_pointer and stack_pointer are always initialized so they could be easily obtained for non-syscall stops; * a boolean is_error field was added along with rval field, this way the tracer can more reliably distinguish a return value from an error value. strace has been ported to PTRACE_GET_SYSCALL_INFO. Starting with release 4.26, strace uses PTRACE_GET_SYSCALL_INFO API as the preferred mechanism of obtaining syscall information. [1] https://lore.kernel.org/lkml/CA+55aFzcSVmdDj9Lh_gdbz1OzHyEm6ZrGPBDAJnywm2LF… [2] https://lore.kernel.org/lkml/CAObL_7GM0n80N7J_DFw_eQyfLyzq+sf4y2AvsCCV88Tb3… --- Notes: v10: added more Acked-by. v9: * Rebased to linux-next again due to syscall_get_arguments() signature change. v8: * Moved syscall_get_arch() specific patches to a separate patchset which is now merged into audit/next tree. * Rebased to linux-next. * Moved ptrace_get_syscall_info code under #ifdef CONFIG_HAVE_ARCH_TRACEHOOK, narrowing down the set of architectures supported by this implementation back to those 19 that enable CONFIG_HAVE_ARCH_TRACEHOOK because I failed to get all syscall_get_*(), instruction_pointer(), and user_stack_pointer() functions implemented on some niche architectures. This leaves the following architectures out: alpha, h8300, m68k, microblaze, and unicore32. v7: * Rebased to v5.0-rc1. * 5 arch-specific preparatory patches out of 25 have been merged into v5.0-rc1 via arch trees. v6: * Add syscall_get_arguments and syscall_set_arguments wrappers to asm-generic/syscall.h, requested by Geert. * Change PTRACE_GET_SYSCALL_INFO return code: do not take trailing paddings into account, use the end of the last field of the structure being written. * Change struct ptrace_syscall_info: * remove .frame_pointer field, is is not needed and not portable; * make .arch field explicitly aligned, remove no longer needed padding before .arch field; * remove trailing pads, they are no longer needed. v5: * Merge separate series and patches into the single series. * Change PTRACE_EVENTMSG_SYSCALL_{ENTRY,EXIT} values as requested by Oleg. * Change struct ptrace_syscall_info: generalize instruction_pointer, stack_pointer, and frame_pointer fields by moving them from ptrace_syscall_info.{entry,seccomp} substructures to ptrace_syscall_info and initializing them for all stops. * Add PTRACE_SYSCALL_INFO_NONE, set it when not in a syscall stop, so e.g. "strace -i" could use PTRACE_SYSCALL_INFO_SECCOMP to obtain instruction_pointer when the tracee is in a signal stop. * Patch all remaining architectures to provide all necessary syscall_get_* functions. * Make available for all architectures: do not conditionalize on CONFIG_HAVE_ARCH_TRACEHOOK since all syscall_get_* functions are implemented on all architectures. * Add a test for PTRACE_GET_SYSCALL_INFO to selftests/ptrace. v4: * Do not introduce task_struct.ptrace_event, use child->last_siginfo->si_code instead. * Implement PTRACE_SYSCALL_INFO_SECCOMP and ptrace_syscall_info.seccomp support along with PTRACE_SYSCALL_INFO_{ENTRY,EXIT} and ptrace_syscall_info.{entry,exit}. v3: * Change struct ptrace_syscall_info. * Support PTRACE_EVENT_SECCOMP by adding ptrace_event to task_struct. * Add proper defines for ptrace_syscall_info.op values. * Rename PT_SYSCALL_IS_ENTERING and PT_SYSCALL_IS_EXITING to PTRACE_EVENTMSG_SYSCALL_ENTRY and PTRACE_EVENTMSG_SYSCALL_EXIT * and move them to uapi. v2: * Do not use task->ptrace. * Replace entry_info.is_compat with entry_info.arch, use syscall_get_arch(). * Use addr argument of sys_ptrace to get expected size of the struct; return full size of the struct. Dmitry V. Levin (6): nds32: fix asm/syscall.h # acked hexagon: define syscall_get_error() and syscall_get_return_value() # waiting for ack since November mips: define syscall_get_error() # acked parisc: define syscall_get_error() # acked powerpc: define syscall_get_error() # waiting for ack since early December selftests/ptrace: add a test case for PTRACE_GET_SYSCALL_INFO # acked Elvira Khabirova (1): ptrace: add PTRACE_GET_SYSCALL_INFO request # reviewed arch/hexagon/include/asm/syscall.h | 14 + arch/mips/include/asm/syscall.h | 6 + arch/nds32/include/asm/syscall.h | 27 +- arch/parisc/include/asm/syscall.h | 7 + arch/powerpc/include/asm/syscall.h | 10 + include/linux/tracehook.h | 9 +- include/uapi/linux/ptrace.h | 35 +++ kernel/ptrace.c | 103 ++++++- tools/testing/selftests/ptrace/.gitignore | 1 + tools/testing/selftests/ptrace/Makefile | 2 +- .../selftests/ptrace/get_syscall_info.c | 271 ++++++++++++++++++ 11 files changed, 470 insertions(+), 15 deletions(-) create mode 100644 tools/testing/selftests/ptrace/get_syscall_info.c -- ldv

6 years, 7 months

2
2
0 0

[PATCH net-next] net: sched: Introduce act_ctinfo action

by Kevin 'ldir' Darbyshire-Bryant

ctinfo is a new tc filter action module. It is designed to restore DSCPs stored in conntrack marks into the ipv4/v6 diffserv field. The feature is intended for use and has been found useful for restoring ingress classifications based on egress classifications across links that bleach or otherwise change DSCP, typically home ISP Internet links. Restoring DSCP on ingress on the WAN link allows qdiscs such as CAKE to shape inbound packets according to policies that are easier to indicate on egress. Ingress classification is traditionally a challenging task since iptables rules haven't yet run and tc filter/eBPF programs are pre-NAT lookups, hence are unable to see internal IPv4 addresses as used on the typical home masquerading gateway. ctinfo understands the following parameters: dscp dscpmask[/statemask] dscpmask - a 32 bit mask of at least 6 contiguous bits and indicates where ctinfo will find the DSCP bits stored in the conntrack mark. statemask - a 32 bit mask of (usually) 1 bit length, outside the area specified by dscpmask. This represents a conditional operation flag whereby the DSCP is only restored if the flag is set. This is useful to implement a 'one shot' iptables based classification where the 'complicated' iptables rules are only run once to classify the connection on initial (egress) packet and subsequent packets are all marked/restored with the same DSCP. A mask of zero disables the conditional behaviour ie. the conntrack mark DSCP bits are always restored to the ip diffserv field (assuming the conntrack entry is found & the skb is an ipv4/ipv6 type) optional parameters: zone - conntrack zone control - action related control (reclassify | pipe | drop | continue | ok | goto chain <CHAIN_INDEX>) e.g. dscp 0xfc000000/0x01000000 |----0xFC----conntrack mark----000000---| | Bits 31-26 | bit 25 | bit24 |~~~ Bit 0| | DSCP | unused | flag |unused | |-----------------------0x01---000000---| | | | | ---| Conditional flag v only restore if set |-ip diffserv-| | 6 bits | |-------------| Signed-off-by: Kevin Darbyshire-Bryant <ldir(a)darbyshire-bryant.me.uk> --- include/net/tc_act/tc_ctinfo.h | 24 ++ include/uapi/linux/pkt_cls.h | 1 + include/uapi/linux/tc_act/tc_ctinfo.h | 33 ++ net/sched/Kconfig | 13 + net/sched/Makefile | 1 + net/sched/act_ctinfo.c | 375 ++++++++++++++++++++++ tools/testing/selftests/tc-testing/config | 1 + 7 files changed, 448 insertions(+) create mode 100644 include/net/tc_act/tc_ctinfo.h create mode 100644 include/uapi/linux/tc_act/tc_ctinfo.h create mode 100644 net/sched/act_ctinfo.c diff --git a/include/net/tc_act/tc_ctinfo.h b/include/net/tc_act/tc_ctinfo.h new file mode 100644 index 000000000000..bb33e66d3ea5 --- /dev/null +++ b/include/net/tc_act/tc_ctinfo.h @@ -0,0 +1,24 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __NET_TC_CTINFO_H +#define __NET_TC_CTINFO_H + +#include <net/act_api.h> + +struct tcf_ctinfo_params { + struct net *net; + u32 dscpmask; + u32 dscpstatemask; + u16 zone; + u8 mode; + u8 dscpmaskshift; + struct rcu_head rcu; +}; + +struct tcf_ctinfo { + struct tc_action common; + struct tcf_ctinfo_params __rcu *params; +}; + +#define to_ctinfo(a) ((struct tcf_ctinfo *)a) + +#endif /* __NET_TC_CTINFO_H */ diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h index 51a0496f78ea..a93680fc4bfa 100644 --- a/include/uapi/linux/pkt_cls.h +++ b/include/uapi/linux/pkt_cls.h @@ -105,6 +105,7 @@ enum tca_id { TCA_ID_IFE = TCA_ACT_IFE, TCA_ID_SAMPLE = TCA_ACT_SAMPLE, /* other actions go here */ + TCA_ID_CTINFO, __TCA_ID_MAX = 255 }; diff --git a/include/uapi/linux/tc_act/tc_ctinfo.h b/include/uapi/linux/tc_act/tc_ctinfo.h new file mode 100644 index 000000000000..b84902b5e3b1 --- /dev/null +++ b/include/uapi/linux/tc_act/tc_ctinfo.h @@ -0,0 +1,33 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef __UAPI_TC_CTINFO_H +#define __UAPI_TC_CTINFO_H + +#include <linux/types.h> +#include <linux/pkt_cls.h> + +struct tc_ctinfo { + tc_gen; +}; + +struct tc_ctinfo_dscp { + __u32 mask; + __u32 statemask; +}; + +enum { + TCA_CTINFO_UNSPEC, + TCA_CTINFO_ACT, + TCA_CTINFO_ZONE, + TCA_CTINFO_DSCP_PARMS, + TCA_CTINFO_MODE_DSCP, + TCA_CTINFO_TM, + TCA_CTINFO_PAD, + __TCA_CTINFO_MAX +}; +#define TCA_CTINFO_MAX (__TCA_CTINFO_MAX - 1) + +enum { + CTINFO_MODE_SETDSCP = BIT(0) +}; + +#endif diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 5c02ad97ef23..5ac01c5ebae9 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -876,6 +876,19 @@ config NET_ACT_CONNMARK To compile this code as a module, choose M here: the module will be called act_connmark. +config NET_ACT_CTINFO + tristate "Netfilter Connmark to DSCP Retriever" + depends on NET_CLS_ACT && NETFILTER && IP_NF_IPTABLES + depends on NF_CONNTRACK && NF_CONNTRACK_MARK + help + Say Y here to allow transfer of a connmark stored DSCP into + ipv4/v6 diffserv + + If unsure, say N. + + To compile this code as a module, choose M here: the + module will be called act_ctinfo. + config NET_ACT_SKBMOD tristate "skb data modification action" depends on NET_CLS_ACT diff --git a/net/sched/Makefile b/net/sched/Makefile index 8a40431d7b5c..d54bfcbd7981 100644 --- a/net/sched/Makefile +++ b/net/sched/Makefile @@ -21,6 +21,7 @@ obj-$(CONFIG_NET_ACT_CSUM) += act_csum.o obj-$(CONFIG_NET_ACT_VLAN) += act_vlan.o obj-$(CONFIG_NET_ACT_BPF) += act_bpf.o obj-$(CONFIG_NET_ACT_CONNMARK) += act_connmark.o +obj-$(CONFIG_NET_ACT_CTINFO) += act_ctinfo.o obj-$(CONFIG_NET_ACT_SKBMOD) += act_skbmod.o obj-$(CONFIG_NET_ACT_IFE) += act_ife.o obj-$(CONFIG_NET_IFE_SKBMARK) += act_meta_mark.o diff --git a/net/sched/act_ctinfo.c b/net/sched/act_ctinfo.c new file mode 100644 index 000000000000..01a8694651ea --- /dev/null +++ b/net/sched/act_ctinfo.c @@ -0,0 +1,375 @@ +// SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note +/* net/sched/act_ctinfo.c netfilter ctinfo connmark->DSCP action + * + * Copyright (c) 2019 Kevin Darbyshire-Bryant <ldir(a)darbyshire-bryant.me.uk> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#include <linux/module.h> +#include <linux/init.h> +#include <linux/kernel.h> +#include <linux/skbuff.h> +#include <linux/rtnetlink.h> +#include <linux/pkt_cls.h> +#include <linux/ip.h> +#include <linux/ipv6.h> +#include <net/netlink.h> +#include <net/pkt_sched.h> +#include <net/act_api.h> +#include <net/pkt_cls.h> +#include <uapi/linux/tc_act/tc_ctinfo.h> +#include <net/tc_act/tc_ctinfo.h> + +#include <net/netfilter/nf_conntrack.h> +#include <net/netfilter/nf_conntrack_core.h> +#include <net/netfilter/nf_conntrack_ecache.h> +#include <net/netfilter/nf_conntrack_zones.h> + +static unsigned int ctinfo_net_id; +static struct tc_action_ops act_ctinfo_ops; + +static void tcf_ctinfo_dscp_set(struct nf_conn *ct, struct tcf_ctinfo *ca, + struct tcf_ctinfo_params *cp, + struct sk_buff *skb, int wlen, int proto) +{ + u8 dscp, newdscp; + + newdscp = (((ct->mark & cp->dscpmask) >> cp->dscpmaskshift) << 2) & + ~INET_ECN_MASK; + + /* mark contains DSCP so restore DSCP bits from c->mark into diffserv */ + /* using overlimits stats to count how many DSCP updates */ + switch (proto) { + case NFPROTO_IPV4: + dscp = ipv4_get_dsfield(ip_hdr(skb)) & ~INET_ECN_MASK; + if (dscp != newdscp) { + if (!skb_try_make_writable(skb, wlen)) { + ipv4_change_dsfield(ip_hdr(skb), + INET_ECN_MASK, + newdscp); + ca->tcf_qstats.overlimits++; + } else { + ca->tcf_qstats.drops++; + } + } + break; + case NFPROTO_IPV6: + dscp = ipv6_get_dsfield(ipv6_hdr(skb)) & ~INET_ECN_MASK; + if (dscp != newdscp) { + if (!skb_try_make_writable(skb, wlen)) { + ipv6_change_dsfield(ipv6_hdr(skb), + INET_ECN_MASK, + newdscp); + ca->tcf_qstats.overlimits++; + } else { + ca->tcf_qstats.drops++; + } + } + break; + default: + break; + } +} + +static int tcf_ctinfo_act(struct sk_buff *skb, const struct tc_action *a, + struct tcf_result *res) +{ + const struct nf_conntrack_tuple_hash *thash = NULL; + struct nf_conntrack_tuple tuple; + enum ip_conntrack_info ctinfo; + struct tcf_ctinfo *ca = to_ctinfo(a); + struct tcf_ctinfo_params *cp; + struct nf_conntrack_zone zone; + struct nf_conn *ct; + int proto, wlen; + int action; + + cp = rcu_dereference_bh(ca->params); + + tcf_lastuse_update(&ca->tcf_tm); + bstats_update(&ca->tcf_bstats, skb); + action = READ_ONCE(ca->tcf_action); + + /* currently the only mode we know but in future...*/ + if (unlikely(!(cp->mode & CTINFO_MODE_SETDSCP))) + goto out; + + wlen = skb_network_offset(skb); + if (tc_skb_protocol(skb) == htons(ETH_P_IP)) { + wlen += sizeof(struct iphdr); + if (!pskb_may_pull(skb, wlen)) + goto out; + + proto = NFPROTO_IPV4; + } else if (tc_skb_protocol(skb) == htons(ETH_P_IPV6)) { + wlen += sizeof(struct ipv6hdr); + if (!pskb_may_pull(skb, wlen)) + goto out; + + proto = NFPROTO_IPV6; + } else { + goto out; + } + + ct = nf_ct_get(skb, &ctinfo); + if (!ct) { /* look harder usually ingress */ + if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb), + proto, cp->net, &tuple)) + goto out; + zone.id = cp->zone; + zone.dir = NF_CT_DEFAULT_ZONE_DIR; + + thash = nf_conntrack_find_get(cp->net, &zone, &tuple); + if (!thash) + goto out; + + ct = nf_ct_tuplehash_to_ctrack(thash); + } + + if (!cp->dscpstatemask || (ct->mark & cp->dscpstatemask)) + tcf_ctinfo_dscp_set(ct, ca, cp, skb, wlen, proto); + + if (thash) + nf_ct_put(ct); +out: + return action; +} + +static const struct nla_policy ctinfo_policy[TCA_CTINFO_MAX + 1] = { + [TCA_CTINFO_ACT] = { .len = sizeof(struct tc_ctinfo) }, + [TCA_CTINFO_ZONE] = { .type = NLA_U16 }, + [TCA_CTINFO_MODE_DSCP] = { .type = NLA_FLAG }, + [TCA_CTINFO_DSCP_PARMS] = { .len = sizeof(struct tc_ctinfo_dscp) }, +}; + +static int tcf_ctinfo_init(struct net *net, struct nlattr *nla, + struct nlattr *est, struct tc_action **a, + int ovr, int bind, bool rtnl_held, + struct tcf_proto *tp, + struct netlink_ext_ack *extack) +{ + struct tc_action_net *tn = net_generic(net, ctinfo_net_id); + struct tcf_ctinfo_params *cp_new; + struct nlattr *tb[TCA_CTINFO_MAX + 1]; + struct tcf_chain *goto_ch = NULL; + struct tcf_ctinfo *ci; + struct tc_ctinfo *actparm; + struct tc_ctinfo_dscp *dscpparm; + int ret = 0, err, i; + + if (!nla) + return -EINVAL; + + err = nla_parse_nested(tb, TCA_CTINFO_MAX, nla, ctinfo_policy, NULL); + if (err < 0) + return err; + + if (!tb[TCA_CTINFO_ACT]) + return -EINVAL; + + if (tb[TCA_CTINFO_MODE_DSCP] && !tb[TCA_CTINFO_DSCP_PARMS]) + return -EINVAL; + + actparm = nla_data(tb[TCA_CTINFO_ACT]); + dscpparm = nla_data(tb[TCA_CTINFO_DSCP_PARMS]); + + if (dscpparm) { + /* need at least contiguous 6 bit mask */ + i = dscpparm->mask ? __ffs(dscpparm->mask) : 0; + if ((0x3f & (dscpparm->mask >> i)) != 0x3f) + return -EINVAL; + /* mask & statemask must not overlap */ + if (dscpparm->mask & dscpparm->statemask) + return -EINVAL; + } +//done the validation:now to the actual action allocation + err = tcf_idr_check_alloc(tn, &actparm->index, a, bind); + if (!err) { + ret = tcf_idr_create(tn, actparm->index, est, a, + &act_ctinfo_ops, bind, false); + if (ret) { + tcf_idr_cleanup(tn, actparm->index); + return ret; + } + } else if (err > 0) { + if (bind) /* don't override defaults */ + return 0; + if (!ovr) { + tcf_idr_release(*a, bind); + return -EEXIST; + } + } else { + return err; + } + + err = tcf_action_check_ctrlact(actparm->action, tp, &goto_ch, extack); + if (err < 0) + goto release_idr; + + ci = to_ctinfo(*a); + + cp_new = kzalloc(sizeof(*cp_new), GFP_KERNEL); + if (unlikely(!cp_new)) { + err = -ENOMEM; + goto put_chain; + } + + cp_new->net = net; + cp_new->zone = tb[TCA_CTINFO_ZONE] ? + nla_get_u16(tb[TCA_CTINFO_ZONE]) : 0; + if (dscpparm) { + cp_new->dscpmask = dscpparm->mask; + cp_new->dscpmaskshift = cp_new->dscpmask ? + __ffs(cp_new->dscpmask) : 0; + cp_new->dscpstatemask = dscpparm->statemask; + } + + if (tb[TCA_CTINFO_MODE_DSCP]) + cp_new->mode |= CTINFO_MODE_SETDSCP; + else + cp_new->mode &= ~CTINFO_MODE_SETDSCP; + + spin_lock_bh(&ci->tcf_lock); + goto_ch = tcf_action_set_ctrlact(*a, actparm->action, goto_ch); + rcu_swap_protected(ci->params, cp_new, + lockdep_is_held(&ci->tcf_lock)); + spin_unlock_bh(&ci->tcf_lock); + + if (goto_ch) + tcf_chain_put_by_act(goto_ch); + if (cp_new) + kfree_rcu(cp_new, rcu); + + if (ret == ACT_P_CREATED) + tcf_idr_insert(tn, *a); + + return ret; + +put_chain: + if (goto_ch) + tcf_chain_put_by_act(goto_ch); +release_idr: + tcf_idr_release(*a, bind); + return err; +} + +static inline int tcf_ctinfo_dump(struct sk_buff *skb, struct tc_action *a, + int bind, int ref) +{ + unsigned char *b = skb_tail_pointer(skb); + struct tcf_ctinfo *ci = to_ctinfo(a); + struct tcf_ctinfo_params *cp; + struct tc_ctinfo opt = { + .index = ci->tcf_index, + .refcnt = refcount_read(&ci->tcf_refcnt) - ref, + .bindcnt = atomic_read(&ci->tcf_bindcnt) - bind, + }; + struct tcf_t t; + struct tc_ctinfo_dscp dscpparm; + + spin_lock_bh(&ci->tcf_lock); + cp = rcu_dereference_protected(ci->params, + lockdep_is_held(&ci->tcf_lock)); + opt.action = ci->tcf_action; + + if (nla_put(skb, TCA_CTINFO_ACT, sizeof(opt), &opt)) + goto nla_put_failure; + + if (cp->mode & CTINFO_MODE_SETDSCP) { + dscpparm.mask = cp->dscpmask; + dscpparm.statemask = cp->dscpstatemask; + if (nla_put(skb, TCA_CTINFO_DSCP_PARMS, sizeof(dscpparm), + &dscpparm)) + goto nla_put_failure; + + if (nla_put_flag(skb, TCA_CTINFO_MODE_DSCP)) + goto nla_put_failure; + } + + if (cp->zone) { + if (nla_put_u16(skb, TCA_CTINFO_ZONE, cp->zone)) + goto nla_put_failure; + } + + tcf_tm_dump(&t, &ci->tcf_tm); + if (nla_put_64bit(skb, TCA_CTINFO_TM, sizeof(t), &t, + TCA_CTINFO_PAD)) + goto nla_put_failure; + + spin_unlock_bh(&ci->tcf_lock); + + return skb->len; + +nla_put_failure: + spin_unlock_bh(&ci->tcf_lock); + nlmsg_trim(skb, b); + return -1; +} + +static int tcf_ctinfo_walker(struct net *net, struct sk_buff *skb, + struct netlink_callback *cb, int type, + const struct tc_action_ops *ops, + struct netlink_ext_ack *extack) +{ + struct tc_action_net *tn = net_generic(net, ctinfo_net_id); + + return tcf_generic_walker(tn, skb, cb, type, ops, extack); +} + +static int tcf_ctinfo_search(struct net *net, struct tc_action **a, u32 index) +{ + struct tc_action_net *tn = net_generic(net, ctinfo_net_id); + + return tcf_idr_search(tn, a, index); +} + +static struct tc_action_ops act_ctinfo_ops = { + .kind = "ctinfo", + .id = TCA_ID_CTINFO, + .owner = THIS_MODULE, + .act = tcf_ctinfo_act, + .dump = tcf_ctinfo_dump, + .init = tcf_ctinfo_init, + .walk = tcf_ctinfo_walker, + .lookup = tcf_ctinfo_search, + .size = sizeof(struct tcf_ctinfo), +}; + +static __net_init int ctinfo_init_net(struct net *net) +{ + struct tc_action_net *tn = net_generic(net, ctinfo_net_id); + + return tc_action_net_init(tn, &act_ctinfo_ops); +} + +static void __net_exit ctinfo_exit_net(struct list_head *net_list) +{ + tc_action_net_exit(net_list, ctinfo_net_id); +} + +static struct pernet_operations ctinfo_net_ops = { + .init = ctinfo_init_net, + .exit_batch = ctinfo_exit_net, + .id = &ctinfo_net_id, + .size = sizeof(struct tc_action_net), +}; + +static int __init ctinfo_init_module(void) +{ + return tcf_register_action(&act_ctinfo_ops, &ctinfo_net_ops); +} + +static void __exit ctinfo_cleanup_module(void) +{ + tcf_unregister_action(&act_ctinfo_ops, &ctinfo_net_ops); +} + +module_init(ctinfo_init_module); +module_exit(ctinfo_cleanup_module); +MODULE_AUTHOR("Kevin Darbyshire-Bryant <ldir(a)darbyshire-bryant.me.uk>"); +MODULE_DESCRIPTION("Conntrack mark to DSCP restoring"); +MODULE_LICENSE("GPL"); diff --git a/tools/testing/selftests/tc-testing/config b/tools/testing/selftests/tc-testing/config index 203302065458..9d1fddcfb887 100644 --- a/tools/testing/selftests/tc-testing/config +++ b/tools/testing/selftests/tc-testing/config @@ -37,6 +37,7 @@ CONFIG_NET_ACT_SKBEDIT=m CONFIG_NET_ACT_CSUM=m CONFIG_NET_ACT_VLAN=m CONFIG_NET_ACT_BPF=m +CONFIG_NET_ACT_CONNDSCP=m CONFIG_NET_ACT_CONNMARK=m CONFIG_NET_ACT_SKBMOD=m CONFIG_NET_ACT_IFE=m -- 2.20.1 (Apple Git-117)

6 years, 7 months

5
12
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror April 2019