Add specification for test metadata to the KTAP v2 spec.
KTAP v1 only specifies the output format of very basic test information:
test result and test name. Any additional test information either gets
added to general diagnostic data or is not included in the output at all.
The purpose of KTAP metadata is to create a framework to include and
easily identify additional important test information in KTAP.
KTAP metadata could include any test information that is pertinent for
user interaction before or after the running of the test. For example,
the test file path or the test speed.
Since this includes a large variety of information, this specification
will recognize notable types of KTAP metadata to ensure consistent format
across test frameworks. See the full list of types in the specification.
Example of KTAP Metadata:
KTAP version 2
# ktap_test: main
# ktap_arch: uml
1..1
KTAP version 2
# ktap_test: suite_1
# ktap_subsystem: example
# ktap_test_file: lib/test.c
1..2
ok 1 test_1
# ktap_test: test_2
# ktap_speed: very_slow
# custom_is_flaky: true
ok 2 test_2
ok 1 test_suite
The changes to the KTAP specification outline the format, location, and
different types of metadata.
Here is a link to a version of the KUnit parser that is able to parse test
metadata lines for KTAP version 2. Note this includes test metadata
lines for the main level of KTAP.
Link: https://kunit-review.googlesource.com/c/linux/+/5889
Signed-off-by: Rae Moar <rmoar(a)google.com>
---
Documentation/dev-tools/ktap.rst | 163 ++++++++++++++++++++++++++++++-
1 file changed, 159 insertions(+), 4 deletions(-)
diff --git a/Documentation/dev-tools/ktap.rst b/Documentation/dev-tools/ktap.rst
index ff77f4aaa6ef..4480eaf5bbc3 100644
--- a/Documentation/dev-tools/ktap.rst
+++ b/Documentation/dev-tools/ktap.rst
@@ -17,19 +17,20 @@ KTAP test results describe a series of tests (which may be nested: i.e., test
can have subtests), each of which can contain both diagnostic data -- e.g., log
lines -- and a final result. The test structure and results are
machine-readable, whereas the diagnostic data is unstructured and is there to
-aid human debugging.
+aid human debugging. One exception to this is test metadata lines - a type
+of diagnostic lines. Test metadata is used to identify important supplemental
+test information and can be machine-readable.
KTAP output is built from four different types of lines:
- Version lines
- Plan lines
- Test case result lines
-- Diagnostic lines
+- Diagnostic lines (including test metadata)
In general, valid KTAP output should also form valid TAP output, but some
information, in particular nested test results, may be lost. Also note that
there is a stagnant draft specification for TAP14, KTAP diverges from this in
-a couple of places (notably the "Subtest" header), which are described where
-relevant later in this document.
+a couple of places, which are described where relevant later in this document.
Version lines
-------------
@@ -166,6 +167,154 @@ even if they do not start with a "#": this is to capture any other useful
kernel output which may help debug the test. It is nevertheless recommended
that tests always prefix any diagnostic output they have with a "#" character.
+KTAP metadata lines
+-------------------
+
+KTAP metadata lines are a subset of diagnostic lines that are used to include
+and easily identify important supplemental test information in KTAP.
+
+.. code-block:: none
+
+ # <prefix>_<metadata type>: <metadata value>
+
+The <prefix> indicates where to find the specification for the type of
+metadata. The metadata types listed below use the prefix "ktap" (See Types of
+KTAP Metadata).
+
+Types that are instead specified by an individual test framework use the
+framework name as the prefix. For example, a metadata type documented by the
+kselftest specification would use the prefix "kselftest". Any metadata type
+that is not listed in a specification must use the prefix "custom". Note the
+prefix must not include spaces or the characters ":" or "_".
+
+The format of <metadata type> and <value> varies based on the type. See the
+individual specification. For "custom" types the <metadata type> can be any
+string excluding ":", spaces, or newline characters and the <value> can be any
+string.
+
+**Location:**
+
+The first KTAP metadata entry for a test must be "# ktap_test: <test name>",
+which acts as a header to associate metadata with the correct test.
+
+For test cases, the location of the metadata is between the prior test result
+line and the current test result line. For test suites, the location of the
+metadata is between the suite's version line and test plan line. See the
+example below.
+
+KTAP metadata for a test does not need to be contiguous. For example, a kernel
+warning or other diagnostic output could interrupt metadata lines. However, it
+is recommended to keep a test's metadata lines together when possible, as this
+improves readability.
+
+**Here is an example of using KTAP metadata:**
+
+::
+
+ KTAP version 2
+ # ktap_test: main
+ # ktap_arch: uml
+ 1..1
+ KTAP version 2
+ # ktap_test: suite_1
+ # ktap_subsystem: example
+ # ktap_test_file: lib/test.c
+ 1..2
+ ok 1 test_1
+ # ktap_test: test_2
+ # ktap_speed: very_slow
+ # custom_is_flaky: true
+ ok 2 test_2
+ # suite_1 passed
+ ok 1 suite_1
+
+In this example, the tests are running on UML. The test suite "suite_1" is part
+of the subsystem "example" and belongs to the file "lib/example_test.c". It has
+two subtests, "test_1" and "test_2". The subtest "test_2" has a speed of
+"very_slow" and has been marked with a custom KTAP metadata type called
+"custom_is_flaky" with the value of "true".
+
+**Types of KTAP Metadata:**
+
+This is the current list of KTAP metadata types recognized in this
+specification. Note that all of these metadata types are optional (except for
+ktap_test as the KTAP metadata header).
+
+- ``ktap_test``: Name of test (used as header of KTAP metadata). This should
+ match the test name printed in the test result line: "ok 1 [test_name]".
+
+- ``ktap_module``: Name of the module containing the test
+
+- ``ktap_subsystem``: Name of the subsystem being tested
+
+- ``ktap_start_time``: Time tests started in ISO8601 format
+
+ - Example: "# ktap_start_time: 2024-01-09T13:09:01.990000+00:00"
+
+- ``ktap_duration``: Time taken (in seconds) to execute the test
+
+ - Example: "ktap_duration: 10.154s"
+
+- ``ktap_speed``: Category of how fast test runs: "normal", "slow", or
+ "very_slow"
+
+- ``ktap_test_file``: Path to source file containing the test. This metadata
+ line can be repeated if the test is spread across multiple files.
+
+ - Example: "# ktap_test_file: lib/test.c"
+
+- ``ktap_generated_file``: Description of and path to file generated during
+ test execution. This could be a core dump, generated filesystem image, some
+ form of visual output (for graphics drivers), etc. This metadata line can be
+ repeated to attach multiple files to the test.
+
+ - Example: "# ktap_generated_file: Core dump: /var/lib/systemd/coredump/hello.core"
+
+- ``ktap_log_file``: Path to file containing kernel log test output
+
+ - Example: "# ktap_log_file: /sys/kernel/debugfs/kunit/example/results"
+
+- ``ktap_error_file``: Path to file containing context for test failure or
+ error. This could include the difference between optimal test output and
+ actual test output.
+
+ - Example: "# ktap_error_file: fs/results/example.out.bad"
+
+- ``ktap_results_url``: Link to webpage describing this test run and its
+ results
+
+ - Example: "# ktap_results_url: https://kcidb.kernelci.org/hello"
+
+- ``ktap_arch``: Architecture used during test run
+
+ - Example: "# ktap_arch: x86_64"
+
+- ``ktap_compiler``: Compiler used during test run
+
+ - Example: "# ktap_compiler: gcc (GCC) 10.1.1 20200507 (Red Hat 10.1.1-1)"
+
+- ``ktap_respository_url``: Link to git repository of the checked out code.
+
+ - Example: "# ktap_respository_url: https://github.com/torvalds/linux.git"
+
+- ``ktap_git_branch``: Name of git branch of checked out code
+
+ - Example: "# ktap_git_branch: kselftest/kunit"
+
+- ``ktap_kernel_version``: Version of Linux Kernel being used during test run
+
+ - Example: "# ktap_kernel_version: 6.7-rc1"
+
+- ``ktap_commit_hash``: The full git commit hash of the checked out base code.
+
+ - Example: "# ktap_commit_hash: 064725faf8ec2e6e36d51e22d3b86d2707f0f47f"
+
+**Other Metadata Types:**
+
+There can also be KTAP metadata that is not included in the recognized list
+above. This metadata must be prefixed with the test framework, ie. "kselftest",
+or with the prefix "custom". For example, "# custom_batch: 20".
+
Unknown lines
-------------
@@ -206,6 +355,7 @@ An example of a test with two nested subtests:
KTAP version 2
1..1
KTAP version 2
+ # ktap_test: example
1..2
ok 1 test_1
not ok 2 test_2
@@ -219,6 +369,7 @@ An example format with multiple levels of nested testing:
KTAP version 2
1..2
KTAP version 2
+ # ktap_test: example_test_1
1..2
KTAP version 2
1..2
@@ -254,6 +405,7 @@ Example KTAP output
KTAP version 2
1..1
KTAP version 2
+ # ktap_test: main_test
1..3
KTAP version 2
1..1
@@ -261,11 +413,14 @@ Example KTAP output
ok 1 test_1
ok 1 example_test_1
KTAP version 2
+ # ktap_test: example_test_2
+ # ktap_speed: slow
1..2
ok 1 test_1 # SKIP test_1 skipped
ok 2 test_2
ok 2 example_test_2
KTAP version 2
+ # ktap_test: example_test_3
1..3
ok 1 test_1
# test_2: FAIL
base-commit: 906f02e42adfbd5ae70d328ee71656ecb602aaf5
--
2.43.0.429.g432eaa2c6b-goog
The seccomp benchmark test (for validating the benefit of bitmaps) can
be sensitive to scheduling speed, so pin the process to a single CPU,
which appears to significantly improve reliability, and loosen the
"close enough" checking to allow up to 10% variance instead of 1%.
Reported-by: kernel test robot <oliver.sang(a)intel.com>
Closes: https://lore.kernel.org/oe-lkp/202402061002.3a8722fd-oliver.sang@intel.com
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Andy Lutomirski <luto(a)amacapital.net>
Cc: Will Drewry <wad(a)chromium.org>
Signed-off-by: Kees Cook <keescook(a)chromium.org>
---
v2:
- improve comment about selecting CPU (broonie)
- loosen variance check from 1% to 10%
v1: https://lore.kernel.org/all/20240206095642.work.502-kees@kernel.org/
---
.../selftests/seccomp/seccomp_benchmark.c | 38 ++++++++++++++++++-
1 file changed, 36 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/seccomp/seccomp_benchmark.c b/tools/testing/selftests/seccomp/seccomp_benchmark.c
index 5b5c9d558dee..9d7aa5a730e0 100644
--- a/tools/testing/selftests/seccomp/seccomp_benchmark.c
+++ b/tools/testing/selftests/seccomp/seccomp_benchmark.c
@@ -4,7 +4,9 @@
*/
#define _GNU_SOURCE
#include <assert.h>
+#include <err.h>
#include <limits.h>
+#include <sched.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdio.h>
@@ -76,8 +78,12 @@ unsigned long long calibrate(void)
bool approx(int i_one, int i_two)
{
- double one = i_one, one_bump = one * 0.01;
- double two = i_two, two_bump = two * 0.01;
+ /*
+ * This continues to be a noisy test. Instead of a 1% comparison
+ * go with 10%.
+ */
+ double one = i_one, one_bump = one * 0.1;
+ double two = i_two, two_bump = two * 0.1;
one_bump = one + MAX(one_bump, 2.0);
two_bump = two + MAX(two_bump, 2.0);
@@ -119,6 +125,32 @@ long compare(const char *name_one, const char *name_eval, const char *name_two,
return good ? 0 : 1;
}
+/* Pin to a single CPU so the benchmark won't bounce around the system. */
+void affinity(void)
+{
+ long cpu;
+ ulong ncores = sysconf(_SC_NPROCESSORS_CONF);
+ cpu_set_t *setp = CPU_ALLOC(ncores);
+ ulong setsz = CPU_ALLOC_SIZE(ncores);
+
+ /*
+ * Totally unscientific way to avoid CPUs that might be busier:
+ * choose the highest CPU instead of the lowest.
+ */
+ for (cpu = ncores - 1; cpu >= 0; cpu--) {
+ CPU_ZERO_S(setsz, setp);
+ CPU_SET_S(cpu, setsz, setp);
+ if (sched_setaffinity(getpid(), setsz, setp) == -1)
+ continue;
+ printf("Pinned to CPU %lu of %lu\n", cpu + 1, ncores);
+ goto out;
+ }
+ fprintf(stderr, "Could not set CPU affinity -- calibration may not work well");
+
+out:
+ CPU_FREE(setp);
+}
+
int main(int argc, char *argv[])
{
struct sock_filter bitmap_filter[] = {
@@ -153,6 +185,8 @@ int main(int argc, char *argv[])
system("grep -H . /proc/sys/net/core/bpf_jit_enable");
system("grep -H . /proc/sys/net/core/bpf_jit_harden");
+ affinity();
+
if (argc > 1)
samples = strtoull(argv[1], NULL, 0);
else
--
2.34.1
I have been steadily working but struggled to find a seamlessly
integrated way to implement tty frontend until Guilherme inspired me
that multi-backend and tty frontend are actually two separate entities.
This submission presents the second iteration of my efforts, listing
notable changes form the v1:
1. pstore.backend no longer acts as "registered backend", but "backends
eligible for registration".
2. drop subdir since it will break user space
3. drop tty frontend since I haven't yet devised a satisfactory
implementation strategy
A heartfelt thank you to Kees and Guilherme for your suggestions.
I firmly believe that a tty frontend is crucial for kdump debugging,
and I am still dedicating effort to develop one. Hope in the future I
can accomplish it with deeper comprehension with tty driver :)
Yuanhe Shu (3):
pstore: add multi-backend support
Documentation: adjust pstore backend related document
tools/testing: adjust pstore backend related selftest
Documentation/ABI/testing/pstore | 8 +-
.../admin-guide/kernel-parameters.txt | 4 +-
fs/pstore/ftrace.c | 29 ++-
fs/pstore/inode.c | 19 +-
fs/pstore/internal.h | 4 +-
fs/pstore/platform.c | 225 ++++++++++++------
fs/pstore/pmsg.c | 24 +-
include/linux/pstore.h | 29 +++
tools/testing/selftests/pstore/common_tests | 8 +-
.../selftests/pstore/pstore_post_reboot_tests | 65 ++---
tools/testing/selftests/pstore/pstore_tests | 2 +-
11 files changed, 293 insertions(+), 124 deletions(-)
--
2.39.3
From: Willem de Bruijn <willemb(a)google.com>
This test is time sensitive. It may fail on virtual machines and for
debug builds.
Continue to run in these environments to get code coverage. But
optionally suppress failure for timing errors (only). This is
controlled with environment variable KSFT_MACHINE_SLOW.
The test continues to return 0 (KSFT_PASS), rather than KSFT_XFAIL
as previously discussed. Because making so_txtime.c return that and
then making so_txtime.sh capture runs that pass that vs KSFT_FAIL
and pass it on added a bunch of (fragile bash) boilerplate, while the
result is interpreted the same as KSFT_PASS anyway.
Signed-off-by: Willem de Bruijn <willemb(a)google.com>
---
tools/testing/selftests/net/so_txtime.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/net/so_txtime.c b/tools/testing/selftests/net/so_txtime.c
index 2672ac0b6d1f..8457b7ccbc09 100644
--- a/tools/testing/selftests/net/so_txtime.c
+++ b/tools/testing/selftests/net/so_txtime.c
@@ -134,8 +134,11 @@ static void do_recv_one(int fdr, struct timed_send *ts)
if (rbuf[0] != ts->data)
error(1, 0, "payload mismatch. expected %c", ts->data);
- if (llabs(tstop - texpect) > cfg_variance_us)
- error(1, 0, "exceeds variance (%d us)", cfg_variance_us);
+ if (llabs(tstop - texpect) > cfg_variance_us) {
+ fprintf(stderr, "exceeds variance (%d us)\n", cfg_variance_us);
+ if (!getenv("KSFT_MACHINE_SLOW"))
+ exit(1);
+ }
}
static void do_recv_verify_empty(int fdr)
--
2.43.0.429.g432eaa2c6b-goog
Selftests here check not only that connect()/accept() for
TCP-AO/TCP-MD5/non-signed-TCP combinations do/don't establish
connections, but also counters: those are per-AO-key, per-socket and
per-netns.
The counters are checked on the server's side, as the server listener
has TCP-AO/TCP-MD5/no keys for different peers. All tests run in
the same namespaces with the same veth pair, created in test_init().
After close() in both client and server, the sides go through
the regular FIN/ACK + FIN/ACK sequence, which goes in the background.
If the selftest has already started a new testing scenario, read
per-netns counters - it may fail in the end iff it doesn't expect
the TCPAOGood per-netns counters go up during the test.
Let's just kill both TCP-AO sides - that will avoid any asynchronous
background TCP-AO segments going to either sides.
Reported-by: Jakub Kicinski <kuba(a)kernel.org>
Closes: https://lore.kernel.org/all/20240201132153.4d68f45e@kernel.org/T/#u
Fixes: 6f0c472a6815 ("selftests/net: Add TCP-AO + TCP-MD5 + no sign listen socket tests")
Signed-off-by: Dmitry Safonov <dima(a)arista.com>
---
tools/testing/selftests/net/tcp_ao/unsigned-md5.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/net/tcp_ao/unsigned-md5.c b/tools/testing/selftests/net/tcp_ao/unsigned-md5.c
index c5b568cd7d90..6b59a652159f 100644
--- a/tools/testing/selftests/net/tcp_ao/unsigned-md5.c
+++ b/tools/testing/selftests/net/tcp_ao/unsigned-md5.c
@@ -110,9 +110,9 @@ static void try_accept(const char *tst_name, unsigned int port,
test_tcp_ao_counters_cmp(tst_name, &ao_cnt1, &ao_cnt2, cnt_expected);
out:
- synchronize_threads(); /* close() */
+ synchronize_threads(); /* test_kill_sk() */
if (sk > 0)
- close(sk);
+ test_kill_sk(sk);
}
static void server_add_routes(void)
@@ -302,10 +302,10 @@ static void try_connect(const char *tst_name, unsigned int port,
test_ok("%s: connected", tst_name);
out:
- synchronize_threads(); /* close() */
+ synchronize_threads(); /* test_kill_sk() */
/* _test_connect_socket() cleans up on failure */
if (ret > 0)
- close(sk);
+ test_kill_sk(sk);
}
#define PREINSTALL_MD5_FIRST BIT(0)
@@ -486,10 +486,10 @@ static void try_to_add(const char *tst_name, unsigned int port,
}
out:
- synchronize_threads(); /* close() */
+ synchronize_threads(); /* test_kill_sk() */
/* _test_connect_socket() cleans up on failure */
if (ret > 0)
- close(sk);
+ test_kill_sk(sk);
}
static void client_add_ip(union tcp_addr *client, const char *ip)
---
base-commit: 021533194476035883300d60fbb3136426ac8ea5
change-id: 20240202-unsigned-md5-netns-counters-35134409362a
Best regards,
--
Dmitry Safonov <dima(a)arista.com>
Non-contiguous CBM support for Intel CAT has been merged into the kernel
with Commit 0e3cd31f6e90 ("x86/resctrl: Enable non-contiguous CBMs in
Intel CAT") but there is no selftest that would validate if this feature
works correctly.
The selftest needs to verify if writing non-contiguous CBMs to the
schemata file behaves as expected in comparison to the information about
non-contiguous CBMs support.
The patch series is based on a rework of resctrl selftests that's
currently in review [1]. The patch also implements a similar
functionality presented in the bash script included in the cover letter
of the original non-contiguous CBMs in Intel CAT series [3].
Changelog v4:
- Changes to error failure return values in non-contiguous test.
- Some minor text refactoring without functional changes.
Changelog v3:
- Rebase onto v4 of Ilpo's series [1].
- Split old patch 3/4 into two parts. One doing refactoring and one
adding a new function.
- Some changes to all the patches after Reinette's review.
Changelog v2:
- Rebase onto v4 of Ilpo's series [2].
- Add two patches that prepare helpers for the new test.
- Move Ilpo's patch that adds test grouping to this series.
- Apply Ilpo's suggestion to the patch that adds a new test.
[1] https://lore.kernel.org/all/20231215150515.36983-1-ilpo.jarvinen@linux.inte…
[2] https://lore.kernel.org/all/20231211121826.14392-1-ilpo.jarvinen@linux.inte…
[3] https://lore.kernel.org/all/cover.1696934091.git.maciej.wieczor-retman@inte…
Older versions of this series:
[v1] https://lore.kernel.org/all/20231109112847.432687-1-maciej.wieczor-retman@i…
[v2] https://lore.kernel.org/all/cover.1702392177.git.maciej.wieczor-retman@inte…
Ilpo Järvinen (1):
selftests/resctrl: Add test groups and name L3 CAT test L3_CAT
Maciej Wieczor-Retman (4):
selftests/resctrl: Add helpers for the non-contiguous test
selftests/resctrl: Split validate_resctrl_feature_request()
selftests/resctrl: Add resource_info_file_exists()
selftests/resctrl: Add non-contiguous CBMs CAT test
tools/testing/selftests/resctrl/cat_test.c | 84 ++++++++++++++++-
tools/testing/selftests/resctrl/cmt_test.c | 2 +-
tools/testing/selftests/resctrl/mba_test.c | 2 +-
tools/testing/selftests/resctrl/mbm_test.c | 6 +-
tools/testing/selftests/resctrl/resctrl.h | 10 +-
.../testing/selftests/resctrl/resctrl_tests.c | 18 +++-
tools/testing/selftests/resctrl/resctrlfs.c | 94 ++++++++++++++++---
7 files changed, 192 insertions(+), 24 deletions(-)
--
2.43.0
From: Jeff Xu <jeffxu(a)chromium.org>
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory
range against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW)
and no-execute (NX) bits. Linux has supported NX since the release of
kernel version 2.6.8 in August 2004 [1]. The memory permission feature
improves the security stance on memory corruption bugs, as an attacker
cannot simply write to arbitrary memory and point the code to it. The
memory must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects
the VMA itself against modifications of the selected seal type.
Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped. Memory sealing can
automatically be applied by the runtime loader to seal .text and
.rodata pages and applications can additionally seal security critical
data at runtime. A similar feature already exists in the XNU kernel
with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the
mimmutable syscall [4]. Also, Chrome wants to adopt this feature for
their CFI work [2] and this patchset has been designed to be
compatible with the Chrome use case.
Two system calls are involved in sealing the map: mmap() and mseal().
The new mseal() is an syscall on 64 bit CPU, and with
following signature:
int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.
mseal() blocks following operations for the given memory range.
1> Unmapping, moving to another location, and shrinking the size,
via munmap() and mremap(), can leave an empty space, therefore can
be replaced with a VMA with a new set of attributes.
2> Moving or expanding a different VMA into the current location,
via mremap().
3> Modifying a VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed VMA.
5> mprotect() and pkey_mprotect().
6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
memory, when users don't have write permission to the memory. Those
behaviors can alter region contents by discarding pages, effectively a
memset(0) for anonymous memory.
In addition: mmap() has two related changes.
The PROT_SEAL bit in prot field of mmap(). When present, it marks
the map sealed since creation.
The MAP_SEALABLE bit in the flags field of mmap(). When present, it marks
the map as sealable. A map created without MAP_SEALABLE will not support
sealing, i.e. mseal() will fail.
Applications that don't care about sealing will expect their behavior
unchanged. For those that need sealing support, opt-in by adding
MAP_SEALABLE in mmap().
The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.
Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in
the case of libc, sealing is only applied to read-only (RO) or
read-execute (RX) memory segments (such as .text and .RELRO) to
prevent them from becoming writable, the lifetime of those mappings
are tied to the lifetime of the process.
Chrome wants to seal two large address space reservations that are
managed by different allocators. The memory is mapped RW- and RWX
respectively but write access to it is restricted using pkeys (or in
the future ARM permission overlay extensions). The lifetime of those
mappings are not tied to the lifetime of the process, therefore, while
the memory is sealed, the allocators still need to free or discard the
unused memory. For example, with madvise(DONTNEED).
However, always allowing madvise(DONTNEED) on this range poses a
security risk. For example if a jump instruction crosses a page
boundary and the second page gets discarded, it will overwrite the
target bytes with zeros and change the control flow. Checking
write-permission before the discard operation allows us to control
when the operation is valid. In this case, the madvise will only
succeed if the executing thread has PKEY write permissions and PKRU
changes are protected in software by control-flow integrity.
Although the initial version of this patch series is targeting the
Chrome browser as its first user, it became evident during upstream
discussions that we would also want to ensure that the patch set
eventually is a complete solution for memory sealing and compatible
with other use cases. The specific scenario currently in mind is
glibc's use case of loading and sealing ELF executables. To this end,
Stephen is working on a change to glibc to add sealing support to the
dynamic linker, which will seal all non-writable segments at startup.
Once this work is completed, all applications will be able to
automatically benefit from these new protections.
In closing, I would like to formally acknowledge the valuable
contributions received during the RFC process, which were instrumental
in shaping this patch:
Jann Horn: raising awareness and providing valuable insights on the
destructive madvise operations.
Liam R. Howlett: perf optimization.
Linus Torvalds: assisting in defining system call signature and scope.
Pedro Falcato: suggesting sealing in the mmap().
Theo de Raadt: sharing the experiences and insight gained from
implementing mimmutable() in OpenBSD.
Change history:
===============
V8:
- perf optimization in mmap. (Liam R. Howlett)
- add one testcase (test_seal_zero_address)
- Update mseal.rst to add note for MAP_SEALABLE.
V7:
- fix index.rst (Randy Dunlap)
- fix arm build (Randy Dunlap)
- return EPERM for blocked operations (Theo de Raadt)
https://lore.kernel.org/linux-mm/20240122152905.2220849-2-jeffxu@chromium.o…
V6:
- Drop RFC from subject, Given Linus's general approval.
- Adjust syscall number for mseal (main Jan.11/2024)
- Code style fix (Matthew Wilcox)
- selftest: use ksft macros (Muhammad Usama Anjum)
- Document fix. (Randy Dunlap)
https://lore.kernel.org/all/20240111234142.2944934-1-jeffxu@chromium.org/
V5:
- fix build issue in mseal-Wire-up-mseal-syscall
(Suggested by Linus Torvalds, and Greg KH)
- updates on selftest.
https://lore.kernel.org/lkml/20240109154547.1839886-1-jeffxu@chromium.org/#r
V4:
(Suggested by Linus Torvalds)
- new signature: mseal(start,len,flags)
- 32 bit is not supported. vm_seal is removed, use vm_flags instead.
- single bit in vm_flags for sealed state.
- CONFIG_MSEAL kernel config is removed.
- single bit of PROT_SEAL in the "Prot" field of mmap().
Other changes:
- update selftest (Suggested by Muhammad Usama Anjum)
- update documentation.
https://lore.kernel.org/all/20240104185138.169307-1-jeffxu@chromium.org/
V3:
- Abandon per-syscall approach, (Suggested by Linus Torvalds).
- Organize sealing types around their functionality, such as
MM_SEAL_BASE, MM_SEAL_PROT_PKEY.
- Extend the scope of sealing from calls originated in userspace to
both kernel and userspace. (Suggested by Linus Torvalds)
- Add seal type support in mmap(). (Suggested by Pedro Falcato)
- Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent
destructive operations of madvise. (Suggested by Jann Horn and
Stephen Röttger)
- Make sealed VMAs mergeable. (Suggested by Jann Horn)
- Add MAP_SEALABLE to mmap()
- Add documentation - mseal.rst
https://lore.kernel.org/linux-mm/20231212231706.2680890-2-jeffxu@chromium.o…
v2:
Use _BITUL to define MM_SEAL_XX type.
Use unsigned long for seal type in sys_mseal() and other functions.
Remove internal VM_SEAL_XX type and convert_user_seal_type().
Remove MM_ACTION_XX type.
Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask.
Add more comments in code.
Add a detailed commit message.
https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/
v1:
https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/
----------------------------------------------------------------
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b…
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXge…
[6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426Fkcgnf…
[7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/
Jeff Xu (4):
mseal: Wire up mseal syscall
mseal: add mseal syscall
selftest mm/mseal memory sealing
mseal:add documentation
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/mseal.rst | 215 ++
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/mman-common.h | 8 +
include/uapi/asm-generic/unistd.h | 5 +-
kernel/sys_ni.c | 1 +
mm/Makefile | 4 +
mm/internal.h | 48 +
mm/madvise.c | 12 +
mm/mmap.c | 35 +-
mm/mprotect.c | 10 +
mm/mremap.c | 31 +
mm/mseal.c | 343 ++++
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/mseal_test.c | 2024 +++++++++++++++++++
33 files changed, 2756 insertions(+), 3 deletions(-)
create mode 100644 Documentation/userspace-api/mseal.rst
create mode 100644 mm/mseal.c
create mode 100644 tools/testing/selftests/mm/mseal_test.c
--
2.43.0.429.g432eaa2c6b-goog