From: Roberto Sassu <roberto.sassu(a)huawei.com>
IMA and EVM are not effectively LSMs, especially due to the fact that in
the past they could not provide a security blob while there is another LSM
active.
That changed in the recent years, the LSM stacking feature now makes it
possible to stack together multiple LSMs, and allows them to provide a
security blob for most kernel objects. While the LSM stacking feature has
some limitations being worked out, it is already suitable to make IMA and
EVM as LSMs.
The main purpose of this patch set is to remove IMA and EVM function calls,
hardcoded in the LSM infrastructure and other places in the kernel, and to
register them as LSM hook implementations, so that those functions are
called by the LSM infrastructure like other regular LSMs.
This patch set introduces two new LSMs 'ima' and 'evm', so that functions
can be registered to their respective LSM, and removes the 'integrity' LSM.
integrity_kernel_module_request() was moved to IMA, since deadlock could
occur only there. integrity_inode_free() was replaced with ima_inode_free()
(EVM does not need to free memory).
In order to make 'ima' and 'evm' independent LSMs, it was necessary to
split integrity metadata used by both IMA and EVM, and to let them manage
their own. The special case of the IMA_NEW_FILE flag, managed by IMA and
used by EVM, was handled by introducing a new flag in EVM, EVM_NEW_FILE,
managed by two additional LSM hooks, evm_post_path_mknod() and
evm_file_release(), equivalent to their counterparts ima_post_path_mknod()
and ima_file_free().
In addition to splitting metadata, it was decided to embed the
evm_iint_inode structure into the inode security blob, since it is small
and because anyway it cannot rely on IMA anymore allocating it (since it
uses a different structure).
On the other hand, to avoid memory pressure concerns, only a pointer to the
ima_iint_cache structure is stored in the inode security blob, and the
structure is allocated on demand, like before.
Another follow-up change was removing the iint parameter from
evm_verifyxattr(), that IMA used to pass integrity metadata to EVM. After
splitting metadata, and aligning EVM_NEW_FILE with IMA_NEW_FILE, this
parameter was not necessary anymore.
The last part was to ensure that the order of IMA and EVM functions is
respected after they become LSMs. Since the order of lsm_info structures in
the .lsm_info.init section depends on the order object files containing
those structures are passed to the linker of the kernel image, and since
IMA is before EVM in the Makefile, that is sufficient to assert that IMA
functions are executed before EVM ones.
The patch set is organized as follows.
Patches 1-9 make IMA and EVM functions suitable to be registered to the LSM
infrastructure, by aligning function parameters.
Patches 10-18 add new LSM hooks in the same places where IMA and EVM
functions are called, if there is no LSM hook already.
Patch 19 moves integrity_kernel_module_request() to IMA, as a prerequisite
for removing the 'integrity' LSM.
Patches 20-22 introduce the new standalone LSMs 'ima' and 'evm', and move
hardcoded calls to IMA, EVM and integrity functions to those LSMs.
Patches 23-24 remove the dependency on the 'integrity' LSM by splitting
integrity metadata, so that the 'ima' and 'evm' LSMs can use their own.
They also duplicate iint_lockdep_annotate() in ima_main.c, since the mutex
field was moved from integrity_iint_cache to ima_iint_cache.
Patch 25 finally removes the 'integrity' LSM, since 'ima' and 'evm' are now
self-contained and independent.
The patch set applies on top of lsm/next, commit 97280fa1ed94 ("Automated
merge of 'dev' into 'next'").
Changelog:
v9:
- Add new Reviewed-by/Acked-by
- Rewrite documentation of ima_kernel_module_request() (suggested by
Stefan)
- Move evm_inode_post_setxattr() registration after evm_inode_setxattr()
(suggested by Stefan)
v8:
- Restore dynamic allocation of IMA integrity metadata, and store only the
pointer in the inode security blob
- Select SECURITY_PATH both in IMA and EVM
- Rename evm_file_free() to evm_file_release()
- Unconditionally register evm_post_path_mknod()
- Introduce the new ima_iint.c file for the management of IMA integrity
metadata
- Introduce ima_inode_set_iint()/ima_inode_get_iint() in ima.h to
respectively store/retrieve the IMA integrity metadata pointer
- Replace ima_iint_inode() with ima_inode_get() and ima_iint_find(), with
same behavior of integrity_inode_get() and integrity_iint_find()
- Initialize the ima_iint_cache in ima_iintcache_init() and call it from
init_ima_lsm()
- Move integrity_kernel_module_request() to IMA in a separate patch
(suggested by Mimi)
- Compile ima_kernel_module_request() if CONFIG_INTEGRITY_ASYMMETRIC_KEYS
is enabled
- Remove ima_inode_alloc_security() and ima_inode_free_security(), since
the IMA integrity metadata is not fully embedded in the inode security
blob
- Fixed the missed initialization of ima_iint_cache in
process_measurement() and __ima_inode_hash()
- Add a sentence in 'evm: Move to LSM infrastructure' to mention about
moving evm_inode_remove_acl(), evm_inode_post_remove_acl() and
evm_inode_post_set_acl() to evm_main.c
- Add a sentence in 'ima: Move IMA-Appraisal to LSM infrastructure' to
mention about moving ima_inode_remove_acl() to ima_appraise.c
v7:
- Use return instead of goto in __vfs_removexattr_locked() (suggested by
Casey)
- Clarify in security/integrity/Makefile that the order of 'ima' and 'evm'
LSMs depends on the order in which IMA and EVM are compiled
- Move integrity_iint_cache flags to ima.h and evm.h in security/ and
duplicate IMA_NEW_FILE to EVM_NEW_FILE
- Rename evm_inode_get_iint() to evm_iint_inode() and ima_inode_get_iint()
to ima_iint_inode(), check if inode->i_security is NULL, and just return
the pointer from the inode security blob
- Restore the non-NULL checks after ima_iint_inode() and evm_iint_inode()
(suggested by Casey)
- Introduce evm_file_free() to clear EVM_NEW_FILE
- Remove comment about LSM_ORDER_LAST not guaranteeing the order of 'ima'
and 'evm' LSMs
- Lock iint->mutex before reading IMA_COLLECTED flag in __ima_inode_hash()
and restored ima_policy_flag check
- Remove patch about the hardcoded ordering of 'ima' and 'evm' LSMs in
security.c
- Add missing ima_inode_free_security() to free iint->ima_hash
- Add the cases for LSM_ID_IMA and LSM_ID_EVM in lsm_list_modules_test.c
- Mention about the change in IMA and EVM post functions for private
inodes
v6:
- See v7
v5:
- Rename security_file_pre_free() to security_file_release() and the LSM
hook file_pre_free_security to file_release (suggested by Paul)
- Move integrity_kernel_module_request() to ima_main.c (renamed to
ima_kernel_module_request())
- Split the integrity_iint_cache structure into ima_iint_cache and
evm_iint_cache, so that IMA and EVM can use disjoint metadata and
reserve space with the LSM infrastructure
- Reserve space for the entire ima_iint_cache and evm_iint_cache
structures, not just the pointer (suggested by Paul)
- Introduce ima_inode_get_iint() and evm_inode_get_iint() to retrieve
respectively the ima_iint_cache and evm_iint_cache structure from the
security blob
- Remove the various non-NULL checks for the ima_iint_cache and
evm_iint_cache structures, since the LSM infrastructure ensure that they
always exist
- Remove the iint parameter from evm_verifyxattr() since IMA and EVM
use disjoint integrity metaddata
- Introduce the evm_post_path_mknod() to set the IMA_NEW_FILE flag
- Register the inode_alloc_security LSM hook in IMA and EVM to
initialize the respective integrity metadata structures
- Remove the 'integrity' LSM completely and instead make 'ima' and 'evm'
proper standalone LSMs
- Add the inode parameter to ima_get_verity_digest(), since the inode
field is not present in ima_iint_cache
- Move iint_lockdep_annotate() to ima_main.c (renamed to
ima_iint_lockdep_annotate())
- Remove ima_get_lsm_id() and evm_get_lsm_id(), since IMA and EVM directly
register the needed LSM hooks
- Enforce 'ima' and 'evm' LSM ordering at LSM infrastructure level
v4:
- Improve short and long description of
security_inode_post_create_tmpfile(), security_inode_post_set_acl(),
security_inode_post_remove_acl() and security_file_post_open()
(suggested by Mimi)
- Improve commit message of 'ima: Move to LSM infrastructure' (suggested
by Mimi)
v3:
- Drop 'ima: Align ima_post_path_mknod() definition with LSM
infrastructure' and 'ima: Align ima_post_create_tmpfile() definition
with LSM infrastructure', define the new LSM hooks with the same
IMA parameters instead (suggested by Mimi)
- Do IS_PRIVATE() check in security_path_post_mknod() and
security_inode_post_create_tmpfile() on the new inode rather than the
parent directory (in the post method it is available)
- Don't export ima_file_check() (suggested by Stefan)
- Remove redundant check of file mode in ima_post_path_mknod() (suggested
by Mimi)
- Mention that ima_post_path_mknod() is now conditionally invoked when
CONFIG_SECURITY_PATH=y (suggested by Mimi)
- Mention when a LSM hook will be introduced in the IMA/EVM alignment
patches (suggested by Mimi)
- Simplify the commit messages when introducing a new LSM hook
- Still keep the 'extern' in the function declaration, until the
declaration is removed (suggested by Mimi)
- Improve documentation of security_file_pre_free()
- Register 'ima' and 'evm' as standalone LSMs (suggested by Paul)
- Initialize the 'ima' and 'evm' LSMs from 'integrity', to keep the
original ordering of IMA and EVM functions as when they were hardcoded
- Return the IMA and EVM LSM IDs to 'integrity' for registration of the
integrity-specific hooks
- Reserve an xattr slot from the 'evm' LSM instead of 'integrity'
- Pass the LSM ID to init_ima_appraise_lsm()
v2:
- Add description for newly introduced LSM hooks (suggested by Casey)
- Clarify in the description of security_file_pre_free() that actions can
be performed while the file is still open
v1:
- Drop 'evm: Complete description of evm_inode_setattr()', 'fs: Fix
description of vfs_tmpfile()' and 'security: Introduce LSM_ORDER_LAST',
they were sent separately (suggested by Christian Brauner)
- Replace dentry with file descriptor parameter for
security_inode_post_create_tmpfile()
- Introduce mode_stripped and pass it as mode argument to
security_path_mknod() and security_path_post_mknod()
- Use goto in do_mknodat() and __vfs_removexattr_locked() (suggested by
Mimi)
- Replace __lsm_ro_after_init with __ro_after_init
- Modify short description of security_inode_post_create_tmpfile() and
security_inode_post_set_acl() (suggested by Stefan)
- Move security_inode_post_setattr() just after security_inode_setattr()
(suggested by Mimi)
- Modify short description of security_key_post_create_or_update()
(suggested by Mimi)
- Add back exported functions ima_file_check() and
evm_inode_init_security() respectively to ima.h and evm.h (reported by
kernel robot)
- Remove extern from prototype declarations and fix style issues
- Remove unnecessary include of linux/lsm_hooks.h in ima_main.c and
ima_appraise.c
Roberto Sassu (25):
ima: Align ima_inode_post_setattr() definition with LSM infrastructure
ima: Align ima_file_mprotect() definition with LSM infrastructure
ima: Align ima_inode_setxattr() definition with LSM infrastructure
ima: Align ima_inode_removexattr() definition with LSM infrastructure
ima: Align ima_post_read_file() definition with LSM infrastructure
evm: Align evm_inode_post_setattr() definition with LSM infrastructure
evm: Align evm_inode_setxattr() definition with LSM infrastructure
evm: Align evm_inode_post_setxattr() definition with LSM
infrastructure
security: Align inode_setattr hook definition with EVM
security: Introduce inode_post_setattr hook
security: Introduce inode_post_removexattr hook
security: Introduce file_post_open hook
security: Introduce file_release hook
security: Introduce path_post_mknod hook
security: Introduce inode_post_create_tmpfile hook
security: Introduce inode_post_set_acl hook
security: Introduce inode_post_remove_acl hook
security: Introduce key_post_create_or_update hook
integrity: Move integrity_kernel_module_request() to IMA
ima: Move to LSM infrastructure
ima: Move IMA-Appraisal to LSM infrastructure
evm: Move to LSM infrastructure
evm: Make it independent from 'integrity' LSM
ima: Make it independent from 'integrity' LSM
integrity: Remove LSM
fs/attr.c | 5 +-
fs/file_table.c | 3 +-
fs/namei.c | 12 +-
fs/nfsd/vfs.c | 3 +-
fs/open.c | 1 -
fs/posix_acl.c | 5 +-
fs/xattr.c | 9 +-
include/linux/evm.h | 117 +-------
include/linux/ima.h | 142 ----------
include/linux/integrity.h | 27 --
include/linux/lsm_hook_defs.h | 20 +-
include/linux/security.h | 59 ++++
include/uapi/linux/lsm.h | 2 +
security/integrity/Makefile | 1 +
security/integrity/digsig_asymmetric.c | 23 --
security/integrity/evm/Kconfig | 1 +
security/integrity/evm/evm.h | 19 ++
security/integrity/evm/evm_crypto.c | 4 +-
security/integrity/evm/evm_main.c | 195 ++++++++++---
security/integrity/iint.c | 197 +------------
security/integrity/ima/Kconfig | 1 +
security/integrity/ima/Makefile | 2 +-
security/integrity/ima/ima.h | 148 ++++++++--
security/integrity/ima/ima_api.c | 23 +-
security/integrity/ima/ima_appraise.c | 66 +++--
security/integrity/ima/ima_iint.c | 142 ++++++++++
security/integrity/ima/ima_init.c | 2 +-
security/integrity/ima/ima_main.c | 148 +++++++---
security/integrity/ima/ima_policy.c | 2 +-
security/integrity/integrity.h | 80 +-----
security/keys/key.c | 10 +-
security/security.c | 263 +++++++++++-------
security/selinux/hooks.c | 3 +-
security/smack/smack_lsm.c | 4 +-
.../selftests/lsm/lsm_list_modules_test.c | 6 +
35 files changed, 906 insertions(+), 839 deletions(-)
create mode 100644 security/integrity/ima/ima_iint.c
--
2.34.1
From: Zi Yan <ziy(a)nvidia.com>
Hi all,
File folio supports any order and multi-size THP is upstreamed[1], so both
file and anonymous folios can be >0 order. Currently, split_huge_page()
only splits a huge page to order-0 pages, but splitting to orders higher than
0 is going to better utilize large folios. In addition, Large Block
Sizes in XFS support would benefit from it[2]. This patchset adds support for
splitting a large folio to any lower order folios and uses it during file
folio truncate operations.
For Patch 6, Hugh did not like my approach to minimize the number of
folios for truncate[3]. I would like to get more feedback, especially
from FS people, on it to decide whether to keep it or not.
The patchset is on top of mm-everything-2024-02-13-01-26.
Changelog
===
Since v3
---
1. Excluded shmem folios and pagecache folios without FS support from
splitting to any order (per Hugh Dickins).
2. Allowed splitting anonymous large folio to any lower order since
multi-size THP is upstreamed.
3. Adapted selftests code to new framework.
Since v2
---
1. Fixed an issue in __split_page_owner() introduced during my rebase
Since v1
---
1. Changed split_page_memcg() and split_page_owner() parameter to use order
2. Used folio_test_pmd_mappable() in place of the equivalent code
Details
===
* Patch 1 changes split_page_memcg() to use order instead of nr_pages
* Patch 2 changes split_page_owner() to use order instead of nr_pages
* Patch 3 and 4 add new_order parameter split_page_memcg() and
split_page_owner() and prepare for upcoming changes.
* Patch 5 adds split_huge_page_to_list_to_order() to split a huge page
to any lower order. The original split_huge_page_to_list() calls
split_huge_page_to_list_to_order() with new_order = 0.
* Patch 6 uses split_huge_page_to_list_to_order() in large pagecache folio
truncation instead of split the large folio all the way down to order-0.
* Patch 7 adds a test API to debugfs and test cases in
split_huge_page_test selftests.
Comments and/or suggestions are welcome.
[1] https://lore.kernel.org/all/20231207161211.2374093-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/qzbcjn4gcyxla4gwuj6smlnwknz2wvo5wrjctin6ee…
[3] https://lore.kernel.org/linux-mm/9dd96da-efa2-5123-20d4-4992136ef3ad@google…
Zi Yan (7):
mm/memcg: use order instead of nr in split_page_memcg()
mm/page_owner: use order instead of nr in split_page_owner()
mm: memcg: make memcg huge page split support any order split.
mm: page_owner: add support for splitting to any order in split
page_owner.
mm: thp: split huge page to any lower order pages (except order-1).
mm: truncate: split huge page cache page to a non-zero order if
possible.
mm: huge_memory: enable debugfs to split huge pages to any order.
include/linux/huge_mm.h | 21 +-
include/linux/memcontrol.h | 4 +-
include/linux/page_owner.h | 10 +-
mm/huge_memory.c | 149 +++++++++---
mm/memcontrol.c | 10 +-
mm/page_alloc.c | 8 +-
mm/page_owner.c | 8 +-
mm/truncate.c | 21 +-
.../selftests/mm/split_huge_page_test.c | 223 +++++++++++++++++-
9 files changed, 382 insertions(+), 72 deletions(-)
--
2.43.0
Arch maintainers, please ack/review patches.
This is a resend of a series from Frank last year[1]. I worked in Rob's
review comments to unconditionally call unflatten_device_tree() and
fixup/audit calls to of_have_populated_dt() so that behavior doesn't
change.
I need this series so I can add DT based tests in the clk framework.
Either I can merge it through the clk tree once everyone is happy, or
Rob can merge it through the DT tree and provide some branch so I can
base clk patches on it.
Changes from Frank's series[1]:
* Add a DTB loaded kunit test
* Make of_have_populated_dt() return false if the DTB isn't from the
bootloader
* Architecture calls made unconditional so that a root node is always
made
Changes from v2 (https://lore.kernel.org/r/20240130004508.1700335-1-sboyd@kernel.org):
* Reorder patches to have OF changes largely first
* No longer modify initial_boot_params if ACPI=y
* Put arm64 patch back to v1
Changes from v1 (https://lore.kernel.org/r/20240112200750.4062441-1-sboyd@kernel.org):
* x86 patch included
* arm64 knocks out initial dtb if acpi is in use
* keep Kconfig hidden but def_bool enabled otherwise
Frank Rowand (2):
of: Create of_root if no dtb provided by firmware
of: unittest: treat missing of_root as error instead of fixing up
Stephen Boyd (5):
of: Always unflatten in unflatten_and_copy_device_tree()
um: Unconditionally call unflatten_device_tree()
x86/of: Unconditionally call unflatten_and_copy_device_tree()
arm64: Unconditionally call unflatten_device_tree()
of: Add KUnit test to confirm DTB is loaded
arch/arm64/kernel/setup.c | 3 +-
arch/um/kernel/dtb.c | 14 ++++----
arch/x86/kernel/devicetree.c | 24 +++++++-------
drivers/of/.kunitconfig | 3 ++
drivers/of/Kconfig | 11 ++++++-
drivers/of/Makefile | 4 ++-
drivers/of/empty_root.dts | 6 ++++
drivers/of/fdt.c | 64 +++++++++++++++++++++++++++---------
drivers/of/of_test.c | 48 +++++++++++++++++++++++++++
drivers/of/platform.c | 3 --
drivers/of/unittest.c | 16 +++------
include/linux/of.h | 25 ++++++++------
12 files changed, 158 insertions(+), 63 deletions(-)
create mode 100644 drivers/of/.kunitconfig
create mode 100644 drivers/of/empty_root.dts
create mode 100644 drivers/of/of_test.c
Cc: Anton Ivanov <anton.ivanov(a)cambridgegreys.com>
Cc: Brendan Higgins <brendan.higgins(a)linux.dev>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: David Gow <davidgow(a)google.com>
Cc: Frank Rowand <frowand.list(a)gmail.com>
Cc: Johannes Berg <johannes(a)sipsolutions.net>
Cc: Richard Weinberger <richard(a)nod.at>
Cc: Rob Herring <robh+dt(a)kernel.org>
Cc: Will Deacon <will(a)kernel.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: <x86(a)kernel.org>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Saurabh Sengar <ssengar(a)linux.microsoft.com>
[1] https://lore.kernel.org/r/20230317053415.2254616-1-frowand.list@gmail.com
base-commit: 0dd3ee31125508cd67f7e7172247f05b7fd1753a
--
https://git.kernel.org/pub/scm/linux/kernel/git/clk/linux.git/https://git.kernel.org/pub/scm/linux/kernel/git/sboyd/spmi.git
Add specification for test metadata to the KTAP v2 spec.
KTAP v1 only specifies the output format of very basic test information:
test result and test name. Any additional test information either gets
added to general diagnostic data or is not included in the output at all.
The purpose of KTAP metadata is to create a framework to include and
easily identify additional important test information in KTAP.
KTAP metadata could include any test information that is pertinent for
user interaction before or after the running of the test. For example,
the test file path or the test speed.
Since this includes a large variety of information, this specification
will recognize notable types of KTAP metadata to ensure consistent format
across test frameworks. See the full list of types in the specification.
Example of KTAP Metadata:
KTAP version 2
#:ktap_test: main
#:ktap_arch: uml
1..1
KTAP version 2
#:ktap_test: suite_1
#:ktap_subsystem: example
#:ktap_test_file: lib/test.c
1..2
ok 1 test_1
#:ktap_test: test_2
#:ktap_speed: very_slow
# test_2 has begun
#:custom_is_flaky: true
ok 2 test_2
# suite_1 has passed
ok 1 suite_1
The changes to the KTAP specification outline the format, location, and
different types of metadata.
Reviewed-by: David Gow <davidgow(a)google.com>
Signed-off-by: Rae Moar <rmoar(a)google.com>
---
Changes since v2:
- Change format of metadata line from "# prefix_type: value" to
"#:prefix_type: value".
- Add examples of edge cases, such as diagnostic information interrupting
metadata lines and lines in the wrong location, to the example in the
specification.
- Add current list of accepted prefixes.
- Add notes on indentation and generated files.
- Fix a few typos.
Documentation/dev-tools/ktap.rst | 182 ++++++++++++++++++++++++++++++-
1 file changed, 178 insertions(+), 4 deletions(-)
diff --git a/Documentation/dev-tools/ktap.rst b/Documentation/dev-tools/ktap.rst
index ff77f4aaa6ef..42a7e50b9270 100644
--- a/Documentation/dev-tools/ktap.rst
+++ b/Documentation/dev-tools/ktap.rst
@@ -17,19 +17,22 @@ KTAP test results describe a series of tests (which may be nested: i.e., test
can have subtests), each of which can contain both diagnostic data -- e.g., log
lines -- and a final result. The test structure and results are
machine-readable, whereas the diagnostic data is unstructured and is there to
-aid human debugging.
+aid human debugging. Since version 2, tests can also contain metadata which
+consists of important supplemental test information and can be
+machine-readable.
+
+KTAP output is built from five different types of lines:
-KTAP output is built from four different types of lines:
- Version lines
- Plan lines
- Test case result lines
- Diagnostic lines
+- Metadata lines
In general, valid KTAP output should also form valid TAP output, but some
information, in particular nested test results, may be lost. Also note that
there is a stagnant draft specification for TAP14, KTAP diverges from this in
-a couple of places (notably the "Subtest" header), which are described where
-relevant later in this document.
+a couple of places, which are described where relevant later in this document.
Version lines
-------------
@@ -166,6 +169,171 @@ even if they do not start with a "#": this is to capture any other useful
kernel output which may help debug the test. It is nevertheless recommended
that tests always prefix any diagnostic output they have with a "#" character.
+KTAP metadata lines
+-------------------
+
+KTAP metadata lines are used to include and easily identify important
+supplemental test information in KTAP. These lines may appear similar to
+diagnostic lines. The format of metadata lines is below:
+
+.. code-block:: none
+
+ #:<prefix>_<metadata type>: <metadata value>
+
+The <prefix> indicates where to find the specification for the type of
+metadata, such as the name of a test framework or "ktap" to indicate this
+specification. The list of currently approved prefixes and where to find the
+documentation of the metadata types is below. Note any metadata type that does
+not use a prefix from the list below must use the prefix "custom".
+
+Current List of Approved Prefixes:
+
+- ``ktap``: See Types of KTAP Metadata below for the list of metadata types.
+
+The format of <metadata type> and <value> varies based on the type. See the
+individual specification. For "custom" types the <metadata type> can be any
+string excluding ":", spaces, or newline characters and the <value> can be any
+string.
+
+**Location:**
+
+The first KTAP metadata line for a test must be "#:ktap_test: <test name>",
+which acts as a header to associate metadata with the correct test. Metadata
+for the main KTAP level uses the test name "main".
+
+For test cases, the location of the metadata is between the prior test result
+line and the current test result line. For test suites, the location of the
+metadata is between the suite's version line and test plan line. For the main
+level, the location of the metadata is between the main version line and main
+test plan line. See the example below.
+
+Note that a test case's metadata is inline with the test's result line. Whereas
+a suite's metadata is inline with the suite's version line and thus will be
+more indented than the suite's result line. Additionally, metadata for the main
+level is inline with the main version line.
+
+KTAP metadata for a test does not need to be contiguous. For example, a kernel
+warning or other diagnostic output could interrupt metadata lines. However, it
+is recommended to keep a test's metadata lines in the correct location and
+together when possible, as this improves readability.
+
+**Example of KTAP metadata:**
+
+::
+
+ KTAP version 2
+ #:ktap_test: main
+ #:ktap_arch: uml
+ 1..1
+ KTAP version 2
+ #:ktap_test: suite_1
+ #:ktap_subsystem: example
+ #:ktap_test_file: lib/test.c
+ 1..2
+ # WARNING: test_1 skipped
+ ok 1 test_1 # SKIP
+ #:ktap_test: test_2
+ #:ktap_speed: very_slow
+ # test_2 has begun
+ #:custom_is_flaky: true
+ ok 2 test_2
+ #:ktap_duration: 1.231s
+ # suite_1 passed
+ ok 1 suite_1
+
+In this example, the tests are running on UML. The test suite "suite_1" is part
+of the subsystem "example" and belongs to the file "lib/test.c". It has
+two subtests, "test_1" and "test_2". The subtest "test_2" has a speed of
+"very_slow" and has been marked with a custom KTAP metadata type called
+"custom_is_flaky" with the value of "true".
+
+Additionally "test_2" has a duration of 1.231s. The ktap_duration line is not
+contiguous with the other metadata and is not in the correct location, which
+is not recommended. However, it is allowed because it is below the
+"#:ktap_test: test_2" line.
+
+**Types of KTAP Metadata:**
+
+This is the current list of KTAP metadata types recognized in this
+specification. Note that all of these metadata types are optional (except for
+ktap_test as the KTAP metadata header).
+
+- ``ktap_test``: Name of test (used as header of KTAP metadata). This should
+ match the test name printed in the test result line: "ok 1 [test_name]".
+
+- ``ktap_module``: Name of the module containing the test
+
+- ``ktap_subsystem``: Name of the subsystem being tested
+
+- ``ktap_start_time``: Time tests started in ISO8601 format
+
+ - Example: "#:ktap_start_time: 2024-01-09T13:09:01.990000+00:00"
+
+- ``ktap_duration``: Time taken (in seconds) to execute the test
+
+ - Example: "#:ktap_duration: 10.154s"
+
+- ``ktap_speed``: Category of how fast test runs: "normal", "slow", or
+ "very_slow"
+
+- ``ktap_test_file``: Path to source file containing the test. This metadata
+ line can be repeated if the test is spread across multiple files.
+
+ - Example: "#:ktap_test_file: lib/test.c"
+
+- ``ktap_generated_file``: Description of and path to file generated during
+ test execution. This could be a core dump, generated filesystem image, some
+ form of visual output (for graphics drivers), etc. This metadata line can be
+ repeated to attach multiple files to the test. Note use ktap_log_file or
+ ktap_error_file instead of this type if more applicable.
+
+ - Example: "#:ktap_generated_file: Core dump: /var/lib/systemd/coredump/hello.core"
+
+- ``ktap_log_file``: Path to file containing kernel log test output
+
+ - Example: "#:ktap_log_file: /sys/kernel/debugfs/kunit/example/results"
+
+- ``ktap_error_file``: Path to file containing context for test failure or
+ error. This could include the difference between optimal test output and
+ actual test output.
+
+ - Example: "#:ktap_error_file: fs/results/example.out.bad"
+
+- ``ktap_results_url``: Link to webpage describing this test run and its
+ results
+
+ - Example: "#:ktap_results_url: https://kcidb.kernelci.org/hello"
+
+- ``ktap_arch``: Architecture used during test run
+
+ - Example: "#:ktap_arch: x86_64"
+
+- ``ktap_compiler``: Compiler used during test run
+
+ - Example: "#:ktap_compiler: gcc (GCC) 10.1.1 20200507 (Red Hat 10.1.1-1)"
+
+- ``ktap_respository_url``: Link to git repository of the checked out code.
+
+ - Example: "#:ktap_respository_url: https://github.com/torvalds/linux.git"
+
+- ``ktap_git_branch``: Name of git branch of checked out code
+
+ - Example: "#:ktap_git_branch: kselftest/kunit"
+
+- ``ktap_kernel_version``: Version of Linux Kernel being used during test run
+
+ - Example: "#:ktap_kernel_version: 6.7-rc1"
+
+- ``ktap_commit_hash``: The full git commit hash of the checked out base code.
+
+ - Example: "#:ktap_commit_hash: 064725faf8ec2e6e36d51e22d3b86d2707f0f47f"
+
+**Other Metadata Types:**
+
+There can also be KTAP metadata that is not included in the recognized list
+above. This metadata must be prefixed with the test framework, ie. "kselftest",
+or with the prefix "custom". For example, "# custom_batch: 20".
+
Unknown lines
-------------
@@ -206,6 +374,7 @@ An example of a test with two nested subtests:
KTAP version 2
1..1
KTAP version 2
+ #:ktap_test: example
1..2
ok 1 test_1
not ok 2 test_2
@@ -219,6 +388,7 @@ An example format with multiple levels of nested testing:
KTAP version 2
1..2
KTAP version 2
+ #:ktap_test: example_test_1
1..2
KTAP version 2
1..2
@@ -254,6 +424,7 @@ Example KTAP output
KTAP version 2
1..1
KTAP version 2
+ #:ktap_test: main_test
1..3
KTAP version 2
1..1
@@ -261,11 +432,14 @@ Example KTAP output
ok 1 test_1
ok 1 example_test_1
KTAP version 2
+ #:ktap_test: example_test_2
+ #:ktap_speed: slow
1..2
ok 1 test_1 # SKIP test_1 skipped
ok 2 test_2
ok 2 example_test_2
KTAP version 2
+ #:ktap_test: example_test_3
1..3
ok 1 test_1
# test_2: FAIL
base-commit: 906f02e42adfbd5ae70d328ee71656ecb602aaf5
--
2.43.0.687.g38aa6559b0-goog
From: Roberto Sassu <roberto.sassu(a)huawei.com>
IMA and EVM are not effectively LSMs, especially due to the fact that in
the past they could not provide a security blob while there is another LSM
active.
That changed in the recent years, the LSM stacking feature now makes it
possible to stack together multiple LSMs, and allows them to provide a
security blob for most kernel objects. While the LSM stacking feature has
some limitations being worked out, it is already suitable to make IMA and
EVM as LSMs.
The main purpose of this patch set is to remove IMA and EVM function calls,
hardcoded in the LSM infrastructure and other places in the kernel, and to
register them as LSM hook implementations, so that those functions are
called by the LSM infrastructure like other regular LSMs.
This patch set introduces two new LSMs 'ima' and 'evm', so that functions
can be registered to their respective LSM, and removes the 'integrity' LSM.
integrity_kernel_module_request() was moved to IMA, since deadlock could
occur only there. integrity_inode_free() was replaced with ima_inode_free()
(EVM does not need to free memory).
In order to make 'ima' and 'evm' independent LSMs, it was necessary to
split integrity metadata used by both IMA and EVM, and to let them manage
their own. The special case of the IMA_NEW_FILE flag, managed by IMA and
used by EVM, was handled by introducing a new flag in EVM, EVM_NEW_FILE,
managed by two additional LSM hooks, evm_post_path_mknod() and
evm_file_release(), equivalent to their counterparts ima_post_path_mknod()
and ima_file_free().
In addition to splitting metadata, it was decided to embed the
evm_iint_inode structure into the inode security blob, since it is small
and because anyway it cannot rely on IMA anymore allocating it (since it
uses a different structure).
On the other hand, to avoid memory pressure concerns, only a pointer to the
ima_iint_cache structure is stored in the inode security blob, and the
structure is allocated on demand, like before.
Another follow-up change was removing the iint parameter from
evm_verifyxattr(), that IMA used to pass integrity metadata to EVM. After
splitting metadata, and aligning EVM_NEW_FILE with IMA_NEW_FILE, this
parameter was not necessary anymore.
The last part was to ensure that the order of IMA and EVM functions is
respected after they become LSMs. Since the order of lsm_info structures in
the .lsm_info.init section depends on the order object files containing
those structures are passed to the linker of the kernel image, and since
IMA is before EVM in the Makefile, that is sufficient to assert that IMA
functions are executed before EVM ones.
The patch set is organized as follows.
Patches 1-9 make IMA and EVM functions suitable to be registered to the LSM
infrastructure, by aligning function parameters.
Patches 10-18 add new LSM hooks in the same places where IMA and EVM
functions are called, if there is no LSM hook already.
Patch 19 moves integrity_kernel_module_request() to IMA, as a prerequisite
for removing the 'integrity' LSM.
Patches 20-22 introduce the new standalone LSMs 'ima' and 'evm', and move
hardcoded calls to IMA, EVM and integrity functions to those LSMs.
Patches 23-24 remove the dependency on the 'integrity' LSM by splitting
integrity metadata, so that the 'ima' and 'evm' LSMs can use their own.
They also duplicate iint_lockdep_annotate() in ima_main.c, since the mutex
field was moved from integrity_iint_cache to ima_iint_cache.
Patch 25 finally removes the 'integrity' LSM, since 'ima' and 'evm' are now
self-contained and independent.
The patch set applies on top of lsm/dev, commit f1bb47a31dff ("lsm: new
security_file_ioctl_compat() hook"). The linux-integrity/next-integrity at
commit c00f94b3a5be ("overlay: disable EVM") was merged.
Changelog:
v8:
- Restore dynamic allocation of IMA integrity metadata, and store only the
pointer in the inode security blob
- Select SECURITY_PATH both in IMA and EVM
- Rename evm_file_free() to evm_file_release()
- Unconditionally register evm_post_path_mknod()
- Introduce the new ima_iint.c file for the management of IMA integrity
metadata
- Introduce ima_inode_set_iint()/ima_inode_get_iint() in ima.h to
respectively store/retrieve the IMA integrity metadata pointer
- Replace ima_iint_inode() with ima_inode_get() and ima_iint_find(), with
same behavior of integrity_inode_get() and integrity_iint_find()
- Initialize the ima_iint_cache in ima_iintcache_init() and call it from
init_ima_lsm()
- Move integrity_kernel_module_request() to IMA in a separate patch
(suggested by Mimi)
- Compile ima_kernel_module_request() if CONFIG_INTEGRITY_ASYMMETRIC_KEYS
is enabled
- Remove ima_inode_alloc_security() and ima_inode_free_security(), since
the IMA integrity metadata is not fully embedded in the inode security
blob
- Fixed the missed initialization of ima_iint_cache in
process_measurement() and __ima_inode_hash()
- Add a sentence in 'evm: Move to LSM infrastructure' to mention about
moving evm_inode_remove_acl(), evm_inode_post_remove_acl() and
evm_inode_post_set_acl() to evm_main.c
- Add a sentence in 'ima: Move IMA-Appraisal to LSM infrastructure' to
mention about moving ima_inode_remove_acl() to ima_appraise.c
v7:
- Use return instead of goto in __vfs_removexattr_locked() (suggested by
Casey)
- Clarify in security/integrity/Makefile that the order of 'ima' and 'evm'
LSMs depends on the order in which IMA and EVM are compiled
- Move integrity_iint_cache flags to ima.h and evm.h in security/ and
duplicate IMA_NEW_FILE to EVM_NEW_FILE
- Rename evm_inode_get_iint() to evm_iint_inode() and ima_inode_get_iint()
to ima_iint_inode(), check if inode->i_security is NULL, and just return
the pointer from the inode security blob
- Restore the non-NULL checks after ima_iint_inode() and evm_iint_inode()
(suggested by Casey)
- Introduce evm_file_free() to clear EVM_NEW_FILE
- Remove comment about LSM_ORDER_LAST not guaranteeing the order of 'ima'
and 'evm' LSMs
- Lock iint->mutex before reading IMA_COLLECTED flag in __ima_inode_hash()
and restored ima_policy_flag check
- Remove patch about the hardcoded ordering of 'ima' and 'evm' LSMs in
security.c
- Add missing ima_inode_free_security() to free iint->ima_hash
- Add the cases for LSM_ID_IMA and LSM_ID_EVM in lsm_list_modules_test.c
- Mention about the change in IMA and EVM post functions for private
inodes
v6:
- See v7
v5:
- Rename security_file_pre_free() to security_file_release() and the LSM
hook file_pre_free_security to file_release (suggested by Paul)
- Move integrity_kernel_module_request() to ima_main.c (renamed to
ima_kernel_module_request())
- Split the integrity_iint_cache structure into ima_iint_cache and
evm_iint_cache, so that IMA and EVM can use disjoint metadata and
reserve space with the LSM infrastructure
- Reserve space for the entire ima_iint_cache and evm_iint_cache
structures, not just the pointer (suggested by Paul)
- Introduce ima_inode_get_iint() and evm_inode_get_iint() to retrieve
respectively the ima_iint_cache and evm_iint_cache structure from the
security blob
- Remove the various non-NULL checks for the ima_iint_cache and
evm_iint_cache structures, since the LSM infrastructure ensure that they
always exist
- Remove the iint parameter from evm_verifyxattr() since IMA and EVM
use disjoint integrity metaddata
- Introduce the evm_post_path_mknod() to set the IMA_NEW_FILE flag
- Register the inode_alloc_security LSM hook in IMA and EVM to
initialize the respective integrity metadata structures
- Remove the 'integrity' LSM completely and instead make 'ima' and 'evm'
proper standalone LSMs
- Add the inode parameter to ima_get_verity_digest(), since the inode
field is not present in ima_iint_cache
- Move iint_lockdep_annotate() to ima_main.c (renamed to
ima_iint_lockdep_annotate())
- Remove ima_get_lsm_id() and evm_get_lsm_id(), since IMA and EVM directly
register the needed LSM hooks
- Enforce 'ima' and 'evm' LSM ordering at LSM infrastructure level
v4:
- Improve short and long description of
security_inode_post_create_tmpfile(), security_inode_post_set_acl(),
security_inode_post_remove_acl() and security_file_post_open()
(suggested by Mimi)
- Improve commit message of 'ima: Move to LSM infrastructure' (suggested
by Mimi)
v3:
- Drop 'ima: Align ima_post_path_mknod() definition with LSM
infrastructure' and 'ima: Align ima_post_create_tmpfile() definition
with LSM infrastructure', define the new LSM hooks with the same
IMA parameters instead (suggested by Mimi)
- Do IS_PRIVATE() check in security_path_post_mknod() and
security_inode_post_create_tmpfile() on the new inode rather than the
parent directory (in the post method it is available)
- Don't export ima_file_check() (suggested by Stefan)
- Remove redundant check of file mode in ima_post_path_mknod() (suggested
by Mimi)
- Mention that ima_post_path_mknod() is now conditionally invoked when
CONFIG_SECURITY_PATH=y (suggested by Mimi)
- Mention when a LSM hook will be introduced in the IMA/EVM alignment
patches (suggested by Mimi)
- Simplify the commit messages when introducing a new LSM hook
- Still keep the 'extern' in the function declaration, until the
declaration is removed (suggested by Mimi)
- Improve documentation of security_file_pre_free()
- Register 'ima' and 'evm' as standalone LSMs (suggested by Paul)
- Initialize the 'ima' and 'evm' LSMs from 'integrity', to keep the
original ordering of IMA and EVM functions as when they were hardcoded
- Return the IMA and EVM LSM IDs to 'integrity' for registration of the
integrity-specific hooks
- Reserve an xattr slot from the 'evm' LSM instead of 'integrity'
- Pass the LSM ID to init_ima_appraise_lsm()
v2:
- Add description for newly introduced LSM hooks (suggested by Casey)
- Clarify in the description of security_file_pre_free() that actions can
be performed while the file is still open
v1:
- Drop 'evm: Complete description of evm_inode_setattr()', 'fs: Fix
description of vfs_tmpfile()' and 'security: Introduce LSM_ORDER_LAST',
they were sent separately (suggested by Christian Brauner)
- Replace dentry with file descriptor parameter for
security_inode_post_create_tmpfile()
- Introduce mode_stripped and pass it as mode argument to
security_path_mknod() and security_path_post_mknod()
- Use goto in do_mknodat() and __vfs_removexattr_locked() (suggested by
Mimi)
- Replace __lsm_ro_after_init with __ro_after_init
- Modify short description of security_inode_post_create_tmpfile() and
security_inode_post_set_acl() (suggested by Stefan)
- Move security_inode_post_setattr() just after security_inode_setattr()
(suggested by Mimi)
- Modify short description of security_key_post_create_or_update()
(suggested by Mimi)
- Add back exported functions ima_file_check() and
evm_inode_init_security() respectively to ima.h and evm.h (reported by
kernel robot)
- Remove extern from prototype declarations and fix style issues
- Remove unnecessary include of linux/lsm_hooks.h in ima_main.c and
ima_appraise.c
Roberto Sassu (25):
ima: Align ima_inode_post_setattr() definition with LSM infrastructure
ima: Align ima_file_mprotect() definition with LSM infrastructure
ima: Align ima_inode_setxattr() definition with LSM infrastructure
ima: Align ima_inode_removexattr() definition with LSM infrastructure
ima: Align ima_post_read_file() definition with LSM infrastructure
evm: Align evm_inode_post_setattr() definition with LSM infrastructure
evm: Align evm_inode_setxattr() definition with LSM infrastructure
evm: Align evm_inode_post_setxattr() definition with LSM
infrastructure
security: Align inode_setattr hook definition with EVM
security: Introduce inode_post_setattr hook
security: Introduce inode_post_removexattr hook
security: Introduce file_post_open hook
security: Introduce file_release hook
security: Introduce path_post_mknod hook
security: Introduce inode_post_create_tmpfile hook
security: Introduce inode_post_set_acl hook
security: Introduce inode_post_remove_acl hook
security: Introduce key_post_create_or_update hook
integrity: Move integrity_kernel_module_request() to IMA
ima: Move to LSM infrastructure
ima: Move IMA-Appraisal to LSM infrastructure
evm: Move to LSM infrastructure
evm: Make it independent from 'integrity' LSM
ima: Make it independent from 'integrity' LSM
integrity: Remove LSM
fs/attr.c | 5 +-
fs/file_table.c | 3 +-
fs/namei.c | 12 +-
fs/nfsd/vfs.c | 3 +-
fs/open.c | 1 -
fs/posix_acl.c | 5 +-
fs/xattr.c | 9 +-
include/linux/evm.h | 117 +-------
include/linux/ima.h | 142 ----------
include/linux/integrity.h | 27 --
include/linux/lsm_hook_defs.h | 20 +-
include/linux/security.h | 59 ++++
include/uapi/linux/lsm.h | 2 +
security/integrity/Makefile | 1 +
security/integrity/digsig_asymmetric.c | 23 --
security/integrity/evm/Kconfig | 1 +
security/integrity/evm/evm.h | 19 ++
security/integrity/evm/evm_crypto.c | 4 +-
security/integrity/evm/evm_main.c | 195 ++++++++++---
security/integrity/iint.c | 197 +------------
security/integrity/ima/Kconfig | 1 +
security/integrity/ima/Makefile | 2 +-
security/integrity/ima/ima.h | 148 ++++++++--
security/integrity/ima/ima_api.c | 23 +-
security/integrity/ima/ima_appraise.c | 66 +++--
security/integrity/ima/ima_iint.c | 142 ++++++++++
security/integrity/ima/ima_init.c | 2 +-
security/integrity/ima/ima_main.c | 144 +++++++---
security/integrity/ima/ima_policy.c | 2 +-
security/integrity/integrity.h | 80 +-----
security/keys/key.c | 10 +-
security/security.c | 263 +++++++++++-------
security/selinux/hooks.c | 3 +-
security/smack/smack_lsm.c | 4 +-
.../selftests/lsm/lsm_list_modules_test.c | 6 +
35 files changed, 902 insertions(+), 839 deletions(-)
create mode 100644 security/integrity/ima/ima_iint.c
--
2.34.1
Calling get_system_loc_code before checking devfd and errno - fails the test
when the device is not available, expected a SKIP.
Change the order of 'SKIP_IF_MSG' correctly SKIP when the /dev/papr-vpd device
is not available.
with out patch: Test FAILED on line 271
with patch: [SKIP] Test skipped on line 266: /dev/papr-vpd not present
Signed-off-by: R Nageswara Sastry <rnsastry(a)linux.ibm.com>
---
tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c b/tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c
index 98cbb9109ee6..505294da1b9f 100644
--- a/tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c
+++ b/tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c
@@ -263,10 +263,10 @@ static int papr_vpd_system_loc_code(void)
off_t size;
int fd;
- SKIP_IF_MSG(get_system_loc_code(&lc),
- "Cannot determine system location code");
SKIP_IF_MSG(devfd < 0 && errno == ENOENT,
DEVPATH " not present");
+ SKIP_IF_MSG(get_system_loc_code(&lc),
+ "Cannot determine system location code");
FAIL_IF(devfd < 0);
--
2.37.1 (Apple Git-137.1)
The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zicfiss respectively), I am actively
working on GCS[1]. With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses. This provides some
protection against ROP attacks and making it easier to collect call
stacks. These shadow stacks are allocated in the address space of the
userspace process.
Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled. The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread. This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces. As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.
Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process in a similar manner
to how the normal stack is specified, keeping the current implicit
allocation behaviour if one is not specified either with clone3() or
through the use of clone(). The user must provide a shadow stack
address and size, this must point to memory mapped for use as a shadow
stackby map_shadow_stack() with a shadow stack token at the top of the
stack.
Please note that the x86 portions of this code are build tested only, I
don't appear to have a system that can run CET avaible to me, I have
done testing with an integration into my pending work for GCS. There is
some possibility that the arm64 implementation may require the use of
clone3() and explicit userspace allocation of shadow stacks, this is
still under discussion.
Please further note that the token consumption done by clone3() is not
currently implemented in an atomic fashion, Rick indicated that he would
look into fixing this if people are OK with the implementation.
A new architecture feature Kconfig option for shadow stacks is added as
here, this was suggested as part of the review comments for the arm64
GCS series and since we need to detect if shadow stacks are supported it
seemed sensible to roll it in here.
[1] https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org/
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v5:
- Rebase onto v6.8-rc2.
- Rework ABI to have the user allocate the shadow stack memory with
map_shadow_stack() and a token.
- Force inlining of the x86 shadow stack enablement.
- Move shadow stack enablement out into a shared header for reuse by
other tests.
- Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@ke…
Changes in v4:
- Formatting changes.
- Use a define for minimum shadow stack size and move some basic
validation to fork.c.
- Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke…
Changes in v3:
- Rebase onto v6.7-rc2.
- Remove stale shadow_stack in internal kargs.
- If a shadow stack is specified unconditionally use it regardless of
CLONE_ parameters.
- Force enable shadow stacks in the selftest.
- Update changelogs for RISC-V feature rename.
- Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@ke…
Changes in v2:
- Rebase onto v6.7-rc1.
- Remove ability to provide preallocated shadow stack, just specify the
desired size.
- Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@ke…
---
Mark Brown (7):
Documentation: userspace-api: Add shadow stack API documentation
selftests: Provide helper header for shadow stack testing
mm: Introduce ARCH_HAS_USER_SHADOW_STACK
fork: Add shadow stack support to clone3()
selftests/clone3: Factor more of main loop into test_clone3()
selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
selftests/clone3: Test shadow stack support
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/shadow_stack.rst | 41 +++++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/shstk.h | 11 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 91 +++++++---
fs/proc/task_mmu.c | 2 +-
include/linux/mm.h | 2 +-
include/linux/sched/task.h | 2 +
include/uapi/linux/sched.h | 13 +-
kernel/fork.c | 61 +++++--
mm/Kconfig | 6 +
tools/testing/selftests/clone3/clone3.c | 211 ++++++++++++++++++----
tools/testing/selftests/clone3/clone3_selftests.h | 8 +
tools/testing/selftests/ksft_shstk.h | 63 +++++++
15 files changed, 430 insertions(+), 85 deletions(-)
---
base-commit: 41bccc98fb7931d63d03f326a746ac4d429c1dd3
change-id: 20231019-clone3-shadow-stack-15d40d2bf536
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Hi Linus,
Please pull the following KUnit fixes update for Linux 6.8-rc5.
This KUnit update for Linux 6.8-rc5 consists of one important fix
to unregister kunit_bus when KUnit module is unloaded. Not doing
so causes an error when KUnit module tries to re-register the bus
when it gets reloaded.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 1a9f2c776d1416c4ea6cb0d0b9917778c41a1a7d:
Documentation: KUnit: Update the instructions on how to test static functions (2024-01-22 07:59:03 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-kunit-fixes-6.8-rc5
for you to fetch changes up to 829388b725f8d266ccec32a2f446717d8693eaba:
kunit: device: Unregister the kunit_bus on shutdown (2024-02-06 17:07:37 -0700)
----------------------------------------------------------------
linux_kselftest-kunit-fixes-6.8-rc5
This KUnit update for Linux 6.8-rc5 consists of one important fix
to unregister kunit_bus when KUnit module is unloaded. Not doing
so causes an error when KUnit module tries to re-register the bus
when it gets reloaded.
----------------------------------------------------------------
David Gow (1):
kunit: device: Unregister the kunit_bus on shutdown
lib/kunit/device-impl.h | 2 ++
lib/kunit/device.c | 14 ++++++++++++++
lib/kunit/test.c | 3 +++
3 files changed, 19 insertions(+)
----------------------------------------------------------------
[Putting this as a RFC because I'm pretty sure I'm not doing the things
correctly at the BPF level.]
[Also using bpf-next as the base tree as there will be conflicting
changes otherwise]
Ideally I'd like to have something similar to bpf_timers, but not
in soft IRQ context. So I'm emulating this with a sleepable
bpf_tail_call() (see "HID: bpf: allow to defer work in a delayed
workqueue").
Basically, I need to be able to defer a HID-BPF program for the
following reasons (from the aforementioned patch):
1. defer an event:
Sometimes we receive an out of proximity event, but the device can not
be trusted enough, and we need to ensure that we won't receive another
one in the following n milliseconds. So we need to wait those n
milliseconds, and eventually re-inject that event in the stack.
2. inject new events in reaction to one given event:
We might want to transform one given event into several. This is the
case for macro keys where a single key press is supposed to send
a sequence of key presses. But this could also be used to patch a
faulty behavior, if a device forgets to send a release event.
3. communicate with the device in reaction to one event:
We might want to communicate back to the device after a given event.
For example a device might send us an event saying that it came back
from sleeping state and needs to be re-initialized.
Currently we can achieve that by keeping a userspace program around,
raise a bpf event, and let that userspace program inject the events and
commands.
However, we are just keeping that program alive as a daemon for just
scheduling commands. There is no logic in it, so it doesn't really justify
an actual userspace wakeup. So a kernel workqueue seems simpler to handle.
Besides the "this is insane to reimplement the wheel", the other part I'm
not sure is whether we can say that BPF maps of type prog_array and of
type queue/stack can be used in sleepable context.
I don't see any warning when running the test programs, but that's probably
not a guarantee I'm doing the things properly :)
Cheers,
Benjamin
Signed-off-by: Benjamin Tissoires <bentiss(a)kernel.org>
---
Benjamin Tissoires (9):
bpf: allow more maps in sleepable bpf programs
HID: bpf/dispatch: regroup kfuncs definitions
HID: bpf: export hid_hw_output_report as a BPF kfunc
selftests/hid: Add test for hid_bpf_hw_output_report
HID: bpf: allow to inject HID event from BPF
selftests/hid: add tests for hid_bpf_input_report
HID: bpf: allow to defer work in a delayed workqueue
selftests/hid: add test for hid_bpf_schedule_delayed_work
selftests/hid: add another set of delayed work tests
Documentation/hid/hid-bpf.rst | 26 +-
drivers/hid/bpf/entrypoints/entrypoints.bpf.c | 8 +
drivers/hid/bpf/entrypoints/entrypoints.lskel.h | 236 +++++++++------
drivers/hid/bpf/hid_bpf_dispatch.c | 315 ++++++++++++++++-----
drivers/hid/bpf/hid_bpf_jmp_table.c | 34 ++-
drivers/hid/hid-core.c | 2 +
include/linux/hid_bpf.h | 10 +
kernel/bpf/verifier.c | 3 +
tools/testing/selftests/hid/hid_bpf.c | 286 ++++++++++++++++++-
tools/testing/selftests/hid/progs/hid.c | 205 ++++++++++++++
.../testing/selftests/hid/progs/hid_bpf_helpers.h | 8 +
11 files changed, 962 insertions(+), 171 deletions(-)
---
base-commit: 4f7a05917237b006ceae760507b3d15305769ade
change-id: 20240205-hid-bpf-sleepable-c01260fd91c4
Best regards,
--
Benjamin Tissoires <bentiss(a)kernel.org>
From: Paul Durrant <pdurrant(a)amazon.com>
This series has one small fix to what was in v11 [1]:
* KVM: xen: re-initialize shared_info if guest (32/64-bit) mode is set
The v11 patch failed to set the return code of the ioctl if the mode
was not actually changed, leading to a spurious failure.
This version of the series also contains a new bug-fix to the pfncache
code from David Woodhouse.
[1] https://lore.kernel.org/kvm/20231219161109.1318-1-paul@xen.org/
David Woodhouse (1):
KVM: pfncache: rework __kvm_gpc_refresh() to fix locking issues
Paul Durrant (19):
KVM: pfncache: Add a map helper function
KVM: pfncache: remove unnecessary exports
KVM: xen: mark guest pages dirty with the pfncache lock held
KVM: pfncache: add a mark-dirty helper
KVM: pfncache: remove KVM_GUEST_USES_PFN usage
KVM: pfncache: stop open-coding offset_in_page()
KVM: pfncache: include page offset in uhva and use it consistently
KVM: pfncache: allow a cache to be activated with a fixed (userspace)
HVA
KVM: xen: separate initialization of shared_info cache and content
KVM: xen: re-initialize shared_info if guest (32/64-bit) mode is set
KVM: xen: allow shared_info to be mapped by fixed HVA
KVM: xen: allow vcpu_info to be mapped by fixed HVA
KVM: selftests / xen: map shared_info using HVA rather than GFN
KVM: selftests / xen: re-map vcpu_info using HVA rather than GPA
KVM: xen: advertize the KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA capability
KVM: xen: split up kvm_xen_set_evtchn_fast()
KVM: xen: don't block on pfncache locks in kvm_xen_set_evtchn_fast()
KVM: pfncache: check the need for invalidation under read lock first
KVM: xen: allow vcpu_info content to be 'safely' copied
Documentation/virt/kvm/api.rst | 53 ++-
arch/x86/kvm/x86.c | 7 +-
arch/x86/kvm/xen.c | 356 +++++++++++-------
include/linux/kvm_host.h | 40 +-
include/linux/kvm_types.h | 8 -
include/uapi/linux/kvm.h | 9 +-
.../selftests/kvm/x86_64/xen_shinfo_test.c | 59 ++-
virt/kvm/pfncache.c | 340 +++++++++--------
8 files changed, 535 insertions(+), 337 deletions(-)
base-commit: 1c6d984f523f67ecfad1083bb04c55d91977bb15
--
2.39.2
Major changes in RFC v5:
------------------------
1. Rebased on top of 'Abstract page from net stack' series and used the
new netmem type to refer to LSB set pointers instead of re-using
struct page.
2. Downgraded this series back to RFC and called it RFC v5. This is
because this series is now dependent on 'Abstract page from net
stack'[1] and the queue API. Both are removed from the series to
pre-requisite work.
3. Reworked the page_pool devmem support to use netmem and for some
more unified handling.
4. Reworked the reference counting of net_iov (renamed from
page_pool_iov) to use pp_ref_count for refcounting.
The full changes including the dependent series and GVE page pool
support is here:
https://github.com/mina/linux/commits/tcpdevmem-rfcv5/
[1] https://patchwork.kernel.org/project/netdevbpf/list/?series=810774
Major changes in v1:
--------------------
1. Implemented MVP queue API ndos to remove the userspace-visible
driver reset.
2. Fixed issues in the napi_pp_put_page() devmem frag unref path.
3. Removed RFC tag.
Many smaller addressed comments across all the patches (patches have
individual change log).
Full tree including the rest of the GVE driver changes:
https://github.com/mina/linux/commits/tcpdevmem-v1
Changes in RFC v3:
------------------
1. Pulled in the memory-provider dependency from Jakub's RFC[1] to make the
series reviewable and mergable.
2. Implemented multi-rx-queue binding which was a todo in v2.
3. Fix to cmsg handling.
The sticking point in RFC v2[2] was the device reset required to refill
the device rx-queues after the dmabuf bind/unbind. The solution
suggested as I understand is a subset of the per-queue management ops
Jakub suggested or similar:
https://lore.kernel.org/netdev/20230815171638.4c057dcd@kernel.org/
This is not addressed in this revision, because:
1. This point was discussed at netconf & netdev and there is openness to
using the current approach of requiring a device reset.
2. Implementing individual queue resetting seems to be difficult for my
test bed with GVE. My prototype to test this ran into issues with the
rx-queues not coming back up properly if reset individually. At the
moment I'm unsure if it's a mistake in the POC or a genuine issue in
the virtualization stack behind GVE, which currently doesn't test
individual rx-queue restart.
3. Our usecases are not bothered by requiring a device reset to refill
the buffer queues, and we'd like to support NICs that run into this
limitation with resetting individual queues.
My thought is that drivers that have trouble with per-queue configs can
use the support in this series, while drivers that support new netdev
ops to reset individual queues can automatically reset the queue as
part of the dma-buf bind/unbind.
The same approach with device resets is presented again for consideration
with other sticking points addressed.
This proposal includes the rx devmem path only proposed for merge. For a
snapshot of my entire tree which includes the GVE POC page pool support &
device memory support:
https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-v3
[1] https://lore.kernel.org/netdev/f8270765-a27b-6ccf-33ea-cda097168d79@redhat.…
[2] https://lore.kernel.org/netdev/CAHS8izOVJGJH5WF68OsRWFKJid1_huzzUK+hpKbLcL4…
Changes in RFC v2:
------------------
The sticking point in RFC v1[1] was the dma-buf pages approach we used to
deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
that attempts to resolve this by implementing scatterlist support in the
networking stack, such that we can import the dma-buf scatterlist
directly. This is the approach proposed at a high level here[2].
Detailed changes:
1. Replaced dma-buf pages approach with importing scatterlist into the
page pool.
2. Replace the dma-buf pages centric API with a netlink API.
3. Removed the TX path implementation - there is no issue with
implementing the TX path with scatterlist approach, but leaving
out the TX path makes it easier to review.
4. Functionality is tested with this proposal, but I have not conducted
perf testing yet. I'm not sure there are regressions, but I removed
perf claims from the cover letter until they can be re-confirmed.
5. Added Signed-off-by: contributors to the implementation.
6. Fixed some bugs with the RX path since RFC v1.
Any feedback welcome, but specifically the biggest pending questions
needing feedback IMO are:
1. Feedback on the scatterlist-based approach in general.
2. Netlink API (Patch 1 & 2).
3. Approach to handle all the drivers that expect to receive pages from
the page pool (Patch 6).
[1] https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb0b4@gmail.c…
[2] https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLX…
----------------------
* TL;DR:
Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
from device memory efficiently, without bouncing the data to a host memory
buffer.
* Problem:
A large amount of data transfers have device memory as the source and/or
destination. Accelerators drastically increased the volume of such transfers.
Some examples include:
- ML accelerators transferring large amounts of training data from storage into
GPU/TPU memory. In some cases ML training setup time can be as long as 50% of
TPU compute time, improving data transfer throughput & efficiency can help
improving GPU/TPU utilization.
- Distributed training, where ML accelerators, such as GPUs on different hosts,
exchange data among them.
- Distributed raw block storage applications transfer large amounts of data with
remote SSDs, much of this data does not require host processing.
Today, the majority of the Device-to-Device data transfers the network are
implemented as the following low level operations: Device-to-Host copy,
Host-to-Host network transfer, and Host-to-Device copy.
The implementation is suboptimal, especially for bulk data transfers, and can
put significant strains on system resources, such as host memory bandwidth,
PCIe bandwidth, etc. One important reason behind the current state is the
kernel’s lack of semantics to express device to network transfers.
* Proposal:
In this patch series we attempt to optimize this use case by implementing
socket APIs that enable the user to:
1. send device memory across the network directly, and
2. receive incoming network packets directly into device memory.
Packet _payloads_ go directly from the NIC to device memory for receive and from
device memory to NIC for transmit.
Packet _headers_ go to/from host memory and are processed by the TCP/IP stack
normally. The NIC _must_ support header split to achieve this.
Advantages:
- Alleviate host memory bandwidth pressure, compared to existing
network-transfer + device-copy semantics.
- Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
of the PCIe tree, compared to traditional path which sends data through the
root complex.
* Patch overview:
** Part 1: netlink API
Gives user ability to bind dma-buf to an RX queue.
** Part 2: scatterlist support
Currently the standard for device memory sharing is DMABUF, which doesn't
generate struct pages. On the other hand, networking stack (skbs, drivers, and
page pool) operate on pages. We have 2 options:
1. Generate struct pages for dmabuf device memory, or,
2. Modify the networking stack to process scatterlist.
** part 3: page pool support
We piggy back on page pool memory providers proposal:
https://github.com/kuba-moo/linux/tree/pp-providers
It allows the page pool to define a memory provider that provides the
page allocation and freeing. It helps abstract most of the device memory
TCP changes from the driver.
** part 4: support for unreadable skb frags
Page pool iovs are not accessible by the host; we implement changes
throughput the networking stack to correctly handle skbs with unreadable
frags.
** Part 5: recvmsg() APIs
We define user APIs for the user to send and receive device memory.
Not included with this series is the GVE devmem TCP support, just to
simplify the review. Code available here if desired:
https://github.com/mina/linux/tree/tcpdevmem
This series is built on top of net-next with Jakub's pp-providers changes
cherry-picked.
* NIC dependencies:
1. (strict) Devmem TCP require the NIC to support header split, i.e. the
capability to split incoming packets into a header + payload and to put
each into a separate buffer. Devmem TCP works by using device memory
for the packet payload, and host memory for the packet headers.
2. (optional) Devmem TCP works better with flow steering support & RSS support,
i.e. the NIC's ability to steer flows into certain rx queues. This allows the
sysadmin to enable devmem TCP on a subset of the rx queues, and steer
devmem TCP traffic onto these queues and non devmem TCP elsewhere.
The NIC I have access to with these properties is the GVE with DQO support
running in Google Cloud, but any NIC that supports these features would suffice.
I may be able to help reviewers bring up devmem TCP on their NICs.
* Testing:
The series includes a udmabuf kselftest that show a simple use case of
devmem TCP and validates the entire data path end to end without
a dependency on a specific dmabuf provider.
** Test Setup
Kernel: net-next with this series and memory provider API cherry-picked
locally.
Hardware: Google Cloud A3 VMs.
NIC: GVE with header split & RSS & flow steering support.
Cc: Pavel Begunkov <asml.silence(a)gmail.com>
Cc: David Wei <dw(a)davidwei.uk>
Cc: Jason Gunthorpe <jgg(a)ziepe.ca>
Cc: Yunsheng Lin <linyunsheng(a)huawei.com>
Cc: Shailend Chand <shailend(a)google.com>
Cc: Harshitha Ramamurthy <hramamurthy(a)google.com>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Jeroen de Borst <jeroendb(a)google.com>
Cc: Praveen Kaligineedi <pkaligineedi(a)google.com>
Jakub Kicinski (1):
net: page_pool: create hooks for custom page providers
Mina Almasry (13):
net: page_pool: factor out page_pool recycle check
net: netdev netlink api to bind dma-buf to a net device
netdev: support binding dma-buf to netdevice
netdev: netdevice devmem allocator
page_pool: convert to use netmem
page_pool: devmem support
memory-provider: dmabuf devmem memory provider
net: support non paged skb frags
net: add support for skbs with unreadable frags
tcp: RX path for devmem TCP
net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags
net: add devmem TCP documentation
selftests: add ncdevmem, netcat for devmem TCP
Documentation/netlink/specs/netdev.yaml | 52 +++
Documentation/networking/devmem.rst | 271 +++++++++++++
Documentation/networking/index.rst | 1 +
arch/alpha/include/uapi/asm/socket.h | 8 +-
arch/mips/include/uapi/asm/socket.h | 6 +
arch/parisc/include/uapi/asm/socket.h | 6 +
arch/sparc/include/uapi/asm/socket.h | 6 +
include/linux/skbuff.h | 68 +++-
include/linux/socket.h | 1 +
include/net/devmem.h | 127 ++++++
include/net/netdev_rx_queue.h | 1 +
include/net/netmem.h | 223 ++++++++++-
include/net/page_pool/helpers.h | 117 ++++--
include/net/page_pool/types.h | 27 +-
include/net/sock.h | 2 +
include/net/tcp.h | 5 +-
include/trace/events/page_pool.h | 28 +-
include/uapi/asm-generic/socket.h | 6 +
include/uapi/linux/netdev.h | 19 +
include/uapi/linux/uio.h | 14 +
net/bpf/test_run.c | 5 +-
net/core/datagram.c | 6 +
net/core/dev.c | 314 ++++++++++++++-
net/core/gro.c | 7 +-
net/core/netdev-genl-gen.c | 19 +
net/core/netdev-genl-gen.h | 2 +
net/core/netdev-genl.c | 123 ++++++
net/core/page_pool.c | 419 +++++++++++++-------
net/core/skbuff.c | 92 ++++-
net/core/sock.c | 45 +++
net/ipv4/tcp.c | 196 +++++++++-
net/ipv4/tcp_input.c | 13 +-
net/ipv4/tcp_ipv4.c | 9 +
net/ipv4/tcp_output.c | 5 +-
net/packet/af_packet.c | 4 +-
tools/include/uapi/linux/netdev.h | 19 +
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 5 +
tools/testing/selftests/net/ncdevmem.c | 489 ++++++++++++++++++++++++
39 files changed, 2527 insertions(+), 234 deletions(-)
create mode 100644 Documentation/networking/devmem.rst
create mode 100644 include/net/devmem.h
create mode 100644 tools/testing/selftests/net/ncdevmem.c
--
2.43.0.472.g3155946c3a-goog
From: Jeff Xu <jeffxu(a)chromium.org>
This is V9 version, with removing MAP_SEALABLE and PROT_SEAL
from mmap(), adding perfrmance benchmark and a test
to demo the sealing of read-only memory segment of elf mapping.
------------------------------------------------------------------
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory
range against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW)
and no-execute (NX) bits. Linux has supported NX since the release of
kernel version 2.6.8 in August 2004 [1]. The memory permission feature
improves the security stance on memory corruption bugs, as an attacker
cannot simply write to arbitrary memory and point the code to it. The
memory must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects
the VMA itself against modifications of the selected seal type.
Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped. Memory sealing can
automatically be applied by the runtime loader to seal .text and
.rodata pages and applications can additionally seal security critical
data at runtime. A similar feature already exists in the XNU kernel
with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the
mimmutable syscall [4]. Also, Chrome wants to adopt this feature for
their CFI work [2] and this patchset has been designed to be
compatible with the Chrome use case.
Two system calls are involved in sealing the map: mmap() and mseal().
The new mseal() is an syscall on 64 bit CPU, and with
following signature:
int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.
mseal() blocks following operations for the given memory range.
1> Unmapping, moving to another location, and shrinking the size,
via munmap() and mremap(), can leave an empty space, therefore can
be replaced with a VMA with a new set of attributes.
2> Moving or expanding a different VMA into the current location,
via mremap().
3> Modifying a VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed VMA.
5> mprotect() and pkey_mprotect().
6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
memory, when users don't have write permission to the memory. Those
behaviors can alter region contents by discarding pages, effectively a
memset(0) for anonymous memory.
The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.
Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in
the case of libc, sealing is only applied to read-only (RO) or
read-execute (RX) memory segments (such as .text and .RELRO) to
prevent them from becoming writable, the lifetime of those mappings
are tied to the lifetime of the process.
Chrome wants to seal two large address space reservations that are
managed by different allocators. The memory is mapped RW- and RWX
respectively but write access to it is restricted using pkeys (or in
the future ARM permission overlay extensions). The lifetime of those
mappings are not tied to the lifetime of the process, therefore, while
the memory is sealed, the allocators still need to free or discard the
unused memory. For example, with madvise(DONTNEED).
However, always allowing madvise(DONTNEED) on this range poses a
security risk. For example if a jump instruction crosses a page
boundary and the second page gets discarded, it will overwrite the
target bytes with zeros and change the control flow. Checking
write-permission before the discard operation allows us to control
when the operation is valid. In this case, the madvise will only
succeed if the executing thread has PKEY write permissions and PKRU
changes are protected in software by control-flow integrity.
Although the initial version of this patch series is targeting the
Chrome browser as its first user, it became evident during upstream
discussions that we would also want to ensure that the patch set
eventually is a complete solution for memory sealing and compatible
with other use cases. The specific scenario currently in mind is
glibc's use case of loading and sealing ELF executables. To this end,
Stephen is working on a change to glibc to add sealing support to the
dynamic linker, which will seal all non-writable segments at startup.
Once this work is completed, all applications will be able to
automatically benefit from these new protections.
In closing, I would like to formally acknowledge the valuable
contributions received during the RFC process, which were instrumental
in shaping this patch:
Jann Horn: raising awareness and providing valuable insights on the
destructive madvise operations.
Liam R. Howlett: perf optimization.
Linus Torvalds: assisting in defining system call signature and scope.
Theo de Raadt: sharing the experiences and insight gained from
implementing mimmutable() in OpenBSD.
MM perf benchmarks
==================
This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to
check the VMAs’ sealing flag, so that no partial update can be made,
when any segment within the given memory range is sealed.
To measure the performance impact of this loop, two tests are developed.
[8]
The first is measuring the time taken for a particular system call,
by using clock_gettime(CLOCK_MONOTONIC). The second is using
PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have
similar results.
The tests have roughly below sequence:
for (i = 0; i < 1000, i++)
create 1000 mappings (1 page per VMA)
start the sampling
for (j = 0; j < 1000, j++)
mprotect one mapping
stop and save the sample
delete 1000 mappings
calculates all samples.
Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz,
4G memory, Chromebook.
Based on the latest upstream code:
The first test (measuring time)
syscall__ vmas t t_mseal delta_ns per_vma %
munmap__ 1 909 944 35 35 104%
munmap__ 2 1398 1502 104 52 107%
munmap__ 4 2444 2594 149 37 106%
munmap__ 8 4029 4323 293 37 107%
munmap__ 16 6647 6935 288 18 104%
munmap__ 32 11811 12398 587 18 105%
mprotect 1 439 465 26 26 106%
mprotect 2 1659 1745 86 43 105%
mprotect 4 3747 3889 142 36 104%
mprotect 8 6755 6969 215 27 103%
mprotect 16 13748 14144 396 25 103%
mprotect 32 27827 28969 1142 36 104%
madvise_ 1 240 262 22 22 109%
madvise_ 2 366 442 76 38 121%
madvise_ 4 623 751 128 32 121%
madvise_ 8 1110 1324 215 27 119%
madvise_ 16 2127 2451 324 20 115%
madvise_ 32 4109 4642 534 17 113%
The second test (measuring cpu cycle)
syscall__ vmas cpu cmseal delta_cpu per_vma %
munmap__ 1 1790 1890 100 100 106%
munmap__ 2 2819 3033 214 107 108%
munmap__ 4 4959 5271 312 78 106%
munmap__ 8 8262 8745 483 60 106%
munmap__ 16 13099 14116 1017 64 108%
munmap__ 32 23221 24785 1565 49 107%
mprotect 1 906 967 62 62 107%
mprotect 2 3019 3203 184 92 106%
mprotect 4 6149 6569 420 105 107%
mprotect 8 9978 10524 545 68 105%
mprotect 16 20448 21427 979 61 105%
mprotect 32 40972 42935 1963 61 105%
madvise_ 1 434 497 63 63 115%
madvise_ 2 752 899 147 74 120%
madvise_ 4 1313 1513 200 50 115%
madvise_ 8 2271 2627 356 44 116%
madvise_ 16 4312 4883 571 36 113%
madvise_ 32 8376 9319 943 29 111%
Based on the result, for 6.8 kernel, sealing check adds
20-40 nano seconds, or around 50-100 CPU cycles, per VMA.
In addition, I applied the sealing to 5.10 kernel:
The first test (measuring time)
syscall__ vmas t tmseal delta_ns per_vma %
munmap__ 1 357 390 33 33 109%
munmap__ 2 442 463 21 11 105%
munmap__ 4 614 634 20 5 103%
munmap__ 8 1017 1137 120 15 112%
munmap__ 16 1889 2153 263 16 114%
munmap__ 32 4109 4088 -21 -1 99%
mprotect 1 235 227 -7 -7 97%
mprotect 2 495 464 -30 -15 94%
mprotect 4 741 764 24 6 103%
mprotect 8 1434 1437 2 0 100%
mprotect 16 2958 2991 33 2 101%
mprotect 32 6431 6608 177 6 103%
madvise_ 1 191 208 16 16 109%
madvise_ 2 300 324 24 12 108%
madvise_ 4 450 473 23 6 105%
madvise_ 8 753 806 53 7 107%
madvise_ 16 1467 1592 125 8 108%
madvise_ 32 2795 3405 610 19 122%
The second test (measuring cpu cycle)
syscall__ nbr_vma cpu cmseal delta_cpu per_vma %
munmap__ 1 684 715 31 31 105%
munmap__ 2 861 898 38 19 104%
munmap__ 4 1183 1235 51 13 104%
munmap__ 8 1999 2045 46 6 102%
munmap__ 16 3839 3816 -23 -1 99%
munmap__ 32 7672 7887 216 7 103%
mprotect 1 397 443 46 46 112%
mprotect 2 738 788 50 25 107%
mprotect 4 1221 1256 35 9 103%
mprotect 8 2356 2429 72 9 103%
mprotect 16 4961 4935 -26 -2 99%
mprotect 32 9882 10172 291 9 103%
madvise_ 1 351 380 29 29 108%
madvise_ 2 565 615 49 25 109%
madvise_ 4 872 933 61 15 107%
madvise_ 8 1508 1640 132 16 109%
madvise_ 16 3078 3323 245 15 108%
madvise_ 32 5893 6704 811 25 114%
For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30
CPU cycles, there is even decrease in some cases.
It might be interesting to compare 5.10 and 6.8 kernel
The first test (measuring time)
syscall__ vmas t_5_10 t_6_8 delta_ns per_vma %
munmap__ 1 357 909 552 552 254%
munmap__ 2 442 1398 956 478 316%
munmap__ 4 614 2444 1830 458 398%
munmap__ 8 1017 4029 3012 377 396%
munmap__ 16 1889 6647 4758 297 352%
munmap__ 32 4109 11811 7702 241 287%
mprotect 1 235 439 204 204 187%
mprotect 2 495 1659 1164 582 335%
mprotect 4 741 3747 3006 752 506%
mprotect 8 1434 6755 5320 665 471%
mprotect 16 2958 13748 10790 674 465%
mprotect 32 6431 27827 21397 669 433%
madvise_ 1 191 240 49 49 125%
madvise_ 2 300 366 67 33 122%
madvise_ 4 450 623 173 43 138%
madvise_ 8 753 1110 357 45 147%
madvise_ 16 1467 2127 660 41 145%
madvise_ 32 2795 4109 1314 41 147%
The second test (measuring cpu cycle)
syscall__ vmas cpu_5_10 c_6_8 delta_cpu per_vma %
munmap__ 1 684 1790 1106 1106 262%
munmap__ 2 861 2819 1958 979 327%
munmap__ 4 1183 4959 3776 944 419%
munmap__ 8 1999 8262 6263 783 413%
munmap__ 16 3839 13099 9260 579 341%
munmap__ 32 7672 23221 15549 486 303%
mprotect 1 397 906 509 509 228%
mprotect 2 738 3019 2281 1140 409%
mprotect 4 1221 6149 4929 1232 504%
mprotect 8 2356 9978 7622 953 423%
mprotect 16 4961 20448 15487 968 412%
mprotect 32 9882 40972 31091 972 415%
madvise_ 1 351 434 82 82 123%
madvise_ 2 565 752 186 93 133%
madvise_ 4 872 1313 442 110 151%
madvise_ 8 1508 2271 763 95 151%
madvise_ 16 3078 4312 1234 77 140%
madvise_ 32 5893 8376 2483 78 142%
From 5.10 to 6.8
munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma.
mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma.
madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma.
In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the
increase from 5.10 to 6.8 is significantly larger, approximately ten
times greater for munmap and mprotect.
When I discuss the mm performance with Brian Makin, an engineer worked
on performance, it was brought to my attention that such a performance
benchmarks, which measuring millions of mm syscall in a tight loop, may
not accurately reflect real-world scenarios, such as that of a database
service. Also this is tested using a single HW and ChromeOS, the data
from another HW or distribution might be different. It might be best
to take this data with a grain of salt.
Change history:
===============
V9:
- remove mmap(PROT_SEAL) and mmap(MAP_SEALABLE) (Linus, Theo de Raadt)
- Update mseal_test to check for prot bit (Liam R. Howlett)
- Update documentation to give more detail on sealing check (Liam R. Howlett)
- Add seal_elf test.
- Add performance measure data.
- mseal_test: fix arm build.
V8:
- perf optimization in mmap. (Liam R. Howlett)
- add one testcase (test_seal_zero_address)
- Update mseal.rst to add note for MAP_SEALABLE.
https://lore.kernel.org/lkml/20240131175027.3287009-1-jeffxu@chromium.org/
V7:
- fix index.rst (Randy Dunlap)
- fix arm build (Randy Dunlap)
- return EPERM for blocked operations (Theo de Raadt)
https://lore.kernel.org/linux-mm/20240122152905.2220849-2-jeffxu@chromium.o…
V6:
- Drop RFC from subject, Given Linus's general approval.
- Adjust syscall number for mseal (main Jan.11/2024)
- Code style fix (Matthew Wilcox)
- selftest: use ksft macros (Muhammad Usama Anjum)
- Document fix. (Randy Dunlap)
https://lore.kernel.org/all/20240111234142.2944934-1-jeffxu@chromium.org/
V5:
- fix build issue in mseal-Wire-up-mseal-syscall
(Suggested by Linus Torvalds, and Greg KH)
- updates on selftest.
https://lore.kernel.org/lkml/20240109154547.1839886-1-jeffxu@chromium.org/#r
V4:
(Suggested by Linus Torvalds)
- new signature: mseal(start,len,flags)
- 32 bit is not supported. vm_seal is removed, use vm_flags instead.
- single bit in vm_flags for sealed state.
- CONFIG_MSEAL kernel config is removed.
- single bit of PROT_SEAL in the "Prot" field of mmap().
Other changes:
- update selftest (Suggested by Muhammad Usama Anjum)
- update documentation.
https://lore.kernel.org/all/20240104185138.169307-1-jeffxu@chromium.org/
V3:
- Abandon per-syscall approach, (Suggested by Linus Torvalds).
- Organize sealing types around their functionality, such as
MM_SEAL_BASE, MM_SEAL_PROT_PKEY.
- Extend the scope of sealing from calls originated in userspace to
both kernel and userspace. (Suggested by Linus Torvalds)
- Add seal type support in mmap(). (Suggested by Pedro Falcato)
- Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent
destructive operations of madvise. (Suggested by Jann Horn and
Stephen Röttger)
- Make sealed VMAs mergeable. (Suggested by Jann Horn)
- Add MAP_SEALABLE to mmap()
- Add documentation - mseal.rst
https://lore.kernel.org/linux-mm/20231212231706.2680890-2-jeffxu@chromium.o…
v2:
Use _BITUL to define MM_SEAL_XX type.
Use unsigned long for seal type in sys_mseal() and other functions.
Remove internal VM_SEAL_XX type and convert_user_seal_type().
Remove MM_ACTION_XX type.
Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask.
Add more comments in code.
Add a detailed commit message.
https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/
v1:
https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/
----------------------------------------------------------------
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b…
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXge…
[6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426Fkcgnf…
[7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/
[8] https://github.com/peaktocreek/mmperf
Jeff Xu (5):
mseal: Wire up mseal syscall
mseal: add mseal syscall
selftest mm/mseal memory sealing
mseal:add documentation
selftest mm/mseal read-only elf memory segment
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/mseal.rst | 199 ++
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 5 +-
kernel/sys_ni.c | 1 +
mm/Makefile | 4 +
mm/internal.h | 37 +
mm/madvise.c | 12 +
mm/mmap.c | 31 +-
mm/mprotect.c | 10 +
mm/mremap.c | 31 +
mm/mseal.c | 307 ++++
tools/testing/selftests/mm/.gitignore | 2 +
tools/testing/selftests/mm/Makefile | 2 +
tools/testing/selftests/mm/mseal_test.c | 1836 +++++++++++++++++++
tools/testing/selftests/mm/seal_elf.c | 183 ++
33 files changed, 2678 insertions(+), 3 deletions(-)
create mode 100644 Documentation/userspace-api/mseal.rst
create mode 100644 mm/mseal.c
create mode 100644 tools/testing/selftests/mm/mseal_test.c
create mode 100644 tools/testing/selftests/mm/seal_elf.c
--
2.43.0.687.g38aa6559b0-goog
When opening yama/ptrace_scope we unconditionally use sudo to ensure we
are running as root, resulting in failures if running in a minimal root
filesystem where sudo is not installed. Since automated test systems will
typically just run all of kselftest as root (and many kselftests rely on
this for full functionality) add a check to see if we're already root and
only invoke sudo if not.
Since I am unclear what the intended effect of the command being run is I
have not added any error handling for the case where we fail to obtain
root.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/mm/run_vmtests.sh | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index fe140a9f4f9d..c8ca830dba93 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -248,6 +248,17 @@ run_test() {
echo "TAP version 13" | tap_output
+HAVE_ROOT=0
+if [ "$(id -u)" = "0" ]; then
+ AS_ROOT=
+ HAVE_ROOT=1
+elif [ "$(command -v sudo)" != "" ]; then
+ AS_ROOT=sudo
+ HAVE_ROOT=1
+else
+ echo # WARNING: Unable to run as root
+fi
+
CATEGORY="hugetlb" run_test ./hugepage-mmap
shmmax=$(cat /proc/sys/kernel/shmmax)
@@ -363,7 +374,8 @@ CATEGORY="hmm" run_test bash ./test_hmm.sh smoke
# MADV_POPULATE_READ and MADV_POPULATE_WRITE tests
CATEGORY="madv_populate" run_test ./madv_populate
-(echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope 2>&1) | tap_prefix
+# FIXME: What if we can't get root?
+(echo 0 | ${AS_ROOT} tee /proc/sys/kernel/yama/ptrace_scope 2>&1) | tap_prefix
CATEGORY="memfd_secret" run_test ./memfd_secret
# KSM KSM_MERGE_TIME_HUGE_PAGES test with size of 100
---
base-commit: 445a555e0623387fa9b94e68e61681717e70200a
change-id: 20240209-kselftest-mm-check-deps-01a825e5fed4
Best regards,
--
Mark Brown <broonie(a)kernel.org>
This series aims to keep the git status clean after building the
selftests by adding some missing .gitignore files and object inclusion
in existing .gitignore files. This is one of the requirements listed in
the selftests documentation for new tests, but it is not always followed
as desired.
After adding these .gitignore files and including the generated objects,
the working tree appears clean again.
The new version includes a missing entry fot the .gitignore in damon,
which was reported by Bernd Edlinger <bernd.edlinger(a)hotmail.de>, who
also proposed a patch for it as well as for other missing .gitignore
files covered by v1. Bernd has been added to the corresponding patch as
the reporter. If a different tag is desired, I am fine with it.
To: Shuah Khan <shuah(a)kernel.org>
To: SeongJae Park <sj(a)kernel.org>
To: Bernd Edlinger <bernd.edlinger(a)hotmail.de>
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: damon(a)lists.linux.dev
Cc: linux-mm(a)kvack.org
Signed-off-by: Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
Changes in v3:
- General: base on mm-unstable to avoid conflicts.
- damon: add missing Closes tag.
- Link to v2: https://lore.kernel.org/r/20240212-selftest_gitignore-v2-0-75f0de50a178@gma…
Changes in v2:
- Remove patch for netfilter (not relevant anymore).
- Add patch for damon (missing binary in .gitignore).
- Link to v1: https://lore.kernel.org/r/20240101-selftest_gitignore-v1-0-eb61b09adb05@gma…
---
Javier Carrasco (4):
selftests: uevent: add missing gitignore
selftests: thermal: intel: power_floor: add missing gitignore
selftests: thermal: intel: workload_hint: add missing gitignore
selftests: damon: add access_memory to .gitignore
tools/testing/selftests/damon/.gitignore | 1 +
tools/testing/selftests/thermal/intel/power_floor/.gitignore | 1 +
tools/testing/selftests/thermal/intel/workload_hint/.gitignore | 1 +
tools/testing/selftests/uevent/.gitignore | 1 +
4 files changed, 4 insertions(+)
---
base-commit: 7e56cf9a7f108e8129d75cea0dabc9488fb4defa
change-id: 20240101-selftest_gitignore-7da2c503766e
Best regards,
--
Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
The mentioned test is still flaky, unusally enough in 'fast'
environments.
Patch 2/2 [try to] address the existing issues, while patch 1/2
introduces more strict tests for the existing net helpers, to hopefully
prevent future pain.
Paolo Abeni (2):
selftests: net: more strict check in net_helper
selftests: net: more pmtu.sh fixes
tools/testing/selftests/net/net_helper.sh | 11 +++++++----
tools/testing/selftests/net/pmtu.sh | 4 ++--
2 files changed, 9 insertions(+), 6 deletions(-)
--
2.43.0
The gro self-tests sends the packets to be aggregated with
multiple write operations.
When running is slow environment, it's hard to guarantee that
the GRO engine will wait for the last packet in an intended
train.
The above causes almost deterministic failures in our CI for
the 'large' test-case.
Address the issue explicitly ignoring failures for such case
in slow environments (KSFT_MACHINE_SLOW==true).
Fixes: 7d1575014a63 ("selftests/net: GRO coalesce test")
Reviewed-by: Willem de Bruijn <willemb(a)google.com>
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
v2 -> v3:
- dump the error suppression message only on actual errors (Jakub)
v1 -> v2:
- replace the '-a' operator with '&&' - Mattbe
Note that the fixes tag is there mainly to justify targeting the net
tree, and this is aiming at net to hopefully make the test more stable
ASAP for both trees.
---
tools/testing/selftests/net/gro.sh | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/net/gro.sh b/tools/testing/selftests/net/gro.sh
index 19352f106c1d..02c21ff4ca81 100755
--- a/tools/testing/selftests/net/gro.sh
+++ b/tools/testing/selftests/net/gro.sh
@@ -31,6 +31,11 @@ run_test() {
1>>log.txt
wait "${server_pid}"
exit_code=$?
+ if [[ ${test} == "large" && -n "${KSFT_MACHINE_SLOW}" && \
+ ${exit_code} -ne 0 ]]; then
+ echo "Ignoring errors due to slow environment" 1>&2
+ exit_code=0
+ fi
if [[ "${exit_code}" -eq 0 ]]; then
break;
fi
--
2.43.0
Hi Christian, Janosch, Heiko,
Here is a pair of patches to replace the RFC I recently sent [1].
Patch 1 performs the host/guest access register swap that Christian
suggested (instead of a full sync_reg/store_reg process).
Patch 2 provides a selftest patch that exercises this scenario.
Applying patch 2 without patch 1 fails in the following way:
[eric@host linux]# ./tools/testing/selftests/kvm/s390x/memop
TAP version 13
1..16
ok 1 simple copy
ok 2 generic error checks
ok 3 copy with storage keys
ok 4 cmpxchg with storage keys
ok 5 concurrently cmpxchg with storage keys
ok 6 copy with key storage protection override
ok 7 copy with key fetch protection
ok 8 copy with key fetch protection override
==== Test Assertion Failure ====
s390x/memop.c:186: !r
pid=5720 tid=5720 errno=4 - Interrupted system call
1 0x00000000010042af: memop_ioctl at memop.c:186 (discriminator 3)
2 0x0000000001006697: test_copy_access_register at memop.c:400 (discriminator 2)
3 0x0000000001002aaf: main at memop.c:1181
4 0x000003ffaec33a5b: ?? ??:0
5 0x000003ffaec33b5d: ?? ??:0
6 0x0000000001002ba9: _start at ??:?
KVM_S390_MEM_OP failed, rc: 40 errno: 4 (Interrupted system call)
Thoughts on this approach?
Thanks,
Eric
[1] https://lore.kernel.org/r/20240131205832.2179029-1-farman@linux.ibm.com/
Eric Farman (2):
KVM: s390: load guest access registers in MEM_OP ioctl
KVM: s390: selftests: memop: add a simple AR test
arch/s390/kvm/kvm-s390.c | 6 +++++
tools/testing/selftests/kvm/s390x/memop.c | 28 +++++++++++++++++++++++
2 files changed, 34 insertions(+)
--
2.40.1
From: Zi Yan <ziy(a)nvidia.com>
Hi all,
File folio supports any order and people would like to support flexible orders
for anonymous folio[1] too. Currently, split_huge_page() only splits a huge
page to order-0 pages, but splitting to orders higher than 0 is also useful.
This patchset adds support for splitting a huge page to any lower order pages
and uses it during file folio truncate operations.
The patchset is on top of mm-everything-2023-03-27-21-20.
Changelog
===
Since v2
---
1. Fixed an issue in __split_page_owner() introduced during my rebase
Since v1
---
1. Changed split_page_memcg() and split_page_owner() parameter to use order
2. Used folio_test_pmd_mappable() in place of the equivalent code
Details
===
* Patch 1 changes split_page_memcg() to use order instead of nr_pages
* Patch 2 changes split_page_owner() to use order instead of nr_pages
* Patch 3 and 4 add new_order parameter split_page_memcg() and
split_page_owner() and prepare for upcoming changes.
* Patch 5 adds split_huge_page_to_list_to_order() to split a huge page
to any lower order. The original split_huge_page_to_list() calls
split_huge_page_to_list_to_order() with new_order = 0.
* Patch 6 uses split_huge_page_to_list_to_order() in large pagecache folio
truncation instead of split the large folio all the way down to order-0.
* Patch 7 adds a test API to debugfs and test cases in
split_huge_page_test selftests.
Comments and/or suggestions are welcome.
[1] https://lore.kernel.org/linux-mm/Y%2FblF0GIunm+pRIC@casper.infradead.org/
Zi Yan (7):
mm/memcg: use order instead of nr in split_page_memcg()
mm/page_owner: use order instead of nr in split_page_owner()
mm: memcg: make memcg huge page split support any order split.
mm: page_owner: add support for splitting to any order in split
page_owner.
mm: thp: split huge page to any lower order pages.
mm: truncate: split huge page cache page to a non-zero order if
possible.
mm: huge_memory: enable debugfs to split huge pages to any order.
include/linux/huge_mm.h | 10 +-
include/linux/memcontrol.h | 4 +-
include/linux/page_owner.h | 10 +-
mm/huge_memory.c | 137 ++++++++---
mm/memcontrol.c | 10 +-
mm/page_alloc.c | 8 +-
mm/page_owner.c | 8 +-
mm/truncate.c | 21 +-
.../selftests/mm/split_huge_page_test.c | 225 +++++++++++++++++-
9 files changed, 365 insertions(+), 68 deletions(-)
--
2.39.2
This series aims to keep the git status clean after building the
selftests by adding some missing .gitignore files and object inclusion
in existing .gitignore files. This is one of the requirements listed in
the selftests documentation for new tests, but it is not always followed
as desired.
After adding these .gitignore files and including the generated objects,
the working tree appears clean again.
The new version includes a missing entry fot the .gitignore in damon,
which was reported by Bernd Edlinger <bernd.edlinger(a)hotmail.de>, who
also proposed a patch for it as well as for other missing .gitignore
files covered by v1. Bernd has been added to the corresponding patch as
the reporter. If a different tag is desired, I am fine with it.
To: Shuah Khan <shuah(a)kernel.org>
To: SeongJae Park <sj(a)kernel.org>
To: Bernd Edlinger <bernd.edlinger(a)hotmail.de>
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: damon(a)lists.linux.dev
Cc: linux-mm(a)kvack.org
Signed-off-by: Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
Changes in v2:
- Remove patch for netfilter (not relevant anymore).
- Add patch for damon (missing binary in .gitignore).
- Link to v1: https://lore.kernel.org/r/20240101-selftest_gitignore-v1-0-eb61b09adb05@gma…
---
Javier Carrasco (4):
selftests: uevent: add missing gitignore
selftests: thermal: intel: power_floor: add missing gitignore
selftests: thermal: intel: workload_hint: add missing gitignore
selftests: damon: add access_memory to .gitignore
tools/testing/selftests/damon/.gitignore | 1 +
tools/testing/selftests/thermal/intel/power_floor/.gitignore | 1 +
tools/testing/selftests/thermal/intel/workload_hint/.gitignore | 1 +
tools/testing/selftests/uevent/.gitignore | 1 +
4 files changed, 4 insertions(+)
---
base-commit: 716f4aaa7b48a55c73d632d0657b35342b1fefd7
change-id: 20240101-selftest_gitignore-7da2c503766e
Best regards,
--
Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
Non-contiguous CBM support for Intel CAT has been merged into the kernel
with Commit 0e3cd31f6e90 ("x86/resctrl: Enable non-contiguous CBMs in
Intel CAT") but there is no selftest that would validate if this feature
works correctly. The selftest needs to verify if writing non-contiguous
CBMs to the schemata file behaves as expected in comparison to the
information about non-contiguous CBMs support.
The patch series is based on a rework of resctrl selftests that's
currently in review [1]. The patch also implements a similar
functionality presented in the bash script included in the cover letter
of the original non-contiguous CBMs in Intel CAT series [3].
Changelog v5:
- Add a few reviewed-by tags.
- Make some minor text changes in patch messages and comments.
- Redo resctrl_mon_feature_exists() to be more generic and fix some of
my errors in refactoring feature checking.
Changelog v4:
- Changes to error failure return values in non-contiguous test.
- Some minor text refactoring without functional changes.
Changelog v3:
- Rebase onto v4 of Ilpo's series [1].
- Split old patch 3/4 into two parts. One doing refactoring and one
adding a new function.
- Some changes to all the patches after Reinette's review.
Changelog v2:
- Rebase onto v4 of Ilpo's series [2].
- Add two patches that prepare helpers for the new test.
- Move Ilpo's patch that adds test grouping to this series.
- Apply Ilpo's suggestion to the patch that adds a new test.
[1] https://lore.kernel.org/all/20231215150515.36983-1-ilpo.jarvinen@linux.inte…
[2] https://lore.kernel.org/all/20231211121826.14392-1-ilpo.jarvinen@linux.inte…
[3] https://lore.kernel.org/all/cover.1696934091.git.maciej.wieczor-retman@inte…
Older versions of this series:
[v1] https://lore.kernel.org/all/20231109112847.432687-1-maciej.wieczor-retman@i…
[v2] https://lore.kernel.org/all/cover.1702392177.git.maciej.wieczor-retman@inte…
[v3] https://lore.kernel.org/all/cover.1706180726.git.maciej.wieczor-retman@inte…
[v4] https://lore.kernel.org/all/cover.1707130307.git.maciej.wieczor-retman@inte…
Ilpo Järvinen (1):
selftests/resctrl: Add test groups and name L3 CAT test L3_CAT
Maciej Wieczor-Retman (4):
selftests/resctrl: Add a helper for the non-contiguous test
selftests/resctrl: Split validate_resctrl_feature_request()
selftests/resctrl: Add resource_info_file_exists()
selftests/resctrl: Add non-contiguous CBMs CAT test
tools/testing/selftests/resctrl/cat_test.c | 84 ++++++++++++++++-
tools/testing/selftests/resctrl/cmt_test.c | 2 +-
tools/testing/selftests/resctrl/mba_test.c | 2 +-
tools/testing/selftests/resctrl/mbm_test.c | 6 +-
tools/testing/selftests/resctrl/resctrl.h | 10 +-
.../testing/selftests/resctrl/resctrl_tests.c | 18 +++-
tools/testing/selftests/resctrl/resctrlfs.c | 94 +++++++++++++++++--
7 files changed, 194 insertions(+), 22 deletions(-)
--
2.43.0
Title under line too short
Signed-off-by: Meng Li <li.meng(a)amd.com>
---
Documentation/admin-guide/pm/amd-pstate.rst | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/pm/amd-pstate.rst b/Documentation/admin-guide/pm/amd-pstate.rst
index 0a3aa6b8ffd5..1e0d101b020a 100644
--- a/Documentation/admin-guide/pm/amd-pstate.rst
+++ b/Documentation/admin-guide/pm/amd-pstate.rst
@@ -381,7 +381,7 @@ driver receives a message with the highest performance change, it will
update the core ranking and set the cpu's priority.
``amd-pstate`` Preferred Core Switch
-=================================
+=====================================
Kernel Parameters
-----------------
--
2.34.1
The gro self-tests sends the packets to be aggregated with
multiple write operations.
When running is slow environment, it's hard to guarantee that
the GRO engine will wait for the last packet in an intended
train.
The above causes almost deterministic failures in our CI for
the 'large' test-case.
Address the issue explicitly ignoring failures for such case
in slow environments (KSFT_MACHINE_SLOW==true).
Fixes: 7d1575014a63 ("selftests/net: GRO coalesce test")
Reviewed-by: Willem de Bruijn <willemb(a)google.com>
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
v1 -> v2:
- replace the '-a' operator with '&&' - Mattbe
Note that the fixes tag is there mainly to justify targeting the net
tree, and this is aiming at net to hopefully make the test more stable
ASAP for both trees.
I experimented with a largish refactory replacing the multiple writes
with a single GSO packet, but exhausted by time budget before reaching
any good result.
---
tools/testing/selftests/net/gro.sh | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/net/gro.sh b/tools/testing/selftests/net/gro.sh
index 19352f106c1d..3190b41e8bfc 100755
--- a/tools/testing/selftests/net/gro.sh
+++ b/tools/testing/selftests/net/gro.sh
@@ -31,6 +31,10 @@ run_test() {
1>>log.txt
wait "${server_pid}"
exit_code=$?
+ if [[ ${test} == "large" && -n "${KSFT_MACHINE_SLOW}" ]]; then
+ echo "Ignoring errors due to slow environment" 1>&2
+ exit_code=0
+ fi
if [[ "${exit_code}" -eq 0 ]]; then
break;
fi
--
2.43.0
The mentioned test is failing in slow environments:
# SO_TXTIME ipv4 clock monotonic
# ./so_txtime: recv: timeout: Resource temporarily unavailable
not ok 1 selftests: net: so_txtime.sh # exit=1
The receiver is started in background and the sender could end-up
transmitting the packet before the receiver is ready, so that the
later recv times out.
Address the issue explcitly waiting for the socket being bound to
the relevant port.
Fixes: af5136f95045 ("selftests/net: SO_TXTIME with ETF and FQ")
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
Note that to really cope with slow env the mentioned self-tests also
need net-next commit c41dfb0dfbec ("selftests/net: ignore timing
errors in so_txtime if KSFT_MACHINE_SLOW"), so this could be applied to
net-next, too
---
tools/testing/selftests/net/so_txtime.sh | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/net/so_txtime.sh b/tools/testing/selftests/net/so_txtime.sh
index 3f06f4d286a9..ade0e5755099 100755
--- a/tools/testing/selftests/net/so_txtime.sh
+++ b/tools/testing/selftests/net/so_txtime.sh
@@ -5,6 +5,8 @@
set -e
+source net_helper.sh
+
readonly DEV="veth0"
readonly BIN="./so_txtime"
@@ -51,13 +53,16 @@ do_test() {
local readonly CLOCK="$2"
local readonly TXARGS="$3"
local readonly RXARGS="$4"
+ local PROTO
if [[ "${IP}" == "4" ]]; then
local readonly SADDR="${SADDR4}"
local readonly DADDR="${DADDR4}"
+ PROTO=udp
elif [[ "${IP}" == "6" ]]; then
local readonly SADDR="${SADDR6}"
local readonly DADDR="${DADDR6}"
+ PROTO=udp6
else
echo "Invalid IP version ${IP}"
exit 1
@@ -65,6 +70,7 @@ do_test() {
local readonly START="$(date +%s%N --date="+ 0.1 seconds")"
ip netns exec "${NS2}" "${BIN}" -"${IP}" -c "${CLOCK}" -t "${START}" -S "${SADDR}" -D "${DADDR}" "${RXARGS}" -r &
+ wait_local_port_listen "${NS2}" 8000 "${PROTO}"
ip netns exec "${NS1}" "${BIN}" -"${IP}" -c "${CLOCK}" -t "${START}" -S "${SADDR}" -D "${DADDR}" "${TXARGS}"
wait "$!"
}
--
2.43.0
This patch series is based on the RFC patch from Frederic [1]. Instead
of offering RCU_NOCB as a separate option, it is now lumped into a
root-only cpuset.cpus.isolation_full flag that will enable all the
additional CPU isolation capabilities available for isolated partitions
if set. RCU_NOCB is just the first one to this party. Additional dynamic
CPU isolation capabilities will be added in the future.
The first 2 patches are adopted from Federic with minor twists to fix
merge conflicts and compilation issue. The rests are for implementing
the new cpuset.cpus.isolation_full interface which is essentially a flag
to globally enable or disable full CPU isolation on isolated partitions.
On read, it also shows the CPU isolation capabilities that are currently
enabled. RCU_NOCB requires that the rcu_nocbs option be present in
the kernel boot command line. Without that, the rcu_nocb functionality
cannot be enabled even if the isolation_full flag is set. So we allow
users to check the isolation_full file to verify that if the desired
CPU isolation capability is enabled or not.
Only sanity checking has been done so far. More testing, especially on
the RCU side, will be needed.
[1] https://lore.kernel.org/lkml/20220525221055.1152307-1-frederic@kernel.org/
Frederic Weisbecker (2):
rcu/nocb: Pass a cpumask instead of a single CPU to offload/deoffload
rcu/nocb: Prepare to change nocb cpumask from CPU-hotplug protected
cpuset caller
Waiman Long (6):
rcu/no_cb: Add rcu_nocb_enabled() to expose the rcu_nocb state
cgroup/cpuset: Better tracking of addition/deletion of isolated CPUs
cgroup/cpuset: Add cpuset.cpus.isolation_full
cgroup/cpuset: Enable dynamic rcu_nocb mode on isolated CPUs
cgroup/cpuset: Document the new cpuset.cpus.isolation_full control
file
cgroup/cpuset: Update test_cpuset_prs.sh to handle
cpuset.cpus.isolation_full
Documentation/admin-guide/cgroup-v2.rst | 24 ++
include/linux/rcupdate.h | 15 +-
kernel/cgroup/cpuset.c | 237 ++++++++++++++----
kernel/rcu/rcutorture.c | 6 +-
kernel/rcu/tree_nocb.h | 118 ++++++---
.../selftests/cgroup/test_cpuset_prs.sh | 23 +-
6 files changed, 337 insertions(+), 86 deletions(-)
--
2.39.3
Open vSwitch module accepts actions as a list from the netlink socket
and then creates a copy which it uses in the action set processing.
During processing of the action list on a packet, the module keeps a
count of the execution depth and exits processing if the action depth
goes too high.
However, during netlink processing the recursion depth isn't checked
anywhere, and the copy trusts that kernel has large enough stack to
accommodate it. The OVS sample action was the original action which
could perform this kinds of recursion, and it originally checked that
it didn't exceed the sample depth limit. However, when sample became
optimized to provide the clone() semantics, the recursion limit was
dropped.
This series adds a depth limit during the __ovs_nla_copy_actions() call
that will ensure we don't exceed the max that the OVS userspace could
generate for a clone().
Additionally, this series provides a selftest in 2/2 that can be used to
determine if the OVS module is allowing unbounded access. It can be
safely omitted where the ovs selftest framework isn't available.
Aaron Conole (2):
net: openvswitch: limit the number of recursions from action sets
selftests: openvswitch: Add validation for the recursion test
net/openvswitch/flow_netlink.c | 49 ++++++++-----
.../selftests/net/openvswitch/openvswitch.sh | 13 ++++
.../selftests/net/openvswitch/ovs-dpctl.py | 71 +++++++++++++++----
3 files changed, 102 insertions(+), 31 deletions(-)
--
2.41.0
The altnames test uses the forwarding/lib.sh and that dependency
currently causes failures when running the test after install:
make -C tools/testing/selftests/ TARGETS=net install
./tools/testing/selftests/kselftest_install/run_kselftest.sh \
-t net:altnames.sh
# ...
# ./altnames.sh: line 8: ./forwarding/lib.sh: No such file or directory
# RTNETLINK answers: Operation not permitted
# ./altnames.sh: line 73: tests_run: command not found
# ./altnames.sh: line 65: pre_cleanup: command not found
Address the issue leveraging the TEST_INCLUDES infrastructure
provided by commit 2a0683be5b4c ("selftests: Introduce Makefile variable
to list shared bash scripts")
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
tools/testing/selftests/net/Makefile | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 211753756bde..7b6918d5f4af 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -97,6 +97,8 @@ TEST_PROGS += vlan_hw_filter.sh
TEST_FILES := settings
TEST_FILES += in_netns.sh lib.sh net_helper.sh setup_loopback.sh setup_veth.sh
+TEST_INCLUDES := forwarding/lib.sh
+
include ../lib.mk
$(OUTPUT)/reuseport_bpf_numa: LDLIBS += -lnuma
--
2.43.0
Fix various problems in the forwarding selftests so that they will pass
in the netdev CI instead of being ignored. See commit messages for
details.
Ido Schimmel (4):
selftests: forwarding: Fix layer 2 miss test flakiness
selftests: forwarding: Fix bridge MDB test flakiness
selftests: forwarding: Suppress grep warnings
selftests: forwarding: Fix bridge locked port test flakiness
.../selftests/net/forwarding/bridge_locked_port.sh | 4 ++--
.../testing/selftests/net/forwarding/bridge_mdb.sh | 14 +++++++++-----
.../selftests/net/forwarding/tc_flower_l2_miss.sh | 8 ++++++--
3 files changed, 17 insertions(+), 9 deletions(-)
--
2.43.0
The two tests that make use of multicast routig (router.sh and
router_multicast.sh) are currently failing in the netdev CI because the
kernel is missing multicast routing support.
Fix by adding the required config entries.
Fixes: 6d4efada3b82 ("selftests: forwarding: Add multicast routing test")
Signed-off-by: Ido Schimmel <idosch(a)nvidia.com>
---
Targeting at net-next because this is where 4acf4e62cd57 ("selftests:
forwarding: Add missing config entries") was applied to, but you can
apply to net as well.
---
tools/testing/selftests/net/forwarding/config | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/tools/testing/selftests/net/forwarding/config b/tools/testing/selftests/net/forwarding/config
index ba2343514582..f59083d8c59d 100644
--- a/tools/testing/selftests/net/forwarding/config
+++ b/tools/testing/selftests/net/forwarding/config
@@ -2,7 +2,14 @@ CONFIG_BRIDGE=m
CONFIG_VLAN_8021Q=m
CONFIG_BRIDGE_VLAN_FILTERING=y
CONFIG_NET_L3_MASTER_DEV=y
+CONFIG_IP_MROUTE=y
+CONFIG_IP_MROUTE_MULTIPLE_TABLES=y
+CONFIG_IP_PIMSM_V1=y
+CONFIG_IP_PIMSM_V2=y
+CONFIG_IPV6_MROUTE=y
+CONFIG_IPV6_MROUTE_MULTIPLE_TABLES=y
CONFIG_IPV6_MULTIPLE_TABLES=y
+CONFIG_IPV6_PIMSM_V2=y
CONFIG_NET_VRF=m
CONFIG_BPF_SYSCALL=y
CONFIG_CGROUP_BPF=y
--
2.43.0
A couple of small updates for the check_compaction selftest which make
it play more nicely with test automation systems.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (2):
selftests/mm: Log skipped compaction test as a skip
selftests/mm: Log a consistent test name for check_compaction
tools/testing/selftests/mm/compaction_test.c | 37 +++++++++++++++-------------
1 file changed, 20 insertions(+), 17 deletions(-)
---
base-commit: b1d3a0e70c3881d2f8cf6692ccf7c2a4fb2d030d
change-id: 20240208-kselftest-mm-cleanup-30edd2e567cb
Best regards,
--
Mark Brown <broonie(a)kernel.org>
I encountered the following build errors while compiling the selftests net
test cases on Linux next-20240208 tag with clang toolchain.
Reported-by: Linux Kernel Functional Testing <lkft(a)linaro.org>
selftests/net/ip_local_port_range
ip_local_port_range.c:152:17: error: use of undeclared identifier
'IPPROTO_MPTCP'
152 | .so_protocol = IPPROTO_MPTCP,
| ^
ip_local_port_range.c:176:17: error: use of undeclared identifier
'IPPROTO_MPTCP'
176 | .so_protocol = IPPROTO_MPTCP,
| ^
2 errors generated.
Build link,
- https://storage.tuxsuite.com/public/linaro/lkft/builds/2c4LtUoRSYhdGbErOY8h…
--
Linaro LKFT
https://lkft.linaro.org
The upcoming RISC-V Ssdtso specification introduces a bit in the senvcfg
CSR to switch the memory consistency model of user mode at run-time from
RVWMO to TSO. The active consistency model can therefore be switched on a
per-hart base and managed by the kernel on a per-process base.
This patchset implements basic Ssdtso support and adds a prctl API on top
so that user-space processes can switch to a stronger memory consistency
model (than the kernel was written for) at run-time.
The patchset also comes with a short documentation of the prctl API.
This series is based on the third draft of the Ssdtso specification
which can be found here:
https://github.com/riscv/riscv-ssdtso/releases/tag/v1.0-draft3
Note, that the Ssdtso specification is in development state
(i.e., not frozen or even ratified) which is also the reason
why this series is marked as RFC.
This series saw the following changes since v1:
* Reordered/restructured patches
* Fixed build issues
* Addressed typos
* Removed ability to switch TSO->WMO
* Moved the state from per-thread to per-process
* Reschedule all CPUs after switching
* Some cleanups in the documentation
* Adding compatibility with Ztso (spec change in draft 3)
This patchset can also be found in this GitHub branch:
https://github.com/cmuellner/linux/tree/ssdtso-v2
A QEMU implementation of DTSO can be found in this GitHub branch:
https://github.com/cmuellner/qemu/tree/ssdtso-v2
Christoph Müllner (6):
mm: Add dynamic memory consistency model switching
uapi: prctl: Add new prctl call to set/get the memory consistency
model
RISC-V: Enable dynamic memory consistency model support with Ssdtso
RISC-V: Implement prctl call to set/get the memory consistency model
RISC-V: Expose Ssdtso via hwprobe API
RISC-V: selftests: Add DTSO tests
Documentation/arch/riscv/hwprobe.rst | 3 +
.../mm/dynamic-memory-consistency-model.rst | 86 ++++++++++++++++
Documentation/mm/index.rst | 1 +
arch/Kconfig | 14 +++
arch/riscv/Kconfig | 11 +++
arch/riscv/include/asm/csr.h | 1 +
arch/riscv/include/asm/dtso.h | 97 +++++++++++++++++++
arch/riscv/include/asm/hwcap.h | 1 +
arch/riscv/include/asm/processor.h | 7 ++
arch/riscv/include/asm/switch_to.h | 3 +
arch/riscv/include/uapi/asm/hwprobe.h | 1 +
arch/riscv/kernel/Makefile | 1 +
arch/riscv/kernel/asm-offsets.c | 3 +
arch/riscv/kernel/cpufeature.c | 1 +
arch/riscv/kernel/dtso.c | 67 +++++++++++++
arch/riscv/kernel/sys_hwprobe.c | 2 +
include/linux/sched.h | 5 +
include/uapi/linux/prctl.h | 5 +
kernel/sys.c | 12 +++
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/dtso/.gitignore | 1 +
tools/testing/selftests/riscv/dtso/Makefile | 11 +++
tools/testing/selftests/riscv/dtso/dtso.c | 82 ++++++++++++++++
23 files changed, 416 insertions(+), 1 deletion(-)
create mode 100644 Documentation/mm/dynamic-memory-consistency-model.rst
create mode 100644 arch/riscv/include/asm/dtso.h
create mode 100644 arch/riscv/kernel/dtso.c
create mode 100644 tools/testing/selftests/riscv/dtso/.gitignore
create mode 100644 tools/testing/selftests/riscv/dtso/Makefile
create mode 100644 tools/testing/selftests/riscv/dtso/dtso.c
--
2.43.0
The reuseport_addr_any.sh is currently skipping DCCP tests and
pmtu.sh is skipping all the FOU/GUE related cases: add the missing
options.
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
Note that this does not include the - still missing - OVS-related
option and pmtu.sh is will keep skipping such cases. Such tests
will still fail in the virtme environment even with the relevant
kernel options enabled, as they have an hard to solve dependency
on systemd/dbus.
The longer term plan is to move such test cases in the openvswitch
directory. One short term option to avoid skips in selftests results
while retaining the potential code coverage would be making the ovs
tests disabled by default but reachable via pmtu.sh command line
arguments.
---
tools/testing/selftests/net/config | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
index 3b749addd364..54d21e2911a9 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -13,6 +13,10 @@ CONFIG_IPV6=y
CONFIG_IPV6_MULTIPLE_TABLES=y
CONFIG_VETH=y
CONFIG_NET_IPVTI=y
+CONFIG_NET_FOU=y
+CONFIG_NET_FOU_IP_TUNNELS=y
+CONFIG_NET_IPIP=y
+CONFIG_IPV6_SIT=y
CONFIG_IPV6_VTI=y
CONFIG_DUMMY=y
CONFIG_BRIDGE_VLAN_FILTERING=y
@@ -24,6 +28,7 @@ CONFIG_IFB=y
CONFIG_INET_DIAG=y
CONFIG_INET_ESP=y
CONFIG_INET_ESP_OFFLOAD=y
+CONFIG_IP_DCCP=m
CONFIG_IP_GRE=m
CONFIG_NETFILTER=y
CONFIG_NETFILTER_ADVANCED=y
--
2.43.0
Here's a follow-up from my RFC series last year:
https://lore.kernel.org/lkml/20221004093131.40392-1-thuth@redhat.com/T/
and from v1 earlier this year:
https://lore.kernel.org/kvm/20230712075910.22480-1-thuth@redhat.com/
Basic idea of this series is now to use the kselftest_harness.h
framework to get TAP output in the tests, so that it is easier
for the user to see what is going on, and e.g. to be able to
detect whether a certain test is part of the test binary or not
(which is useful when tests get extended in the course of time).
v2:
- Dropped the "Rename the ASSERT_EQ macro" patch (already merged)
- Split the fixes in the sync_regs_test into separate patches
(see the first two patches)
- Introduce the KVM_ONE_VCPU_TEST_SUITE() macro as suggested
by Sean (see third patch) and use it in the following patches
- Add a new patch to convert vmx_pmu_caps_test.c, too
Thomas Huth (7):
KVM: selftests: x86: sync_regs_test: Use vcpu_run() where appropriate
KVM: selftests: x86: sync_regs_test: Get regs structure before
modifying it
KVM: selftests: Add a macro to define a test with one vcpu
KVM: selftests: x86: Use TAP interface in the sync_regs test
KVM: selftests: x86: Use TAP interface in the fix_hypercall test
KVM: selftests: x86: Use TAP interface in the vmx_pmu_caps test
KVM: selftests: x86: Use TAP interface in the userspace_msr_exit test
.../selftests/kvm/include/kvm_test_harness.h | 35 +++++
.../selftests/kvm/x86_64/fix_hypercall_test.c | 27 ++--
.../selftests/kvm/x86_64/sync_regs_test.c | 121 +++++++++++++-----
.../kvm/x86_64/userspace_msr_exit_test.c | 19 +--
.../selftests/kvm/x86_64/vmx_pmu_caps_test.c | 50 ++------
5 files changed, 160 insertions(+), 92 deletions(-)
create mode 100644 tools/testing/selftests/kvm/include/kvm_test_harness.h
--
2.41.0
On Fri, Nov 24, 2023 at 12:04:09PM +0100, Jonas Oberhauser wrote:
> Unfortunately, at least last time I checked RISC-V still hadn't gotten such
> instructions.
> What they have is the *semantics* of the instructions, but no actual opcodes
> to encode them.
> I argued for them in the RISC-V memory group, but it was considered to be
> outside the scope of that group.
(Sorry for the late, late reply; just recalled this thread...)
That's right. AFAICT, the discussion about the native load-acquire
and store-release instructions was revived somewhere last year within
the RVI community, culminating in the so called Zalasr-proposal [1];
Brendan, Hans and Andrew (+ Cc) might be able to provide more up-to-
date information about the status/plans for that proposal.
(Remark that RISC-V did introduce LR/SCs and AMOs instructions with
acquire/release semantics separately, cf. the so called A-extension.)
Andrea
[1] https://github.com/mehnadnerd/riscv-zalasr
The gro self-tests sends the packets to be aggregated with
multiple write operations.
When running is slow environment, it's hard to guarantee that
the GRO engine will wait for the last packet in an intended
train.
The above causes almost deterministic failures in our CI for
the 'large' test-case.
Address the issue explicitly ignoring failures for such case
in slow environments (KSFT_MACHINE_SLOW==true).
Fixes: 7d1575014a63 ("selftests/net: GRO coalesce test")
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
Note that the fixes tag is there mainly to justify targeting the net
tree, and this is aiming at net to hopefully make the test more stable
ASAP for both trees.
I experimented with a largish refactory replacing the multiple writes
with a single GSO packet, but exhausted by time budget before reaching
any good result.
---
tools/testing/selftests/net/gro.sh | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/net/gro.sh b/tools/testing/selftests/net/gro.sh
index 19352f106c1d..114b5281a3f5 100755
--- a/tools/testing/selftests/net/gro.sh
+++ b/tools/testing/selftests/net/gro.sh
@@ -31,6 +31,10 @@ run_test() {
1>>log.txt
wait "${server_pid}"
exit_code=$?
+ if [ ${test} == "large" -a -n "${KSFT_MACHINE_SLOW}" ]; then
+ echo "Ignoring errors due to slow environment" 1>&2
+ exit_code=0
+ fi
if [[ "${exit_code}" -eq 0 ]]; then
break;
fi
--
2.43.0
If KUnit is built as a module, and it's unloaded, the kunit_bus is not
unregistered. This causes an error if it's then re-loaded later, as we
try to re-register the bus.
Unregister the bus and root_device on shutdown, if it looks valid.
In addition, be more specific about the value of kunit_bus_device. It
is:
- a valid struct device* if the kunit_bus initialised correctly.
- an ERR_PTR if it failed to initialise.
- NULL before initialisation and after shutdown.
Fixes: d03c720e03bd ("kunit: Add APIs for managing devices")
Signed-off-by: David Gow <davidgow(a)google.com>
---
This will hopefully resolve some of the issues linked to from:
https://lore.kernel.org/intel-gfx/DM4PR11MB614179CB9C387842D8E8BB40B97C2@DM…
---
lib/kunit/device-impl.h | 2 ++
lib/kunit/device.c | 14 ++++++++++++++
lib/kunit/test.c | 3 +++
3 files changed, 19 insertions(+)
diff --git a/lib/kunit/device-impl.h b/lib/kunit/device-impl.h
index 54bd55836405..5fcd48ff0f36 100644
--- a/lib/kunit/device-impl.h
+++ b/lib/kunit/device-impl.h
@@ -13,5 +13,7 @@
// For internal use only -- registers the kunit_bus.
int kunit_bus_init(void);
+// For internal use only -- unregisters the kunit_bus.
+void kunit_bus_shutdown(void);
#endif //_KUNIT_DEVICE_IMPL_H
diff --git a/lib/kunit/device.c b/lib/kunit/device.c
index 074c6dd2e36a..644a38a1f5b1 100644
--- a/lib/kunit/device.c
+++ b/lib/kunit/device.c
@@ -54,6 +54,20 @@ int kunit_bus_init(void)
return error;
}
+/* Unregister the 'kunit_bus' in case the KUnit module is unloaded. */
+void kunit_bus_shutdown(void)
+{
+ /* Make sure the bus exists before we unregister it. */
+ if (IS_ERR_OR_NULL(kunit_bus_device))
+ return;
+
+ bus_unregister(&kunit_bus_type);
+
+ root_device_unregister(kunit_bus_device);
+
+ kunit_bus_device = NULL;
+}
+
/* Release a 'fake' KUnit device. */
static void kunit_device_release(struct device *d)
{
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index 31a5a992e646..1d1475578515 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -928,6 +928,9 @@ static void __exit kunit_exit(void)
#ifdef CONFIG_MODULES
unregister_module_notifier(&kunit_mod_nb);
#endif
+
+ kunit_bus_shutdown();
+
kunit_debugfs_cleanup();
}
module_exit(kunit_exit);
--
2.43.0.429.g432eaa2c6b-goog
Other mechanisms for querying the peak memory usage of either a process
or v1 memory cgroup allow for resetting the high watermark. Restore
parity with those mechanisms.
For example:
- Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets
the high watermark.
- writing "5" to the clear_refs pseudo-file in a processes's proc
directory resets the peak RSS.
This change copies the cgroup v1 behavior so any write to the
memory.peak and memory.swap.peak pseudo-files reset the high watermark
to the current usage.
This behavior is particularly useful for work scheduling systems that
need to track memory usage of worker processes/cgroups per-work-item.
Since memory can't be squeezed like CPU can (the OOM-killer has
opinions), these systems need to track the peak memory usage to compute
system/container fullness when binpacking workitems.
Signed-off-by: David Finkel <davidf(a)vimeo.com>
---
Documentation/admin-guide/cgroup-v2.rst | 20 +++---
mm/memcontrol.c | 23 ++++++
.../selftests/cgroup/test_memcontrol.c | 72 ++++++++++++++++---
3 files changed, 99 insertions(+), 16 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3f85254f3cef..95af0628dc44 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1305,11 +1305,13 @@ PAGE_SIZE multiple when read back.
reclaim induced by memory.reclaim.
memory.peak
- A read-only single value file which exists on non-root
- cgroups.
+ A read-write single value file which exists on non-root cgroups.
+
+ The max memory usage recorded for the cgroup and its descendants since
+ either the creation of the cgroup or the most recent reset.
- The max memory usage recorded for the cgroup and its
- descendants since the creation of the cgroup.
+ Any non-empty write to this file resets it to the current memory usage.
+ All content written is completely ignored.
memory.oom.group
A read-write single value file which exists on non-root
@@ -1626,11 +1628,13 @@ PAGE_SIZE multiple when read back.
Healthy workloads are not expected to reach this limit.
memory.swap.peak
- A read-only single value file which exists on non-root
- cgroups.
+ A read-write single value file which exists on non-root cgroups.
+
+ The max swap usage recorded for the cgroup and its descendants since
+ the creation of the cgroup or the most recent reset.
- The max swap usage recorded for the cgroup and its
- descendants since the creation of the cgroup.
+ Any non-empty write to this file resets it to the current swap usage.
+ All content written is completely ignored.
memory.swap.max
A read-write single value file which exists on non-root
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1c1061df9cd1..b04af158922d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -25,6 +25,7 @@
* Copyright (C) 2020 Alibaba, Inc, Alex Shi
*/
+#include <linux/cgroup-defs.h>
#include <linux/page_counter.h>
#include <linux/memcontrol.h>
#include <linux/cgroup.h>
@@ -6635,6 +6636,16 @@ static u64 memory_peak_read(struct cgroup_subsys_state *css,
return (u64)memcg->memory.watermark * PAGE_SIZE;
}
+static ssize_t memory_peak_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+
+ page_counter_reset_watermark(&memcg->memory);
+
+ return nbytes;
+}
+
static int memory_min_show(struct seq_file *m, void *v)
{
return seq_puts_memcg_tunable(m,
@@ -6947,6 +6958,7 @@ static struct cftype memory_files[] = {
.name = "peak",
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = memory_peak_read,
+ .write = memory_peak_write,
},
{
.name = "min",
@@ -7917,6 +7929,16 @@ static u64 swap_peak_read(struct cgroup_subsys_state *css,
return (u64)memcg->swap.watermark * PAGE_SIZE;
}
+static ssize_t swap_peak_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+
+ page_counter_reset_watermark(&memcg->swap);
+
+ return nbytes;
+}
+
static int swap_high_show(struct seq_file *m, void *v)
{
return seq_puts_memcg_tunable(m,
@@ -7999,6 +8021,7 @@ static struct cftype swap_files[] = {
.name = "swap.peak",
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = swap_peak_read,
+ .write = swap_peak_write,
},
{
.name = "swap.events",
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index c7c9572003a8..0326c317f1f2 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -161,12 +161,12 @@ static int alloc_pagecache_50M_check(const char *cgroup, void *arg)
/*
* This test create a memory cgroup, allocates
* some anonymous memory and some pagecache
- * and check memory.current and some memory.stat values.
+ * and checks memory.current, memory.peak, and some memory.stat values.
*/
-static int test_memcg_current(const char *root)
+static int test_memcg_current_peak(const char *root)
{
int ret = KSFT_FAIL;
- long current;
+ long current, peak, peak_reset;
char *memcg;
memcg = cg_name(root, "memcg_test");
@@ -180,12 +180,32 @@ static int test_memcg_current(const char *root)
if (current != 0)
goto cleanup;
+ peak = cg_read_long(memcg, "memory.peak");
+ if (peak != 0)
+ goto cleanup;
+
if (cg_run(memcg, alloc_anon_50M_check, NULL))
goto cleanup;
+ peak = cg_read_long(memcg, "memory.peak");
+ if (peak < MB(50))
+ goto cleanup;
+
+ peak_reset = cg_write(memcg, "memory.peak", "\n");
+ if (peak_reset != 0)
+ goto cleanup;
+
+ peak = cg_read_long(memcg, "memory.peak");
+ if (peak > MB(30))
+ goto cleanup;
+
if (cg_run(memcg, alloc_pagecache_50M_check, NULL))
goto cleanup;
+ peak = cg_read_long(memcg, "memory.peak");
+ if (peak < MB(50))
+ goto cleanup;
+
ret = KSFT_PASS;
cleanup:
@@ -815,13 +835,14 @@ static int alloc_anon_50M_check_swap(const char *cgroup, void *arg)
/*
* This test checks that memory.swap.max limits the amount of
- * anonymous memory which can be swapped out.
+ * anonymous memory which can be swapped out. Additionally, it verifies that
+ * memory.swap.peak reflects the high watermark and can be reset.
*/
-static int test_memcg_swap_max(const char *root)
+static int test_memcg_swap_max_peak(const char *root)
{
int ret = KSFT_FAIL;
char *memcg;
- long max;
+ long max, peak;
if (!is_swap_enabled())
return KSFT_SKIP;
@@ -838,6 +859,12 @@ static int test_memcg_swap_max(const char *root)
goto cleanup;
}
+ if (cg_read_long(memcg, "memory.swap.peak"))
+ goto cleanup;
+
+ if (cg_read_long(memcg, "memory.peak"))
+ goto cleanup;
+
if (cg_read_strcmp(memcg, "memory.max", "max\n"))
goto cleanup;
@@ -860,6 +887,27 @@ static int test_memcg_swap_max(const char *root)
if (cg_read_key_long(memcg, "memory.events", "oom_kill ") != 1)
goto cleanup;
+ peak = cg_read_long(memcg, "memory.peak");
+ if (peak < MB(29))
+ goto cleanup;
+
+ peak = cg_read_long(memcg, "memory.swap.peak");
+ if (peak < MB(29))
+ goto cleanup;
+
+ if (cg_write(memcg, "memory.swap.peak", "\n"))
+ goto cleanup;
+
+ if (cg_read_long(memcg, "memory.swap.peak") > MB(10))
+ goto cleanup;
+
+
+ if (cg_write(memcg, "memory.peak", "\n"))
+ goto cleanup;
+
+ if (cg_read_long(memcg, "memory.peak"))
+ goto cleanup;
+
if (cg_run(memcg, alloc_anon_50M_check_swap, (void *)MB(30)))
goto cleanup;
@@ -867,6 +915,14 @@ static int test_memcg_swap_max(const char *root)
if (max <= 0)
goto cleanup;
+ peak = cg_read_long(memcg, "memory.peak");
+ if (peak < MB(29))
+ goto cleanup;
+
+ peak = cg_read_long(memcg, "memory.swap.peak");
+ if (peak < MB(19))
+ goto cleanup;
+
ret = KSFT_PASS;
cleanup:
@@ -1293,7 +1349,7 @@ struct memcg_test {
const char *name;
} tests[] = {
T(test_memcg_subtree_control),
- T(test_memcg_current),
+ T(test_memcg_current_peak),
T(test_memcg_min),
T(test_memcg_low),
T(test_memcg_high),
@@ -1301,7 +1357,7 @@ struct memcg_test {
T(test_memcg_max),
T(test_memcg_reclaim),
T(test_memcg_oom_events),
- T(test_memcg_swap_max),
+ T(test_memcg_swap_max_peak),
T(test_memcg_sock),
T(test_memcg_oom_group_leaf_events),
T(test_memcg_oom_group_parent_events),
--
2.39.2
Continue DAMON selftests' test coverage improvement works with a trivial
improvement of the test code itself. The sequence of the patches in
patchset is as follows.
The first five patches add two DAMON core functionalities tests. Those
begins with three patches (patches 1-3) that update the test-purpose
DAMON sysfs interface wrapper to support DAMOS quota, stats, and apply
interval features, respectively. The fourth patch implements and adds a
selftest for DAMOS quota feature, using the DAMON sysfs interface
wrapper's newly added support of the quota and the stats feature. The
fifth patch further implements and adds a selftest for DAMOS apply
interval using the DAMON sysfs interface wrapper's newly added support
of the apply interval and the stats feature.
Two patches (patches 6 and 7) for implementing and adding two corner
cases handling selftests follow. Those try to avoid two previously
fixed bugs from recurring.
Finally, a patch for making DAMON debugfs selftests dependency checker
to use /proc/mounts instead of the hard-coded mount point assumption
follows.
SeongJae Park (8):
selftests/damon/_damon_sysfs: support DAMOS quota
selftests/damon/_damon_sysfs: support DAMOS stats
selftests/damon/_damon_sysfs: support DAMOS apply interval
selftests/damon: add a test for DAMOS quota
selftests/damon: add a test for DAMOS apply intervals
selftests/damon: add a test for a race between target_ids_read() and
dbgfs_before_terminate()
selftests/damon: add a test for the pid leak of
dbgfs_target_ids_write()
selftests/damon/_chk_dependency: get debugfs mount point from
/proc/mounts
tools/testing/selftests/damon/.gitignore | 2 +
tools/testing/selftests/damon/Makefile | 5 ++
.../selftests/damon/_chk_dependency.sh | 9 ++-
tools/testing/selftests/damon/_damon_sysfs.py | 77 ++++++++++++++++--
.../selftests/damon/damos_apply_interval.py | 67 ++++++++++++++++
tools/testing/selftests/damon/damos_quota.py | 67 ++++++++++++++++
.../damon/debugfs_target_ids_pid_leak.c | 68 ++++++++++++++++
.../damon/debugfs_target_ids_pid_leak.sh | 22 +++++
...fs_target_ids_read_before_terminate_race.c | 80 +++++++++++++++++++
...s_target_ids_read_before_terminate_race.sh | 14 ++++
10 files changed, 403 insertions(+), 8 deletions(-)
create mode 100755 tools/testing/selftests/damon/damos_apply_interval.py
create mode 100755 tools/testing/selftests/damon/damos_quota.py
create mode 100644 tools/testing/selftests/damon/debugfs_target_ids_pid_leak.c
create mode 100755 tools/testing/selftests/damon/debugfs_target_ids_pid_leak.sh
create mode 100644 tools/testing/selftests/damon/debugfs_target_ids_read_before_terminate_race.c
create mode 100755 tools/testing/selftests/damon/debugfs_target_ids_read_before_terminate_race.sh
base-commit: f51e629727d8cc526a3156a2c80489b8f050410f
--
2.39.2
cmsg_ipv6 test requests tcpdump to capture 4 packets,
and sends until tcpdump quits. Only the first packet
is "real", however, and the rest are basic UDP packets.
So if tcpdump doesn't start in time it will miss
the real packet and only capture the UDP ones.
This makes the test fail on slow machine (no KVM or with
debug enabled) 100% of the time, while it passes in fast
environments.
Repeat the "real" / expected packet.
Fixes: 9657ad09e1fa ("selftests: net: test IPV6_TCLASS")
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
CC: shuah(a)kernel.org
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/cmsg_ipv6.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/net/cmsg_ipv6.sh b/tools/testing/selftests/net/cmsg_ipv6.sh
index f30bd57d5e38..8bc23fb4c82b 100755
--- a/tools/testing/selftests/net/cmsg_ipv6.sh
+++ b/tools/testing/selftests/net/cmsg_ipv6.sh
@@ -89,7 +89,7 @@ for ovr in setsock cmsg both diff; do
check_result $? 0 "TCLASS $prot $ovr - pass"
while [ -d /proc/$BG ]; do
- $NSEXE ./cmsg_sender -6 -p u $TGT6 1234
+ $NSEXE ./cmsg_sender -6 -p $p $m $((TOS2)) $TGT6 1234
done
tcpdump -r $TMPF -v 2>&1 | grep "class $TOS2" >> /dev/null
@@ -126,7 +126,7 @@ for ovr in setsock cmsg both diff; do
check_result $? 0 "HOPLIMIT $prot $ovr - pass"
while [ -d /proc/$BG ]; do
- $NSEXE ./cmsg_sender -6 -p u $TGT6 1234
+ $NSEXE ./cmsg_sender -6 -p $p $m $LIM $TGT6 1234
done
tcpdump -r $TMPF -v 2>&1 | grep "hlim $LIM[^0-9]" >> /dev/null
--
2.43.0
Add specification for test metadata to the KTAP v2 spec.
KTAP v1 only specifies the output format of very basic test information:
test result and test name. Any additional test information either gets
added to general diagnostic data or is not included in the output at all.
The purpose of KTAP metadata is to create a framework to include and
easily identify additional important test information in KTAP.
KTAP metadata could include any test information that is pertinent for
user interaction before or after the running of the test. For example,
the test file path or the test speed.
Since this includes a large variety of information, this specification
will recognize notable types of KTAP metadata to ensure consistent format
across test frameworks. See the full list of types in the specification.
Example of KTAP Metadata:
KTAP version 2
# ktap_test: main
# ktap_arch: uml
1..1
KTAP version 2
# ktap_test: suite_1
# ktap_subsystem: example
# ktap_test_file: lib/test.c
1..2
ok 1 test_1
# ktap_test: test_2
# ktap_speed: very_slow
# custom_is_flaky: true
ok 2 test_2
ok 1 test_suite
The changes to the KTAP specification outline the format, location, and
different types of metadata.
Here is a link to a version of the KUnit parser that is able to parse test
metadata lines for KTAP version 2. Note this includes test metadata
lines for the main level of KTAP.
Link: https://kunit-review.googlesource.com/c/linux/+/5889
Signed-off-by: Rae Moar <rmoar(a)google.com>
---
Documentation/dev-tools/ktap.rst | 163 ++++++++++++++++++++++++++++++-
1 file changed, 159 insertions(+), 4 deletions(-)
diff --git a/Documentation/dev-tools/ktap.rst b/Documentation/dev-tools/ktap.rst
index ff77f4aaa6ef..4480eaf5bbc3 100644
--- a/Documentation/dev-tools/ktap.rst
+++ b/Documentation/dev-tools/ktap.rst
@@ -17,19 +17,20 @@ KTAP test results describe a series of tests (which may be nested: i.e., test
can have subtests), each of which can contain both diagnostic data -- e.g., log
lines -- and a final result. The test structure and results are
machine-readable, whereas the diagnostic data is unstructured and is there to
-aid human debugging.
+aid human debugging. One exception to this is test metadata lines - a type
+of diagnostic lines. Test metadata is used to identify important supplemental
+test information and can be machine-readable.
KTAP output is built from four different types of lines:
- Version lines
- Plan lines
- Test case result lines
-- Diagnostic lines
+- Diagnostic lines (including test metadata)
In general, valid KTAP output should also form valid TAP output, but some
information, in particular nested test results, may be lost. Also note that
there is a stagnant draft specification for TAP14, KTAP diverges from this in
-a couple of places (notably the "Subtest" header), which are described where
-relevant later in this document.
+a couple of places, which are described where relevant later in this document.
Version lines
-------------
@@ -166,6 +167,154 @@ even if they do not start with a "#": this is to capture any other useful
kernel output which may help debug the test. It is nevertheless recommended
that tests always prefix any diagnostic output they have with a "#" character.
+KTAP metadata lines
+-------------------
+
+KTAP metadata lines are a subset of diagnostic lines that are used to include
+and easily identify important supplemental test information in KTAP.
+
+.. code-block:: none
+
+ # <prefix>_<metadata type>: <metadata value>
+
+The <prefix> indicates where to find the specification for the type of
+metadata. The metadata types listed below use the prefix "ktap" (See Types of
+KTAP Metadata).
+
+Types that are instead specified by an individual test framework use the
+framework name as the prefix. For example, a metadata type documented by the
+kselftest specification would use the prefix "kselftest". Any metadata type
+that is not listed in a specification must use the prefix "custom". Note the
+prefix must not include spaces or the characters ":" or "_".
+
+The format of <metadata type> and <value> varies based on the type. See the
+individual specification. For "custom" types the <metadata type> can be any
+string excluding ":", spaces, or newline characters and the <value> can be any
+string.
+
+**Location:**
+
+The first KTAP metadata entry for a test must be "# ktap_test: <test name>",
+which acts as a header to associate metadata with the correct test.
+
+For test cases, the location of the metadata is between the prior test result
+line and the current test result line. For test suites, the location of the
+metadata is between the suite's version line and test plan line. See the
+example below.
+
+KTAP metadata for a test does not need to be contiguous. For example, a kernel
+warning or other diagnostic output could interrupt metadata lines. However, it
+is recommended to keep a test's metadata lines together when possible, as this
+improves readability.
+
+**Here is an example of using KTAP metadata:**
+
+::
+
+ KTAP version 2
+ # ktap_test: main
+ # ktap_arch: uml
+ 1..1
+ KTAP version 2
+ # ktap_test: suite_1
+ # ktap_subsystem: example
+ # ktap_test_file: lib/test.c
+ 1..2
+ ok 1 test_1
+ # ktap_test: test_2
+ # ktap_speed: very_slow
+ # custom_is_flaky: true
+ ok 2 test_2
+ # suite_1 passed
+ ok 1 suite_1
+
+In this example, the tests are running on UML. The test suite "suite_1" is part
+of the subsystem "example" and belongs to the file "lib/example_test.c". It has
+two subtests, "test_1" and "test_2". The subtest "test_2" has a speed of
+"very_slow" and has been marked with a custom KTAP metadata type called
+"custom_is_flaky" with the value of "true".
+
+**Types of KTAP Metadata:**
+
+This is the current list of KTAP metadata types recognized in this
+specification. Note that all of these metadata types are optional (except for
+ktap_test as the KTAP metadata header).
+
+- ``ktap_test``: Name of test (used as header of KTAP metadata). This should
+ match the test name printed in the test result line: "ok 1 [test_name]".
+
+- ``ktap_module``: Name of the module containing the test
+
+- ``ktap_subsystem``: Name of the subsystem being tested
+
+- ``ktap_start_time``: Time tests started in ISO8601 format
+
+ - Example: "# ktap_start_time: 2024-01-09T13:09:01.990000+00:00"
+
+- ``ktap_duration``: Time taken (in seconds) to execute the test
+
+ - Example: "ktap_duration: 10.154s"
+
+- ``ktap_speed``: Category of how fast test runs: "normal", "slow", or
+ "very_slow"
+
+- ``ktap_test_file``: Path to source file containing the test. This metadata
+ line can be repeated if the test is spread across multiple files.
+
+ - Example: "# ktap_test_file: lib/test.c"
+
+- ``ktap_generated_file``: Description of and path to file generated during
+ test execution. This could be a core dump, generated filesystem image, some
+ form of visual output (for graphics drivers), etc. This metadata line can be
+ repeated to attach multiple files to the test.
+
+ - Example: "# ktap_generated_file: Core dump: /var/lib/systemd/coredump/hello.core"
+
+- ``ktap_log_file``: Path to file containing kernel log test output
+
+ - Example: "# ktap_log_file: /sys/kernel/debugfs/kunit/example/results"
+
+- ``ktap_error_file``: Path to file containing context for test failure or
+ error. This could include the difference between optimal test output and
+ actual test output.
+
+ - Example: "# ktap_error_file: fs/results/example.out.bad"
+
+- ``ktap_results_url``: Link to webpage describing this test run and its
+ results
+
+ - Example: "# ktap_results_url: https://kcidb.kernelci.org/hello"
+
+- ``ktap_arch``: Architecture used during test run
+
+ - Example: "# ktap_arch: x86_64"
+
+- ``ktap_compiler``: Compiler used during test run
+
+ - Example: "# ktap_compiler: gcc (GCC) 10.1.1 20200507 (Red Hat 10.1.1-1)"
+
+- ``ktap_respository_url``: Link to git repository of the checked out code.
+
+ - Example: "# ktap_respository_url: https://github.com/torvalds/linux.git"
+
+- ``ktap_git_branch``: Name of git branch of checked out code
+
+ - Example: "# ktap_git_branch: kselftest/kunit"
+
+- ``ktap_kernel_version``: Version of Linux Kernel being used during test run
+
+ - Example: "# ktap_kernel_version: 6.7-rc1"
+
+- ``ktap_commit_hash``: The full git commit hash of the checked out base code.
+
+ - Example: "# ktap_commit_hash: 064725faf8ec2e6e36d51e22d3b86d2707f0f47f"
+
+**Other Metadata Types:**
+
+There can also be KTAP metadata that is not included in the recognized list
+above. This metadata must be prefixed with the test framework, ie. "kselftest",
+or with the prefix "custom". For example, "# custom_batch: 20".
+
Unknown lines
-------------
@@ -206,6 +355,7 @@ An example of a test with two nested subtests:
KTAP version 2
1..1
KTAP version 2
+ # ktap_test: example
1..2
ok 1 test_1
not ok 2 test_2
@@ -219,6 +369,7 @@ An example format with multiple levels of nested testing:
KTAP version 2
1..2
KTAP version 2
+ # ktap_test: example_test_1
1..2
KTAP version 2
1..2
@@ -254,6 +405,7 @@ Example KTAP output
KTAP version 2
1..1
KTAP version 2
+ # ktap_test: main_test
1..3
KTAP version 2
1..1
@@ -261,11 +413,14 @@ Example KTAP output
ok 1 test_1
ok 1 example_test_1
KTAP version 2
+ # ktap_test: example_test_2
+ # ktap_speed: slow
1..2
ok 1 test_1 # SKIP test_1 skipped
ok 2 test_2
ok 2 example_test_2
KTAP version 2
+ # ktap_test: example_test_3
1..3
ok 1 test_1
# test_2: FAIL
base-commit: 906f02e42adfbd5ae70d328ee71656ecb602aaf5
--
2.43.0.429.g432eaa2c6b-goog
The seccomp benchmark test (for validating the benefit of bitmaps) can
be sensitive to scheduling speed, so pin the process to a single CPU,
which appears to significantly improve reliability, and loosen the
"close enough" checking to allow up to 10% variance instead of 1%.
Reported-by: kernel test robot <oliver.sang(a)intel.com>
Closes: https://lore.kernel.org/oe-lkp/202402061002.3a8722fd-oliver.sang@intel.com
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Andy Lutomirski <luto(a)amacapital.net>
Cc: Will Drewry <wad(a)chromium.org>
Signed-off-by: Kees Cook <keescook(a)chromium.org>
---
v2:
- improve comment about selecting CPU (broonie)
- loosen variance check from 1% to 10%
v1: https://lore.kernel.org/all/20240206095642.work.502-kees@kernel.org/
---
.../selftests/seccomp/seccomp_benchmark.c | 38 ++++++++++++++++++-
1 file changed, 36 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/seccomp/seccomp_benchmark.c b/tools/testing/selftests/seccomp/seccomp_benchmark.c
index 5b5c9d558dee..9d7aa5a730e0 100644
--- a/tools/testing/selftests/seccomp/seccomp_benchmark.c
+++ b/tools/testing/selftests/seccomp/seccomp_benchmark.c
@@ -4,7 +4,9 @@
*/
#define _GNU_SOURCE
#include <assert.h>
+#include <err.h>
#include <limits.h>
+#include <sched.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdio.h>
@@ -76,8 +78,12 @@ unsigned long long calibrate(void)
bool approx(int i_one, int i_two)
{
- double one = i_one, one_bump = one * 0.01;
- double two = i_two, two_bump = two * 0.01;
+ /*
+ * This continues to be a noisy test. Instead of a 1% comparison
+ * go with 10%.
+ */
+ double one = i_one, one_bump = one * 0.1;
+ double two = i_two, two_bump = two * 0.1;
one_bump = one + MAX(one_bump, 2.0);
two_bump = two + MAX(two_bump, 2.0);
@@ -119,6 +125,32 @@ long compare(const char *name_one, const char *name_eval, const char *name_two,
return good ? 0 : 1;
}
+/* Pin to a single CPU so the benchmark won't bounce around the system. */
+void affinity(void)
+{
+ long cpu;
+ ulong ncores = sysconf(_SC_NPROCESSORS_CONF);
+ cpu_set_t *setp = CPU_ALLOC(ncores);
+ ulong setsz = CPU_ALLOC_SIZE(ncores);
+
+ /*
+ * Totally unscientific way to avoid CPUs that might be busier:
+ * choose the highest CPU instead of the lowest.
+ */
+ for (cpu = ncores - 1; cpu >= 0; cpu--) {
+ CPU_ZERO_S(setsz, setp);
+ CPU_SET_S(cpu, setsz, setp);
+ if (sched_setaffinity(getpid(), setsz, setp) == -1)
+ continue;
+ printf("Pinned to CPU %lu of %lu\n", cpu + 1, ncores);
+ goto out;
+ }
+ fprintf(stderr, "Could not set CPU affinity -- calibration may not work well");
+
+out:
+ CPU_FREE(setp);
+}
+
int main(int argc, char *argv[])
{
struct sock_filter bitmap_filter[] = {
@@ -153,6 +185,8 @@ int main(int argc, char *argv[])
system("grep -H . /proc/sys/net/core/bpf_jit_enable");
system("grep -H . /proc/sys/net/core/bpf_jit_harden");
+ affinity();
+
if (argc > 1)
samples = strtoull(argv[1], NULL, 0);
else
--
2.34.1
I have been steadily working but struggled to find a seamlessly
integrated way to implement tty frontend until Guilherme inspired me
that multi-backend and tty frontend are actually two separate entities.
This submission presents the second iteration of my efforts, listing
notable changes form the v1:
1. pstore.backend no longer acts as "registered backend", but "backends
eligible for registration".
2. drop subdir since it will break user space
3. drop tty frontend since I haven't yet devised a satisfactory
implementation strategy
A heartfelt thank you to Kees and Guilherme for your suggestions.
I firmly believe that a tty frontend is crucial for kdump debugging,
and I am still dedicating effort to develop one. Hope in the future I
can accomplish it with deeper comprehension with tty driver :)
Yuanhe Shu (3):
pstore: add multi-backend support
Documentation: adjust pstore backend related document
tools/testing: adjust pstore backend related selftest
Documentation/ABI/testing/pstore | 8 +-
.../admin-guide/kernel-parameters.txt | 4 +-
fs/pstore/ftrace.c | 29 ++-
fs/pstore/inode.c | 19 +-
fs/pstore/internal.h | 4 +-
fs/pstore/platform.c | 225 ++++++++++++------
fs/pstore/pmsg.c | 24 +-
include/linux/pstore.h | 29 +++
tools/testing/selftests/pstore/common_tests | 8 +-
.../selftests/pstore/pstore_post_reboot_tests | 65 ++---
tools/testing/selftests/pstore/pstore_tests | 2 +-
11 files changed, 293 insertions(+), 124 deletions(-)
--
2.39.3
From: Willem de Bruijn <willemb(a)google.com>
This test is time sensitive. It may fail on virtual machines and for
debug builds.
Continue to run in these environments to get code coverage. But
optionally suppress failure for timing errors (only). This is
controlled with environment variable KSFT_MACHINE_SLOW.
The test continues to return 0 (KSFT_PASS), rather than KSFT_XFAIL
as previously discussed. Because making so_txtime.c return that and
then making so_txtime.sh capture runs that pass that vs KSFT_FAIL
and pass it on added a bunch of (fragile bash) boilerplate, while the
result is interpreted the same as KSFT_PASS anyway.
Signed-off-by: Willem de Bruijn <willemb(a)google.com>
---
tools/testing/selftests/net/so_txtime.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/net/so_txtime.c b/tools/testing/selftests/net/so_txtime.c
index 2672ac0b6d1f..8457b7ccbc09 100644
--- a/tools/testing/selftests/net/so_txtime.c
+++ b/tools/testing/selftests/net/so_txtime.c
@@ -134,8 +134,11 @@ static void do_recv_one(int fdr, struct timed_send *ts)
if (rbuf[0] != ts->data)
error(1, 0, "payload mismatch. expected %c", ts->data);
- if (llabs(tstop - texpect) > cfg_variance_us)
- error(1, 0, "exceeds variance (%d us)", cfg_variance_us);
+ if (llabs(tstop - texpect) > cfg_variance_us) {
+ fprintf(stderr, "exceeds variance (%d us)\n", cfg_variance_us);
+ if (!getenv("KSFT_MACHINE_SLOW"))
+ exit(1);
+ }
}
static void do_recv_verify_empty(int fdr)
--
2.43.0.429.g432eaa2c6b-goog
Selftests here check not only that connect()/accept() for
TCP-AO/TCP-MD5/non-signed-TCP combinations do/don't establish
connections, but also counters: those are per-AO-key, per-socket and
per-netns.
The counters are checked on the server's side, as the server listener
has TCP-AO/TCP-MD5/no keys for different peers. All tests run in
the same namespaces with the same veth pair, created in test_init().
After close() in both client and server, the sides go through
the regular FIN/ACK + FIN/ACK sequence, which goes in the background.
If the selftest has already started a new testing scenario, read
per-netns counters - it may fail in the end iff it doesn't expect
the TCPAOGood per-netns counters go up during the test.
Let's just kill both TCP-AO sides - that will avoid any asynchronous
background TCP-AO segments going to either sides.
Reported-by: Jakub Kicinski <kuba(a)kernel.org>
Closes: https://lore.kernel.org/all/20240201132153.4d68f45e@kernel.org/T/#u
Fixes: 6f0c472a6815 ("selftests/net: Add TCP-AO + TCP-MD5 + no sign listen socket tests")
Signed-off-by: Dmitry Safonov <dima(a)arista.com>
---
tools/testing/selftests/net/tcp_ao/unsigned-md5.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/net/tcp_ao/unsigned-md5.c b/tools/testing/selftests/net/tcp_ao/unsigned-md5.c
index c5b568cd7d90..6b59a652159f 100644
--- a/tools/testing/selftests/net/tcp_ao/unsigned-md5.c
+++ b/tools/testing/selftests/net/tcp_ao/unsigned-md5.c
@@ -110,9 +110,9 @@ static void try_accept(const char *tst_name, unsigned int port,
test_tcp_ao_counters_cmp(tst_name, &ao_cnt1, &ao_cnt2, cnt_expected);
out:
- synchronize_threads(); /* close() */
+ synchronize_threads(); /* test_kill_sk() */
if (sk > 0)
- close(sk);
+ test_kill_sk(sk);
}
static void server_add_routes(void)
@@ -302,10 +302,10 @@ static void try_connect(const char *tst_name, unsigned int port,
test_ok("%s: connected", tst_name);
out:
- synchronize_threads(); /* close() */
+ synchronize_threads(); /* test_kill_sk() */
/* _test_connect_socket() cleans up on failure */
if (ret > 0)
- close(sk);
+ test_kill_sk(sk);
}
#define PREINSTALL_MD5_FIRST BIT(0)
@@ -486,10 +486,10 @@ static void try_to_add(const char *tst_name, unsigned int port,
}
out:
- synchronize_threads(); /* close() */
+ synchronize_threads(); /* test_kill_sk() */
/* _test_connect_socket() cleans up on failure */
if (ret > 0)
- close(sk);
+ test_kill_sk(sk);
}
static void client_add_ip(union tcp_addr *client, const char *ip)
---
base-commit: 021533194476035883300d60fbb3136426ac8ea5
change-id: 20240202-unsigned-md5-netns-counters-35134409362a
Best regards,
--
Dmitry Safonov <dima(a)arista.com>
Non-contiguous CBM support for Intel CAT has been merged into the kernel
with Commit 0e3cd31f6e90 ("x86/resctrl: Enable non-contiguous CBMs in
Intel CAT") but there is no selftest that would validate if this feature
works correctly.
The selftest needs to verify if writing non-contiguous CBMs to the
schemata file behaves as expected in comparison to the information about
non-contiguous CBMs support.
The patch series is based on a rework of resctrl selftests that's
currently in review [1]. The patch also implements a similar
functionality presented in the bash script included in the cover letter
of the original non-contiguous CBMs in Intel CAT series [3].
Changelog v4:
- Changes to error failure return values in non-contiguous test.
- Some minor text refactoring without functional changes.
Changelog v3:
- Rebase onto v4 of Ilpo's series [1].
- Split old patch 3/4 into two parts. One doing refactoring and one
adding a new function.
- Some changes to all the patches after Reinette's review.
Changelog v2:
- Rebase onto v4 of Ilpo's series [2].
- Add two patches that prepare helpers for the new test.
- Move Ilpo's patch that adds test grouping to this series.
- Apply Ilpo's suggestion to the patch that adds a new test.
[1] https://lore.kernel.org/all/20231215150515.36983-1-ilpo.jarvinen@linux.inte…
[2] https://lore.kernel.org/all/20231211121826.14392-1-ilpo.jarvinen@linux.inte…
[3] https://lore.kernel.org/all/cover.1696934091.git.maciej.wieczor-retman@inte…
Older versions of this series:
[v1] https://lore.kernel.org/all/20231109112847.432687-1-maciej.wieczor-retman@i…
[v2] https://lore.kernel.org/all/cover.1702392177.git.maciej.wieczor-retman@inte…
Ilpo Järvinen (1):
selftests/resctrl: Add test groups and name L3 CAT test L3_CAT
Maciej Wieczor-Retman (4):
selftests/resctrl: Add helpers for the non-contiguous test
selftests/resctrl: Split validate_resctrl_feature_request()
selftests/resctrl: Add resource_info_file_exists()
selftests/resctrl: Add non-contiguous CBMs CAT test
tools/testing/selftests/resctrl/cat_test.c | 84 ++++++++++++++++-
tools/testing/selftests/resctrl/cmt_test.c | 2 +-
tools/testing/selftests/resctrl/mba_test.c | 2 +-
tools/testing/selftests/resctrl/mbm_test.c | 6 +-
tools/testing/selftests/resctrl/resctrl.h | 10 +-
.../testing/selftests/resctrl/resctrl_tests.c | 18 +++-
tools/testing/selftests/resctrl/resctrlfs.c | 94 ++++++++++++++++---
7 files changed, 192 insertions(+), 24 deletions(-)
--
2.43.0
From: Jeff Xu <jeffxu(a)chromium.org>
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory
range against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW)
and no-execute (NX) bits. Linux has supported NX since the release of
kernel version 2.6.8 in August 2004 [1]. The memory permission feature
improves the security stance on memory corruption bugs, as an attacker
cannot simply write to arbitrary memory and point the code to it. The
memory must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects
the VMA itself against modifications of the selected seal type.
Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped. Memory sealing can
automatically be applied by the runtime loader to seal .text and
.rodata pages and applications can additionally seal security critical
data at runtime. A similar feature already exists in the XNU kernel
with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the
mimmutable syscall [4]. Also, Chrome wants to adopt this feature for
their CFI work [2] and this patchset has been designed to be
compatible with the Chrome use case.
Two system calls are involved in sealing the map: mmap() and mseal().
The new mseal() is an syscall on 64 bit CPU, and with
following signature:
int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.
mseal() blocks following operations for the given memory range.
1> Unmapping, moving to another location, and shrinking the size,
via munmap() and mremap(), can leave an empty space, therefore can
be replaced with a VMA with a new set of attributes.
2> Moving or expanding a different VMA into the current location,
via mremap().
3> Modifying a VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed VMA.
5> mprotect() and pkey_mprotect().
6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
memory, when users don't have write permission to the memory. Those
behaviors can alter region contents by discarding pages, effectively a
memset(0) for anonymous memory.
In addition: mmap() has two related changes.
The PROT_SEAL bit in prot field of mmap(). When present, it marks
the map sealed since creation.
The MAP_SEALABLE bit in the flags field of mmap(). When present, it marks
the map as sealable. A map created without MAP_SEALABLE will not support
sealing, i.e. mseal() will fail.
Applications that don't care about sealing will expect their behavior
unchanged. For those that need sealing support, opt-in by adding
MAP_SEALABLE in mmap().
The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.
Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in
the case of libc, sealing is only applied to read-only (RO) or
read-execute (RX) memory segments (such as .text and .RELRO) to
prevent them from becoming writable, the lifetime of those mappings
are tied to the lifetime of the process.
Chrome wants to seal two large address space reservations that are
managed by different allocators. The memory is mapped RW- and RWX
respectively but write access to it is restricted using pkeys (or in
the future ARM permission overlay extensions). The lifetime of those
mappings are not tied to the lifetime of the process, therefore, while
the memory is sealed, the allocators still need to free or discard the
unused memory. For example, with madvise(DONTNEED).
However, always allowing madvise(DONTNEED) on this range poses a
security risk. For example if a jump instruction crosses a page
boundary and the second page gets discarded, it will overwrite the
target bytes with zeros and change the control flow. Checking
write-permission before the discard operation allows us to control
when the operation is valid. In this case, the madvise will only
succeed if the executing thread has PKEY write permissions and PKRU
changes are protected in software by control-flow integrity.
Although the initial version of this patch series is targeting the
Chrome browser as its first user, it became evident during upstream
discussions that we would also want to ensure that the patch set
eventually is a complete solution for memory sealing and compatible
with other use cases. The specific scenario currently in mind is
glibc's use case of loading and sealing ELF executables. To this end,
Stephen is working on a change to glibc to add sealing support to the
dynamic linker, which will seal all non-writable segments at startup.
Once this work is completed, all applications will be able to
automatically benefit from these new protections.
In closing, I would like to formally acknowledge the valuable
contributions received during the RFC process, which were instrumental
in shaping this patch:
Jann Horn: raising awareness and providing valuable insights on the
destructive madvise operations.
Liam R. Howlett: perf optimization.
Linus Torvalds: assisting in defining system call signature and scope.
Pedro Falcato: suggesting sealing in the mmap().
Theo de Raadt: sharing the experiences and insight gained from
implementing mimmutable() in OpenBSD.
Change history:
===============
V8:
- perf optimization in mmap. (Liam R. Howlett)
- add one testcase (test_seal_zero_address)
- Update mseal.rst to add note for MAP_SEALABLE.
V7:
- fix index.rst (Randy Dunlap)
- fix arm build (Randy Dunlap)
- return EPERM for blocked operations (Theo de Raadt)
https://lore.kernel.org/linux-mm/20240122152905.2220849-2-jeffxu@chromium.o…
V6:
- Drop RFC from subject, Given Linus's general approval.
- Adjust syscall number for mseal (main Jan.11/2024)
- Code style fix (Matthew Wilcox)
- selftest: use ksft macros (Muhammad Usama Anjum)
- Document fix. (Randy Dunlap)
https://lore.kernel.org/all/20240111234142.2944934-1-jeffxu@chromium.org/
V5:
- fix build issue in mseal-Wire-up-mseal-syscall
(Suggested by Linus Torvalds, and Greg KH)
- updates on selftest.
https://lore.kernel.org/lkml/20240109154547.1839886-1-jeffxu@chromium.org/#r
V4:
(Suggested by Linus Torvalds)
- new signature: mseal(start,len,flags)
- 32 bit is not supported. vm_seal is removed, use vm_flags instead.
- single bit in vm_flags for sealed state.
- CONFIG_MSEAL kernel config is removed.
- single bit of PROT_SEAL in the "Prot" field of mmap().
Other changes:
- update selftest (Suggested by Muhammad Usama Anjum)
- update documentation.
https://lore.kernel.org/all/20240104185138.169307-1-jeffxu@chromium.org/
V3:
- Abandon per-syscall approach, (Suggested by Linus Torvalds).
- Organize sealing types around their functionality, such as
MM_SEAL_BASE, MM_SEAL_PROT_PKEY.
- Extend the scope of sealing from calls originated in userspace to
both kernel and userspace. (Suggested by Linus Torvalds)
- Add seal type support in mmap(). (Suggested by Pedro Falcato)
- Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent
destructive operations of madvise. (Suggested by Jann Horn and
Stephen Röttger)
- Make sealed VMAs mergeable. (Suggested by Jann Horn)
- Add MAP_SEALABLE to mmap()
- Add documentation - mseal.rst
https://lore.kernel.org/linux-mm/20231212231706.2680890-2-jeffxu@chromium.o…
v2:
Use _BITUL to define MM_SEAL_XX type.
Use unsigned long for seal type in sys_mseal() and other functions.
Remove internal VM_SEAL_XX type and convert_user_seal_type().
Remove MM_ACTION_XX type.
Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask.
Add more comments in code.
Add a detailed commit message.
https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/
v1:
https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/
----------------------------------------------------------------
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b…
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXge…
[6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426Fkcgnf…
[7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/
Jeff Xu (4):
mseal: Wire up mseal syscall
mseal: add mseal syscall
selftest mm/mseal memory sealing
mseal:add documentation
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/mseal.rst | 215 ++
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/mman-common.h | 8 +
include/uapi/asm-generic/unistd.h | 5 +-
kernel/sys_ni.c | 1 +
mm/Makefile | 4 +
mm/internal.h | 48 +
mm/madvise.c | 12 +
mm/mmap.c | 35 +-
mm/mprotect.c | 10 +
mm/mremap.c | 31 +
mm/mseal.c | 343 ++++
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/mseal_test.c | 2024 +++++++++++++++++++
33 files changed, 2756 insertions(+), 3 deletions(-)
create mode 100644 Documentation/userspace-api/mseal.rst
create mode 100644 mm/mseal.c
create mode 100644 tools/testing/selftests/mm/mseal_test.c
--
2.43.0.429.g432eaa2c6b-goog
hugetlb_madv_vs_map selftest was not part of the mm test-suite since we
didn't have a fix for the problem it found.
Now that the problem is already fixed (see previous commit), let's
enable this selftest in the default test-suite.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
tools/testing/selftests/mm/run_vmtests.sh | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 246d53a5d7f2..50e2094ed761 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -253,6 +253,7 @@ nr_hugepages_tmp=$(cat /proc/sys/vm/nr_hugepages)
# For this test, we need one and just one huge page
echo 1 > /proc/sys/vm/nr_hugepages
CATEGORY="hugetlb" run_test ./hugetlb_fault_after_madv
+CATEGORY="hugetlb" run_test ./hugetlb_madv_vs_map
# Restore the previous number of huge pages, since further tests rely on it
echo "$nr_hugepages_tmp" > /proc/sys/vm/nr_hugepages
--
2.34.1
=== Description ===
This is a bpf-treewide change that annotates all kfuncs as such inside
.BTF_ids. This annotation eventually allows us to automatically generate
kfunc prototypes from bpftool.
We store this metadata inside a yet-unused flags field inside struct
btf_id_set8 (thanks Kumar!). pahole will be taught where to look.
More details about the full chain of events are available in commit 3's
description.
The accompanying pahole and bpftool changes can be viewed
here on these "frozen" branches [0][1].
[0]: https://github.com/danobi/pahole/tree/kfunc_btf-v3-mailed
[1]: https://github.com/danobi/linux/tree/kfunc_bpftool-mailed
=== Changelog ===
Changes from v3:
* Rebase to bpf-next and add missing annotation on new kfunc
Changes from v2:
* Only WARN() for vmlinux kfuncs
Changes from v1:
* Move WARN_ON() up a call level
* Also return error when kfunc set is not properly tagged
* Use BTF_KFUNCS_START/END instead of flags
* Rename BTF_SET8_KFUNC to BTF_SET8_KFUNCS
Daniel Xu (3):
bpf: btf: Support flags for BTF_SET8 sets
bpf: btf: Add BTF_KFUNCS_START/END macro pair
bpf: treewide: Annotate BPF kfuncs in BTF
Documentation/bpf/kfuncs.rst | 8 +++----
drivers/hid/bpf/hid_bpf_dispatch.c | 8 +++----
fs/verity/measure.c | 4 ++--
include/linux/btf_ids.h | 21 +++++++++++++++----
kernel/bpf/btf.c | 8 +++++++
kernel/bpf/cpumask.c | 4 ++--
kernel/bpf/helpers.c | 8 +++----
kernel/bpf/map_iter.c | 4 ++--
kernel/cgroup/rstat.c | 4 ++--
kernel/trace/bpf_trace.c | 8 +++----
net/bpf/test_run.c | 8 +++----
net/core/filter.c | 20 +++++++++---------
net/core/xdp.c | 4 ++--
net/ipv4/bpf_tcp_ca.c | 4 ++--
net/ipv4/fou_bpf.c | 4 ++--
net/ipv4/tcp_bbr.c | 4 ++--
net/ipv4/tcp_cubic.c | 4 ++--
net/ipv4/tcp_dctcp.c | 4 ++--
net/netfilter/nf_conntrack_bpf.c | 4 ++--
net/netfilter/nf_nat_bpf.c | 4 ++--
net/xfrm/xfrm_interface_bpf.c | 4 ++--
net/xfrm/xfrm_state_bpf.c | 4 ++--
.../selftests/bpf/bpf_testmod/bpf_testmod.c | 8 +++----
23 files changed, 87 insertions(+), 66 deletions(-)
--
2.42.1
When execute the dirty_log_test on some aarch64 machine, it sometimes
trigger the ASSERT:
==== Test Assertion Failure ====
dirty_log_test.c:384: dirty_ring_vcpu_ring_full
pid=14854 tid=14854 errno=22 - Invalid argument
1 0x00000000004033eb: dirty_ring_collect_dirty_pages at dirty_log_test.c:384
2 0x0000000000402d27: log_mode_collect_dirty_pages at dirty_log_test.c:505
3 (inlined by) run_test at dirty_log_test.c:802
4 0x0000000000403dc7: for_each_guest_mode at guest_modes.c:100
5 0x0000000000401dff: main at dirty_log_test.c:941 (discriminator 3)
6 0x0000ffff9be173c7: ?? ??:0
7 0x0000ffff9be1749f: ?? ??:0
8 0x000000000040206f: _start at ??:?
Didn't continue vcpu even without ring full
The dirty_log_test fails when execute the dirty-ring test, this is
because the sem_vcpu_cont and the sem_vcpu_stop is non-zero value when
execute the dirty_ring_collect_dirty_pages() function. When those two
sem_t variables are non-zero, the dirty_ring_wait_vcpu() at the
beginning of the dirty_ring_collect_dirty_pages() will not wait for the
vcpu to stop, but continue to execute the following code. In this case,
before vcpu stop, if the dirty_ring_vcpu_ring_full is true, and the
dirty_ring_collect_dirty_pages() has passed the check for the
dirty_ring_vcpu_ring_full but hasn't execute the check for the
continued_vcpu, the vcpu stop, and set the dirty_ring_vcpu_ring_full to
false. Then dirty_ring_collect_dirty_pages() will trigger the ASSERT.
Why sem_vcpu_cont and sem_vcpu_stop can be non-zero value? It's because
the dirty_ring_before_vcpu_join() execute the sem_post(&sem_vcpu_cont)
at the end of each dirty-ring test. It can cause two cases:
1. sem_vcpu_cont be non-zero. When we set the host_quit to be true,
the vcpu_worker directly see the host_quit to be true, it quit. So
the log_mode_before_vcpu_join() function will set the sem_vcpu_cont
to 1, since the vcpu_worker has quit, it won't consume it.
2. sem_vcpu_stop be non-zero. When we set the host_quit to be true,
the vcpu_worker has entered the guest state, the next time it exit
from guest state, it will set the sem_vcpu_stop to 1, and then see
the host_quit, no one will consume the sem_vcpu_stop.
When execute more and more dirty-ring tests, the sem_vcpu_cont and
sem_vcpu_stop can be larger and larger, which makes many code paths
don't wait for the sem_t. Thus finally cause the problem.
To fix this problem, we can wait a while before set the host_quit to
true, which gives the vcpu time to enter the guest state, so it will
exit again. Then we can wait the vcpu to exit, and let it continue
again, then the vcpu will see the host_quit. Thus the sem_vcpu_cont and
sem_vcpu_stop will be both zero when test finished.
Signed-off-by: Shaoqin Huang <shahuang(a)redhat.com>
---
v2->v3:
- Rebase to v6.8-rc2.
- Use TEST_ASSERT().
v1->v2:
- Fix the real logic bug, not just fresh the context.
v1: https://lore.kernel.org/all/20231116093536.22256-1-shahuang@redhat.com/
v2: https://lore.kernel.org/all/20231117052210.26396-1-shahuang@redhat.com/
tools/testing/selftests/kvm/dirty_log_test.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 6cbecf499767..dd2d8be390a5 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -417,7 +417,8 @@ static void dirty_ring_after_vcpu_run(struct kvm_vcpu *vcpu, int ret, int err)
static void dirty_ring_before_vcpu_join(void)
{
- /* Kick another round of vcpu just to make sure it will quit */
+ /* Wait vcpu exit, and let it continue to see the host_quit. */
+ dirty_ring_wait_vcpu();
sem_post(&sem_vcpu_cont);
}
@@ -719,6 +720,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
struct kvm_vm *vm;
unsigned long *bmap;
uint32_t ring_buf_idx = 0;
+ int sem_val;
if (!log_mode_supported()) {
print_skip("Log mode '%s' not supported",
@@ -726,6 +728,11 @@ static void run_test(enum vm_guest_mode mode, void *arg)
return;
}
+ sem_getvalue(&sem_vcpu_stop, &sem_val);
+ assert(sem_val == 0);
+ sem_getvalue(&sem_vcpu_cont, &sem_val);
+ assert(sem_val == 0);
+
/*
* We reserve page table for 2 times of extra dirty mem which
* will definitely cover the original (1G+) test range. Here
@@ -825,6 +832,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
sync_global_to_guest(vm, iteration);
}
+ /*
+ *
+ * Before we set the host_quit, let the vcpu has time to run, to make
+ * sure we consume the sem_vcpu_stop and the vcpu consume the
+ * sem_vcpu_cont, to keep the semaphore balance.
+ */
+ usleep(p->interval * 1000);
/* Tell the vcpu thread to quit */
host_quit = true;
log_mode_before_vcpu_join();
base-commit: 41bccc98fb7931d63d03f326a746ac4d429c1dd3
--
2.40.1
If HUGETLBFS is not enabled then the default_huge_page_size function will
return 0 and cause a divide by 0 error. Add a check to see if the huge page
size is 0 and skip the hugetlb tests if it is.
Signed-off-by: Terry Tritton <terry.tritton(a)linaro.org>
---
tools/testing/selftests/mm/uffd-unit-tests.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/mm/uffd-unit-tests.c b/tools/testing/selftests/mm/uffd-unit-tests.c
index cce90a10515a..2b9f8cc52639 100644
--- a/tools/testing/selftests/mm/uffd-unit-tests.c
+++ b/tools/testing/selftests/mm/uffd-unit-tests.c
@@ -1517,6 +1517,12 @@ int main(int argc, char *argv[])
continue;
uffd_test_start("%s on %s", test->name, mem_type->name);
+ if ((mem_type->mem_flag == MEM_HUGETLB ||
+ mem_type->mem_flag == MEM_HUGETLB_PRIVATE) &&
+ (default_huge_page_size() == 0)) {
+ uffd_test_skip("huge page size is 0, feature missing?");
+ continue;
+ }
if (!uffd_feature_supported(test)) {
uffd_test_skip("feature missing");
continue;
--
2.43.0.594.gd9cf4e227d-goog
In very slow environments, most big TCP cases including
segmentation and reassembly of big TCP packets have a good
chance to fail: by default the TCP client uses write size
well below 64K. If the host is low enough autocorking is
unable to build real big TCP packets.
Address the issue using much larger write operations.
Note that is hard to observe the issue without an extremely
slow and/or overloaded environment; reduce the TCP transfer
time to allow for much easier/faster reproducibility.
Fixes: 6bb382bcf742 ("selftests: add a selftest for big tcp")
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
tools/testing/selftests/net/big_tcp.sh | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/big_tcp.sh b/tools/testing/selftests/net/big_tcp.sh
index cde9a91c4797..2db9d15cd45f 100755
--- a/tools/testing/selftests/net/big_tcp.sh
+++ b/tools/testing/selftests/net/big_tcp.sh
@@ -122,7 +122,9 @@ do_netperf() {
local netns=$1
[ "$NF" = "6" ] && serip=$SERVER_IP6
- ip net exec $netns netperf -$NF -t TCP_STREAM -H $serip 2>&1 >/dev/null
+
+ # use large write to be sure to generate big tcp packets
+ ip net exec $netns netperf -$NF -t TCP_STREAM -l 1 -H $serip -- -m 262144 2>&1 >/dev/null
}
do_test() {
--
2.43.0
On Mon, Nov 27, 2023 at 11:49:16AM +0000, Felix Huettner wrote:
> conntrack zones are heavily used by tools like openvswitch to run
> multiple virtual "routers" on a single machine. In this context each
> conntrack zone matches to a single router, thereby preventing
> overlapping IPs from becoming issues.
> In these systems it is common to operate on all conntrack entries of a
> given zone, e.g. to delete them when a router is deleted. Previously this
> required these tools to dump the full conntrack table and filter out the
> relevant entries in userspace potentially causing performance issues.
>
> To do this we reuse the existing CTA_ZONE attribute. This was previous
> parsed but not used during dump and flush requests. Now if CTA_ZONE is
> set we filter these operations based on the provided zone.
> However this means that users that previously passed CTA_ZONE will
> experience a difference in functionality.
>
> Alternatively CTA_FILTER could have been used for the same
> functionality. However it is not yet supported during flush requests and
> is only available when using AF_INET or AF_INET6.
For the record, this is applied to nf-next.
Paolo points out that ifconfig is legacy and we should not use it.
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
CC: shuah(a)kernel.org
CC: horms(a)kernel.org
CC: linux-kselftest(a)vger.kernel.org
---
.../drivers/net/netdevsim/udp_tunnel_nic.sh | 40 +++++++++----------
1 file changed, 20 insertions(+), 20 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh b/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
index f98435c502f6..384cfa3d38a6 100755
--- a/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
+++ b/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
@@ -270,7 +270,7 @@ for port in 0 1; do
echo 1 > $NSIM_DEV_SYS/new_port
fi
NSIM_NETDEV=`get_netdev_name old_netdevs`
- ifconfig $NSIM_NETDEV up
+ ip link set dev $NSIM_NETDEV up
msg="new NIC device created"
exp0=( 0 0 0 0 )
@@ -284,8 +284,8 @@ for port in 0 1; do
msg="VxLAN v4 devices go down"
exp0=( 0 0 0 0 )
- ifconfig vxlan1 down
- ifconfig vxlan0 down
+ ip link set dev vxlan1 down
+ ip link set dev vxlan0 down
check_tables
msg="VxLAN v6 devices"
@@ -293,7 +293,7 @@ for port in 0 1; do
new_vxlan vxlanA 4789 $NSIM_NETDEV 6
for ifc in vxlan0 vxlan1; do
- ifconfig $ifc up
+ ip link set dev $ifc up
done
new_vxlan vxlanB 4789 $NSIM_NETDEV 6
@@ -307,14 +307,14 @@ for port in 0 1; do
new_geneve gnv0 6081
msg="NIC device goes down"
- ifconfig $NSIM_NETDEV down
+ ip link set dev $NSIM_NETDEV down
if [ $port -eq 1 ]; then
exp0=( 0 0 0 0 )
exp1=( 0 0 0 0 )
fi
check_tables
msg="NIC device goes up again"
- ifconfig $NSIM_NETDEV up
+ ip link set dev $NSIM_NETDEV up
exp0=( `mke 4789 1` `mke 4790 1` 0 0 )
exp1=( `mke 6081 2` 0 0 0 )
check_tables
@@ -433,7 +433,7 @@ for port in 0 1; do
echo $port > $NSIM_DEV_SYS/new_port
NSIM_NETDEV=`get_netdev_name old_netdevs`
- ifconfig $NSIM_NETDEV up
+ ip link set dev $NSIM_NETDEV up
overflow_table0 "overflow NIC table"
overflow_table1 "overflow NIC table"
@@ -491,7 +491,7 @@ for port in 0 1; do
echo $port > $NSIM_DEV_SYS/new_port
NSIM_NETDEV=`get_netdev_name old_netdevs`
- ifconfig $NSIM_NETDEV up
+ ip link set dev $NSIM_NETDEV up
overflow_table0 "overflow NIC table"
overflow_table1 "overflow NIC table"
@@ -548,7 +548,7 @@ for port in 0 1; do
echo $port > $NSIM_DEV_SYS/new_port
NSIM_NETDEV=`get_netdev_name old_netdevs`
- ifconfig $NSIM_NETDEV up
+ ip link set dev $NSIM_NETDEV up
overflow_table0 "destroy NIC"
overflow_table1 "destroy NIC"
@@ -578,7 +578,7 @@ for port in 0 1; do
echo $port > $NSIM_DEV_SYS/new_port
NSIM_NETDEV=`get_netdev_name old_netdevs`
- ifconfig $NSIM_NETDEV up
+ ip link set dev $NSIM_NETDEV up
msg="create VxLANs v6"
new_vxlan vxlanA0 10000 $NSIM_NETDEV 6
@@ -639,7 +639,7 @@ for port in 0 1; do
echo $port > $NSIM_DEV_SYS/new_port
NSIM_NETDEV=`get_netdev_name old_netdevs`
- ifconfig $NSIM_NETDEV up
+ ip link set dev $NSIM_NETDEV up
echo 110 > $NSIM_DEV_DFS/ports/$port/udp_ports_inject_error
@@ -695,7 +695,7 @@ for port in 0 1; do
echo $port > $NSIM_DEV_SYS/new_port
NSIM_NETDEV=`get_netdev_name old_netdevs`
- ifconfig $NSIM_NETDEV up
+ ip link set dev $NSIM_NETDEV up
msg="create VxLANs v6"
exp0=( `mke 10000 1` 0 0 0 )
@@ -755,7 +755,7 @@ for port in 0 1; do
echo $port > $NSIM_DEV_SYS/new_port
NSIM_NETDEV=`get_netdev_name old_netdevs`
- ifconfig $NSIM_NETDEV up
+ ip link set dev $NSIM_NETDEV up
msg="create VxLANs v6"
exp0=( `mke 10000 1` 0 0 0 )
@@ -768,7 +768,7 @@ for port in 0 1; do
check_tables
msg="NIC device goes down"
- ifconfig $NSIM_NETDEV down
+ ip link set dev $NSIM_NETDEV down
if [ $port -eq 1 ]; then
exp0=( 0 0 0 0 )
exp1=( 0 0 0 0 )
@@ -779,7 +779,7 @@ for port in 0 1; do
check_tables
msg="NIC device goes up again"
- ifconfig $NSIM_NETDEV up
+ ip link set dev $NSIM_NETDEV up
exp0=( `mke 10000 1` 0 0 0 )
check_tables
@@ -827,12 +827,12 @@ new_vxlan vxlan1 4789 $NSIM_NETDEV2
msg="VxLAN v4 devices go down"
exp0=( 0 0 0 0 )
-ifconfig vxlan1 down
-ifconfig vxlan0 down
+ip link set dev vxlan1 down
+ip link set dev vxlan0 down
check_tables
for ifc in vxlan0 vxlan1; do
- ifconfig $ifc up
+ ip link set dev $ifc up
done
msg="VxLAN v6 device"
@@ -844,11 +844,11 @@ exp1=( `mke 6081 2` 0 0 0 )
new_geneve gnv0 6081
msg="NIC device goes down"
-ifconfig $NSIM_NETDEV down
+ip link set dev $NSIM_NETDEV down
check_tables
msg="NIC device goes up again"
-ifconfig $NSIM_NETDEV up
+ip link set dev $NSIM_NETDEV up
check_tables
for i in `seq 2`; do
--
2.43.0
The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zicfiss respectively), I am actively
working on GCS[1]. With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses. This provides some
protection against ROP attacks and making it easier to collect call
stacks. These shadow stacks are allocated in the address space of the
userspace process.
Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled. The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread. This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces. As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.
Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process in a similar manner
to how the normal stack is specified, keeping the current implicit
allocation behaviour if one is not specified either with clone3() or
through the use of clone(). Unlike normal stacks only the shadow stack
size is specified, similar issues to those that lead to the creation of
map_shadow_stack() apply.
Please note that the x86 portions of this code are build tested only, I
don't appear to have a system that can run CET avaible to me, I have
done testing with an integration into my pending work for GCS. There is
some possibility that the arm64 implementation may require the use of
clone3() and explicit userspace allocation of shadow stacks, this is
still under discussion.
A new architecture feature Kconfig option for shadow stacks is added as
here, this was suggested as part of the review comments for the arm64
GCS series and since we need to detect if shadow stacks are supported it
seemed sensible to roll it in here.
[1] https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org/
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v3:
- Rebase onto v6.7-rc2.
- Remove stale shadow_stack in internal kargs.
- If a shadow stack is specified unconditionally use it regardless of
CLONE_ parameters.
- Force enable shadow stacks in the selftest.
- Update changelogs for RISC-V feature rename.
- Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@ke…
Changes in v2:
- Rebase onto v6.7-rc1.
- Remove ability to provide preallocated shadow stack, just specify the
desired size.
- Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@ke…
---
Mark Brown (5):
mm: Introduce ARCH_HAS_USER_SHADOW_STACK
fork: Add shadow stack support to clone3()
selftests/clone3: Factor more of main loop into test_clone3()
selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
kselftest/clone3: Test shadow stack support
arch/x86/Kconfig | 1 +
arch/x86/include/asm/shstk.h | 11 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 59 +++++--
fs/proc/task_mmu.c | 2 +-
include/linux/mm.h | 2 +-
include/linux/sched/task.h | 1 +
include/uapi/linux/sched.h | 4 +
kernel/fork.c | 22 ++-
mm/Kconfig | 6 +
tools/testing/selftests/clone3/clone3.c | 200 +++++++++++++++++-----
tools/testing/selftests/clone3/clone3_selftests.h | 7 +
12 files changed, 250 insertions(+), 67 deletions(-)
---
base-commit: 98b1cc82c4affc16f5598d4fa14b1858671b2263
change-id: 20231019-clone3-shadow-stack-15d40d2bf536
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Another small bunch of fixes, addressing issues outlined by the
netdev CI.
The first 2 patches are just rebased.
The following 2 are new fixes, for even more problems that surfaced
meanwhile.
Paolo Abeni (4):
selftests: net: cut more slack for gro fwd tests.
selftests: net: fix setup_ns usage in rtnetlink.sh
selftests: net: fix tcp listener handling in pmtu.sh
selftests: net: avoid just another constant wait
tools/testing/selftests/net/pmtu.sh | 23 ++++++++++++++-----
tools/testing/selftests/net/rtnetlink.sh | 6 ++---
tools/testing/selftests/net/udpgro_fwd.sh | 14 +++++++++--
tools/testing/selftests/net/udpgso_bench_rx.c | 2 +-
4 files changed, 32 insertions(+), 13 deletions(-)
--
2.43.0
This patch series introduces a new char misc driver, /dev/ntsync, which is used
to implement Windows NT synchronization primitives.
== Background ==
The Wine project emulates the Windows API in user space. One particular part of
that API, namely the NT synchronization primitives, have historically been
implemented via RPC to a dedicated "kernel" process. However, more recent
applications use these APIs more strenuously, and the overhead of RPC has become
a bottleneck.
The NT synchronization APIs are too complex to implement on top of existing
primitives without sacrificing correctness. Certain operations, such as
NtPulseEvent() or the "wait-for-all" mode of NtWaitForMultipleObjects(), require
direct control over the underlying wait queue, and implementing a wait queue
sufficiently robust for Wine in user space is not possible. This proposed
driver, therefore, implements the problematic interfaces directly in the Linux
kernel.
This driver was presented at Linux Plumbers Conference 2023. For those further
interested in the history of synchronization in Wine and past attempts to solve
this problem in user space, a recording of the presentation can be viewed here:
https://www.youtube.com/watch?v=NjU4nyWyhU8
== Performance ==
The gain in performance varies wildly depending on the application in question
and the user's hardware. For some games NT synchronization is not a bottleneck
and no change can be observed, but for others frame rate improvements of 50 to
150 percent are not atypical. The following table lists frame rate measurements
from a variety of games on a variety of hardware, taken by users Dmitry
Skvortsov, FuzzyQuils, OnMars, and myself:
Game Upstream ntsync improvement
===========================================================================
Anger Foot 69 99 43%
Call of Juarez 99.8 224.1 125%
Dirt 3 110.6 860.7 678%
Forza Horizon 5 108 160 48%
Lara Croft: Temple of Osiris 141 326 131%
Metro 2033 164.4 199.2 21%
Resident Evil 2 26 77 196%
The Crew 26 51 96%
Tiny Tina's Wonderlands 130 360 177%
Total War Saga: Troy 109 146 34%
===========================================================================
== Patches ==
The intended semantics of the patches are broadly intended to match those of the
corresponding Windows functions. For those not already familiar with the Windows
functions (or their undocumented behaviour), patch 29/29 provides a detailed
specification, and individual patches also include a brief description of the
API they are implementing.
The patches making use of this driver in Wine can be retrieved or browsed here:
https://repo.or.cz/wine/zf.git/shortlog/refs/heads/ntsync5
== Implementation ==
Some aspects of the implementation may deserve particular comment:
* In the interest of performance, each object is governed only by a single
spinlock. However, NTSYNC_IOC_WAIT_ALL requires that the state of multiple
objects be changed as a single atomic operation. In order to achieve this, we
first take a device-wide lock ("wait_all_lock") any time we are going to lock
more than one object at a time.
The maximum number of objects that can be used in a vectored wait, and
therefore the maximum that can be locked simultaneously, is 64. This number is
NT's own limit.
The acquisition of multiple spinlocks will degrade performance. This is a
conscious choice, however. Wait-for-all is known to be a very rare operation
in practice, especially with counts that approach the maximum, and it is the
intent of the ntsync driver to optimize wait-for-any at the expense of
wait-for-all as much as possible.
* NT mutexes are tied to their threads on an OS level, and the kernel includes
builtin support for "robust" mutexes. In order to keep the ntsync driver
self-contained and avoid touching more code than necessary, it does not hook
into task exit nor use pids.
Instead, the user space emulator is expected to manage thread IDs and pass
them as an argument to any relevant functions; this is the "owner" field of
ntsync_wait_args and ntsync_mutex_args.
When the emulator detects that a thread dies, it should therefore call
NTSYNC_IOC_KILL_OWNER, which will mark mutexes owned by that thread (if any)
as abandoned.
* This implementation uses a misc device mostly because it seemed like the
simplest and least obtrusive option.
Besides simplicitly of implementation, the only particularly interesting
advantage is the ability to create an arbitrary number of "contexts"
(corresponding to Windows virtual machines) which are self-contained and
shareable across multiple processes; this maps nicely to file descriptions
(i.e. struct file). This is not impossible with syscalls of course but would
require an extra argument.
On the other hand, there is no reason to forbid using ntsync by default from
user-mode processes, and (as far as I understand) to do so with a char device
requires explicit configuration by e.g. udev or init. Since this is done with
e.g. fuse, I assume this is the model to follow, but I may have chosen
something deprecated.
* ntsync is module-capable mostly because there was nothing preventing it, and
because it aided development. It is not a hard requirement, though.
== Previous versions ==
Changes in v2:
* Send the whole series instead of just the first few patches.
* Try to add more description to each patch, as a short documentation of the
functions to be implemented. A more complete documentation of all aspects of
the driver is provided in the contents of the last patch.
* Objects are now files rather than indices into a table. This prevents a
process from changing the state of an object which it should not have access
to. Suggested by Andy Lutorminski.
* Because the device no longer inherently has a table of all objects, marking a
thread's owned mutexes as abandoned is now done through an ioctl on the mutex.
* Change the names of a couple ioctls to be a bit less odd (PUT_SEM -> SEM_POST,
PUT_MUTEX -> MUTEX_UNLOCK), and to reflect that they are ioctls on an object
rather than on the device.
* Pass the timeout for wait functions as a bare u64 (in ns), per Arnd Bergmann,
with U64_MAX used to indicate no timeout. I originally indicated that I would
change the timeout to be relative, but on reflection ended up keeping it as
absolute, as this results in the least number of calls to get the current time
(i.e. one).
* Use compat_ptr_ioctl(), per Arnd Bergmann.
* Remove the fixed minor number and module alias, per Greg Kroah-Hartman.
* Allocate the fds array on stack in setup_wait(). This array takes up 260
bytes.
* Link to v1: https://lore.kernel.org/lkml/20240124004028.16826-1-zfigura@codeweavers.com/
Elizabeth Figura (29):
ntsync: Introduce the ntsync driver and character device.
ntsync: Introduce NTSYNC_IOC_CREATE_SEM.
ntsync: Introduce NTSYNC_IOC_SEM_POST.
ntsync: Introduce NTSYNC_IOC_WAIT_ANY.
ntsync: Introduce NTSYNC_IOC_WAIT_ALL.
ntsync: Introduce NTSYNC_IOC_CREATE_MUTEX.
ntsync: Introduce NTSYNC_IOC_MUTEX_UNLOCK.
ntsync: Introduce NTSYNC_IOC_MUTEX_KILL.
ntsync: Introduce NTSYNC_IOC_CREATE_EVENT.
ntsync: Introduce NTSYNC_IOC_EVENT_SET.
ntsync: Introduce NTSYNC_IOC_EVENT_RESET.
ntsync: Introduce NTSYNC_IOC_EVENT_PULSE.
ntsync: Introduce NTSYNC_IOC_SEM_READ.
ntsync: Introduce NTSYNC_IOC_MUTEX_READ.
ntsync: Introduce NTSYNC_IOC_EVENT_READ.
ntsync: Introduce alertable waits.
selftests: ntsync: Add some tests for semaphore state.
selftests: ntsync: Add some tests for mutex state.
selftests: ntsync: Add some tests for NTSYNC_IOC_WAIT_ANY.
selftests: ntsync: Add some tests for NTSYNC_IOC_WAIT_ALL.
selftests: ntsync: Add some tests for wakeup signaling with
WINESYNC_IOC_WAIT_ANY.
selftests: ntsync: Add some tests for wakeup signaling with
WINESYNC_IOC_WAIT_ALL.
selftests: ntsync: Add some tests for manual-reset event state.
selftests: ntsync: Add some tests for auto-reset event state.
selftests: ntsync: Add some tests for wakeup signaling with events.
selftests: ntsync: Add tests for alertable waits.
selftests: ntsync: Add some tests for wakeup signaling via alerts.
maintainers: Add an entry for ntsync.
docs: ntsync: Add documentation for the ntsync uAPI.
Documentation/userspace-api/index.rst | 1 +
.../userspace-api/ioctl/ioctl-number.rst | 2 +
Documentation/userspace-api/ntsync.rst | 390 +++++
MAINTAINERS | 9 +
drivers/misc/Kconfig | 9 +
drivers/misc/Makefile | 1 +
drivers/misc/ntsync.c | 1132 ++++++++++++++
include/uapi/linux/ntsync.h | 58 +
tools/testing/selftests/Makefile | 1 +
.../testing/selftests/drivers/ntsync/Makefile | 8 +
tools/testing/selftests/drivers/ntsync/config | 1 +
.../testing/selftests/drivers/ntsync/ntsync.c | 1300 +++++++++++++++++
12 files changed, 2912 insertions(+)
create mode 100644 Documentation/userspace-api/ntsync.rst
create mode 100644 drivers/misc/ntsync.c
create mode 100644 include/uapi/linux/ntsync.h
create mode 100644 tools/testing/selftests/drivers/ntsync/Makefile
create mode 100644 tools/testing/selftests/drivers/ntsync/config
create mode 100644 tools/testing/selftests/drivers/ntsync/ntsync.c
--
2.43.0
Changelog:
v2:
* Make the swapin test also checks for zswap usage (patch 3)
(suggested by Yosry Ahmed)
* Some test simplifications/cleanups (patch 3)
(suggested by Yosry Ahmed).
Fix a broken zswap kselftest due to cgroup zswap writeback counter
renaming, and add 2 zswap kselftests, one to cover the (z)swapin case,
and another to check that no zswapping happens when the cgroup limit is
0.
Also, add the zswap kselftest file to zswap maintainer entry so that
get_maintainers script can find zswap maintainers.
Nhat Pham (3):
selftests: zswap: add zswap selftest file to zswap maintainer entry
selftests: fix the zswap invasive shrink test
selftests: add zswapin and no zswap tests
MAINTAINERS | 1 +
tools/testing/selftests/cgroup/test_zswap.c | 99 ++++++++++++++++++++-
2 files changed, 99 insertions(+), 1 deletion(-)
base-commit: 3a92c45e4ba694381c46994f3fde0d8544a2088b
--
2.39.3
From: Maxim Mikityanskiy <maxim(a)isovalent.com>
The goal of this series is to extend the verifier's capabilities of
tracking scalars when they are spilled to stack, especially when the
spill or fill is narrowing. It also contains a fix by Eduard for
infinite loop detection and a state pruning optimization by Eduard that
compensates for a verification complexity regression introduced by
tracking unbounded scalars. These improvements reduce the surface of
false rejections that I saw while working on Cilium codebase.
Patches 1-9 of the original series were previously applied in v2.
Patches 1-2 (Maxim): Support the case when boundary checks are first
performed after the register was spilled to the stack.
Patches 3-4 (Maxim): Support narrowing fills.
Patches 5-6 (Eduard): Optimization for state pruning in stacksafe() to
mitigate the verification complexity regression.
veristat -e file,prog,states -f '!states_diff<50' -f '!states_pct<10' -f '!states_a<10' -f '!states_b<10' -C ...
* Without patch 5:
File Program States (A) States (B) States (DIFF)
-------------------- -------- ---------- ---------- ----------------
pyperf100.bpf.o on_event 4878 6528 +1650 (+33.83%)
pyperf180.bpf.o on_event 6936 11032 +4096 (+59.05%)
pyperf600.bpf.o on_event 22271 39455 +17184 (+77.16%)
pyperf600_iter.bpf.o on_event 400 490 +90 (+22.50%)
strobemeta.bpf.o on_event 4895 14028 +9133 (+186.58%)
* With patch 5:
File Program States (A) States (B) States (DIFF)
----------------------- ------------- ---------- ---------- ---------------
bpf_xdp.o tail_lb_ipv4 2770 2224 -546 (-19.71%)
pyperf100.bpf.o on_event 4878 5848 +970 (+19.89%)
pyperf180.bpf.o on_event 6936 8868 +1932 (+27.85%)
pyperf600.bpf.o on_event 22271 29656 +7385 (+33.16%)
pyperf600_iter.bpf.o on_event 400 450 +50 (+12.50%)
xdp_synproxy_kern.bpf.o syncookie_tc 280 226 -54 (-19.29%)
xdp_synproxy_kern.bpf.o syncookie_xdp 302 228 -74 (-24.50%)
v2 changes:
Fixed comments in patch 1, moved endianness checks to header files in
patch 12 where possible, added Eduard's ACKs.
v3 changes:
Maxim: Removed __is_scalar_unbounded altogether, addressed Andrii's
comments.
Eduard: Patch #5 (#14 in v2) changed significantly:
- Logical changes:
- Handling of STACK_{MISC,ZERO} mix turned out to be incorrect:
a mix of MISC and ZERO in old state is not equivalent to e.g.
just MISC is current state, because verifier could have deduced
zero scalars from ZERO slots in old state for some loads.
- There is no reason to limit the change only to cases when
old or current stack is a spill of unbounded scalar,
it is valid to compare any 64-bit scalar spill with fake
register impersonating MISC.
- STACK_ZERO vs spilled zero case was dropped,
after recent changes for zero handling by Andrii and Yonghong
it is hard (impossible?) to conjure all ZERO slots for an spi.
=> the case does not make any difference in veristat results.
- Use global static variable for unbound_reg (Andrii)
- Code shuffling to remove duplication in stacksafe() (Andrii)
Eduard Zingerman (2):
bpf: handle scalar spill vs all MISC in stacksafe()
selftests/bpf: states pruning checks for scalar vs STACK_MISC
Maxim Mikityanskiy (4):
bpf: Track spilled unbounded scalars
selftests/bpf: Test tracking spilled unbounded scalars
bpf: Preserve boundaries and track scalars on narrowing fill
selftests/bpf: Add test cases for narrowing fill
include/linux/bpf_verifier.h | 9 +
kernel/bpf/verifier.c | 103 ++++--
.../selftests/bpf/progs/verifier_spill_fill.c | 324 +++++++++++++++++-
3 files changed, 404 insertions(+), 32 deletions(-)
--
2.43.0
Non-contiguous CBM support for Intel CAT has been merged into the kernel
with Commit 0e3cd31f6e90 ("x86/resctrl: Enable non-contiguous CBMs in
Intel CAT") but there is no selftest that would validate if this feature
works correctly.
The selftest needs to verify if writing non-contiguous CBMs to the
schemata file behaves as expected in comparison to the information about
non-contiguous CBMs support.
The patch series is based on a rework of resctrl selftests that's
currently in review [1]. The patch also implements a similar
functionality presented in the bash script included in the cover letter
of the original non-contiguous CBMs in Intel CAT series [3].
Changelog v3:
- Rebase onto v4 of Ilpo's series [1].
- Split old patch 3/4 into two parts. One doing refactoring and one
adding a new function.
- Some changes to all the patches after Reinette's review.
Changelog v2:
- Rebase onto v4 of Ilpo's series [2].
- Add two patches that prepare helpers for the new test.
- Move Ilpo's patch that adds test grouping to this series.
- Apply Ilpo's suggestion to the patch that adds a new test.
[1] https://lore.kernel.org/all/20231215150515.36983-1-ilpo.jarvinen@linux.inte…
[2] https://lore.kernel.org/all/20231211121826.14392-1-ilpo.jarvinen@linux.inte…
[3] https://lore.kernel.org/all/cover.1696934091.git.maciej.wieczor-retman@inte…
Older versions of this series:
[v1] https://lore.kernel.org/all/20231109112847.432687-1-maciej.wieczor-retman@i…
[v2] https://lore.kernel.org/all/cover.1702392177.git.maciej.wieczor-retman@inte…
Ilpo Järvinen (1):
selftests/resctrl: Add test groups and name L3 CAT test L3_CAT
Maciej Wieczor-Retman (4):
selftests/resctrl: Add helpers for the non-contiguous test
selftests/resctrl: Split validate_resctrl_feature_request()
selftests/resctrl: Add resource_info_file_exists()
selftests/resctrl: Add non-contiguous CBMs CAT test
tools/testing/selftests/resctrl/cat_test.c | 84 +++++++++++++++-
tools/testing/selftests/resctrl/cmt_test.c | 4 +-
tools/testing/selftests/resctrl/mba_test.c | 4 +-
tools/testing/selftests/resctrl/mbm_test.c | 6 +-
tools/testing/selftests/resctrl/resctrl.h | 11 ++-
.../testing/selftests/resctrl/resctrl_tests.c | 18 +++-
tools/testing/selftests/resctrl/resctrlfs.c | 98 ++++++++++++++++---
7 files changed, 199 insertions(+), 26 deletions(-)
--
2.43.0
Arch maintainers, please ack/review patches.
This is a resend of a series from Frank last year[1]. I worked in Rob's
review comments to unconditionally call unflatten_device_tree() and
fixup/audit calls to of_have_populated_dt() so that behavior doesn't
change.
I need this series so I can add DT based tests in the clk framework.
Either I can merge it through the clk tree once everyone is happy, or
Rob can merge it through the DT tree and provide some branch so I can
base clk patches on it.
Changes from Frank's series[1]:
* Add a DTB loaded kunit test
* Make of_have_populated_dt() return false if the DTB isn't from the
bootloader
* Architecture calls made unconditional so that a root node is always
made
Changes from v1 (https://lore.kernel.org/r/20240112200750.4062441-1-sboyd@kernel.org):
* x86 patch included
* arm64 knocks out initial dtb if acpi is in use
* keep Kconfig hidden but def_bool enabled otherwise
Frank Rowand (2):
of: Create of_root if no dtb provided by firmware
of: unittest: treat missing of_root as error instead of fixing up
Stephen Boyd (5):
arm64: Unconditionally call unflatten_device_tree()
um: Unconditionally call unflatten_device_tree()
x86/of: Unconditionally call unflatten_and_copy_device_tree()
of: Always unflatten in unflatten_and_copy_device_tree()
of: Add KUnit test to confirm DTB is loaded
arch/arm64/kernel/setup.c | 7 +++--
arch/um/kernel/dtb.c | 14 +++++-----
arch/x86/kernel/devicetree.c | 24 +++++++++--------
drivers/of/.kunitconfig | 3 +++
drivers/of/Kconfig | 11 +++++++-
drivers/of/Makefile | 4 ++-
drivers/of/empty_root.dts | 6 +++++
drivers/of/fdt.c | 52 +++++++++++++++++++++++++-----------
drivers/of/of_test.c | 48 +++++++++++++++++++++++++++++++++
drivers/of/platform.c | 3 ---
drivers/of/unittest.c | 16 +++--------
include/linux/of.h | 25 ++++++++++-------
12 files changed, 151 insertions(+), 62 deletions(-)
create mode 100644 drivers/of/.kunitconfig
create mode 100644 drivers/of/empty_root.dts
create mode 100644 drivers/of/of_test.c
Cc: Anton Ivanov <anton.ivanov(a)cambridgegreys.com>
Cc: Brendan Higgins <brendan.higgins(a)linux.dev>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: David Gow <davidgow(a)google.com>
Cc: Frank Rowand <frowand.list(a)gmail.com>
Cc: Johannes Berg <johannes(a)sipsolutions.net>
Cc: Richard Weinberger <richard(a)nod.at>
Cc: Rob Herring <robh+dt(a)kernel.org>
Cc: Will Deacon <will(a)kernel.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: <x86(a)kernel.org>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Saurabh Sengar <ssengar(a)linux.microsoft.com>
[1] https://lore.kernel.org/r/20230317053415.2254616-1-frowand.list@gmail.com
base-commit: 0dd3ee31125508cd67f7e7172247f05b7fd1753a
--
https://git.kernel.org/pub/scm/linux/kernel/git/clk/linux.git/https://git.kernel.org/pub/scm/linux/kernel/git/sboyd/spmi.git
Add a test case for regression in openvswitch nat that was fixed by
commit e6345d2824a3 ("netfilter: nf_nat: fix action not being set for
all ct states").
Link: https://lore.kernel.org/netdev/20231221224311.130319-1-brad@faucet.nz/
Link: https://mail.openvswitch.org/pipermail/ovs-dev/2024-January/410476.html
Suggested-by: Aaron Conole <aconole(a)redhat.com>
Signed-off-by: Brad Cowie <brad(a)faucet.nz>
---
.../selftests/net/openvswitch/openvswitch.sh | 62 +++++++++++++++++++
1 file changed, 62 insertions(+)
diff --git a/tools/testing/selftests/net/openvswitch/openvswitch.sh b/tools/testing/selftests/net/openvswitch/openvswitch.sh
index f8499d4c87f3..87b80bee6df4 100755
--- a/tools/testing/selftests/net/openvswitch/openvswitch.sh
+++ b/tools/testing/selftests/net/openvswitch/openvswitch.sh
@@ -17,6 +17,7 @@ tests="
ct_connect_v4 ip4-ct-xon: Basic ipv4 tcp connection using ct
connect_v4 ip4-xon: Basic ipv4 ping between two NS
nat_connect_v4 ip4-nat-xon: Basic ipv4 tcp connection via NAT
+ nat_related_v4 ip4-nat-related: ICMP related matches work with SNAT
netlink_checks ovsnl: validate netlink attrs and settings
upcall_interfaces ovs: test the upcall interfaces
drop_reason drop: test drop reasons are emitted"
@@ -473,6 +474,67 @@ test_nat_connect_v4 () {
return 0
}
+# nat_related_v4 test
+# - client->server ip packets go via SNAT
+# - client solicits ICMP destination unreachable packet from server
+# - undo NAT for ICMP reply and test dst ip has been updated
+test_nat_related_v4 () {
+ which nc >/dev/null 2>/dev/null || return $ksft_skip
+
+ sbx_add "test_nat_related_v4" || return $?
+
+ ovs_add_dp "test_nat_related_v4" natrelated4 || return 1
+ info "create namespaces"
+ for ns in client server; do
+ ovs_add_netns_and_veths "test_nat_related_v4" "natrelated4" "$ns" \
+ "${ns:0:1}0" "${ns:0:1}1" || return 1
+ done
+
+ ip netns exec client ip addr add 172.31.110.10/24 dev c1
+ ip netns exec client ip link set c1 up
+ ip netns exec server ip addr add 172.31.110.20/24 dev s1
+ ip netns exec server ip link set s1 up
+
+ ip netns exec server ip route add 192.168.0.20/32 via 172.31.110.10
+
+ # Allow ARP
+ ovs_add_flow "test_nat_related_v4" natrelated4 \
+ "in_port(1),eth(),eth_type(0x0806),arp()" "2" || return 1
+ ovs_add_flow "test_nat_related_v4" natrelated4 \
+ "in_port(2),eth(),eth_type(0x0806),arp()" "1" || return 1
+
+ # Allow IP traffic from client->server, rewrite source IP with SNAT to 192.168.0.20
+ ovs_add_flow "test_nat_related_v4" natrelated4 \
+ "ct_state(-trk),in_port(1),eth(),eth_type(0x0800),ipv4(dst=172.31.110.20)" \
+ "ct(commit,nat(src=192.168.0.20)),recirc(0x1)" || return 1
+ ovs_add_flow "test_nat_related_v4" natrelated4 \
+ "recirc_id(0x1),ct_state(+trk-inv),in_port(1),eth(),eth_type(0x0800),ipv4()" \
+ "2" || return 1
+
+ # Allow related ICMP responses back from server and undo NAT to restore original IP
+ # Drop any ICMP related packets where dst ip hasn't been restored back to original IP
+ ovs_add_flow "test_nat_related_v4" natrelated4 \
+ "ct_state(-trk),in_port(2),eth(),eth_type(0x0800),ipv4()" \
+ "ct(commit,nat),recirc(0x2)" || return 1
+ ovs_add_flow "test_nat_related_v4" natrelated4 \
+ "recirc_id(0x2),ct_state(+rel+trk),in_port(2),eth(),eth_type(0x0800),ipv4(src=172.31.110.20,dst=172.31.110.10,proto=1),icmp()" \
+ "1" || return 1
+ ovs_add_flow "test_nat_related_v4" natrelated4 \
+ "recirc_id(0x2),ct_state(+rel+trk),in_port(2),eth(),eth_type(0x0800),ipv4(dst=192.168.0.20,proto=1),icmp()" \
+ "drop" || return 1
+
+ # Solicit destination unreachable response from server
+ ovs_sbx "test_nat_related_v4" ip netns exec client \
+ bash -c "echo a | nc -u -w 1 172.31.110.20 10000"
+
+ # Check to make sure no packets matched the drop rule with incorrect dst ip
+ python3 "$ovs_base/ovs-dpctl.py" dump-flows natrelated4 \
+ | grep "drop" | grep "packets:0" >/dev/null || return 1
+
+ info "done..."
+ return 0
+}
+
# netlink_validation
# - Create a dp
# - check no warning with "old version" simulation
--
2.34.1
When execute the dirty_log_test on some aarch64 machine, it sometimes
trigger the ASSERT:
==== Test Assertion Failure ====
dirty_log_test.c:384: dirty_ring_vcpu_ring_full
pid=14854 tid=14854 errno=22 - Invalid argument
1 0x00000000004033eb: dirty_ring_collect_dirty_pages at dirty_log_test.c:384
2 0x0000000000402d27: log_mode_collect_dirty_pages at dirty_log_test.c:505
3 (inlined by) run_test at dirty_log_test.c:802
4 0x0000000000403dc7: for_each_guest_mode at guest_modes.c:100
5 0x0000000000401dff: main at dirty_log_test.c:941 (discriminator 3)
6 0x0000ffff9be173c7: ?? ??:0
7 0x0000ffff9be1749f: ?? ??:0
8 0x000000000040206f: _start at ??:?
Didn't continue vcpu even without ring full
The dirty_log_test fails when execute the dirty-ring test, this is
because the sem_vcpu_cont and the sem_vcpu_stop is non-zero value when
execute the dirty_ring_collect_dirty_pages() function. When those two
sem_t variables are non-zero, the dirty_ring_wait_vcpu() at the
beginning of the dirty_ring_collect_dirty_pages() will not wait for the
vcpu to stop, but continue to execute the following code. In this case,
before vcpu stop, if the dirty_ring_vcpu_ring_full is true, and the
dirty_ring_collect_dirty_pages() has passed the check for the
dirty_ring_vcpu_ring_full but hasn't execute the check for the
continued_vcpu, the vcpu stop, and set the dirty_ring_vcpu_ring_full to
false. Then dirty_ring_collect_dirty_pages() will trigger the ASSERT.
Why sem_vcpu_cont and sem_vcpu_stop can be non-zero value? It's because
the dirty_ring_before_vcpu_join() execute the sem_post(&sem_vcpu_cont)
at the end of each dirty-ring test. It can cause two cases:
1. sem_vcpu_cont be non-zero. When we set the host_quit to be true,
the vcpu_worker directly see the host_quit to be true, it quit. So
the log_mode_before_vcpu_join() function will set the sem_vcpu_cont
to 1, since the vcpu_worker has quit, it won't consume it.
2. sem_vcpu_stop be non-zero. When we set the host_quit to be true,
the vcpu_worker has entered the guest state, the next time it exit
from guest state, it will set the sem_vcpu_stop to 1, and then see
the host_quit, no one will consume the sem_vcpu_stop.
When execute more and more dirty-ring tests, the sem_vcpu_cont and
sem_vcpu_stop can be larger and larger, which makes many code paths
don't wait for the sem_t. Thus finally cause the problem.
To fix this problem, we can wait a while before set the host_quit to
true, which gives the vcpu time to enter the guest state, so it will
exit again. Then we can wait the vcpu to exit, and let it continue
again, then the vcpu will see the host_quit. Thus the sem_vcpu_cont and
sem_vcpu_stop will be both zero when test finished.
Signed-off-by: Shaoqin Huang <shahuang(a)redhat.com>
---
v1->v2:
- Fix the real logic bug, not just fresh the context.
v1: https://lore.kernel.org/all/20231116093536.22256-1-shahuang@redhat.com/
---
tools/testing/selftests/kvm/dirty_log_test.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 936f3a8d1b83..a6e0ff46a07c 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -417,7 +417,8 @@ static void dirty_ring_after_vcpu_run(struct kvm_vcpu *vcpu, int ret, int err)
static void dirty_ring_before_vcpu_join(void)
{
- /* Kick another round of vcpu just to make sure it will quit */
+ /* Wait vcpu exit, and let it continue to see the host_quit. */
+ dirty_ring_wait_vcpu();
sem_post(&sem_vcpu_cont);
}
@@ -719,6 +720,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
struct kvm_vm *vm;
unsigned long *bmap;
uint32_t ring_buf_idx = 0;
+ int sem_val;
if (!log_mode_supported()) {
print_skip("Log mode '%s' not supported",
@@ -726,6 +728,11 @@ static void run_test(enum vm_guest_mode mode, void *arg)
return;
}
+ sem_getvalue(&sem_vcpu_stop, &sem_val);
+ assert(sem_val == 0);
+ sem_getvalue(&sem_vcpu_cont, &sem_val);
+ assert(sem_val == 0);
+
/*
* We reserve page table for 2 times of extra dirty mem which
* will definitely cover the original (1G+) test range. Here
@@ -825,6 +832,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
sync_global_to_guest(vm, iteration);
}
+ /*
+ *
+ * Before we set the host_quit, let the vcpu has time to run, to make
+ * sure we consume the sem_vcpu_stop and the vcpu consume the
+ * sem_vcpu_cont, to keep the semaphore balance.
+ */
+ usleep(p->interval * 1000);
/* Tell the vcpu thread to quit */
host_quit = true;
log_mode_before_vcpu_join();
--
2.40.1
This series of 9 patches fixes issues mostly identified by CI's not
managed by the MPTCP maintainers. Thank you Linero (LKFT) and Netdev
maintainers (NIPA) for running our kunit and selftests tests!
For the first patch, it took a bit of time to identify the root cause.
Some MPTCP Join selftest subtests have been "flaky", mostly in slow
environments. It appears to be due to the use of a TCP-specific helper
on an MPTCP socket. A fix for kernels >= v5.15.
Patches 2 to 4 add missing kernel config to support NetFilter tables
needed for IPTables commands. These kconfigs are usually enabled in
default configurations, but apparently not for all architectures.
Patches 2 and 3 can be backported up to v5.11 and the 4th one up to
v5.19.
Patch 5 increases the time limit for MPTCP selftests. It appears that
many CI's execute tests in a VM without acceleration supports, e.g. QEmu
without KVM. As a result, the tests take longer. Plus, there are more
and more tests. This patch modifies the timeout added in v5.18.
Patch 6 reduces the maximum rate and delay of the different links in
some Simult Flows selftest subtests. The goal is to let slow VMs reach
the maximum speed. The original rate was introduced in v5.11.
Patch 7 lets CI changing the prefix of the subtests titles, to be able
to run the same selftest multiple times with different parameters. With
different titles, tests will be considered as different and not override
previous results as it is the case with some CI envs. Subtests have been
introduced in v6.6.
Patch 8 and 9 make some MPTCP Join selftest subtests quicker by stopping
the transfer when the expected events have been seen. Patch 8 can be
backported up to v6.5.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Matthieu Baerts (NGI0) (8):
selftests: mptcp: add missing kconfig for NF Filter
selftests: mptcp: add missing kconfig for NF Filter in v6
selftests: mptcp: add missing kconfig for NF Mangle
selftests: mptcp: increase timeout to 30 min
selftests: mptcp: decrease BW in simult flows
selftests: mptcp: allow changing subtests prefix
selftests: mptcp: join: stop transfer when check is done (part 1)
selftests: mptcp: join: stop transfer when check is done (part 2)
Paolo Abeni (1):
mptcp: fix data re-injection from stale subflow
net/mptcp/protocol.c | 3 ---
tools/testing/selftests/net/mptcp/config | 3 +++
tools/testing/selftests/net/mptcp/mptcp_join.sh | 27 +++++++++--------------
tools/testing/selftests/net/mptcp/mptcp_lib.sh | 2 +-
tools/testing/selftests/net/mptcp/settings | 2 +-
tools/testing/selftests/net/mptcp/simult_flows.sh | 8 +++----
6 files changed, 20 insertions(+), 25 deletions(-)
---
base-commit: c9ec85153fea6873c52ed4f5055c87263f1b54f9
change-id: 20240131-upstream-net-20240131-mptcp-ci-issues-9d68b5601e74
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
The mmap() respects rlimit only for normal users. This test should be
run as normal user, without root privileges. Also add back the sudo -u
nobody as run_vmtests.sh is run as root most of the times. Skip the test
instead if sudo isn't present to lower the privileges.
Fixes: b6221771d468 ("selftests/mm: run_vmtests: remove sudo and conform to tap")
Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
---
Please fold this patch in the Fixes patch if needed.
---
tools/testing/selftests/mm/on-fault-limit.c | 6 +++---
tools/testing/selftests/mm/run_vmtests.sh | 7 ++++++-
2 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/mm/on-fault-limit.c b/tools/testing/selftests/mm/on-fault-limit.c
index 0ea98ffab3589..431c1277d83a1 100644
--- a/tools/testing/selftests/mm/on-fault-limit.c
+++ b/tools/testing/selftests/mm/on-fault-limit.c
@@ -21,7 +21,7 @@ static void test_limit(void)
map = mmap(NULL, 2 * lims.rlim_max, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
- ksft_test_result(map == MAP_FAILED, "Failed mmap\n");
+ ksft_test_result(map == MAP_FAILED, "The map failed respecting mlock limits\n");
if (map != MAP_FAILED)
munmap(map, 2 * lims.rlim_max);
@@ -33,8 +33,8 @@ int main(int argc, char **argv)
ksft_print_header();
ksft_set_plan(1);
- if (getuid())
- ksft_test_result_skip("Require root privileges to run\n");
+ if (!getuid())
+ ksft_test_result_skip("The test must be run from a normal user\n");
else
test_limit();
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 55898d64e2ebf..edd73f871c79a 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -303,7 +303,12 @@ echo "$nr_hugepgs" > /proc/sys/vm/nr_hugepages
CATEGORY="compaction" run_test ./compaction_test
-CATEGORY="mlock" run_test ./on-fault-limit
+if command -v sudo &> /dev/null;
+then
+ CATEGORY="mlock" run_test sudo -u nobody ./on-fault-limit
+else
+ echo "# SKIP ./on-fault-limit"
+fi
CATEGORY="mmap" run_test ./map_populate
--
2.42.0
The mmap() respects rlimit only for normal users. This test should be
run as normal user, without root privileges.
Fixes: b6221771d468 ("selftests/mm: run_vmtests: remove sudo and conform to tap")
Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
---
tools/testing/selftests/mm/on-fault-limit.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/mm/on-fault-limit.c b/tools/testing/selftests/mm/on-fault-limit.c
index 0ea98ffab3589..431c1277d83a1 100644
--- a/tools/testing/selftests/mm/on-fault-limit.c
+++ b/tools/testing/selftests/mm/on-fault-limit.c
@@ -21,7 +21,7 @@ static void test_limit(void)
map = mmap(NULL, 2 * lims.rlim_max, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
- ksft_test_result(map == MAP_FAILED, "Failed mmap\n");
+ ksft_test_result(map == MAP_FAILED, "The map failed respecting mlock limits\n");
if (map != MAP_FAILED)
munmap(map, 2 * lims.rlim_max);
@@ -33,8 +33,8 @@ int main(int argc, char **argv)
ksft_print_header();
ksft_set_plan(1);
- if (getuid())
- ksft_test_result_skip("Require root privileges to run\n");
+ if (!getuid())
+ ksft_test_result_skip("The test must be run from a normal user\n");
else
test_limit();
--
2.42.0
This series try to address CI failures for the pmtu.sh tests. It
does _not_ attempt to enable all the currently skipped cases, to
avoid adding more entropy.
Tested with:
make -C tools/testing/selftests/ TARGETS=net install
vng --build --config tools/testing/selftests/net/config
vng --run . --user root -- \
./tools/testing/selftests/kselftest_install/run_kselftest.sh \
-t net:pmtu.sh
Paolo Abeni (3):
selftests: net: add missing config for pmtu.sh tests
selftests: net: fix available tunnels detection
selftests: net: don't access /dev/stdout in pmtu.sh
tools/testing/selftests/net/config | 3 +++
tools/testing/selftests/net/pmtu.sh | 18 +++++++++---------
2 files changed, 12 insertions(+), 9 deletions(-)
--
2.43.0
This patch series give a proposal to support guest VM running
in user mode and in canonical linear address organization as
well.
First design to parition the 64-bit canonical linear address space
into two half parts belonging to user-mode and supervisor-mode
respectively, similar as the organization of linear addresses used
in linux OS. Currently the linear addresses use 48-bit canonical
format in which bits 63:47 of the address are identical.
Secondly setup page table mapping the same guest physical address
of test code and data segment onto both user-mode and supervisor-mode
address space. It allows guest in different runtime mode, i.e.
user or supervisor, can run one code base in the corresponding
linear address space.
Also provide the runtime environment setup API for switching to
user mode execution.
Zeng Guang (8):
KVM: selftests: x86: Fix bug in addr_arch_gva2gpa()
KVM: selftests: x86: Support guest running on canonical linear-address
organization
KVM: selftests: Add virt_arch_ucall_prealloc() arch specific
implementation
KVM : selftests : Adapt selftest cases to kernel canonical linear
address
KVM: selftests: x86: Prepare setup for user mode support
KVM: selftests: x86: Allow user to access user-mode address and I/O
address space
KVM: selftests: x86: Support vcpu run in user mode
KVM: selftests: x86: Add KVM forced emulation prefix capability
.../selftests/kvm/include/kvm_util_base.h | 20 ++-
.../selftests/kvm/include/x86_64/processor.h | 48 ++++++-
.../selftests/kvm/lib/aarch64/processor.c | 5 +
tools/testing/selftests/kvm/lib/kvm_util.c | 6 +-
.../selftests/kvm/lib/riscv/processor.c | 5 +
.../selftests/kvm/lib/s390x/processor.c | 5 +
.../testing/selftests/kvm/lib/ucall_common.c | 2 +
.../selftests/kvm/lib/x86_64/processor.c | 117 ++++++++++++++----
.../selftests/kvm/set_memory_region_test.c | 13 +-
.../testing/selftests/kvm/x86_64/debug_regs.c | 2 +-
.../kvm/x86_64/userspace_msr_exit_test.c | 9 +-
11 files changed, 195 insertions(+), 37 deletions(-)
--
2.21.3
l2_tos_ttl_inherit.sh verifies the inheritance of tos and ttl
for GRETAP, VXLAN and GENEVE.
Before testing it checks if the required module is available
and if not skips the tests accordingly.
Currently only GRETAP and VXLAN are tested because the GENEVE
module is missing.
Signed-off-by: Matthias May <matthias.may(a)westermo.com>
---
tools/testing/selftests/net/config | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
index 19ff75051660..8d79c024bebf 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -76,6 +76,7 @@ CONFIG_CRYPTO_SM4_GENERIC=y
CONFIG_AMT=m
CONFIG_TUN=y
CONFIG_VXLAN=m
+CONFIG_GENEVE=m
CONFIG_IP_SCTP=m
CONFIG_NETFILTER_XT_MATCH_POLICY=m
CONFIG_CRYPTO_ARIA=y
--
2.39.2
From: Willem de Bruijn <willemb(a)google.com>
The test sends packets and compares enqueue, transmit and Ack
timestamps with expected values. It installs netem delays to increase
latency between these points.
The test proves flaky in virtual environment (vng). Increase the
delays to reduce variance. Scale measurement tolerance accordingly.
Time sensitive tests are difficult to calibrate. Increasing delays 10x
also increases runtime 10x, for one. And it may still prove flaky at
some rate.
Signed-off-by: Willem de Bruijn <willemb(a)google.com>
---
tools/testing/selftests/net/txtimestamp.sh | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/net/txtimestamp.sh b/tools/testing/selftests/net/txtimestamp.sh
index 31637769f59f..25baca4b148e 100755
--- a/tools/testing/selftests/net/txtimestamp.sh
+++ b/tools/testing/selftests/net/txtimestamp.sh
@@ -8,13 +8,13 @@ set -e
setup() {
# set 1ms delay on lo egress
- tc qdisc add dev lo root netem delay 1ms
+ tc qdisc add dev lo root netem delay 10ms
# set 2ms delay on ifb0 egress
modprobe ifb
ip link add ifb_netem0 type ifb
ip link set dev ifb_netem0 up
- tc qdisc add dev ifb_netem0 root netem delay 2ms
+ tc qdisc add dev ifb_netem0 root netem delay 20ms
# redirect lo ingress through ifb0 egress
tc qdisc add dev lo handle ffff: ingress
@@ -24,9 +24,11 @@ setup() {
}
run_test_v4v6() {
- # SND will be delayed 1000us
- # ACK will be delayed 6000us: 1 + 2 ms round-trip
- local -r args="$@ -v 1000 -V 6000"
+ # SND will be delayed 10ms
+ # ACK will be delayed 60ms: 10 + 20 ms round-trip
+ # allow +/- tolerance of 8ms
+ # wait for ACK to be queued
+ local -r args="$@ -v 10000 -V 60000 -t 8000 -S 80000"
./txtimestamp ${args} -4 -L 127.0.0.1
./txtimestamp ${args} -6 -L ::1
--
2.43.0.429.g432eaa2c6b-goog
From: Jeff Xu <jeffxu(a)chromium.org>
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory
range against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW)
and no-execute (NX) bits. Linux has supported NX since the release of
kernel version 2.6.8 in August 2004 [1]. The memory permission feature
improves the security stance on memory corruption bugs, as an attacker
cannot simply write to arbitrary memory and point the code to it. The
memory must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects
the VMA itself against modifications of the selected seal type.
Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped. Memory sealing can
automatically be applied by the runtime loader to seal .text and
.rodata pages and applications can additionally seal security critical
data at runtime. A similar feature already exists in the XNU kernel
with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the
mimmutable syscall [4]. Also, Chrome wants to adopt this feature for
their CFI work [2] and this patchset has been designed to be
compatible with the Chrome use case.
Two system calls are involved in sealing the map: mmap() and mseal().
The new mseal() is an syscall on 64 bit CPU, and with
following signature:
int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.
mseal() blocks following operations for the given memory range.
1> Unmapping, moving to another location, and shrinking the size,
via munmap() and mremap(), can leave an empty space, therefore can
be replaced with a VMA with a new set of attributes.
2> Moving or expanding a different VMA into the current location,
via mremap().
3> Modifying a VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed VMA.
5> mprotect() and pkey_mprotect().
6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
memory, when users don't have write permission to the memory. Those
behaviors can alter region contents by discarding pages, effectively a
memset(0) for anonymous memory.
In addition: mmap() has two related changes.
The PROT_SEAL bit in prot field of mmap(). When present, it marks
the map sealed since creation.
The MAP_SEALABLE bit in the flags field of mmap(). When present, it marks
the map as sealable. A map created without MAP_SEALABLE will not support
sealing, i.e. mseal() will fail.
Applications that don't care about sealing will expect their behavior
unchanged. For those that need sealing support, opt-in by adding
MAP_SEALABLE in mmap().
The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.
Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in
the case of libc, sealing is only applied to read-only (RO) or
read-execute (RX) memory segments (such as .text and .RELRO) to
prevent them from becoming writable, the lifetime of those mappings
are tied to the lifetime of the process.
Chrome wants to seal two large address space reservations that are
managed by different allocators. The memory is mapped RW- and RWX
respectively but write access to it is restricted using pkeys (or in
the future ARM permission overlay extensions). The lifetime of those
mappings are not tied to the lifetime of the process, therefore, while
the memory is sealed, the allocators still need to free or discard the
unused memory. For example, with madvise(DONTNEED).
However, always allowing madvise(DONTNEED) on this range poses a
security risk. For example if a jump instruction crosses a page
boundary and the second page gets discarded, it will overwrite the
target bytes with zeros and change the control flow. Checking
write-permission before the discard operation allows us to control
when the operation is valid. In this case, the madvise will only
succeed if the executing thread has PKEY write permissions and PKRU
changes are protected in software by control-flow integrity.
Although the initial version of this patch series is targeting the
Chrome browser as its first user, it became evident during upstream
discussions that we would also want to ensure that the patch set
eventually is a complete solution for memory sealing and compatible
with other use cases. The specific scenario currently in mind is
glibc's use case of loading and sealing ELF executables. To this end,
Stephen is working on a change to glibc to add sealing support to the
dynamic linker, which will seal all non-writable segments at startup.
Once this work is completed, all applications will be able to
automatically benefit from these new protections.
In closing, I would like to formally acknowledge the valuable
contributions received during the RFC process, which were instrumental
in shaping this patch:
Jann Horn: raising awareness and providing valuable insights on the
destructive madvise operations.
Linus Torvalds: assisting in defining system call signature and scope.
Pedro Falcato: suggesting sealing in the mmap().
Theo de Raadt: sharing the experiences and insights gained from
implementing mimmutable() in OpenBSD.
Change history:
===============
V7:
- fix index.rst (Randy Dunlap)
- fix arm build (Randy Dunlap)
- return EPERM for blocked operations (Theo de Raadt)
V6:
- Drop RFC from subject, Given Linus's general approval.
- Adjust syscall number for mseal (main Jan.11/2024)
- Code style fix (Matthew Wilcox)
- selftest: use ksft macros (Muhammad Usama Anjum)
- Document fix. (Randy Dunlap)
https://lore.kernel.org/all/20240111234142.2944934-1-jeffxu@chromium.org/
V5:
- fix build issue in mseal-Wire-up-mseal-syscall
(Suggested by Linus Torvalds, and Greg KH)
- updates on selftest.
https://lore.kernel.org/lkml/20240109154547.1839886-1-jeffxu@chromium.org/#r
V4:
(Suggested by Linus Torvalds)
- new signature: mseal(start,len,flags)
- 32 bit is not supported. vm_seal is removed, use vm_flags instead.
- single bit in vm_flags for sealed state.
- CONFIG_MSEAL kernel config is removed.
- single bit of PROT_SEAL in the "Prot" field of mmap().
Other changes:
- update selftest (Suggested by Muhammad Usama Anjum)
- update documentation.
https://lore.kernel.org/all/20240104185138.169307-1-jeffxu@chromium.org/
V3:
- Abandon per-syscall approach, (Suggested by Linus Torvalds).
- Organize sealing types around their functionality, such as
MM_SEAL_BASE, MM_SEAL_PROT_PKEY.
- Extend the scope of sealing from calls originated in userspace to
both kernel and userspace. (Suggested by Linus Torvalds)
- Add seal type support in mmap(). (Suggested by Pedro Falcato)
- Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent
destructive operations of madvise. (Suggested by Jann Horn and
Stephen Röttger)
- Make sealed VMAs mergeable. (Suggested by Jann Horn)
- Add MAP_SEALABLE to mmap()
- Add documentation - mseal.rst
https://lore.kernel.org/linux-mm/20231212231706.2680890-2-jeffxu@chromium.o…
v2:
Use _BITUL to define MM_SEAL_XX type.
Use unsigned long for seal type in sys_mseal() and other functions.
Remove internal VM_SEAL_XX type and convert_user_seal_type().
Remove MM_ACTION_XX type.
Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask.
Add more comments in code.
Add a detailed commit message.
https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/
v1:
https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/
----------------------------------------------------------------
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b…
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXge…
[6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426Fkcgnf…
[7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/
Jeff Xu (4):
mseal: Wire up mseal syscall
mseal: add mseal syscall
selftest mm/mseal memory sealing
mseal:add documentation
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/mseal.rst | 183 ++
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/mm.h | 48 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/mman-common.h | 8 +
include/uapi/asm-generic/unistd.h | 5 +-
kernel/sys_ni.c | 1 +
mm/Makefile | 4 +
mm/madvise.c | 12 +
mm/mmap.c | 27 +
mm/mprotect.c | 10 +
mm/mremap.c | 31 +
mm/mseal.c | 343 ++++
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/mseal_test.c | 1997 +++++++++++++++++++
33 files changed, 2690 insertions(+), 2 deletions(-)
create mode 100644 Documentation/userspace-api/mseal.rst
create mode 100644 mm/mseal.c
create mode 100644 tools/testing/selftests/mm/mseal_test.c
--
2.43.0.429.g432eaa2c6b-goog
The idea of this RFC is to introduce a way to catalogue and document any tests
that should be executed for changes to a subsystem, as well as to make
checkpatch.pl require a tag in commit messages certifying they were, plus
hopefully make it easier to discover and run them.
This is following a discussion Veronika Kabatova started with a few
(addressed) people at the LPC last year (IIRC), where there was a good deal of
interest for something like this.
Apart from implementing basic support (surely to be improved), two sample
changes are added on top, adding a few test suites (roughly) based on what the
maintainers described earlier. I'm definitely not qualified for describing
them adequately, and don't have the time to dig deeper, but hopefully they
could serve as illustrations, and shouldn't be merged as is.
I would defer to maintainers of the corresponding subsystems and tests to
describe their tests and requirements better. Although I would accept
amendments too, if they prefer it that way.
One bug I know that's definitely there is handling removed files. The
scripts/get_maintainer.pl chokes on non-existing files, failing to output the
required test suites (I'm sure there's a good reason, but I couldn't see it).
My first idea is to only check for required tests upon encountering the '+++
<file>' line, and ignore the '/dev/null' file, but I hope the checkpatch.pl
maintainers could recommend a better way.
Anyway, tell me what you think, and I'll work on polishing this.
Thank you!
Nick
---
Nikolai Kondrashov (3):
MAINTAINERS: Introduce V: field for required tests
MAINTAINERS: Require kvm-xfstests smoke for ext4
MAINTAINERS: Require kunit core tests for framework changes
Documentation/process/submitting-patches.rst | 19 +++++
Documentation/process/tests.rst | 80 ++++++++++++++++++
MAINTAINERS | 8 ++
scripts/checkpatch.pl | 118 ++++++++++++++++++++++++++-
scripts/get_maintainer.pl | 17 +++-
scripts/parse-maintainers.pl | 3 +-
6 files changed, 241 insertions(+), 4 deletions(-)
---
From: Willem de Bruijn <willemb(a)google.com>
This test validates per-band packet limits in FQ. Packets are dropped
rather than enqueued if the limit for their band is reached.
This test is timing sensitive. It queues packets in FQ with a future
delivery time to fill the qdisc.
The test failed in a virtual environment (vng). Increase the delays
to make it more tolerant to environments with timing variance.
Signed-off-by: Willem de Bruijn <willemb(a)google.com>
---
tools/testing/selftests/net/fq_band_pktlimit.sh | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/net/fq_band_pktlimit.sh b/tools/testing/selftests/net/fq_band_pktlimit.sh
index 24b77bdf41ff..977070ed42b3 100755
--- a/tools/testing/selftests/net/fq_band_pktlimit.sh
+++ b/tools/testing/selftests/net/fq_band_pktlimit.sh
@@ -8,7 +8,7 @@
# 3. send 20 pkts on band A: verify that 0 are queued, 20 dropped
# 4. send 20 pkts on band B: verify that 10 are queued, 10 dropped
#
-# Send packets with a 100ms delay to ensure that previously sent
+# Send packets with a delay to ensure that previously sent
# packets are still queued when later ones are sent.
# Use SO_TXTIME for this.
@@ -29,19 +29,21 @@ ip -6 addr add fdaa::1/128 dev dummy0
ip -6 route add fdaa::/64 dev dummy0
tc qdisc replace dev dummy0 root handle 1: fq quantum 1514 initial_quantum 1514 limit 10
-./cmsg_sender -6 -p u -d 100000 -n 20 fdaa::2 8000
+DELAY=400000
+
+./cmsg_sender -6 -p u -d "${DELAY}" -n 20 fdaa::2 8000
OUT1="$(tc -s qdisc show dev dummy0 | grep '^\ Sent')"
-./cmsg_sender -6 -p u -d 100000 -n 20 fdaa::2 8000
+./cmsg_sender -6 -p u -d "${DELAY}" -n 20 fdaa::2 8000
OUT2="$(tc -s qdisc show dev dummy0 | grep '^\ Sent')"
-./cmsg_sender -6 -p u -d 100000 -n 20 -P 7 fdaa::2 8000
+./cmsg_sender -6 -p u -d "${DELAY}" -n 20 -P 7 fdaa::2 8000
OUT3="$(tc -s qdisc show dev dummy0 | grep '^\ Sent')"
# Initial stats will report zero sent, as all packets are still
-# queued in FQ. Sleep for the delay period (100ms) and see that
+# queued in FQ. Sleep for at least the delay period and see that
# twenty are now sent.
-sleep 0.1
+sleep 0.6
OUT4="$(tc -s qdisc show dev dummy0 | grep '^\ Sent')"
# Log the output after the test
--
2.43.0.429.g432eaa2c6b-goog
After commit 25ae948b4478 ("selftests/net: add lib.sh") but before commit
2114e83381d3 ("selftests: forwarding: Avoid failures to source
net/lib.sh"), some net selftests encountered errors when they were being
exported and run. This was because the new net/lib.sh was not exported
along with the tests. The errors were crudely avoided by duplicating some
content between net/lib.sh and net/forwarding/lib.sh in 2114e83381d3.
In order to restore the sourcing of net/lib.sh from net/forwarding/lib.sh
and remove the duplicated content, this series introduces a new selftests
Makefile variable to list extra files to export from other directories and
makes use of it to avoid reintroducing the errors mentioned above.
v2:
* "selftests: Introduce Makefile variable to list shared bash scripts"
Fix rst syntax in Documentation/dev-tools/kselftest.rst (Jakub Kicinski)
v1:
* "selftests: Introduce Makefile variable to list shared bash scripts"
Changed TEST_INCLUDES to take relative paths, like other TEST_* variables.
Paths are adjusted accordingly in the subsequent patches. (Vladimir Oltean)
* selftests: bonding: Change script interpreter
selftests: forwarding: Remove executable bits from lib.sh
Removed from this series, submitted separately.
Since commit 2114e83381d3 ("selftests: forwarding: Avoid failures to source
net/lib.sh") resolved the test errors, this version of the series is
focused on removing the duplication that was added in that commit. Directly
rebasing the series would reintroduce the problems that 2114e83381d3
avoided before fixing them again. In order to prevent such breakage partway
through the series, patches are reordered and content changed slightly but
there is no diff at the end compared with the simple rebasing approach. I
have dropped most review tags on account of this reordering.
RFC:
https://lore.kernel.org/netdev/20231222135836.992841-1-bpoirier@nvidia.com/
Link: https://lore.kernel.org/netdev/ZXu7dGj7F9Ng8iIX@Laptop-X1/
Benjamin Poirier (5):
selftests: Introduce Makefile variable to list shared bash scripts
selftests: bonding: Add net/forwarding/lib.sh to TEST_INCLUDES
selftests: team: Add shared library scripts to TEST_INCLUDES
selftests: dsa: Replace test symlinks by wrapper script
selftests: forwarding: Redefine relative_path variable
Petr Machata (1):
selftests: forwarding: Remove duplicated lib.sh content
Documentation/dev-tools/kselftest.rst | 12 ++++++
tools/testing/selftests/Makefile | 7 +++-
.../selftests/drivers/net/bonding/Makefile | 7 +++-
.../net/bonding/bond-eth-type-change.sh | 2 +-
.../drivers/net/bonding/bond_topo_2d1c.sh | 2 +-
.../drivers/net/bonding/dev_addr_lists.sh | 2 +-
.../net/bonding/mode-1-recovery-updelay.sh | 2 +-
.../net/bonding/mode-2-recovery-updelay.sh | 2 +-
.../drivers/net/bonding/net_forwarding_lib.sh | 1 -
.../selftests/drivers/net/dsa/Makefile | 18 ++++++++-
.../drivers/net/dsa/bridge_locked_port.sh | 2 +-
.../selftests/drivers/net/dsa/bridge_mdb.sh | 2 +-
.../selftests/drivers/net/dsa/bridge_mld.sh | 2 +-
.../drivers/net/dsa/bridge_vlan_aware.sh | 2 +-
.../drivers/net/dsa/bridge_vlan_mcast.sh | 2 +-
.../drivers/net/dsa/bridge_vlan_unaware.sh | 2 +-
.../testing/selftests/drivers/net/dsa/lib.sh | 1 -
.../drivers/net/dsa/local_termination.sh | 2 +-
.../drivers/net/dsa/no_forwarding.sh | 2 +-
.../net/dsa/run_net_forwarding_test.sh | 9 +++++
.../selftests/drivers/net/dsa/tc_actions.sh | 2 +-
.../selftests/drivers/net/dsa/tc_common.sh | 1 -
.../drivers/net/dsa/test_bridge_fdb_stress.sh | 2 +-
.../selftests/drivers/net/team/Makefile | 7 ++--
.../drivers/net/team/dev_addr_lists.sh | 4 +-
.../selftests/drivers/net/team/lag_lib.sh | 1 -
.../drivers/net/team/net_forwarding_lib.sh | 1 -
tools/testing/selftests/lib.mk | 19 ++++++++++
.../testing/selftests/net/forwarding/Makefile | 3 ++
tools/testing/selftests/net/forwarding/lib.sh | 37 +++----------------
.../net/forwarding/mirror_gre_lib.sh | 2 +-
.../net/forwarding/mirror_gre_topo_lib.sh | 2 +-
32 files changed, 98 insertions(+), 64 deletions(-)
delete mode 120000 tools/testing/selftests/drivers/net/bonding/net_forwarding_lib.sh
delete mode 120000 tools/testing/selftests/drivers/net/dsa/lib.sh
create mode 100755 tools/testing/selftests/drivers/net/dsa/run_net_forwarding_test.sh
delete mode 120000 tools/testing/selftests/drivers/net/dsa/tc_common.sh
delete mode 120000 tools/testing/selftests/drivers/net/team/lag_lib.sh
delete mode 120000 tools/testing/selftests/drivers/net/team/net_forwarding_lib.sh
--
2.43.0
The test is inspired by the pmu_event_filter_test which implemented by x86. On
the arm64 platform, there is the same ability to set the pmu_event_filter
through the KVM_ARM_VCPU_PMU_V3_FILTER attribute. So add the test for arm64.
The series first move some pmu common code from vpmu_counter_access to
lib/aarch64/vpmu.c and include/aarch64/vpmu.h, which can be used by
pmu_event_filter_test. Then fix a bug related to the [enable|disable]_counter,
and at last, implement the test itself.
Changelog:
----------
v2->v3:
- Check the pmceid in guest code instead of pmu event count since different
hardware may have different event count result, check pmceid makes it stable
on different platform. [Eric]
- Some typo fixed and commit message improved.
v1->v2:
- Improve the commit message. [Eric]
- Fix the bug in [enable|disable]_counter. [Raghavendra & Marc]
- Add the check if kvm has attr KVM_ARM_VCPU_PMU_V3_FILTER.
- Add if host pmu support the test event throught pmceid0.
- Split the test_invalid_filter() to another patch. [Eric]
v1: https://lore.kernel.org/all/20231123063750.2176250-1-shahuang@redhat.com/
v2: https://lore.kernel.org/all/20231129072712.2667337-1-shahuang@redhat.com/
Shaoqin Huang (5):
KVM: selftests: aarch64: Make the [create|destroy]_vpmu_vm() public
KVM: selftests: aarch64: Move pmu helper functions into vpmu.h
KVM: selftests: aarch64: Fix the buggy [enable|disable]_counter
KVM: selftests: aarch64: Introduce pmu_event_filter_test
KVM: selftests: aarch64: Add invalid filter test in
pmu_event_filter_test
tools/testing/selftests/kvm/Makefile | 2 +
.../kvm/aarch64/pmu_event_filter_test.c | 255 ++++++++++++++++++
.../kvm/aarch64/vpmu_counter_access.c | 218 ++-------------
.../selftests/kvm/include/aarch64/vpmu.h | 135 ++++++++++
.../testing/selftests/kvm/lib/aarch64/vpmu.c | 74 +++++
5 files changed, 490 insertions(+), 194 deletions(-)
create mode 100644 tools/testing/selftests/kvm/aarch64/pmu_event_filter_test.c
create mode 100644 tools/testing/selftests/kvm/include/aarch64/vpmu.h
create mode 100644 tools/testing/selftests/kvm/lib/aarch64/vpmu.c
--
2.40.1
One of the test cases in the test_bridge_backup_port.sh selftest relies
on a matchall classifier to drop unrelated traffic so that the Tx drop
counter on the VXLAN device will only be incremented as a result of
traffic generated by the test.
However, the configuration option for the matchall classifier is
missing from the configuration file which might explain the failures we
see in the netdev CI [1].
Fix by adding CONFIG_NET_CLS_MATCHALL to the configuration file.
[1]
# Backup nexthop ID - invalid IDs
# -------------------------------
[...]
# TEST: Forwarding out of vx0 [ OK ]
# TEST: No forwarding using backup nexthop ID [ OK ]
# TEST: Tx drop increased [FAIL]
# TEST: IPv6 address family nexthop as backup nexthop [ OK ]
# TEST: No forwarding out of swp1 [ OK ]
# TEST: Forwarding out of vx0 [ OK ]
# TEST: No forwarding using backup nexthop ID [ OK ]
# TEST: Tx drop increased [FAIL]
[...]
Fixes: b408453053fb ("selftests: net: Add bridge backup port and backup nexthop ID test")
Signed-off-by: Ido Schimmel <idosch(a)nvidia.com>
---
Jakub, you can apply to net if you want to, but I'm sending this to
net-next since I want to see if it helps the CI. I'm unable to reproduce
this locally.
---
tools/testing/selftests/net/config | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
index 19ff75051660..2bd5f9033ade 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -68,6 +68,7 @@ CONFIG_MPLS_ROUTING=m
CONFIG_MPLS_IPTUNNEL=m
CONFIG_NET_SCH_INGRESS=m
CONFIG_NET_CLS_FLOWER=m
+CONFIG_NET_CLS_MATCHALL=m
CONFIG_NET_ACT_TUNNEL_KEY=m
CONFIG_NET_ACT_MIRRED=m
CONFIG_BAREUDP=m
--
2.43.0
Hi Linus,
Please pull the following KUnit fixes update for Linux 6.8-rc3.
This kunit fixes update for Linux 6.8-rc3 consists of NULL vs IS_ERR()
bug fixes, documentation update, MAINTAINERS file update to add
Rae Moar as a reviewer, and a fix to run test suites only after module
initialization completes.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 6613476e225e090cc9aad49be7fa504e290dd33d:
Linux 6.8-rc1 (2024-01-21 14:11:32 -0800)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-kunit-fixes-6.8-rc3
for you to fetch changes up to 1a9f2c776d1416c4ea6cb0d0b9917778c41a1a7d:
Documentation: KUnit: Update the instructions on how to test static functions (2024-01-22 07:59:03 -0700)
----------------------------------------------------------------
linux_kselftest-kunit-fixes-6.8-rc3
This kunit fixes update for Linux 6.8-rc3 consists of NULL vs IS_ERR()
bug fixes, documentation update, MAINTAINERS file update to add
Rae Moar as a reviewer, and a fix to run test suites only after module
initialization completes.
----------------------------------------------------------------
Arthur Grillo (1):
Documentation: KUnit: Update the instructions on how to test static functions
Dan Carpenter (2):
kunit: Fix a NULL vs IS_ERR() bug
kunit: device: Fix a NULL vs IS_ERR() check in init()
David Gow (1):
MAINTAINERS: kunit: Add Rae Moar as a reviewer
Marco Pagani (1):
kunit: run test suites only after module initialization completes
Documentation/dev-tools/kunit/usage.rst | 19 +++++++++++++++++--
MAINTAINERS | 1 +
lib/kunit/device.c | 4 ++--
lib/kunit/executor.c | 4 ++++
lib/kunit/kunit-test.c | 2 +-
lib/kunit/test.c | 14 +++++++++++---
6 files changed, 36 insertions(+), 8 deletions(-)
----------------------------------------------------------------
Hi Linus,
Please pull the following kselftest fixes update for Linux 6.8-rc3.
This kselftest fixes update for Linux 6.8-rc3 consists of three
fixes to livepatch, rseq, and seccomp tests.
diff is attached
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 6613476e225e090cc9aad49be7fa504e290dd33d:
Linux 6.8-rc1 (2024-01-21 14:11:32 -0800)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-fixes-6.8-rc3
for you to fetch changes up to b54761f6e9773350c0d1fb8e1e5aacaba7769d0f:
kselftest/seccomp: Report each expectation we assert as a KTAP test (2024-01-30 08:55:42 -0700)
----------------------------------------------------------------
linux_kselftest-fixes-6.8-rc3
This kselftest fixes update for Linux 6.8-rc3 consists of three
fixes to livepatch, rseq, and seccomp tests.
----------------------------------------------------------------
Joe Lawrence (1):
selftests/livepatch: fix and refactor new dmesg message code
Mark Brown (2):
kselftest/seccomp: Use kselftest output functions for benchmark
kselftest/seccomp: Report each expectation we assert as a KTAP test
Mathieu Desnoyers (1):
selftests/rseq: Do not skip !allowed_cpus for mm_cid
tools/testing/selftests/livepatch/functions.sh | 37 ++++----
.../testing/selftests/rseq/basic_percpu_ops_test.c | 14 ++-
tools/testing/selftests/rseq/param_test.c | 22 +++--
.../testing/selftests/seccomp/seccomp_benchmark.c | 104 +++++++++++++--------
4 files changed, 109 insertions(+), 68 deletions(-)
----------------------------------------------------------------
On riscv, mmap currently returns an address from the largest address
space that can fit entirely inside of the hint address. This makes it
such that the hint address is almost never returned. This patch raises
the mappable area up to and including the hint address. This allows mmap
to often return the hint address, which allows a performance improvement
over searching for a valid address as well as making the behavior more
similar to other architectures.
Signed-off-by: Charlie Jenkins <charlie(a)rivosinc.com>
---
Charlie Jenkins (3):
riscv: mm: Use hint address in mmap if available
selftests: riscv: Generalize mm selftests
docs: riscv: Define behavior of mmap
Documentation/arch/riscv/vm-layout.rst | 16 ++--
arch/riscv/include/asm/processor.h | 21 ++----
tools/testing/selftests/riscv/mm/mmap_bottomup.c | 20 +----
tools/testing/selftests/riscv/mm/mmap_default.c | 20 +----
tools/testing/selftests/riscv/mm/mmap_test.h | 93 +++++++++++++-----------
5 files changed, 66 insertions(+), 104 deletions(-)
---
base-commit: 556e2d17cae620d549c5474b1ece053430cd50bc
change-id: 20240119-use_mmap_hint_address-f9f4b1b6f5f1
--
- Charlie
Fix a broken zswap kselftest due to cgroup zswap writeback counter
renaming, and add a kselftest to cover the (z)swapin case.
Also, add the zswap kselftest file to zswap maintainer entry so that
get_maintainers script can find zswap maintainers.
Nhat Pham (3):
selftests: zswap: add zswap selftest file to zswap maintainer entry
selftests: fix the zswap invasive shrink test
selftests: add test for zswapin
MAINTAINERS | 1 +
tools/testing/selftests/cgroup/test_zswap.c | 69 ++++++++++++++++++++-
2 files changed, 67 insertions(+), 3 deletions(-)
base-commit: d162e170f1181b4305494843e1976584ddf2b72e
--
2.39.3
From: Björn Töpel <bjorn(a)rivosinc.com>
Here's the fourth try. The "make install" target for the BPF selftests
are missing a bunch of files, which makes the BPF machine flavor fail
(e.g. cpuv4).
This series aims to fix that, but explicitly installing bpftool, all
the BPF programs, and the utilities defined by the TRUNNER_EXTRA_PROGS
for test_progs.
The fact that this series even have a changelog says a lot, but for
those who care:
v4: Added bpftool
v3: Do not use hardcoded file names (Andrii)
v2: Added btf_dump_test_case files
Björn Töpel (3):
selftests/bpf: Remove incorrect object path
selftests/bpf: Make install target copy test_progs extra files
selftests/bpf: Make install target copy bpftool
tools/testing/selftests/bpf/Makefile | 31 +++++++++++++++++-----------
1 file changed, 19 insertions(+), 12 deletions(-)
base-commit: beb53f32698ff9cd0ca442c1f856ea0ecfb82be3
--
2.40.1
Modern OSes use iptables implementation with nf_tables as a backend,
e.g.:
$ iptables -V
iptables v1.8.8 (nf_tables)
Pablo points out that we need CONFIG_NFT_COMPAT to make that work,
otherwise we see a lot of:
Warning: Extension DNAT revision 0 not supported, missing kernel module?
with DNAT being just an example here, other modules we need
include udp, TTL, length etc.
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
Location for new entry chosen based on `sort --version-sort`.
CC: shuah(a)kernel.org
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/config | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
index 413ab9abcf1b..ba56f231e109 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -59,6 +59,7 @@ CONFIG_NET_SCH_HTB=m
CONFIG_NET_SCH_FQ=m
CONFIG_NET_SCH_ETF=m
CONFIG_NET_SCH_NETEM=y
+CONFIG_NFT_COMPAT=m
CONFIG_NF_FLOW_TABLE=m
CONFIG_PSAMPLE=m
CONFIG_TCP_MD5SIG=y
--
2.43.0
DAMON debugfs interface is deprecated in February 2023, by commit
5445fcbc4cda ("Docs/admin-guide/mm/damon/usage: add DAMON debugfs
interface deprecation notice"). Make the fact unable to be easily
ignored by removing an example usage from the document (patch 1),
renaming the config (patch 2), adding a deprecation notice file to the
debugfs directory (patches 3-5), and renaming the debugfs file that
essnetial to be used for real use of DAMON (patches 6-9).
SeongJae Park (9):
Docs/admin-guide/mm/damon/usage: use sysfs interface for tracepoints
example
mm/damon: rename CONFIG_DAMON_DBGFS to DAMON_DBGFS_DEPRECATED
mm/damon/dbgfs: implement deprecation notice file
mm/damon/dbgfs: make debugfs interface deprecation message a macro
Docs/admin-guide/mm/damon/usage: document 'DEPRECATED' file of DAMON
debugfs interface
selftets/damon: prepare for monitor_on file renaming
mm/damon/dbgfs: rename monitor_on file to monitor_on_DEPRECATED
Docs/admin-guide/mm/damon/usage: update for monitor_on renaming
Docs/translations/damon/usage: update for monitor_on renaming
Documentation/admin-guide/mm/damon/usage.rst | 42 +++++++++++--------
.../zh_CN/admin-guide/mm/damon/usage.rst | 20 ++++-----
.../zh_TW/admin-guide/mm/damon/usage.rst | 20 ++++-----
mm/damon/Kconfig | 7 +++-
mm/damon/dbgfs.c | 27 +++++++++---
.../selftests/damon/_chk_dependency.sh | 11 ++++-
.../selftests/damon/_debugfs_common.sh | 7 ++++
.../selftests/damon/debugfs_empty_targets.sh | 12 +++++-
8 files changed, 98 insertions(+), 48 deletions(-)
base-commit: f1ab2f51e99ffb94ce127d132b24be00dc130e6c
--
2.39.2
When walking directory trees, instead of looking for specific files and
running dirname to get the parent folder, traverse all folders and
ignore the ones not containing the desired files. This avoids the need
to call dirname inside the loop, which drastically decreases run time:
Running locally on a mt8192-asurada-spherion, which reports 160 test
cases, has gone from 5.5s to 2.9s, while running remotely with an
nfsroot has gone from 13.5s to 5.5s.
This change has a side-effect, which is that the root DT node now
also shows in the output, even though it isn't expected to bind to a
driver. However there shouldn't be a matching driver for the board
compatible, so the end result will be just an extra skipped test:
ok 1 / # SKIP
Reported-by: Mark Brown <broonie(a)kernel.org>
Closes: https://lore.kernel.org/all/310391e8-fdf2-4c2f-a680-7744eb685177@sirena.org…
Fixes: 14571ab1ad21 ("kselftest: Add new test for detecting unprobed Devicetree devices")
Tested-by: Mark Brown <broonie(a)kernel.org>
Signed-off-by: Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
---
Changes in v2:
- Tweaked commit message
- Added trailer tags
- Rebased on 6.8-rc1
---
tools/testing/selftests/dt/test_unprobed_devices.sh | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/dt/test_unprobed_devices.sh b/tools/testing/selftests/dt/test_unprobed_devices.sh
index b07af2a4c4de..7fae90293a9d 100755
--- a/tools/testing/selftests/dt/test_unprobed_devices.sh
+++ b/tools/testing/selftests/dt/test_unprobed_devices.sh
@@ -33,8 +33,8 @@ if [[ ! -d "${PDT}" ]]; then
fi
nodes_compatible=$(
- for node_compat in $(find ${PDT} -name compatible); do
- node=$(dirname "${node_compat}")
+ for node in $(find ${PDT} -type d); do
+ [ ! -f "${node}"/compatible ] && continue
# Check if node is available
if [[ -e "${node}"/status ]]; then
status=$(tr -d '\000' < "${node}"/status)
@@ -46,10 +46,11 @@ nodes_compatible=$(
nodes_dev_bound=$(
IFS=$'\n'
- for uevent in $(find /sys/devices -name uevent); do
- if [[ -d "$(dirname "${uevent}")"/driver ]]; then
- grep '^OF_FULLNAME=' "${uevent}" | sed -e 's|OF_FULLNAME=||'
- fi
+ for dev_dir in $(find /sys/devices -type d); do
+ [ ! -f "${dev_dir}"/uevent ] && continue
+ [ ! -d "${dev_dir}"/driver ] && continue
+
+ grep '^OF_FULLNAME=' "${dev_dir}"/uevent | sed -e 's|OF_FULLNAME=||'
done
)
---
base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
change-id: 20240122-dt-kselftest-dirname-perf-fix-7dc421e6dfb0
Best regards,
--
Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
The usage of run_vmtests.sh does not include hugetlb, which is a valid
test category.
Add the 'hugetlb' to the usage of run_vmtests.sh.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
tools/testing/selftests/mm/run_vmtests.sh | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 55898d64e2eb..2ee0a1c4740f 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -65,6 +65,8 @@ separated by spaces:
test copy-on-write semantics
- thp
test transparent huge pages
+- hugetlb
+ test hugetlbfs huge pages
- migration
invoke move_pages(2) to exercise the migration entry code
paths in the kernel
--
2.39.3
From: Christoph Müllner <christoph.muellner(a)vrull.eu>
[ Upstream commit 12c16919652b5873f524c8b361336ecfa5ce5e6b ]
When building the mm tests with a riscv32 compiler, we see a range
of shift-count-overflow errors from shifting 1UL by more than 32 bits
in do_mmaps(). Since, the relevant code is only called from code that
is gated by `__riscv_xlen == 64`, we can just apply the same gating
to do_mmaps().
Signed-off-by: Christoph Müllner <christoph.muellner(a)vrull.eu>
Reviewed-by: Alexandre Ghiti <alexghiti(a)rivosinc.com>
Reviewed-by: Andrew Jones <ajones(a)ventanamicro.com>
Link: https://lore.kernel.org/r/20231123185821.2272504-6-christoph.muellner@vrull…
Signed-off-by: Palmer Dabbelt <palmer(a)rivosinc.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/riscv/mm/mmap_test.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/riscv/mm/mmap_test.h b/tools/testing/selftests/riscv/mm/mmap_test.h
index 9b8434f62f57..2e0db9c5be6c 100644
--- a/tools/testing/selftests/riscv/mm/mmap_test.h
+++ b/tools/testing/selftests/riscv/mm/mmap_test.h
@@ -18,6 +18,8 @@ struct addresses {
int *on_56_addr;
};
+// Only works on 64 bit
+#if __riscv_xlen == 64
static inline void do_mmaps(struct addresses *mmap_addresses)
{
/*
@@ -50,6 +52,7 @@ static inline void do_mmaps(struct addresses *mmap_addresses)
mmap_addresses->on_56_addr =
mmap(on_56_bits, 5 * sizeof(int), prot, flags, 0, 0);
}
+#endif /* __riscv_xlen == 64 */
static inline int memory_layout(void)
{
--
2.43.0
From: Christoph Müllner <christoph.muellner(a)vrull.eu>
[ Upstream commit 12c16919652b5873f524c8b361336ecfa5ce5e6b ]
When building the mm tests with a riscv32 compiler, we see a range
of shift-count-overflow errors from shifting 1UL by more than 32 bits
in do_mmaps(). Since, the relevant code is only called from code that
is gated by `__riscv_xlen == 64`, we can just apply the same gating
to do_mmaps().
Signed-off-by: Christoph Müllner <christoph.muellner(a)vrull.eu>
Reviewed-by: Alexandre Ghiti <alexghiti(a)rivosinc.com>
Reviewed-by: Andrew Jones <ajones(a)ventanamicro.com>
Link: https://lore.kernel.org/r/20231123185821.2272504-6-christoph.muellner@vrull…
Signed-off-by: Palmer Dabbelt <palmer(a)rivosinc.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/riscv/mm/mmap_test.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/riscv/mm/mmap_test.h b/tools/testing/selftests/riscv/mm/mmap_test.h
index 9b8434f62f57..2e0db9c5be6c 100644
--- a/tools/testing/selftests/riscv/mm/mmap_test.h
+++ b/tools/testing/selftests/riscv/mm/mmap_test.h
@@ -18,6 +18,8 @@ struct addresses {
int *on_56_addr;
};
+// Only works on 64 bit
+#if __riscv_xlen == 64
static inline void do_mmaps(struct addresses *mmap_addresses)
{
/*
@@ -50,6 +52,7 @@ static inline void do_mmaps(struct addresses *mmap_addresses)
mmap_addresses->on_56_addr =
mmap(on_56_bits, 5 * sizeof(int), prot, flags, 0, 0);
}
+#endif /* __riscv_xlen == 64 */
static inline int memory_layout(void)
{
--
2.43.0
This patchset adds KVM selftests for LoongArch system, currently only
some common test cases are supported and pass to run. These testcase
are listed as following:
demand_paging_test
dirty_log_perf_test
dirty_log_test
guest_print_test
hardware_disable_test
kvm_binary_stats_test
kvm_create_max_vcpus
kvm_page_table_test
memslot_modification_stress_test
memslot_perf_test
set_memory_region_test
This patchset originally is posted from zhaotianrui, I continue to work
on his efforts.
---
Changes in v6:
1. Refresh the patch based on latest kernel 6.8-rc1, add LoongArch
support about testcase set_memory_region_test.
2. Add hardware_disable_test test case.
3. Drop modification about macro DEFAULT_GUEST_TEST_MEM, it is problem
of LoongArch binutils, this issue is raised to LoongArch binutils owners.
Changes in v5:
1. In LoongArch kvm self tests, the DEFAULT_GUEST_TEST_MEM could be
0x130000000, it is different from the default value in memstress.h.
So we Move the definition of DEFAULT_GUEST_TEST_MEM into LoongArch
ucall.h, and add 'ifndef' condition for DEFAULT_GUEST_TEST_MEM
in memstress.h.
Changes in v4:
1. Remove the based-on flag, as the LoongArch KVM patch series
have been accepted by Linux kernel, so this can be applied directly
in kernel.
Changes in v3:
1. Improve implementation of LoongArch VM page walk.
2. Add exception handler for LoongArch.
3. Add dirty_log_test, dirty_log_perf_test, guest_print_test
test cases for LoongArch.
4. Add __ASSEMBLER__ macro to distinguish asm file and c file.
5. Move ucall_arch_do_ucall to the header file and make it as
static inline to avoid function calls.
6. Change the DEFAULT_GUEST_TEST_MEM base addr for LoongArch.
Changes in v2:
1. We should use ".balign 4096" to align the assemble code with 4K in
exception.S instead of "align 12".
2. LoongArch only supports 3 or 4 levels page tables, so we remove the
hanlders for 2-levels page table.
3. Remove the DEFAULT_LOONGARCH_GUEST_STACK_VADDR_MIN and use the common
DEFAULT_GUEST_STACK_VADDR_MIN to allocate stack memory in guest.
4. Reorganize the test cases supported by LoongArch.
5. Fix some code comments.
6. Add kvm_binary_stats_test test case into LoongArch KVM selftests.
---
Tianrui Zhao (4):
KVM: selftests: Add KVM selftests header files for LoongArch
KVM: selftests: Add core KVM selftests support for LoongArch
KVM: selftests: Add ucall test support for LoongArch
KVM: selftests: Add test cases for LoongArch
tools/testing/selftests/kvm/Makefile | 16 +
.../selftests/kvm/include/kvm_util_base.h | 5 +
.../kvm/include/loongarch/processor.h | 133 +++++++
.../selftests/kvm/include/loongarch/ucall.h | 20 ++
.../selftests/kvm/lib/loongarch/exception.S | 59 ++++
.../selftests/kvm/lib/loongarch/processor.c | 332 ++++++++++++++++++
.../selftests/kvm/lib/loongarch/ucall.c | 38 ++
.../selftests/kvm/set_memory_region_test.c | 2 +-
8 files changed, 604 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/kvm/include/loongarch/processor.h
create mode 100644 tools/testing/selftests/kvm/include/loongarch/ucall.h
create mode 100644 tools/testing/selftests/kvm/lib/loongarch/exception.S
create mode 100644 tools/testing/selftests/kvm/lib/loongarch/processor.c
create mode 100644 tools/testing/selftests/kvm/lib/loongarch/ucall.c
base-commit: 7ed2632ec7d72e926b9e8bcc9ad1bb0cd37274bf
--
2.33.0
From: Jo Van Bulck <jo.vanbulck(a)cs.kuleuven.be>
[ Upstream commit 9fd552ee32c6c1e27c125016b87d295bea6faea7 ]
DEFINED only considers symbols, not section names. Hence, replace the
check for .got.plt with the _GLOBAL_OFFSET_TABLE_ symbol and remove other
(non-essential) asserts.
Signed-off-by: Jo Van Bulck <jo.vanbulck(a)cs.kuleuven.be>
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko(a)kernel.org>
Link: https://lore.kernel.org/all/20231005153854.25566-10-jo.vanbulck%40cs.kuleuv…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/sgx/test_encl.lds | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/tools/testing/selftests/sgx/test_encl.lds b/tools/testing/selftests/sgx/test_encl.lds
index a1ec64f7d91f..108bc11d1d8c 100644
--- a/tools/testing/selftests/sgx/test_encl.lds
+++ b/tools/testing/selftests/sgx/test_encl.lds
@@ -34,8 +34,4 @@ SECTIONS
}
}
-ASSERT(!DEFINED(.altinstructions), "ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.altinstr_replacement), "ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.discard.retpoline_safe), "RETPOLINE ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.discard.nospec), "RETPOLINE ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.got.plt), "Libcalls are not supported in enclaves")
+ASSERT(!DEFINED(_GLOBAL_OFFSET_TABLE_), "Libcalls through GOT are not supported in enclaves")
--
2.43.0
From: Jo Van Bulck <jo.vanbulck(a)cs.kuleuven.be>
[ Upstream commit 9fd552ee32c6c1e27c125016b87d295bea6faea7 ]
DEFINED only considers symbols, not section names. Hence, replace the
check for .got.plt with the _GLOBAL_OFFSET_TABLE_ symbol and remove other
(non-essential) asserts.
Signed-off-by: Jo Van Bulck <jo.vanbulck(a)cs.kuleuven.be>
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko(a)kernel.org>
Link: https://lore.kernel.org/all/20231005153854.25566-10-jo.vanbulck%40cs.kuleuv…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/sgx/test_encl.lds | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/tools/testing/selftests/sgx/test_encl.lds b/tools/testing/selftests/sgx/test_encl.lds
index a1ec64f7d91f..108bc11d1d8c 100644
--- a/tools/testing/selftests/sgx/test_encl.lds
+++ b/tools/testing/selftests/sgx/test_encl.lds
@@ -34,8 +34,4 @@ SECTIONS
}
}
-ASSERT(!DEFINED(.altinstructions), "ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.altinstr_replacement), "ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.discard.retpoline_safe), "RETPOLINE ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.discard.nospec), "RETPOLINE ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.got.plt), "Libcalls are not supported in enclaves")
+ASSERT(!DEFINED(_GLOBAL_OFFSET_TABLE_), "Libcalls through GOT are not supported in enclaves")
--
2.43.0
From: Jo Van Bulck <jo.vanbulck(a)cs.kuleuven.be>
[ Upstream commit 9fd552ee32c6c1e27c125016b87d295bea6faea7 ]
DEFINED only considers symbols, not section names. Hence, replace the
check for .got.plt with the _GLOBAL_OFFSET_TABLE_ symbol and remove other
(non-essential) asserts.
Signed-off-by: Jo Van Bulck <jo.vanbulck(a)cs.kuleuven.be>
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko(a)kernel.org>
Link: https://lore.kernel.org/all/20231005153854.25566-10-jo.vanbulck%40cs.kuleuv…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/sgx/test_encl.lds | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/tools/testing/selftests/sgx/test_encl.lds b/tools/testing/selftests/sgx/test_encl.lds
index a1ec64f7d91f..108bc11d1d8c 100644
--- a/tools/testing/selftests/sgx/test_encl.lds
+++ b/tools/testing/selftests/sgx/test_encl.lds
@@ -34,8 +34,4 @@ SECTIONS
}
}
-ASSERT(!DEFINED(.altinstructions), "ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.altinstr_replacement), "ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.discard.retpoline_safe), "RETPOLINE ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.discard.nospec), "RETPOLINE ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.got.plt), "Libcalls are not supported in enclaves")
+ASSERT(!DEFINED(_GLOBAL_OFFSET_TABLE_), "Libcalls through GOT are not supported in enclaves")
--
2.43.0
From: Jo Van Bulck <jo.vanbulck(a)cs.kuleuven.be>
[ Upstream commit 9fd552ee32c6c1e27c125016b87d295bea6faea7 ]
DEFINED only considers symbols, not section names. Hence, replace the
check for .got.plt with the _GLOBAL_OFFSET_TABLE_ symbol and remove other
(non-essential) asserts.
Signed-off-by: Jo Van Bulck <jo.vanbulck(a)cs.kuleuven.be>
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko(a)kernel.org>
Link: https://lore.kernel.org/all/20231005153854.25566-10-jo.vanbulck%40cs.kuleuv…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/sgx/test_encl.lds | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/tools/testing/selftests/sgx/test_encl.lds b/tools/testing/selftests/sgx/test_encl.lds
index a1ec64f7d91f..108bc11d1d8c 100644
--- a/tools/testing/selftests/sgx/test_encl.lds
+++ b/tools/testing/selftests/sgx/test_encl.lds
@@ -34,8 +34,4 @@ SECTIONS
}
}
-ASSERT(!DEFINED(.altinstructions), "ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.altinstr_replacement), "ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.discard.retpoline_safe), "RETPOLINE ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.discard.nospec), "RETPOLINE ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.got.plt), "Libcalls are not supported in enclaves")
+ASSERT(!DEFINED(_GLOBAL_OFFSET_TABLE_), "Libcalls through GOT are not supported in enclaves")
--
2.43.0
The gro.sh test-case relay on the gro_flush_timeout to ensure
that all the segments belonging to any given batch are properly
aggregated.
The other end, the sender is a user-space program transmitting
each packet with a separate write syscall. A busy host and/or
stracing the sender program can make the relevant segments reach
the GRO engine after the flush timeout triggers.
Give the GRO flush timeout more slack, to avoid sporadic self-tests
failures.
Fixes: 9af771d2ec04 ("selftests/net: allow GRO coalesce test on veth")
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
tools/testing/selftests/net/setup_veth.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/setup_veth.sh b/tools/testing/selftests/net/setup_veth.sh
index a9a1759e035c..1f78a87f6f37 100644
--- a/tools/testing/selftests/net/setup_veth.sh
+++ b/tools/testing/selftests/net/setup_veth.sh
@@ -11,7 +11,7 @@ setup_veth_ns() {
local -r ns_mac="$4"
[[ -e /var/run/netns/"${ns_name}" ]] || ip netns add "${ns_name}"
- echo 100000 > "/sys/class/net/${ns_dev}/gro_flush_timeout"
+ echo 1000000 > "/sys/class/net/${ns_dev}/gro_flush_timeout"
ip link set dev "${ns_dev}" netns "${ns_name}" mtu 65535
ip -netns "${ns_name}" link set dev "${ns_dev}" up
--
2.43.0
the udpgro_fraglist self-test uses the BPF classifiers, but the
current net self-test configuration does not include it, causing
CI failures:
# selftests: net: udpgro_frglist.sh
# ipv6
# tcp - over veth touching data
# -l 4 -6 -D 2001:db8::1 -t rx -4 -t
# Error: TC classifier not found.
# We have an error talking to the kernel
# Error: TC classifier not found.
# We have an error talking to the kernel
Add the missing knob.
Fixes: edae34a3ed92 ("selftests net: add UDP GRO fraglist + bpf self-tests")
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
tools/testing/selftests/net/config | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
index 8da562a9ae87..ca4423ee6dc9 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -42,6 +42,7 @@ CONFIG_MPLS_ROUTING=m
CONFIG_MPLS_IPTUNNEL=m
CONFIG_NET_SCH_INGRESS=m
CONFIG_NET_CLS_FLOWER=m
+CONFIG_NET_CLS_BPF=m
CONFIG_NET_ACT_TUNNEL_KEY=m
CONFIG_NET_ACT_MIRRED=m
CONFIG_BAREUDP=m
--
2.43.0
Commit 2810c1e99867 ("kunit: Fix wild-memory-access bug in
kunit_free_suite_set()") fixed a wild-memory-access bug that could have
happened during the loading phase of test suites built and executed as
loadable modules. However, it also introduced a problematic side effect
that causes test suites modules to crash when they attempt to register
fake devices.
When a module is loaded, it traverses the MODULE_STATE_UNFORMED and
MODULE_STATE_COMING states before reaching the normal operating state
MODULE_STATE_LIVE. Finally, when the module is removed, it moves to
MODULE_STATE_GOING before being released. However, if the loading
function load_module() fails between complete_formation() and
do_init_module(), the module goes directly from MODULE_STATE_COMING to
MODULE_STATE_GOING without passing through MODULE_STATE_LIVE.
This behavior was causing kunit_module_exit() to be called without
having first executed kunit_module_init(). Since kunit_module_exit() is
responsible for freeing the memory allocated by kunit_module_init()
through kunit_filter_suites(), this behavior was resulting in a
wild-memory-access bug.
Commit 2810c1e99867 ("kunit: Fix wild-memory-access bug in
kunit_free_suite_set()") fixed this issue by running the tests when the
module is still in MODULE_STATE_COMING. However, modules in that state
are not fully initialized, lacking sysfs kobjects. Therefore, if a test
module attempts to register a fake device, it will inevitably crash.
This patch proposes a different approach to fix the original
wild-memory-access bug while restoring the normal module execution flow
by making kunit_module_exit() able to detect if kunit_module_init() has
previously initialized the tests suite set. In this way, test modules
can once again register fake devices without crashing.
This behavior is achieved by checking whether mod->kunit_suites is a
virtual or direct mapping address. If it is a virtual address, then
kunit_module_init() has allocated the suite_set in kunit_filter_suites()
using kmalloc_array(). On the contrary, if mod->kunit_suites is still
pointing to the original address that was set when looking up the
.kunit_test_suites section of the module, then the loading phase has
failed and there's no memory to be freed.
v4:
- rebased on 6.8
- noted that kunit_filter_suites() must return a virtual address
v3:
- add a comment to clarify why the start address is checked
v2:
- add include <linux/mm.h>
Fixes: 2810c1e99867 ("kunit: Fix wild-memory-access bug in kunit_free_suite_set()")
Reviewed-by: David Gow <davidgow(a)google.com>
Tested-by: Rae Moar <rmoar(a)google.com>
Tested-by: Richard Fitzgerald <rf(a)opensource.cirrus.com>
Reviewed-by: Javier Martinez Canillas <javierm(a)redhat.com>
Signed-off-by: Marco Pagani <marpagan(a)redhat.com>
---
lib/kunit/executor.c | 4 ++++
lib/kunit/test.c | 14 +++++++++++---
2 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/lib/kunit/executor.c b/lib/kunit/executor.c
index 717b9599036b..689fff2b2b10 100644
--- a/lib/kunit/executor.c
+++ b/lib/kunit/executor.c
@@ -146,6 +146,10 @@ void kunit_free_suite_set(struct kunit_suite_set suite_set)
kfree(suite_set.start);
}
+/*
+ * Filter and reallocate test suites. Must return the filtered test suites set
+ * allocated at a valid virtual address or NULL in case of error.
+ */
struct kunit_suite_set
kunit_filter_suites(const struct kunit_suite_set *suite_set,
const char *filter_glob,
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index f95d2093a0aa..31a5a992e646 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -17,6 +17,7 @@
#include <linux/panic.h>
#include <linux/sched/debug.h>
#include <linux/sched.h>
+#include <linux/mm.h>
#include "debugfs.h"
#include "device-impl.h"
@@ -801,12 +802,19 @@ static void kunit_module_exit(struct module *mod)
};
const char *action = kunit_action();
+ /*
+ * Check if the start address is a valid virtual address to detect
+ * if the module load sequence has failed and the suite set has not
+ * been initialized and filtered.
+ */
+ if (!suite_set.start || !virt_addr_valid(suite_set.start))
+ return;
+
if (!action)
__kunit_test_suites_exit(mod->kunit_suites,
mod->num_kunit_suites);
- if (suite_set.start)
- kunit_free_suite_set(suite_set);
+ kunit_free_suite_set(suite_set);
}
static int kunit_module_notify(struct notifier_block *nb, unsigned long val,
@@ -816,12 +824,12 @@ static int kunit_module_notify(struct notifier_block *nb, unsigned long val,
switch (val) {
case MODULE_STATE_LIVE:
+ kunit_module_init(mod);
break;
case MODULE_STATE_GOING:
kunit_module_exit(mod);
break;
case MODULE_STATE_COMING:
- kunit_module_init(mod);
break;
case MODULE_STATE_UNFORMED:
break;
base-commit: 539e582a375dedee95a4fa9ca3f37cdb25c441ec
--
2.43.0
This series aims to keep the git status clean after building the
selftests by adding some missing .gitignore files and object inclusion
in existing .gitignore files. This is one of the requirements listed in
the selftests documentation for new tests, but it is not always followed
as desired.
After adding these .gitignore files and including the generated objects,
the working tree appears clean again.
Signed-off-by: Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
---
Javier Carrasco (4):
selftests: netfilter: add sctp_collision to gitignore
selftests: uevent: add missing gitignore
selftests: thermal: intel: power_floor: add missing gitignore
selftests: thermal: intel: workload_hint: add missing gitignore
tools/testing/selftests/netfilter/.gitignore | 1 +
tools/testing/selftests/thermal/intel/power_floor/.gitignore | 1 +
tools/testing/selftests/thermal/intel/workload_hint/.gitignore | 1 +
tools/testing/selftests/uevent/.gitignore | 1 +
4 files changed, 4 insertions(+)
---
base-commit: 610a9b8f49fbcf1100716370d3b5f6f884a2835a
change-id: 20240101-selftest_gitignore-7da2c503766e
Best regards,
--
Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
In order for the page table level 5 to be in use, the CPU must have the
setting enabled in addition to the CONFIG option. Check for the flag to be
set to avoid false test failures on systems that do not have this cpu flag
set.
Signed-off-by: Audra Mitchell <audra(a)redhat.com>
---
tools/testing/selftests/mm/va_high_addr_switch.sh | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/mm/va_high_addr_switch.sh b/tools/testing/selftests/mm/va_high_addr_switch.sh
index 45cae7cab27e..a0a75f302904 100755
--- a/tools/testing/selftests/mm/va_high_addr_switch.sh
+++ b/tools/testing/selftests/mm/va_high_addr_switch.sh
@@ -29,9 +29,15 @@ check_supported_x86_64()
# See man 1 gzip under '-f'.
local pg_table_levels=$(gzip -dcfq "${config}" | grep PGTABLE_LEVELS | cut -d'=' -f 2)
+ local cpu_supports_pl5=$(awk '/^flags/ {if (/la57/) {print 0;}
+ else {print 1}; exit}' /proc/cpuinfo 2>/dev/null)
+
if [[ "${pg_table_levels}" -lt 5 ]]; then
echo "$0: PGTABLE_LEVELS=${pg_table_levels}, must be >= 5 to run this test"
exit $ksft_skip
+ elif [[ "${cpu_supports_pl5}" -ne 0 ]]; then
+ echo "$0: CPU does not have the necessary la57 flag to support page table level 5"
+ exit $ksft_skip
fi
}
--
2.43.0
The busywait timeout value is a millisecond, not a second. So the
current setting 2 is too small. On slow/busy host (or VMs) the
current timeout can expire even on "correct" execution, causing random
failures. Let's copy the WAIT_TIMEOUT from forwarding/lib.sh and set
BUSYWAIT_TIMEOUT here.
Fixes: 25ae948b4478 ("selftests/net: add lib.sh")
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
v2: add fixes flag. update possible failures.
---
tools/testing/selftests/net/lib.sh | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index dca549443801..f9fe182dfbd4 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -4,6 +4,9 @@
##############################################################################
# Defines
+WAIT_TIMEOUT=${WAIT_TIMEOUT:=20}
+BUSYWAIT_TIMEOUT=$((WAIT_TIMEOUT * 1000)) # ms
+
# Kselftest framework requirement - SKIP code is 4.
ksft_skip=4
# namespace list created by setup_ns
@@ -48,7 +51,7 @@ cleanup_ns()
for ns in "$@"; do
ip netns delete "${ns}" &> /dev/null
- if ! busywait 2 ip netns list \| grep -vq "^$ns$" &> /dev/null; then
+ if ! busywait $BUSYWAIT_TIMEOUT ip netns list \| grep -vq "^$ns$" &> /dev/null; then
echo "Warn: Failed to remove namespace $ns"
ret=1
fi
--
2.43.0
Patches 1 and 3 are fixes for tdc that were discovered when running it
using defconfig + tc-testing config and against the latest iproute2.
Patch 2 improves the taprio tests.
Patch 4 enables all tdc tests.
Patch 5 fixes the return code of tdc for when a test fails
setup/teardown.
v1->v2: Suggestions by Davide
Pedro Tammela (5):
selftests: tc-testing: add missing netfilter config
selftests: tc-testing: check if 'jq' is available in taprio tests
selftests: tc-testing: adjust fq test to latest iproute2
selftests: tc-testing: enable all tdc tests
selftests: tc-testing: return fail if a test fails in setup/teardown
tools/testing/selftests/tc-testing/config | 1 +
tools/testing/selftests/tc-testing/tc-tests/qdiscs/fq.json | 2 +-
tools/testing/selftests/tc-testing/tc-tests/qdiscs/taprio.json | 2 ++
tools/testing/selftests/tc-testing/tdc.py | 2 +-
tools/testing/selftests/tc-testing/tdc.sh | 3 +--
5 files changed, 6 insertions(+), 4 deletions(-)
--
2.40.1
The default timeout for tests is 45sec, bench-lookups_ipv6
seems to take around 50sec when running in a VM without
HW acceleration. Give it a 2x margin and set the timeout
to 120sec.
Fixes: d1066c9c58d4 ("selftests/net: Add test/benchmark for removing MKTs")
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
Long story short I looked at the output for bench-lookups_ipv6
and it seems to be a trivial timeout problem. With this we're at
22/24 passing for TCP AO, the reset case failures aren't as obvious...
CC: shuah(a)kernel.org
CC: 0x7f454c46(a)gmail.com
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/tcp_ao/settings | 1 +
1 file changed, 1 insertion(+)
create mode 100644 tools/testing/selftests/net/tcp_ao/settings
diff --git a/tools/testing/selftests/net/tcp_ao/settings b/tools/testing/selftests/net/tcp_ao/settings
new file mode 100644
index 000000000000..6091b45d226b
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_ao/settings
@@ -0,0 +1 @@
+timeout=120
--
2.43.0
This series address self-tests failures for udp gro-related tests.
The first patch addresses the main problem I observe locally - the XDP
program required by such tests, xdp_dummy, is currently build in the
ebpf self-tests directory, not available if/when the user targets net
only. Arguably is more a refactor than a fix, but still targeting net
to hopefully
The second patch fixes the integration of such tests with the build
system.
Patch 3/3 fixes sporadic failures due to races.
Tested with:
make -C tools/testing/selftests/ TARGETS=net install
./tools/testing/selftests/kselftest_install/run_kselftest.sh \
-t "net:udpgro_bench.sh net:udpgro.sh net:udpgro_fwd.sh \
net:udpgro_frglist.sh net:veth.sh"
no failures.
Paolo Abeni (3):
selftests: net: remove dependency on ebpf tests
selftests: net: included needed helper in the install targets
selftests: net: explicitly wait for listener ready
tools/testing/selftests/net/Makefile | 6 ++++--
tools/testing/selftests/net/udpgro.sh | 4 ++--
tools/testing/selftests/net/udpgro_bench.sh | 4 ++--
tools/testing/selftests/net/udpgro_frglist.sh | 6 +++---
tools/testing/selftests/net/udpgro_fwd.sh | 8 +++++---
tools/testing/selftests/net/veth.sh | 4 ++--
tools/testing/selftests/net/xdp_dummy.c | 13 +++++++++++++
7 files changed, 31 insertions(+), 14 deletions(-)
create mode 100644 tools/testing/selftests/net/xdp_dummy.c
--
2.43.0
Still a bit unclear whether each directory should have its own
config file, but assuming they should lets add one for tcp_ao.
The following tests still fail with this config in place:
- rst_ipv4,
- rst_ipv6,
- bench-lookups_ipv6.
other 21 pass.
Fixes: d11301f65977 ("selftests/net: Add TCP-AO ICMPs accept test")
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
CC: shuah(a)kernel.org
CC: 0x7f454c46(a)gmail.com
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/tcp_ao/config | 10 ++++++++++
1 file changed, 10 insertions(+)
create mode 100644 tools/testing/selftests/net/tcp_ao/config
diff --git a/tools/testing/selftests/net/tcp_ao/config b/tools/testing/selftests/net/tcp_ao/config
new file mode 100644
index 000000000000..d3277a9de987
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_ao/config
@@ -0,0 +1,10 @@
+CONFIG_CRYPTO_HMAC=y
+CONFIG_CRYPTO_RMD160=y
+CONFIG_CRYPTO_SHA1=y
+CONFIG_IPV6_MULTIPLE_TABLES=y
+CONFIG_IPV6=y
+CONFIG_NET_L3_MASTER_DEV=y
+CONFIG_NET_VRF=y
+CONFIG_TCP_AO=y
+CONFIG_TCP_MD5SIG=y
+CONFIG_VETH=m
--
2.43.0
Currently the seccomp benchmark selftest produces non-standard output,
meaning that while it makes a number of checks of the performance it
observes this has to be parsed by humans. This means that automated
systems running this suite of tests are almost certainly ignoring the
results which isn't ideal for spotting problems. Let's rework things so
that each check that the program does is reported as a test result to
the framework.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v4:
- Silence checkpatch noise.
- Link to v3: https://lore.kernel.org/r/20240122-b4-kselftest-seccomp-benchmark-ktap-v3-0…
Changes in v3:
- Re-add signoff.
- Link to v2: https://lore.kernel.org/r/20240122-b4-kselftest-seccomp-benchmark-ktap-v2-0…
Changes in v2:
- Rebase onto v6.8-rc1.
- Link to v1: https://lore.kernel.org/r/20231219-b4-kselftest-seccomp-benchmark-ktap-v1-0…
---
Mark Brown (2):
kselftest/seccomp: Use kselftest output functions for benchmark
kselftest/seccomp: Report each expectation we assert as a KTAP test
.../testing/selftests/seccomp/seccomp_benchmark.c | 104 +++++++++++++--------
1 file changed, 64 insertions(+), 40 deletions(-)
---
base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
change-id: 20231219-b4-kselftest-seccomp-benchmark-ktap-357603823708
Best regards,
--
Mark Brown <broonie(a)kernel.org>
After commit 25ae948b4478 ("selftests/net: add lib.sh") but before commit
2114e83381d3 ("selftests: forwarding: Avoid failures to source
net/lib.sh"), some net selftests encountered errors when they were being
exported and run. This was because the new net/lib.sh was not exported
along with the tests. The errors were crudely avoided by duplicating some
content between net/lib.sh and net/forwarding/lib.sh in 2114e83381d3.
In order to restore the sourcing of net/lib.sh from net/forwarding/lib.sh
and remove the duplicated content, this series introduces a new selftests
Makefile variable to list extra files to export from other directories and
makes use of it to avoid reintroducing the errors mentioned above.
v1:
* "selftests: Introduce Makefile variable to list shared bash scripts"
Changed TEST_INCLUDES to take relative paths, like other TEST_* variables.
Paths are adjusted accordingly in the subsequent patches. (Vladimir Oltean)
* selftests: bonding: Change script interpreter
selftests: forwarding: Remove executable bits from lib.sh
Removed from this series, submitted separately.
Since commit 2114e83381d3 ("selftests: forwarding: Avoid failures to source
net/lib.sh") resolved the test errors, this version of the series is
focused on removing the duplication that was added in that commit. Directly
rebasing the series would reintroduce the problems that 2114e83381d3
avoided before fixing them again. In order to prevent such breakage partway
through the series, patches are reordered and content changed slightly but
there is no diff at the end compared with the simple rebasing approach. I
have dropped most review tags on account of this reordering.
RFC:
https://lore.kernel.org/netdev/20231222135836.992841-1-bpoirier@nvidia.com/
Link: https://lore.kernel.org/netdev/ZXu7dGj7F9Ng8iIX@Laptop-X1/
Benjamin Poirier (5):
selftests: Introduce Makefile variable to list shared bash scripts
selftests: bonding: Add net/forwarding/lib.sh to TEST_INCLUDES
selftests: team: Add shared library scripts to TEST_INCLUDES
selftests: dsa: Replace test symlinks by wrapper script
selftests: forwarding: Redefine relative_path variable
Petr Machata (1):
selftests: forwarding: Remove duplicated lib.sh content
Documentation/dev-tools/kselftest.rst | 10 +++++
tools/testing/selftests/Makefile | 7 +++-
.../selftests/drivers/net/bonding/Makefile | 7 +++-
.../net/bonding/bond-eth-type-change.sh | 2 +-
.../drivers/net/bonding/bond_topo_2d1c.sh | 2 +-
.../drivers/net/bonding/dev_addr_lists.sh | 2 +-
.../net/bonding/mode-1-recovery-updelay.sh | 2 +-
.../net/bonding/mode-2-recovery-updelay.sh | 2 +-
.../drivers/net/bonding/net_forwarding_lib.sh | 1 -
.../selftests/drivers/net/dsa/Makefile | 18 ++++++++-
.../drivers/net/dsa/bridge_locked_port.sh | 2 +-
.../selftests/drivers/net/dsa/bridge_mdb.sh | 2 +-
.../selftests/drivers/net/dsa/bridge_mld.sh | 2 +-
.../drivers/net/dsa/bridge_vlan_aware.sh | 2 +-
.../drivers/net/dsa/bridge_vlan_mcast.sh | 2 +-
.../drivers/net/dsa/bridge_vlan_unaware.sh | 2 +-
.../testing/selftests/drivers/net/dsa/lib.sh | 1 -
.../drivers/net/dsa/local_termination.sh | 2 +-
.../drivers/net/dsa/no_forwarding.sh | 2 +-
.../net/dsa/run_net_forwarding_test.sh | 9 +++++
.../selftests/drivers/net/dsa/tc_actions.sh | 2 +-
.../selftests/drivers/net/dsa/tc_common.sh | 1 -
.../drivers/net/dsa/test_bridge_fdb_stress.sh | 2 +-
.../selftests/drivers/net/team/Makefile | 7 ++--
.../drivers/net/team/dev_addr_lists.sh | 4 +-
.../selftests/drivers/net/team/lag_lib.sh | 1 -
.../drivers/net/team/net_forwarding_lib.sh | 1 -
tools/testing/selftests/lib.mk | 19 ++++++++++
.../testing/selftests/net/forwarding/Makefile | 3 ++
tools/testing/selftests/net/forwarding/lib.sh | 37 +++----------------
.../net/forwarding/mirror_gre_lib.sh | 2 +-
.../net/forwarding/mirror_gre_topo_lib.sh | 2 +-
32 files changed, 96 insertions(+), 64 deletions(-)
delete mode 120000 tools/testing/selftests/drivers/net/bonding/net_forwarding_lib.sh
delete mode 120000 tools/testing/selftests/drivers/net/dsa/lib.sh
create mode 100755 tools/testing/selftests/drivers/net/dsa/run_net_forwarding_test.sh
delete mode 120000 tools/testing/selftests/drivers/net/dsa/tc_common.sh
delete mode 120000 tools/testing/selftests/drivers/net/team/lag_lib.sh
delete mode 120000 tools/testing/selftests/drivers/net/team/net_forwarding_lib.sh
--
2.43.0
Hi,
This two patches fix an issue when the user running net_test is not
root. The second patch simplify test error logs.
Regards,
Mickaël Salaün (2):
selftests/landlock: Fix capability for net_test
selftests/landlock: Clean up error logs related to capabilities
tools/testing/selftests/landlock/common.h | 88 ++++++++++++---------
tools/testing/selftests/landlock/net_test.c | 5 +-
2 files changed, 55 insertions(+), 38 deletions(-)
--
2.43.0
Non-contiguous CBM support for Intel CAT has been merged into the kernel
with Commit 0e3cd31f6e90 ("x86/resctrl: Enable non-contiguous CBMs in
Intel CAT") but there is no selftest that would validate if this feature
works correctly.
The selftest needs to verify if writing non-contiguous CBMs to the
schemata file behaves as expected in comparison to the information about
non-contiguous CBMs support.
The patch series is based on a rework of resctrl selftests that's currently in
review [1]. The patch also implements a similiar functionality presented
in the bash script included in the cover letter of the original
non-contiguous CBMs in Intel CAT series [2].
Changelog v2:
- Rebase onto v3 of [1] series.
- Add two patches that prepare helpers for the new test.
- Move Ilpo's patch that adds test grouping to this series.
- Apply Ilpo's suggestion to the patch that adds a new test.
[1] https://lore.kernel.org/all/20231211121826.14392-1-ilpo.jarvinen@linux.inte…
[2] https://lore.kernel.org/all/cover.1696934091.git.maciej.wieczor-retman@inte…
Ilpo Järvinen (1):
selftests/resctrl: Add test groups and name L3 CAT test L3_CAT
Maciej Wieczor-Retman (3):
selftests/resctrl: Add helpers for the non-contiguous test
selftests/resctrl: Split validate_resctrl_feature_request()
selftests/resctrl: Add non-contiguous CBMs CAT test
tools/testing/selftests/resctrl/cat_test.c | 80 ++++++++++++++++-
tools/testing/selftests/resctrl/cmt_test.c | 4 +-
tools/testing/selftests/resctrl/mba_test.c | 5 +-
tools/testing/selftests/resctrl/mbm_test.c | 6 +-
tools/testing/selftests/resctrl/resctrl.h | 12 ++-
.../testing/selftests/resctrl/resctrl_tests.c | 18 ++--
tools/testing/selftests/resctrl/resctrlfs.c | 86 ++++++++++++++++---
7 files changed, 185 insertions(+), 26 deletions(-)
--
2.43.0
On Tue, 2024-01-23 at 08:45 -0800, Jakub Kicinski wrote:
> On Mon, 22 Jan 2024 15:34:34 +0000 (UTC) Kalle Valo wrote:
> > Lukas Bulwahn (1):
> > wifi: cfg80211/mac80211: remove dependency on non-existing option
>
> BTW we run all kernel's kunit tests on netdev periodically by doing:
>
> ./tools/testing/kunit/kunit.py run --all
>
> and AFAICT the WiFi tests don't pop up there :(
>
> https://netdev.bots.linux.dev/contest.html?branch=net-next-2024-01-23--15-0…
>
> Is that on purpose?
No, but honestly, I didn't even really know about it, which is mostly
for lack of bothering to look - because we run them in a different way
now both in upstream hostap and internally (internally we also have
another file with metadata to tie them to other bits of the whole
project that isn't interesting upstream).
Looks like that needs adjustments to the config file there, mostly? I
can see about adding that, probably not that hard, at least for
mac80211/cfg80211.
++kunit folks:
We're also adding unit tests to iwlwifi (slowly), any idea if we should
enable that here also? It _is_ now possible to build PCI stuff on kunit,
but it requires some additional config options (virt-pci etc.), not sure
that's desirable here? It doesn't need it at runtime for the tests, of
course.
johannes
This patch abstracts envcfg CSR in kernel (as is done for other homonyn
CSRs). CSR_ENVCFG is used as alias for CSR_SENVCFG or CSR_MENVCFG depending
on how kernel is compiled.
Additionally it changes CBZE enabling to start using CSR_ENVCFG instead of
CSR_SENVCFG.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/csr.h | 2 ++
arch/riscv/kernel/cpufeature.c | 2 +-
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index 306a19a5509c..b3400517b0a9 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -415,6 +415,7 @@
# define CSR_STATUS CSR_MSTATUS
# define CSR_IE CSR_MIE
# define CSR_TVEC CSR_MTVEC
+# define CSR_ENVCFG CSR_MENVCFG
# define CSR_SCRATCH CSR_MSCRATCH
# define CSR_EPC CSR_MEPC
# define CSR_CAUSE CSR_MCAUSE
@@ -439,6 +440,7 @@
# define CSR_STATUS CSR_SSTATUS
# define CSR_IE CSR_SIE
# define CSR_TVEC CSR_STVEC
+# define CSR_ENVCFG CSR_SENVCFG
# define CSR_SCRATCH CSR_SSCRATCH
# define CSR_EPC CSR_SEPC
# define CSR_CAUSE CSR_SCAUSE
diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
index b3785ffc1570..98623393fd1f 100644
--- a/arch/riscv/kernel/cpufeature.c
+++ b/arch/riscv/kernel/cpufeature.c
@@ -725,7 +725,7 @@ arch_initcall(check_unaligned_access_all_cpus);
void riscv_user_isa_enable(void)
{
if (riscv_cpu_has_extension_unlikely(smp_processor_id(), RISCV_ISA_EXT_ZICBOZ))
- csr_set(CSR_SENVCFG, ENVCFG_CBZE);
+ csr_set(CSR_ENVCFG, ENVCFG_CBZE);
}
#ifdef CONFIG_RISCV_ALTERNATIVE
--
2.43.0
From e097eed364b4ef5f9d5e6c7ef22685bf34021555 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Tue, 12 Dec 2023 14:28:59 -0800
Subject: [PATCH 02/28] riscv: envcfg save and restore on trap entry/exit
envcfg CSR defines enabling bits for cache management instructions and soon
will control enabling for control flow integrity and pointer masking features.
Control flow integrity enabling for forward cfi and backward cfi is controlled
via envcfg and thus need to be enabled on per thread basis.
This patch creates a place holder for envcfg CSR in `thread_info` and adds
logic to save and restore on trap entry and exits.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/thread_info.h | 1 +
arch/riscv/kernel/asm-offsets.c | 1 +
arch/riscv/kernel/entry.S | 4 ++++
3 files changed, 6 insertions(+)
diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h
index 574779900bfb..320bc899a63b 100644
--- a/arch/riscv/include/asm/thread_info.h
+++ b/arch/riscv/include/asm/thread_info.h
@@ -57,6 +57,7 @@ struct thread_info {
long user_sp; /* User stack pointer */
int cpu;
unsigned long syscall_work; /* SYSCALL_WORK_ flags */
+ unsigned long envcfg;
#ifdef CONFIG_SHADOW_CALL_STACK
void *scs_base;
void *scs_sp;
diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
index a03129f40c46..cdd8f095c30c 100644
--- a/arch/riscv/kernel/asm-offsets.c
+++ b/arch/riscv/kernel/asm-offsets.c
@@ -39,6 +39,7 @@ void asm_offsets(void)
OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
OFFSET(TASK_TI_USER_SP, task_struct, thread_info.user_sp);
+ OFFSET(TASK_TI_ENVCFG, task_struct, thread_info.envcfg);
#ifdef CONFIG_SHADOW_CALL_STACK
OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp);
#endif
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index 54ca4564a926..63c3855ba80d 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -129,6 +129,10 @@ SYM_CODE_START_NOALIGN(ret_from_exception)
addi s0, sp, PT_SIZE_ON_STACK
REG_S s0, TASK_TI_KERNEL_SP(tp)
+ /* restore envcfg bits for current thread */
+ REG_L s0, TASK_TI_ENVCFG(tp)
+ csrw CSR_ENVCFG, s0
+
/* Save the kernel shadow call stack pointer */
scs_save_current
--
2.43.0
From 00561993452d050e19bd386d6d03a2a8aeb92ea2 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Fri, 29 Dec 2023 18:22:25 -0800
Subject: [PATCH 03/28] riscv: define default value for envcfg
Defines a base default value for envcfg per task. By default all tasks
should have cache zeroing capability. Any future capabilities can be
turned on.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/csr.h | 2 ++
arch/riscv/kernel/process.c | 1 +
2 files changed, 3 insertions(+)
diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index b3400517b0a9..01ba87954da2 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -202,6 +202,8 @@
#define ENVCFG_CBIE_FLUSH _AC(0x1, UL)
#define ENVCFG_CBIE_INV _AC(0x3, UL)
#define ENVCFG_FIOM _AC(0x1, UL)
+/* by default all threads should be able to zero cache */
+#define ENVCFG_BASE ENVCFG_CBZE
/* Smstateen bits */
#define SMSTATEEN0_AIA_IMSIC_SHIFT 58
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index 4f21d970a129..2420123444c4 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -152,6 +152,7 @@ void start_thread(struct pt_regs *regs, unsigned long pc,
else
regs->status |= SR_UXL_64;
#endif
+ current->thread_info.envcfg = ENVCFG_BASE;
}
void flush_thread(void)
--
2.43.0
From 87902dd95726e86bcd9d791cc73ca0a88f66d24a Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Fri, 15 Dec 2023 16:58:07 -0800
Subject: [PATCH 04/28] riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv
riscv will need an implementation for exit_thread to clean up shadow stack
when thread exits. If current thread had shadow stack enabled, shadow
stack is allocated by default for any new thread.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/Kconfig | 1 +
arch/riscv/kernel/process.c | 5 +++++
2 files changed, 6 insertions(+)
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 95a2a06acc6a..9d386e9edc45 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -142,6 +142,7 @@ config RISCV
select HAVE_RSEQ
select HAVE_STACKPROTECTOR
select HAVE_SYSCALL_TRACEPOINTS
+ select HAVE_EXIT_THREAD
select HOTPLUG_CORE_SYNC_DEAD if HOTPLUG_CPU
select IRQ_DOMAIN
select IRQ_FORCED_THREADING
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index 2420123444c4..c249cf3d8083 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -192,6 +192,11 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
return 0;
}
+void exit_thread(struct task_struct *tsk)
+{
+ return;
+}
+
int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
--
2.43.0
From bc80126c6b94228de149d767eb21d2e8d98f08df Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Sun, 15 Jan 2023 23:28:54 -0800
Subject: [PATCH 05/28] riscv: zicfiss/zicfilp enumeration
This patch adds support for detecting zicfiss and zicfilp. zicfiss and zicfilp
stands for unprivleged integer spec extension for shadow stack and branch
tracking on indirect branches, respectively.
This patch looks for zicfiss and zicfilp in device tree and accordinlgy lights
up bit in cpu feature bitmap. Furthermore this patch adds detection utility
functions to return whether shadow stack or landing pads are supported by
cpu.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/cpufeature.h | 18 ++++++++++++++++++
arch/riscv/include/asm/hwcap.h | 2 ++
arch/riscv/include/asm/processor.h | 1 +
arch/riscv/kernel/cpufeature.c | 2 ++
4 files changed, 23 insertions(+)
diff --git a/arch/riscv/include/asm/cpufeature.h b/arch/riscv/include/asm/cpufeature.h
index a418c3112cd6..216190731c55 100644
--- a/arch/riscv/include/asm/cpufeature.h
+++ b/arch/riscv/include/asm/cpufeature.h
@@ -133,4 +133,22 @@ static __always_inline bool riscv_cpu_has_extension_unlikely(int cpu, const unsi
return __riscv_isa_extension_available(hart_isa[cpu].isa, ext);
}
+static inline bool cpu_supports_shadow_stack(void)
+{
+#ifdef CONFIG_RISCV_USER_CFI
+ return riscv_isa_extension_available(NULL, ZICFISS);
+#else
+ return false;
+#endif
+}
+
+static inline bool cpu_supports_indirect_br_lp_instr(void)
+{
+#ifdef CONFIG_RISCV_USER_CFI
+ return riscv_isa_extension_available(NULL, ZICFILP);
+#else
+ return false;
+#endif
+}
+
#endif
diff --git a/arch/riscv/include/asm/hwcap.h b/arch/riscv/include/asm/hwcap.h
index 06d30526ef3b..918165cfb4fa 100644
--- a/arch/riscv/include/asm/hwcap.h
+++ b/arch/riscv/include/asm/hwcap.h
@@ -57,6 +57,8 @@
#define RISCV_ISA_EXT_ZIHPM 42
#define RISCV_ISA_EXT_SMSTATEEN 43
#define RISCV_ISA_EXT_ZICOND 44
+#define RISCV_ISA_EXT_ZICFISS 45
+#define RISCV_ISA_EXT_ZICFILP 46
#define RISCV_ISA_EXT_MAX 64
diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h
index f19f861cda54..ee2f51787ff8 100644
--- a/arch/riscv/include/asm/processor.h
+++ b/arch/riscv/include/asm/processor.h
@@ -13,6 +13,7 @@
#include <vdso/processor.h>
#include <asm/ptrace.h>
+#include <asm/hwcap.h>
#ifdef CONFIG_64BIT
#define DEFAULT_MAP_WINDOW (UL(1) << (MMAP_VA_BITS - 1))
diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
index 98623393fd1f..16624bc9a46b 100644
--- a/arch/riscv/kernel/cpufeature.c
+++ b/arch/riscv/kernel/cpufeature.c
@@ -185,6 +185,8 @@ const struct riscv_isa_ext_data riscv_isa_ext[] = {
__RISCV_ISA_EXT_DATA(svinval, RISCV_ISA_EXT_SVINVAL),
__RISCV_ISA_EXT_DATA(svnapot, RISCV_ISA_EXT_SVNAPOT),
__RISCV_ISA_EXT_DATA(svpbmt, RISCV_ISA_EXT_SVPBMT),
+ __RISCV_ISA_EXT_DATA(zicfiss, RISCV_ISA_EXT_ZICFISS),
+ __RISCV_ISA_EXT_DATA(zicfilp, RISCV_ISA_EXT_ZICFILP),
};
const size_t riscv_isa_ext_count = ARRAY_SIZE(riscv_isa_ext);
--
2.43.0
From f5c0949f18a5458498486b332d97121d4f3def27 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Mon, 16 Jan 2023 00:08:32 -0800
Subject: [PATCH 06/28] riscv: zicfiss/zicfilp extension csr and bit
definitions
zicfiss and zicfilp extension gets enabled via b3 and b2 in xenvcfg CSR.
menvcfg controls enabling for S/HS mode. henvcfg control enabling for VS while
senvcfg controls enabling for U/VU mode.
zicfilp extension extends xstatus CSR to hold `expected landing pad` bit.
A trap or interrupt can occur between an indirect jmp/call and target instr.
`expected landing pad` bit from CPU is recorded into xstatus CSR so that when
supervisor performs xret, `expected landing pad` state of CPU can be restored.
zicfiss adds one new CSR
- CSR_SSP: CSR_SSP contains current shadow stack pointer.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/csr.h | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index 01ba87954da2..80fe38d5de4a 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -18,6 +18,15 @@
#define SR_MPP _AC(0x00001800, UL) /* Previously Machine */
#define SR_SUM _AC(0x00040000, UL) /* Supervisor User Memory Access */
+/* zicfilp landing pad status bit */
+#define SR_SPELP _AC(0x00800000, UL)
+#define SR_MPELP _AC(0x020000000000, UL)
+#ifdef CONFIG_RISCV_M_MODE
+#define SR_ELP SR_MPELP
+#else
+#define SR_ELP SR_SPELP
+#endif
+
#define SR_FS _AC(0x00006000, UL) /* Floating-point Status */
#define SR_FS_OFF _AC(0x00000000, UL)
#define SR_FS_INITIAL _AC(0x00002000, UL)
@@ -196,6 +205,8 @@
#define ENVCFG_PBMTE (_AC(1, ULL) << 62)
#define ENVCFG_CBZE (_AC(1, UL) << 7)
#define ENVCFG_CBCFE (_AC(1, UL) << 6)
+#define ENVCFG_LPE (_AC(1, UL) << 2)
+#define ENVCFG_SSE (_AC(1, UL) << 3)
#define ENVCFG_CBIE_SHIFT 4
#define ENVCFG_CBIE (_AC(0x3, UL) << ENVCFG_CBIE_SHIFT)
#define ENVCFG_CBIE_ILL _AC(0x0, UL)
@@ -216,6 +227,11 @@
#define SMSTATEEN0_HSENVCFG (_ULL(1) << SMSTATEEN0_HSENVCFG_SHIFT)
#define SMSTATEEN0_SSTATEEN0_SHIFT 63
#define SMSTATEEN0_SSTATEEN0 (_ULL(1) << SMSTATEEN0_SSTATEEN0_SHIFT)
+/*
+ * zicfiss user mode csr
+ * CSR_SSP holds current shadow stack pointer.
+ */
+#define CSR_SSP 0x011
/* symbolic CSR names: */
#define CSR_CYCLE 0xc00
--
2.43.0
From 0551775c478b9643a7d801e9ed56fe47554e1ba9 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Mon, 16 Jan 2023 03:34:04 -0800
Subject: [PATCH 07/28] riscv: kernel handling on trap entry/exit for user cfi
Carves out space in arch specific thread struct for cfi status and shadow stack
in usermode on riscv.
This patch does following
- defines a new structure cfi_status with status bit for cfi feature
- defines shadow stack pointer, base and size in cfi_status structure
- defines offsets to new member fields in thread in asm-offsets.c
- Saves and restore shadow stack pointer on trap entry (U --> S) and exit
(S --> U)
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/processor.h | 1 +
arch/riscv/include/asm/thread_info.h | 3 +++
arch/riscv/include/asm/usercfi.h | 24 ++++++++++++++++++++++++
arch/riscv/kernel/asm-offsets.c | 5 ++++-
arch/riscv/kernel/entry.S | 25 +++++++++++++++++++++++++
5 files changed, 57 insertions(+), 1 deletion(-)
create mode 100644 arch/riscv/include/asm/usercfi.h
diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h
index ee2f51787ff8..d4dc298880fc 100644
--- a/arch/riscv/include/asm/processor.h
+++ b/arch/riscv/include/asm/processor.h
@@ -14,6 +14,7 @@
#include <asm/ptrace.h>
#include <asm/hwcap.h>
+#include <asm/usercfi.h>
#ifdef CONFIG_64BIT
#define DEFAULT_MAP_WINDOW (UL(1) << (MMAP_VA_BITS - 1))
diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h
index 320bc899a63b..6a2acecec546 100644
--- a/arch/riscv/include/asm/thread_info.h
+++ b/arch/riscv/include/asm/thread_info.h
@@ -58,6 +58,9 @@ struct thread_info {
int cpu;
unsigned long syscall_work; /* SYSCALL_WORK_ flags */
unsigned long envcfg;
+#ifdef CONFIG_RISCV_USER_CFI
+ struct cfi_status user_cfi_state;
+#endif
#ifdef CONFIG_SHADOW_CALL_STACK
void *scs_base;
void *scs_sp;
diff --git a/arch/riscv/include/asm/usercfi.h b/arch/riscv/include/asm/usercfi.h
new file mode 100644
index 000000000000..080d7077d12c
--- /dev/null
+++ b/arch/riscv/include/asm/usercfi.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * Copyright (C) 2023 Rivos, Inc.
+ * Deepak Gupta <debug(a)rivosinc.com>
+ */
+#ifndef _ASM_RISCV_USERCFI_H
+#define _ASM_RISCV_USERCFI_H
+
+#ifndef __ASSEMBLY__
+#include <linux/types.h>
+
+#ifdef CONFIG_RISCV_USER_CFI
+struct cfi_status {
+ unsigned long ubcfi_en : 1; /* Enable for backward cfi. */
+ unsigned long rsvd : ((sizeof(unsigned long)*8) - 1);
+ unsigned long user_shdw_stk; /* Current user shadow stack pointer */
+ unsigned long shdw_stk_base; /* Base address of shadow stack */
+ unsigned long shdw_stk_size; /* size of shadow stack */
+};
+
+#endif /* CONFIG_RISCV_USER_CFI */
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_RISCV_USERCFI_H */
diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
index cdd8f095c30c..5e1f412e96ba 100644
--- a/arch/riscv/kernel/asm-offsets.c
+++ b/arch/riscv/kernel/asm-offsets.c
@@ -43,8 +43,11 @@ void asm_offsets(void)
#ifdef CONFIG_SHADOW_CALL_STACK
OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp);
#endif
-
OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu);
+#ifdef CONFIG_RISCV_USER_CFI
+ OFFSET(TASK_TI_CFI_STATUS, task_struct, thread_info.user_cfi_state);
+ OFFSET(TASK_TI_USER_SSP, task_struct, thread_info.user_cfi_state.user_shdw_stk);
+#endif
OFFSET(TASK_THREAD_F0, task_struct, thread.fstate.f[0]);
OFFSET(TASK_THREAD_F1, task_struct, thread.fstate.f[1]);
OFFSET(TASK_THREAD_F2, task_struct, thread.fstate.f[2]);
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index 63c3855ba80d..410659e2eadb 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -49,6 +49,21 @@ SYM_CODE_START(handle_exception)
REG_S x5, PT_T0(sp)
save_from_x6_to_x31
+#ifdef CONFIG_RISCV_USER_CFI
+ /*
+ * we need to save cfi status only when previous mode was U
+ */
+ csrr s2, CSR_STATUS
+ andi s2, s2, SR_SPP
+ bnez s2, skip_bcfi_save
+ /* load cfi status word */
+ lw s3, TASK_TI_CFI_STATUS(tp)
+ andi s3, s3, 1
+ beqz s3, skip_bcfi_save
+ csrr s3, CSR_SSP
+ REG_S s3, TASK_TI_USER_SSP(tp) /* save user ssp in thread_info */
+skip_bcfi_save:
+#endif
/*
* Disable user-mode memory access as it should only be set in the
* actual user copy routines.
@@ -141,6 +156,16 @@ SYM_CODE_START_NOALIGN(ret_from_exception)
* structures again.
*/
csrw CSR_SCRATCH, tp
+
+#ifdef CONFIG_RISCV_USER_CFI
+ lw s3, TASK_TI_CFI_STATUS(tp)
+ andi s3, s3, 1
+ beqz s3, skip_bcfi_resume
+ REG_L s3, TASK_TI_USER_SSP(tp) /* restore user ssp from thread struct */
+ csrw CSR_SSP, s3
+skip_bcfi_resume:
+#endif
+
1:
REG_L a0, PT_STATUS(sp)
/*
--
2.43.0
From 78a8bd18df45b83011353c24a75bd6bc00bf84c7 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Wed, 6 Dec 2023 17:37:49 -0800
Subject: [PATCH 08/28] mm: Define VM_SHADOW_STACK for RISC-V
VM_SHADOW_STACK is defined by x86 as vm flag to mark a shadow stack vma.
x86 uses VM_HIGH_ARCH_5 bit but that limits shadow stack vma to 64bit only.
arm64 follows same path
https://lore.kernel.org/lkml/20231009-arm64-gcs-v6-12-78e55deaa4dd@kernel.o…
On RISC-V, write-only page table encodings are shadow stack pages. This patch
re-defines VM_WRITE only to be VM_SHADOW_STACK.
Next set of patches will set guard rail that no other mm flow can set VM_WRITE
only in vma except when specifically creating shadow stack.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
include/linux/mm.h | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 418d26608ece..dfe0e8118669 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -352,7 +352,19 @@ extern unsigned int kobjsize(const void *objp);
* for more details on the guard size.
*/
# define VM_SHADOW_STACK VM_HIGH_ARCH_5
-#else
+#endif
+
+#ifdef CONFIG_RISCV_USER_CFI
+/*
+ * On RISC-V pte encodings for shadow stack is R=0, W=1, X=0 and thus RISCV
+ * choosing to use similar mechanism on vm_flags where VM_WRITE only means
+ * VM_SHADOW_STACK. RISCV as well doesn't support VM_SHADOW_STACK to be set
+ * with VM_SHARED.
+ */
+#define VM_SHADOW_STACK VM_WRITE
+#endif
+
+#ifndef VM_SHADOW_STACK
# define VM_SHADOW_STACK VM_NONE
#endif
--
2.43.0
From 0d6de4bf12ec3349f8b3f96760575f8f660f0ae9 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Fri, 29 Dec 2023 18:32:53 -0800
Subject: [PATCH 09/28] mm: abstract shadow stack vma behind
`arch_is_shadow_stack`
x86 has used VM_SHADOW_STACK (alias to VM_HIGH_ARCH_5) to encode shadow
stack VMA. VM_SHADOW_STACK is thus not possible on 32bit. Some arches may
need a way to encode shadow stack on 32bit and 64bit both and they may
encode this information differently in VMAs.
This patch changes checks of VM_SHADOW_STACK flag in generic code to call
to a function `arch_is_shadow_stack` which will return true if arch
supports shadow stack and vma is shadow stack else stub returns false.
There was a suggestion to name it as `vma_is_shadow_stack`. I preferred to
keep `arch` prefix in there because it's each arch specific.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
include/linux/mm.h | 18 +++++++++++++++++-
mm/gup.c | 5 +++--
mm/internal.h | 2 +-
3 files changed, 21 insertions(+), 4 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dfe0e8118669..15c70fc677a3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -352,6 +352,10 @@ extern unsigned int kobjsize(const void *objp);
* for more details on the guard size.
*/
# define VM_SHADOW_STACK VM_HIGH_ARCH_5
+static inline bool arch_is_shadow_stack(vm_flags_t vm_flags)
+{
+ return (vm_flags & VM_SHADOW_STACK);
+}
#endif
#ifdef CONFIG_RISCV_USER_CFI
@@ -362,10 +366,22 @@ extern unsigned int kobjsize(const void *objp);
* with VM_SHARED.
*/
#define VM_SHADOW_STACK VM_WRITE
+
+static inline bool arch_is_shadow_stack(vm_flags_t vm_flags)
+{
+ return ((vm_flags & (VM_WRITE | VM_READ | VM_EXEC)) == VM_WRITE);
+}
+
#endif
#ifndef VM_SHADOW_STACK
# define VM_SHADOW_STACK VM_NONE
+
+static inline bool arch_is_shadow_stack(vm_flags_t vm_flags)
+{
+ return false;
+}
+
#endif
#if defined(CONFIG_X86)
@@ -3464,7 +3480,7 @@ static inline unsigned long stack_guard_start_gap(struct vm_area_struct *vma)
return stack_guard_gap;
/* See reasoning around the VM_SHADOW_STACK definition */
- if (vma->vm_flags & VM_SHADOW_STACK)
+ if (vma->vm_flags && arch_is_shadow_stack(vma->vm_flags))
return PAGE_SIZE;
return 0;
diff --git a/mm/gup.c b/mm/gup.c
index 231711efa390..45798782ed2c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1051,7 +1051,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
!writable_file_mapping_allowed(vma, gup_flags))
return -EFAULT;
- if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
+ if (!(vm_flags & VM_WRITE) || arch_is_shadow_stack(vm_flags)) {
if (!(gup_flags & FOLL_FORCE))
return -EFAULT;
/* hugetlb does not support FOLL_FORCE|FOLL_WRITE. */
@@ -1069,7 +1069,8 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
if (!is_cow_mapping(vm_flags))
return -EFAULT;
}
- } else if (!(vm_flags & VM_READ)) {
+ } else if (!(vm_flags & VM_READ) && !arch_is_shadow_stack(vm_flags)) {
+ /* reads allowed if its shadow stack vma */
if (!(gup_flags & FOLL_FORCE))
return -EFAULT;
/*
diff --git a/mm/internal.h b/mm/internal.h
index b61034bd50f5..0abf00c93fe1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -572,7 +572,7 @@ static inline bool is_exec_mapping(vm_flags_t flags)
*/
static inline bool is_stack_mapping(vm_flags_t flags)
{
- return ((flags & VM_STACK) == VM_STACK) || (flags & VM_SHADOW_STACK);
+ return ((flags & VM_STACK) == VM_STACK) || arch_is_shadow_stack(flags);
}
/*
--
2.43.0
From 5511cd8dc11f00e3ec604b8fc5afdb0f632bda9e Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Thu, 19 Jan 2023 06:13:49 -0800
Subject: [PATCH 10/28] riscv/mm : Introducing new protection flag
"PROT_SHADOWSTACK"
x86 and arm64 are using VM_SHADOW_STACK (which actually is VM_HIGH_ARCH_5)
vma flag and thus restrict it to 64bit implementation only. RISC-V is choosing
to encode presence of only VM_WRITE in vma flags as shadow stack vma. This allows
32bit RISC-V ecosystem leverage shadow stack as well.
This means that existing users of `do_mmap` who had been using `VM_WRITE` and
expecting read and write permissions will break.
Thus introducing `PROT_SHADOWSTACK` to allow `do_mmap` disambiguate between
read write v/s shadow stack mappings. Thus any kernel driver/module using `do_mmap`
and only passing `VM_WRITE` would still get read-write mappings. Although any user
of `do_mmap` intending to map a shaodw stack should pass `PROT_SHADOWSTACK` to get
a shadow stack mapping.
Although for userspace still want to rely on `map_shadow_stack` and not expose
`PROT_SHADOWSTACK` to userspace and that's why this prot flag is not exposed in uapi
headers.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/mman.h | 25 +++++++++++++++++++++++++
mm/mmap.c | 1 +
2 files changed, 26 insertions(+)
create mode 100644 arch/riscv/include/asm/mman.h
diff --git a/arch/riscv/include/asm/mman.h b/arch/riscv/include/asm/mman.h
new file mode 100644
index 000000000000..4902d837e93c
--- /dev/null
+++ b/arch/riscv/include/asm/mman.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_MMAN_H__
+#define __ASM_MMAN_H__
+
+#include <linux/compiler.h>
+#include <linux/types.h>
+#include <uapi/asm/mman.h>
+
+/*
+ * Major architectures (x86, aarch64, riscv) have shadow stack now. x86 and
+ * arm64 choose to use VM_SHADOW_STACK (which actually is VM_HIGH_ARCH_5) vma
+ * flag, however that restrict it to 64bit implementation only. risc-v shadow
+ * stack encodings in page tables is PTE.R=0, PTE.W=1, PTE.D=1 which used to be
+ * reserved until now. risc-v is choosing to encode presence of only VM_WRITE in
+ * vma flags as shadow stack vma. However this means that existing users of mmap
+ * (and do_mmap) who were relying on passing only PROT_WRITE (or VM_WRITE from
+ * kernel driver) but still getting read and write mappings, should still work.
+ * x86 and arm64 followed the direction of a new system call `map_shadow_stack`.
+ * risc-v would like to converge on that so that shadow stacks flows are as much
+ * arch agnostic. Thus a conscious decision to define PROT_XXX definition for
+ * shadow stack here (and not exposed to uapi)
+ */
+#define PROT_SHADOWSTACK 0x40
+
+#endif /* ! __ASM_MMAN_H__ */
diff --git a/mm/mmap.c b/mm/mmap.c
index 1971bfffcc03..fab2acf21ce9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -47,6 +47,7 @@
#include <linux/oom.h>
#include <linux/sched/mm.h>
#include <linux/ksm.h>
+#include <linux/processor.h>
#include <linux/uaccess.h>
#include <asm/cacheflush.h>
--
2.43.0
From a3d7e7e08612681c4980091bfebba930b733eef0 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Thu, 19 Jan 2023 06:28:08 -0800
Subject: [PATCH 11/28] riscv: Implementing "PROT_SHADOWSTACK" on riscv
This patch implements new risc-v specific protection flag
`PROT_SHADOWSTACK` (only for kernel) on riscv.
`PROT_SHADOWSTACK` protection flag is only limited to kernel and not exposed
to userspace. Shadow stack is a security construct to prevent against ROP attacks.
`map_shadow_stack` is a new syscall to manufacture shadow stack. In order to avoid
multiple methods to create shadow stack, `PROT_SHADOWSTACK` is not allowed for user
space `mmap` call. `mprotect` wouldn't allow because `arch_validate_prot` already
takes care of this for risc-v.
`arch_calc_vm_prot_bits` is implemented on risc-v to return VM_SHADOW_STACK (alias
for VM_WRITE) if PROT_SHADOWSTACK is supplied (such as call to `do_mmap` will) and
underlying CPU supports shadow stack. `PROT_WRITE` will be converted to `VM_READ |
`VM_WRITE` so that existing case where `PROT_WRITE` is specified keep working but
don't collide with `VM_WRITE` only encoding which now denotes a shadow stack.
risc-v `mmap` wrapper enforces if PROT_WRITE is specified and PROT_READ is left out
then PROT_READ is enforced.
Earlier `protection_map[VM_WRITE]` used to pick read-write (and copy on write) PTE
encodings. Now all non-shadow stack writeable mappings will pick `protection_map[VM_WRITE
| VM_READ] PTE encodings. `protection[VM_WRITE]` are programmed to pick PAGE_SHADOWSTACK
PTE encordings.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/mman.h | 17 +++++++++++++++++
arch/riscv/include/asm/pgtable.h | 1 +
arch/riscv/kernel/sys_riscv.c | 19 +++++++++++++++++++
arch/riscv/mm/init.c | 2 +-
4 files changed, 38 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/include/asm/mman.h b/arch/riscv/include/asm/mman.h
index 4902d837e93c..bc09a9c0e81f 100644
--- a/arch/riscv/include/asm/mman.h
+++ b/arch/riscv/include/asm/mman.h
@@ -22,4 +22,21 @@
*/
#define PROT_SHADOWSTACK 0x40
+static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
+ unsigned long pkey __always_unused)
+{
+ unsigned long ret = 0;
+
+ if (cpu_supports_shadow_stack())
+ ret = (prot & PROT_SHADOWSTACK) ? VM_SHADOW_STACK : 0;
+ /*
+ * If PROT_WRITE was specified, force it to VM_READ | VM_WRITE.
+ * Only VM_WRITE means shadow stack.
+ */
+ if (prot & PROT_WRITE)
+ ret = (VM_READ | VM_WRITE);
+ return ret;
+}
+#define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
+
#endif /* ! __ASM_MMAN_H__ */
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 294044429e8e..54a8dde29504 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -184,6 +184,7 @@ extern struct pt_alloc_ops pt_ops __initdata;
#define PAGE_READ_EXEC __pgprot(_PAGE_BASE | _PAGE_READ | _PAGE_EXEC)
#define PAGE_WRITE_EXEC __pgprot(_PAGE_BASE | _PAGE_READ | \
_PAGE_EXEC | _PAGE_WRITE)
+#define PAGE_SHADOWSTACK __pgprot(_PAGE_BASE | _PAGE_WRITE)
#define PAGE_COPY PAGE_READ
#define PAGE_COPY_EXEC PAGE_READ_EXEC
diff --git a/arch/riscv/kernel/sys_riscv.c b/arch/riscv/kernel/sys_riscv.c
index a2ca5b7756a5..2a7cf28a6fe0 100644
--- a/arch/riscv/kernel/sys_riscv.c
+++ b/arch/riscv/kernel/sys_riscv.c
@@ -16,6 +16,7 @@
#include <asm/unistd.h>
#include <asm-generic/mman-common.h>
#include <vdso/vsyscall.h>
+#include <asm/mman.h>
static long riscv_sys_mmap(unsigned long addr, unsigned long len,
unsigned long prot, unsigned long flags,
@@ -25,6 +26,24 @@ static long riscv_sys_mmap(unsigned long addr, unsigned long len,
if (unlikely(offset & (~PAGE_MASK >> page_shift_offset)))
return -EINVAL;
+ /*
+ * If only PROT_WRITE is specified then extend that to PROT_READ
+ * protection_map[VM_WRITE] is now going to select shadow stack encodings.
+ * So specifying PROT_WRITE actually should select protection_map [VM_WRITE | VM_READ]
+ * If user wants to create shadow stack then they should use `map_shadow_stack` syscall.
+ */
+ if (unlikely((prot & PROT_WRITE) && !(prot & PROT_READ)))
+ prot |= PROT_READ;
+
+ /*
+ * PROT_SHADOWSTACK is a kernel only protection flag on risc-v.
+ * mmap doesn't expect PROT_SHADOWSTACK to be set by user space.
+ * User space can rely on `map_shadow_stack` syscall to create
+ * shadow stack pages.
+ */
+ if (unlikely(prot & PROT_SHADOWSTACK))
+ return -EINVAL;
+
return ksys_mmap_pgoff(addr, len, prot, flags, fd,
offset >> (PAGE_SHIFT - page_shift_offset));
}
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 2e011cbddf3a..f71c2d2c6cbf 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -296,7 +296,7 @@ pgd_t early_pg_dir[PTRS_PER_PGD] __initdata __aligned(PAGE_SIZE);
static const pgprot_t protection_map[16] = {
[VM_NONE] = PAGE_NONE,
[VM_READ] = PAGE_READ,
- [VM_WRITE] = PAGE_COPY,
+ [VM_WRITE] = PAGE_SHADOWSTACK,
[VM_WRITE | VM_READ] = PAGE_COPY,
[VM_EXEC] = PAGE_EXEC,
[VM_EXEC | VM_READ] = PAGE_READ_EXEC,
--
2.43.0
From e7bf650a2d0bb03e189d0077be51515c4ecaf50d Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Thu, 19 Jan 2023 09:38:32 -0800
Subject: [PATCH 12/28] riscv mm: manufacture shadow stack pte
This patch implements creating shadow stack pte (on riscv). Creating
shadow stack PTE on riscv means that clearing RWX and then setting W=1.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/pgtable.h | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 54a8dde29504..7ed00b4cc73d 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -408,6 +408,12 @@ static inline pte_t pte_mkwrite_novma(pte_t pte)
return __pte(pte_val(pte) | _PAGE_WRITE);
}
+static inline pte_t pte_mkwrite_shstk(pte_t pte)
+{
+ /* shadow stack on risc-v is XWR = 010. Clear everything and only set _PAGE_WRITE */
+ return __pte((pte_val(pte) & ~(_PAGE_LEAF)) | _PAGE_WRITE);
+}
+
/* static inline pte_t pte_mkexec(pte_t pte) */
static inline pte_t pte_mkdirty(pte_t pte)
@@ -705,6 +711,12 @@ static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
return pte_pmd(pte_mkwrite_novma(pmd_pte(pmd)));
}
+static inline pmd_t pmd_mkwrite_shstk(pmd_t pte)
+{
+ /* shadow stack on risc-v is XWR = 010. Clear everything and only set _PAGE_WRITE */
+ return __pmd((pmd_val(pte) & ~(_PAGE_LEAF)) | _PAGE_WRITE);
+}
+
static inline pmd_t pmd_wrprotect(pmd_t pmd)
{
return pte_pmd(pte_wrprotect(pmd_pte(pmd)));
--
2.43.0
From e5d2f0f05004e8f14cde361a5649177ce8a082f0 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Thu, 19 Jan 2023 09:17:54 -0800
Subject: [PATCH 13/28] riscv mmu: teach pte_mkwrite to manufacture shadow
stack PTEs
pte_mkwrite creates PTEs with WRITE encodings for underlying arch. Underlying
arch can have two types of writeable mappings. One that can be written using
regular store instructions. Another one that can only be written using specialized
store instructions (like shadow stack stores). pte_mkwrite can select write PTE
encoding based on VMA range.
On riscv, presence of only VM_WRITE in vma->vm_flags means it's a shadow stack.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
rebase with a30f0ca0fa31cdb2ac3d24b7b5be9e3ae75f4175
Implementation of pte_mkwrite and pmd_mkwrite on riscv
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/pgtable.h | 7 +++++++
arch/riscv/mm/pgtable.c | 21 +++++++++++++++++++++
2 files changed, 28 insertions(+)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 7ed00b4cc73d..9477108e727d 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -403,6 +403,10 @@ static inline pte_t pte_wrprotect(pte_t pte)
/* static inline pte_t pte_mkread(pte_t pte) */
+struct vm_area_struct;
+pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma);
+#define pte_mkwrite pte_mkwrite
+
static inline pte_t pte_mkwrite_novma(pte_t pte)
{
return __pte(pte_val(pte) | _PAGE_WRITE);
@@ -706,6 +710,9 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
return pte_pmd(pte_mkyoung(pmd_pte(pmd)));
}
+pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
+#define pmd_mkwrite pmd_mkwrite
+
static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
{
return pte_pmd(pte_mkwrite_novma(pmd_pte(pmd)));
diff --git a/arch/riscv/mm/pgtable.c b/arch/riscv/mm/pgtable.c
index fef4e7328e49..9b1845f93ea1 100644
--- a/arch/riscv/mm/pgtable.c
+++ b/arch/riscv/mm/pgtable.c
@@ -101,3 +101,24 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
return pmd;
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+ if (arch_is_shadow_stack(vma->vm_flags))
+ return pte_mkwrite_shstk(pte);
+
+ pte = pte_mkwrite_novma(pte);
+
+ return pte;
+}
+
+pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+ if (arch_is_shadow_stack(vma->vm_flags))
+ return pmd_mkwrite_shstk(pmd);
+
+ pmd = pmd_mkwrite_novma(pmd);
+
+ return pmd;
+}
+
--
2.43.0
From 217876b4c2848c78595d9e870a41ca0f2e1c2ea2 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Tue, 17 Jan 2023 09:35:13 -0800
Subject: [PATCH 14/28] riscv mmu: write protect and shadow stack
`fork` implements copy on write (COW) by making pages readonly in child
and parent both.
ptep_set_wrprotect and pte_wrprotect clears _PAGE_WRITE in PTE.
Assumption is that page is readable and on fault copy on write happens.
To implement COW on such pages, clearing up W bit makes them XWR = 000.
This will result in wrong PTE setting which says no perms but V=1 and PFN
field pointing to final page. Instead desired behavior is to turn it into
a readable page, take an access (load/store) fault on sspush/sspop
(shadow stack) and then perform COW on such pages. This way regular reads
would still be allowed and not lead to COW maintaining current behavior
of COW on non-shadow stack but writeable memory.
On the other hand it doesn't interfere with existing COW for read-write
memory. Assumption is always that _PAGE_READ must have been set and thus
setting _PAGE_READ is harmless.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/pgtable.h | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 9477108e727d..9802e8d48616 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -398,7 +398,7 @@ static inline int pte_special(pte_t pte)
static inline pte_t pte_wrprotect(pte_t pte)
{
- return __pte(pte_val(pte) & ~(_PAGE_WRITE));
+ return __pte((pte_val(pte) & ~(_PAGE_WRITE)) | (_PAGE_READ));
}
/* static inline pte_t pte_mkread(pte_t pte) */
@@ -594,7 +594,15 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
static inline void ptep_set_wrprotect(struct mm_struct *mm,
unsigned long address, pte_t *ptep)
{
- atomic_long_and(~(unsigned long)_PAGE_WRITE, (atomic_long_t *)ptep);
+ volatile pte_t read_pte = *ptep;
+ /*
+ * ptep_set_wrprotect can be called for shadow stack ranges too.
+ * shadow stack memory is XWR = 010 and thus clearing _PAGE_WRITE will lead to
+ * encoding 000b which is wrong encoding with V = 1. This should lead to page fault
+ * but we dont want this wrong configuration to be set in page tables.
+ */
+ atomic_long_set((atomic_long_t *)ptep,
+ ((pte_val(read_pte) & ~(unsigned long)_PAGE_WRITE) | _PAGE_READ));
}
#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
--
2.43.0
From 1527cdb4739ae3884160eb8824b187717f9d0960 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Fri, 15 Dec 2023 11:39:14 -0800
Subject: [PATCH 15/28] riscv/mm: Implement map_shadow_stack() syscall
As discussed extensively in the changelog for the addition of this
syscall on x86 ("x86/shstk: Introduce map_shadow_stack syscall") the
existing mmap() and madvise() syscalls do not map entirely well onto the
security requirements for guarded control stacks since they lead to
windows where memory is allocated but not yet protected or stacks which
are not properly and safely initialised. Instead a new syscall
map_shadow_stack() has been defined which allocates and initialises a
shadow stack page.
This patch implements this syscall for riscv. riscv doesn't require token
to be setup by kernel because user mode can do that by itself. However to
provide compatiblity and portability with other architectues, user mode can
specify token set flag.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/kernel/Makefile | 2 +
arch/riscv/kernel/usercfi.c | 150 ++++++++++++++++++++++++++++++++
include/uapi/asm-generic/mman.h | 1 +
3 files changed, 153 insertions(+)
create mode 100644 arch/riscv/kernel/usercfi.c
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index fee22a3d1b53..8c668269e886 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -102,3 +102,5 @@ obj-$(CONFIG_COMPAT) += compat_vdso/
obj-$(CONFIG_64BIT) += pi/
obj-$(CONFIG_ACPI) += acpi.o
+
+obj-$(CONFIG_RISCV_USER_CFI) += usercfi.o
diff --git a/arch/riscv/kernel/usercfi.c b/arch/riscv/kernel/usercfi.c
new file mode 100644
index 000000000000..35ede2cbc05b
--- /dev/null
+++ b/arch/riscv/kernel/usercfi.c
@@ -0,0 +1,150 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2023 Rivos, Inc.
+ * Deepak Gupta <debug(a)rivosinc.com>
+ */
+
+#include <linux/sched.h>
+#include <linux/bitops.h>
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/uaccess.h>
+#include <linux/sizes.h>
+#include <linux/user.h>
+#include <linux/syscalls.h>
+#include <linux/prctl.h>
+#include <asm/csr.h>
+#include <asm/usercfi.h>
+
+#define SHSTK_ENTRY_SIZE sizeof(void *)
+
+/*
+ * Writes on shadow stack can either be `sspush` or `ssamoswap`. `sspush` can happen
+ * implicitly on current shadow stack pointed to by CSR_SSP. `ssamoswap` takes pointer to
+ * shadow stack. To keep it simple, we plan to use `ssamoswap` to perform writes on shadow
+ * stack.
+ */
+static noinline unsigned long amo_user_shstk(unsigned long *addr, unsigned long val)
+{
+ /*
+ * In case ssamoswap faults, return -1.
+ * Never expect -1 on shadow stack. Expect return addresses and zero
+ */
+ unsigned long swap = -1;
+
+ __enable_user_access();
+ asm_volatile_goto(
+ ".option push\n"
+ ".option arch, +zicfiss\n"
+#ifdef CONFIG_64BIT
+ "1: ssamoswap.d %0, %2, %1\n"
+#else
+ "1: ssamoswap.w %0, %2, %1\n"
+#endif
+ _ASM_EXTABLE(1b, %l[fault])
+ RISCV_ACQUIRE_BARRIER
+ ".option pop\n"
+ : "=r" (swap), "+A" (*addr)
+ : "r" (val)
+ : "memory"
+ : fault
+ );
+ __disable_user_access();
+ return swap;
+fault:
+ __disable_user_access();
+ return -1;
+}
+
+/*
+ * Create a restore token on the shadow stack. A token is always XLEN wide
+ * and aligned to XLEN.
+ */
+static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
+{
+ unsigned long addr;
+
+ /* Token must be aligned */
+ if (!IS_ALIGNED(ssp, SHSTK_ENTRY_SIZE))
+ return -EINVAL;
+
+ /* On RISC-V we're constructing token to be function of address itself */
+ addr = ssp - SHSTK_ENTRY_SIZE;
+
+ if (amo_user_shstk((unsigned long __user *)addr, (unsigned long) ssp) == -1)
+ return -EFAULT;
+
+ if (token_addr)
+ *token_addr = addr;
+
+ return 0;
+}
+
+static unsigned long allocate_shadow_stack(unsigned long addr, unsigned long size,
+ unsigned long token_offset,
+ bool set_tok)
+{
+ int flags = MAP_ANONYMOUS | MAP_PRIVATE;
+ struct mm_struct *mm = current->mm;
+ unsigned long populate, tok_loc = 0;
+
+ if (addr)
+ flags |= MAP_FIXED_NOREPLACE;
+
+ mmap_write_lock(mm);
+ addr = do_mmap(NULL, addr, size, PROT_SHADOWSTACK, flags,
+ VM_SHADOW_STACK, 0, &populate, NULL);
+ mmap_write_unlock(mm);
+
+ if (!set_tok || IS_ERR_VALUE(addr))
+ goto out;
+
+ if (create_rstor_token(addr + token_offset, &tok_loc)) {
+ vm_munmap(addr, size);
+ return -EINVAL;
+ }
+
+ addr = tok_loc;
+
+out:
+ return addr;
+}
+
+SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)
+{
+ bool set_tok = flags & SHADOW_STACK_SET_TOKEN;
+ unsigned long aligned_size = 0;
+
+ if (!cpu_supports_shadow_stack())
+ return -EOPNOTSUPP;
+
+ /* Anything other than set token should result in invalid param */
+ if (flags & ~SHADOW_STACK_SET_TOKEN)
+ return -EINVAL;
+
+ /*
+ * Unlike other architectures, on RISC-V, SSP pointer is held in CSR_SSP and is available
+ * CSR in all modes. CSR accesses are performed using 12bit index programmed in instruction
+ * itself. This provides static property on register programming and writes to CSR can't
+ * be unintentional from programmer's perspective. As long as programmer has guarded areas
+ * which perform writes to CSR_SSP properly, shadow stack pivoting is not possible. Since
+ * CSR_SSP is writeable by user mode, it itself can setup a shadow stack token subsequent
+ * to allocation. Although in order to provide portablity with other architecture (because
+ * `map_shadow_stack` is arch agnostic syscall), RISC-V will follow expectation of a token
+ * flag in flags and if provided in flags, setup a token at the base.
+ */
+
+ /* If there isn't space for a token */
+ if (set_tok && size < SHSTK_ENTRY_SIZE)
+ return -ENOSPC;
+
+ if (addr && (addr % PAGE_SIZE))
+ return -EINVAL;
+
+ aligned_size = PAGE_ALIGN(size);
+ if (aligned_size < size)
+ return -EOVERFLOW;
+
+ return allocate_shadow_stack(addr, aligned_size, size, set_tok);
+}
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index 57e8195d0b53..0c0ac6214de6 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -19,4 +19,5 @@
#define MCL_FUTURE 2 /* lock all future mappings */
#define MCL_ONFAULT 4 /* lock all pages that are faulted in */
+#define SHADOW_STACK_SET_TOKEN (1ULL << 0) /* Set up a restore token in the shadow stack */
#endif /* __ASM_GENERIC_MMAN_H */
--
2.43.0
From 2f73f707afb839a7146833ca8bf13e8a11b26eca Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Fri, 15 Dec 2023 13:14:46 -0800
Subject: [PATCH 16/28] riscv/shstk: If needed allocate a new shadow stack on
clone
Userspace specifies VM_CLONE to share address space and spawn new thread.
`clone` allow userspace to specify a new stack for new thread. However
there is no way to specify new shadow stack base address without changing
API. This patch allocates a new shadow stack whenever VM_CLONE is given.
In case of VM_FORK, parent is suspended until child finishes and thus can
child use parent shadow stack. In case of !VM_CLONE, COW kicks in because
entire address space is copied from parent to child.
`clone3` is extensible and can provide mechanisms using which shadow stack
as an input parameter can be provided. This is not settled yet and being
extensively discussed on mailing list. Once that's settled, this commit
will adapt to that.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/usercfi.h | 39 ++++++++++
arch/riscv/kernel/process.c | 10 +++
arch/riscv/kernel/usercfi.c | 121 +++++++++++++++++++++++++++++++
3 files changed, 170 insertions(+)
diff --git a/arch/riscv/include/asm/usercfi.h b/arch/riscv/include/asm/usercfi.h
index 080d7077d12c..eb9a0905e72b 100644
--- a/arch/riscv/include/asm/usercfi.h
+++ b/arch/riscv/include/asm/usercfi.h
@@ -8,6 +8,9 @@
#ifndef __ASSEMBLY__
#include <linux/types.h>
+struct task_struct;
+struct kernel_clone_args;
+
#ifdef CONFIG_RISCV_USER_CFI
struct cfi_status {
unsigned long ubcfi_en : 1; /* Enable for backward cfi. */
@@ -17,6 +20,42 @@ struct cfi_status {
unsigned long shdw_stk_size; /* size of shadow stack */
};
+unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
+ const struct kernel_clone_args *args);
+void shstk_release(struct task_struct *tsk);
+void set_shstk_base(struct task_struct *task, unsigned long shstk_addr, unsigned long size);
+void set_active_shstk(struct task_struct *task, unsigned long shstk_addr);
+bool is_shstk_enabled(struct task_struct *task);
+
+#else
+
+static inline unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
+ const struct kernel_clone_args *args)
+{
+ return 0;
+}
+
+static inline void shstk_release(struct task_struct *tsk)
+{
+
+}
+
+static inline void set_shstk_base(struct task_struct *task, unsigned long shstk_addr,
+ unsigned long size)
+{
+
+}
+
+static inline void set_active_shstk(struct task_struct *task, unsigned long shstk_addr)
+{
+
+}
+
+static inline bool is_shstk_enabled(struct task_struct *task)
+{
+ return false;
+}
+
#endif /* CONFIG_RISCV_USER_CFI */
#endif /* __ASSEMBLY__ */
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index c249cf3d8083..a2b2a686a545 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -26,6 +26,7 @@
#include <asm/cpuidle.h>
#include <asm/vector.h>
#include <asm/cpufeature.h>
+#include <asm/usercfi.h>
register unsigned long gp_in_global __asm__("gp");
@@ -194,6 +195,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
void exit_thread(struct task_struct *tsk)
{
+ shstk_release(tsk);
return;
}
@@ -202,6 +204,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
unsigned long clone_flags = args->flags;
unsigned long usp = args->stack;
unsigned long tls = args->tls;
+ unsigned long ssp = 0;
struct pt_regs *childregs = task_pt_regs(p);
memset(&p->thread.s, 0, sizeof(p->thread.s));
@@ -217,11 +220,18 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
p->thread.s[0] = (unsigned long)args->fn;
p->thread.s[1] = (unsigned long)args->fn_arg;
} else {
+ /* allocate new shadow stack if needed. In case of CLONE_VM we have to */
+ ssp = shstk_alloc_thread_stack(p, args);
+ if (IS_ERR_VALUE(ssp))
+ return PTR_ERR((void *)ssp);
+
*childregs = *(current_pt_regs());
/* Turn off status.VS */
riscv_v_vstate_off(childregs);
if (usp) /* User fork */
childregs->sp = usp;
+ if (ssp) /* if needed, set new ssp */
+ set_active_shstk(p, ssp);
if (clone_flags & CLONE_SETTLS)
childregs->tp = tls;
childregs->a0 = 0; /* Return value of fork() */
diff --git a/arch/riscv/kernel/usercfi.c b/arch/riscv/kernel/usercfi.c
index 35ede2cbc05b..36cac0d653f5 100644
--- a/arch/riscv/kernel/usercfi.c
+++ b/arch/riscv/kernel/usercfi.c
@@ -19,6 +19,41 @@
#define SHSTK_ENTRY_SIZE sizeof(void *)
+bool is_shstk_enabled(struct task_struct *task)
+{
+ return task->thread_info.user_cfi_state.ubcfi_en ? true : false;
+}
+
+void set_shstk_base(struct task_struct *task, unsigned long shstk_addr, unsigned long size)
+{
+ task->thread_info.user_cfi_state.shdw_stk_base = shstk_addr;
+ task->thread_info.user_cfi_state.shdw_stk_size = size;
+}
+
+unsigned long get_shstk_base(struct task_struct *task, unsigned long *size)
+{
+ if (size)
+ *size = task->thread_info.user_cfi_state.shdw_stk_size;
+ return task->thread_info.user_cfi_state.shdw_stk_base;
+}
+
+void set_active_shstk(struct task_struct *task, unsigned long shstk_addr)
+{
+ task->thread_info.user_cfi_state.user_shdw_stk = shstk_addr;
+}
+
+/*
+ * If size is 0, then to be compatible with regular stack we want it to be as big as
+ * regular stack. Else PAGE_ALIGN it and return back
+ */
+static unsigned long calc_shstk_size(unsigned long size)
+{
+ if (size)
+ return PAGE_ALIGN(size);
+
+ return PAGE_ALIGN(min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G));
+}
+
/*
* Writes on shadow stack can either be `sspush` or `ssamoswap`. `sspush` can happen
* implicitly on current shadow stack pointed to by CSR_SSP. `ssamoswap` takes pointer to
@@ -148,3 +183,89 @@ SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsi
return allocate_shadow_stack(addr, aligned_size, size, set_tok);
}
+
+/*
+ * This gets called during clone/clone3/fork. And is needed to allocate a shadow stack for
+ * cases where CLONE_VM is specified and thus a different stack is specified by user. We
+ * thus need a separate shadow stack too. How does separate shadow stack is specified by
+ * user is still being debated. Once that's settled, remove this part of the comment.
+ * This function simply returns 0 if shadow stack are not supported or if separate shadow
+ * stack allocation is not needed (like in case of !CLONE_VM)
+ */
+unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
+ const struct kernel_clone_args *args)
+{
+ unsigned long addr, size;
+
+ /* If shadow stack is not supported, return 0 */
+ if (!cpu_supports_shadow_stack())
+ return 0;
+
+ /*
+ * If shadow stack is not enabled on the new thread, skip any
+ * switch to a new shadow stack.
+ */
+ if (is_shstk_enabled(tsk))
+ return 0;
+
+ /*
+ * For CLONE_VFORK the child will share the parents shadow stack.
+ * Set base = 0 and size = 0, this is special means to track this state
+ * so the freeing logic run for child knows to leave it alone.
+ */
+ if (args->flags & CLONE_VFORK) {
+ set_shstk_base(tsk, 0, 0);
+ return 0;
+ }
+
+ /*
+ * For !CLONE_VM the child will use a copy of the parents shadow
+ * stack.
+ */
+ if (!(args->flags & CLONE_VM))
+ return 0;
+
+ /*
+ * reaching here means, CLONE_VM was specified and thus a separate shadow
+ * stack is needed for new cloned thread. Note: below allocation is happening
+ * using current mm.
+ */
+ size = calc_shstk_size(args->stack_size);
+ addr = allocate_shadow_stack(0, size, 0, false);
+ if (IS_ERR_VALUE(addr))
+ return addr;
+
+ set_shstk_base(tsk, addr, size);
+
+ return addr + size;
+}
+
+void shstk_release(struct task_struct *tsk)
+{
+ unsigned long base = 0, size = 0;
+ /* If shadow stack is not supported or not enabled, nothing to release */
+ if (!cpu_supports_shadow_stack() ||
+ !is_shstk_enabled(tsk))
+ return;
+
+ /*
+ * When fork() with CLONE_VM fails, the child (tsk) already has a
+ * shadow stack allocated, and exit_thread() calls this function to
+ * free it. In this case the parent (current) and the child share
+ * the same mm struct. Move forward only when they're same.
+ */
+ if (!tsk->mm || tsk->mm != current->mm)
+ return;
+
+ /*
+ * We know shadow stack is enabled but if base is NULL, then
+ * this task is not managing its own shadow stack (CLONE_VFORK). So
+ * skip freeing it.
+ */
+ base = get_shstk_base(tsk, &size);
+ if (!base)
+ return;
+
+ vm_munmap(base, size);
+ set_shstk_base(tsk, 0, 0);
+}
--
2.43.0
From 5d9631f58cb41a3a682d1d646b24c04f512c06eb Mon Sep 17 00:00:00 2001
From: Mark Brown <broonie(a)kernel.org>
Date: Mon, 9 Oct 2023 13:08:36 +0100
Subject: [PATCH 17/28] prctl: arch-agnostic prctl for shadow stack
Three architectures (x86, aarch64, riscv) have announced support for
shadow stacks with fairly similar functionality. While x86 is using
arch_prctl() to control the functionality neither arm64 nor riscv uses
that interface so this patch adds arch-agnostic prctl() support to
get and set status of shadow stacks and lock the current configuation to
prevent further changes, with support for turning on and off individual
subfeatures so applications can limit their exposure to features that
they do not need. The features are:
- PR_SHADOW_STACK_ENABLE: Tracking and enforcement of shadow stacks,
including allocation of a shadow stack if one is not already
allocated.
- PR_SHADOW_STACK_WRITE: Writes to specific addresses in the shadow
stack.
- PR_SHADOW_STACK_PUSH: Push additional values onto the shadow stack.
- PR_SHADOW_STACK_DISABLE: Allow to disable shadow stack.
Note once locked, disable must fail.
These features are expected to be inherited by new threads and cleared
on exec(), unknown features should be rejected for enable but accepted
for locking (in order to allow for future proofing).
This is based on a patch originally written by Deepak Gupta but later
modified by Mark Brown for arm's GCS patch series.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Co-developed-by: Deepak Gupta <debug(a)rivosinc.com>
---
include/linux/mm.h | 3 +++
include/uapi/linux/prctl.h | 22 ++++++++++++++++++++++
kernel/sys.c | 30 ++++++++++++++++++++++++++++++
3 files changed, 55 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 15c70fc677a3..df248764bcec 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4170,5 +4170,8 @@ static inline bool pfn_is_unaccepted_memory(unsigned long pfn)
return range_contains_unaccepted_memory(paddr, paddr + PAGE_SIZE);
}
+int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *status);
+int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
+int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
#endif /* _LINUX_MM_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 370ed14b1ae0..3c66ed8f46d8 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -306,4 +306,26 @@ struct prctl_mm_map {
# define PR_RISCV_V_VSTATE_CTRL_NEXT_MASK 0xc
# define PR_RISCV_V_VSTATE_CTRL_MASK 0x1f
+/*
+ * Get the current shadow stack configuration for the current thread,
+ * this will be the value configured via PR_SET_SHADOW_STACK_STATUS.
+ */
+#define PR_GET_SHADOW_STACK_STATUS 71
+
+/*
+ * Set the current shadow stack configuration. Enabling the shadow
+ * stack will cause a shadow stack to be allocated for the thread.
+ */
+#define PR_SET_SHADOW_STACK_STATUS 72
+# define PR_SHADOW_STACK_ENABLE (1UL << 0)
+# define PR_SHADOW_STACK_WRITE (1UL << 1)
+# define PR_SHADOW_STACK_PUSH (1UL << 2)
+
+/*
+ * Prevent further changes to the specified shadow stack
+ * configuration. All bits may be locked via this call, including
+ * undefined bits.
+ */
+#define PR_LOCK_SHADOW_STACK_STATUS 73
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index e219fcfa112d..96e8a6b5993a 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2301,6 +2301,21 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
return -EINVAL;
}
+int __weak arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *status)
+{
+ return -EINVAL;
+}
+
+int __weak arch_set_shadow_stack_status(struct task_struct *t, unsigned long status)
+{
+ return -EINVAL;
+}
+
+int __weak arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status)
+{
+ return -EINVAL;
+}
+
#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)
#ifdef CONFIG_ANON_VMA_NAME
@@ -2743,6 +2758,21 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_RISCV_V_GET_CONTROL:
error = RISCV_V_GET_CONTROL();
break;
+ case PR_GET_SHADOW_STACK_STATUS:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = arch_get_shadow_stack_status(me, (unsigned long __user *) arg2);
+ break;
+ case PR_SET_SHADOW_STACK_STATUS:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = arch_set_shadow_stack_status(me, arg2);
+ break;
+ case PR_LOCK_SHADOW_STACK_STATUS:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = arch_lock_shadow_stack_status(me, arg2);
+ break;
default:
error = -EINVAL;
break;
--
2.43.0
From a4fbd367bc8d3ce589cbda16534d274a109b3666 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Tue, 12 Dec 2023 11:59:55 -0800
Subject: [PATCH 18/28] prctl: arch-agnostic prtcl for indirect branch tracking
Three architectures (x86, aarch64, riscv) have support for indirect branch
tracking feature in a very similar fashion. On a very high level, indirect
branch tracking is a CPU feature where CPU tracks branches which uses memory
operand to perform control transfer in program. As part of this tracking on
indirect branches, CPU goes in a state where it expects a landing pad instr
on target and if not found then CPU raises some fault (architecture dependent)
x86 landing pad instr - `ENDBRANCH`
aarch64 landing pad instr - `BTI`
riscv landing instr - `lpad`
Given that three major arches have support for indirect branch tracking,
This patch makes `prctl` for indirect branch tracking arch agnostic.
To allow userspace to enable this feature for itself, following prtcls are
defined:
- PR_GET_INDIR_BR_LP_STATUS: Gets current configured status for indirect branch
tracking.
- PR_SET_INDIR_BR_LP_STATUS: Sets a configuration for indirect branch tracking
Following status options are allowed
- PR_INDIR_BR_LP_ENABLE: Enables indirect branch tracking on user
thread.
- PR_INDIR_BR_LP_DISABLE; Disables indirect branch tracking on user
thread.
- PR_LOCK_INDIR_BR_LP_STATUS: Locks configured status for indirect branch
tracking for user thread.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
include/uapi/linux/prctl.h | 27 +++++++++++++++++++++++++++
kernel/sys.c | 30 ++++++++++++++++++++++++++++++
2 files changed, 57 insertions(+)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 3c66ed8f46d8..b7a8212a068e 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -328,4 +328,31 @@ struct prctl_mm_map {
*/
#define PR_LOCK_SHADOW_STACK_STATUS 73
+/*
+ * Get the current indirect branch tracking configuration for the current
+ * thread, this will be the value configured via PR_SET_INDIR_BR_LP_STATUS.
+ */
+#define PR_GET_INDIR_BR_LP_STATUS 74
+
+/*
+ * Set the indirect branch tracking configuration. PR_INDIR_BR_LP_ENABLE will
+ * enable cpu feature for user thread, to track all indirect branches and ensure
+ * they land on arch defined landing pad instruction.
+ * x86 - If enabled, an indirect branch must land on `ENDBRANCH` instruction.
+ * arch64 - If enabled, an indirect branch must land on `BTI` instruction.
+ * riscv - If enabled, an indirect branch must land on `lpad` instruction.
+ * PR_INDIR_BR_LP_DISABLE will disable feature for user thread and indirect
+ * branches will no more be tracked by cpu to land on arch defined landing pad
+ * instruction.
+ */
+#define PR_SET_INDIR_BR_LP_STATUS 75
+# define PR_INDIR_BR_LP_ENABLE (1UL << 0)
+
+/*
+ * Prevent further changes to the specified indirect branch tracking
+ * configuration. All bits may be locked via this call, including
+ * undefined bits.
+ */
+#define PR_LOCK_INDIR_BR_LP_STATUS 76
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 96e8a6b5993a..9e2ebf9d9859 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2316,6 +2316,21 @@ int __weak arch_lock_shadow_stack_status(struct task_struct *t, unsigned long st
return -EINVAL;
}
+int __weak arch_get_indir_br_lp_status(struct task_struct *t, unsigned long __user *status)
+{
+ return -EINVAL;
+}
+
+int __weak arch_set_indir_br_lp_status(struct task_struct *t, unsigned long __user *status)
+{
+ return -EINVAL;
+}
+
+int __weak arch_lock_indir_br_lp_status(struct task_struct *t, unsigned long __user *status)
+{
+ return -EINVAL;
+}
+
#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)
#ifdef CONFIG_ANON_VMA_NAME
@@ -2773,6 +2788,21 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
return -EINVAL;
error = arch_lock_shadow_stack_status(me, arg2);
break;
+ case PR_GET_INDIR_BR_LP_STATUS:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = arch_get_indir_br_lp_status(me, (unsigned long __user *) arg2);
+ break;
+ case PR_SET_INDIR_BR_LP_STATUS:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = arch_set_indir_br_lp_status(me, (unsigned long __user *) arg2);
+ break;
+ case PR_LOCK_INDIR_BR_LP_STATUS:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = arch_lock_indir_br_lp_status(me, (unsigned long __user *) arg2);
+ break;
default:
error = -EINVAL;
break;
--
2.43.0
From f2375f9a41e02e9ad8fa6910e12dfae84acbc9ae Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Thu, 19 Jan 2023 10:28:35 -0800
Subject: [PATCH 19/28] riscv: Implements arch agnostic shadow stack prctls
Implement architecture agnostic prctls() interface for setting and getting
shadow stack status.
prctls implemented are PR_GET_SHADOW_STACK_STATUS, PR_SET_SHADOW_STACK_STATUS
and PR_LOCK_SHADOW_STACK_STATUS.
As part of PR_SET_SHADOW_STACK_STATUS/PR_GET_SHADOW_STACK_STATUS, only
PR_SHADOW_STACK_ENABLE is implemented because RISCV allows each mode to write
to their own shadow stack using `sspush` or `ssamoswap`.
PR_LOCK_SHADOW_STACK_STATUS locks current configuration of shadow stack enabling
Following is not supported
"Enable shadow stack, then disable and enable again."
It's not sure whether providing such semantics are useful. It's better to return
error code when such situation arises.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/usercfi.h | 12 +++-
arch/riscv/kernel/usercfi.c | 105 +++++++++++++++++++++++++++++++
2 files changed, 116 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/include/asm/usercfi.h b/arch/riscv/include/asm/usercfi.h
index eb9a0905e72b..72bcfa773752 100644
--- a/arch/riscv/include/asm/usercfi.h
+++ b/arch/riscv/include/asm/usercfi.h
@@ -7,6 +7,7 @@
#ifndef __ASSEMBLY__
#include <linux/types.h>
+#include <linux/prctl.h>
struct task_struct;
struct kernel_clone_args;
@@ -14,7 +15,8 @@ struct kernel_clone_args;
#ifdef CONFIG_RISCV_USER_CFI
struct cfi_status {
unsigned long ubcfi_en : 1; /* Enable for backward cfi. */
- unsigned long rsvd : ((sizeof(unsigned long)*8) - 1);
+ unsigned long ubcfi_locked : 1;
+ unsigned long rsvd : ((sizeof(unsigned long)*8) - 2);
unsigned long user_shdw_stk; /* Current user shadow stack pointer */
unsigned long shdw_stk_base; /* Base address of shadow stack */
unsigned long shdw_stk_size; /* size of shadow stack */
@@ -26,6 +28,9 @@ void shstk_release(struct task_struct *tsk);
void set_shstk_base(struct task_struct *task, unsigned long shstk_addr, unsigned long size);
void set_active_shstk(struct task_struct *task, unsigned long shstk_addr);
bool is_shstk_enabled(struct task_struct *task);
+bool is_shstk_locked(struct task_struct *task);
+
+#define PR_SHADOW_STACK_SUPPORTED_STATUS_MASK (PR_SHADOW_STACK_ENABLE)
#else
@@ -56,6 +61,11 @@ static inline bool is_shstk_enabled(struct task_struct *task)
return false;
}
+static inline bool is_shstk_locked(struct task_struct *task)
+{
+ return false;
+}
+
#endif /* CONFIG_RISCV_USER_CFI */
#endif /* __ASSEMBLY__ */
diff --git a/arch/riscv/kernel/usercfi.c b/arch/riscv/kernel/usercfi.c
index 36cac0d653f5..be3a071272d8 100644
--- a/arch/riscv/kernel/usercfi.c
+++ b/arch/riscv/kernel/usercfi.c
@@ -24,6 +24,16 @@ bool is_shstk_enabled(struct task_struct *task)
return task->thread_info.user_cfi_state.ubcfi_en ? true : false;
}
+bool is_shstk_allocated(struct task_struct *task)
+{
+ return task->thread_info.user_cfi_state.shdw_stk_base ? true : false;
+}
+
+bool is_shstk_locked(struct task_struct *task)
+{
+ return task->thread_info.user_cfi_state.ubcfi_locked ? true : false;
+}
+
void set_shstk_base(struct task_struct *task, unsigned long shstk_addr, unsigned long size)
{
task->thread_info.user_cfi_state.shdw_stk_base = shstk_addr;
@@ -42,6 +52,21 @@ void set_active_shstk(struct task_struct *task, unsigned long shstk_addr)
task->thread_info.user_cfi_state.user_shdw_stk = shstk_addr;
}
+void set_shstk_status(struct task_struct *task, bool enable)
+{
+ task->thread_info.user_cfi_state.ubcfi_en = enable ? 1 : 0;
+
+ if (enable)
+ task->thread_info.envcfg |= ENVCFG_SSE;
+ else
+ task->thread_info.envcfg &= ~ENVCFG_SSE;
+}
+
+void set_shstk_lock(struct task_struct *task)
+{
+ task->thread_info.user_cfi_state.ubcfi_locked = 1;
+}
+
/*
* If size is 0, then to be compatible with regular stack we want it to be as big as
* regular stack. Else PAGE_ALIGN it and return back
@@ -269,3 +294,83 @@ void shstk_release(struct task_struct *tsk)
vm_munmap(base, size);
set_shstk_base(tsk, 0, 0);
}
+
+int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *status)
+{
+ unsigned long bcfi_status = 0;
+
+ if (!cpu_supports_shadow_stack())
+ return -EINVAL;
+
+ /* this means shadow stack is enabled on the task */
+ bcfi_status |= (is_shstk_enabled(t) ? PR_SHADOW_STACK_ENABLE : 0);
+
+ return copy_to_user(status, &bcfi_status, sizeof(bcfi_status)) ? -EFAULT : 0;
+}
+
+int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status)
+{
+ unsigned long size = 0, addr = 0;
+ bool enable_shstk = false;
+
+ if (!cpu_supports_shadow_stack())
+ return -EINVAL;
+
+ /* Reject unknown flags */
+ if (status & ~PR_SHADOW_STACK_SUPPORTED_STATUS_MASK)
+ return -EINVAL;
+
+ /* bcfi status is locked and further can't be modified by user */
+ if (is_shstk_locked(t))
+ return -EINVAL;
+
+ enable_shstk = status & PR_SHADOW_STACK_ENABLE;
+ /* Request is to enable shadow stack and shadow stack is not enabled already */
+ if (enable_shstk && !is_shstk_enabled(t)) {
+ /* shadow stack was allocated and enable request again
+ * no need to support such usecase and return EINVAL.
+ */
+ if (is_shstk_allocated(t))
+ return -EINVAL;
+
+ size = calc_shstk_size(0);
+ addr = allocate_shadow_stack(0, size, 0, false);
+ if (IS_ERR_VALUE(addr))
+ return -ENOMEM;
+ set_shstk_base(t, addr, size);
+ set_active_shstk(t, addr + size);
+ }
+
+ /*
+ * If a request to disable shadow stack happens, let's go ahead and release it
+ * Although, if CLONE_VFORKed child did this, then in that case we will end up
+ * not releasing the shadow stack (because it might be needed in parent). Although
+ * we will disable it for VFORKed child. And if VFORKed child tries to enable again
+ * then in that case, it'll get entirely new shadow stack because following condition
+ * are true
+ * - shadow stack was not enabled for vforked child
+ * - shadow stack base was anyways pointing to 0
+ * This shouldn't be a big issue because we want parent to have availability of shadow
+ * stack whenever VFORKed child releases resources via exit or exec but at the same
+ * time we want VFORKed child to break away and establish new shadow stack if it desires
+ *
+ */
+ if (!enable_shstk)
+ shstk_release(t);
+
+ set_shstk_status(t, enable_shstk);
+ return 0;
+}
+
+int arch_lock_shadow_stack_status(struct task_struct *task,
+ unsigned long arg)
+{
+ /* If shtstk not supported or not enabled on task, nothing to lock here */
+ if (!cpu_supports_shadow_stack() ||
+ !is_shstk_enabled(task))
+ return -EINVAL;
+
+ set_shstk_lock(task);
+
+ return 0;
+}
--
2.43.0
From 8805582b1d18fb5488c9c8ed650f957c08697cfa Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Tue, 19 Dec 2023 07:58:08 -0800
Subject: [PATCH 20/28] riscv: Implements arch argnostic indirect branch
tracking prctls
prctls implemented are PR_SET_INDIR_BR_LP_STATUS / PR_GET_INDIR_BR_LP_STATUS
and PR_LOCK_INDIR_BR_LP_STATUS.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/usercfi.h | 17 +++++++-
arch/riscv/kernel/usercfi.c | 74 ++++++++++++++++++++++++++++++++
2 files changed, 90 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/include/asm/usercfi.h b/arch/riscv/include/asm/usercfi.h
index 72bcfa773752..4bd10dcd48aa 100644
--- a/arch/riscv/include/asm/usercfi.h
+++ b/arch/riscv/include/asm/usercfi.h
@@ -16,7 +16,9 @@ struct kernel_clone_args;
struct cfi_status {
unsigned long ubcfi_en : 1; /* Enable for backward cfi. */
unsigned long ubcfi_locked : 1;
- unsigned long rsvd : ((sizeof(unsigned long)*8) - 2);
+ unsigned long ufcfi_en : 1; /* Enable for forward cfi. Note that ELP goes in sstatus */
+ unsigned long ufcfi_locked : 1;
+ unsigned long rsvd : ((sizeof(unsigned long)*8) - 4);
unsigned long user_shdw_stk; /* Current user shadow stack pointer */
unsigned long shdw_stk_base; /* Base address of shadow stack */
unsigned long shdw_stk_size; /* size of shadow stack */
@@ -29,6 +31,8 @@ void set_shstk_base(struct task_struct *task, unsigned long shstk_addr, unsigned
void set_active_shstk(struct task_struct *task, unsigned long shstk_addr);
bool is_shstk_enabled(struct task_struct *task);
bool is_shstk_locked(struct task_struct *task);
+bool is_indir_lp_enabled(struct task_struct *task);
+bool is_indir_lp_locked(struct task_struct *task);
#define PR_SHADOW_STACK_SUPPORTED_STATUS_MASK (PR_SHADOW_STACK_ENABLE)
@@ -66,6 +70,17 @@ static inline bool is_shstk_locked(struct task_struct *task)
return false;
}
+static inline bool is_indir_lp_enabled(struct task_struct *task)
+{
+ return false;
+}
+
+static inline bool is_indir_lp_locked(struct task_struct *task)
+
+{
+ return false;
+}
+
#endif /* CONFIG_RISCV_USER_CFI */
#endif /* __ASSEMBLY__ */
diff --git a/arch/riscv/kernel/usercfi.c b/arch/riscv/kernel/usercfi.c
index be3a071272d8..af8cc8f4616c 100644
--- a/arch/riscv/kernel/usercfi.c
+++ b/arch/riscv/kernel/usercfi.c
@@ -67,6 +67,30 @@ void set_shstk_lock(struct task_struct *task)
task->thread_info.user_cfi_state.ubcfi_locked = 1;
}
+bool is_indir_lp_enabled(struct task_struct *task)
+{
+ return task->thread_info.user_cfi_state.ufcfi_en ? true : false;
+}
+
+bool is_indir_lp_locked(struct task_struct *task)
+{
+ return task->thread_info.user_cfi_state.ufcfi_locked ? true : false;
+}
+
+void set_indir_lp_status(struct task_struct *task, bool enable)
+{
+ task->thread_info.user_cfi_state.ufcfi_en = enable ? 1 : 0;
+
+ if (enable)
+ task->thread_info.envcfg |= ENVCFG_LPE;
+ else
+ task->thread_info.envcfg &= ~ENVCFG_LPE;
+}
+
+void set_indir_lp_lock(struct task_struct *task)
+{
+ task->thread_info.user_cfi_state.ufcfi_locked = 1;
+}
/*
* If size is 0, then to be compatible with regular stack we want it to be as big as
* regular stack. Else PAGE_ALIGN it and return back
@@ -374,3 +398,53 @@ int arch_lock_shadow_stack_status(struct task_struct *task,
return 0;
}
+
+int arch_get_indir_br_lp_status(struct task_struct *t, unsigned long __user *status)
+{
+ unsigned long fcfi_status = 0;
+
+ if (!cpu_supports_indirect_br_lp_instr())
+ return -EINVAL;
+
+ /* indirect branch tracking is enabled on the task or not */
+ fcfi_status |= (is_indir_lp_enabled(t) ? PR_INDIR_BR_LP_ENABLE : 0);
+
+ return copy_to_user(status, &fcfi_status, sizeof(fcfi_status)) ? -EFAULT : 0;
+}
+
+int arch_set_indir_br_lp_status(struct task_struct *t, unsigned long status)
+{
+ bool enable_indir_lp = false;
+
+ if (!cpu_supports_indirect_br_lp_instr())
+ return -EINVAL;
+
+ /* indirect branch tracking is locked and further can't be modified by user */
+ if (is_indir_lp_locked(t))
+ return -EINVAL;
+
+ /* Reject unknown flags */
+ if (status & ~PR_INDIR_BR_LP_ENABLE)
+ return -EINVAL;
+
+ enable_indir_lp = (status & PR_INDIR_BR_LP_ENABLE) ? true : false;
+ set_indir_lp_status(t, enable_indir_lp);
+
+ return 0;
+}
+
+int arch_lock_indir_br_lp_status(struct task_struct *task,
+ unsigned long arg)
+{
+ /*
+ * If indirect branch tracking is not supported or not enabled on task,
+ * nothing to lock here
+ */
+ if (!cpu_supports_indirect_br_lp_instr() ||
+ !is_indir_lp_enabled(task))
+ return -EINVAL;
+
+ set_indir_lp_lock(task);
+
+ return 0;
+}
--
2.43.0
From 5b7ab396ec70d3e830acc7c9229c040f87db1ae8 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Mon, 11 Dec 2023 20:41:17 -0800
Subject: [PATCH 21/28] riscv/traps: Introduce software check exception
zicfiss / zicfilp introduces a new exception to priv isa `software check
exception` with cause code = 18. This patch implements software check exception.
Additionally it implements a cfi violation handler which checks for code in xtval
If xtval=2, it means that sw check exception happened because of an indirect
branch not landing on 4 byte aligned PC or not landing on `lpad` instruction or
label value embedded in `lpad` not matching label value setup in `x7`.
If xtval=3, it means that sw check exception happened because of mismatch between
link register (x1 or x5) and top of shadow stack (on execution of `sspopchk`)
In case of cfi violation, SIGSEGV is raised with code=SEGV_CPERR. SEGV_CPERR was
introduced by x86 shadow stack patches.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/asm-prototypes.h | 1 +
arch/riscv/kernel/entry.S | 3 ++
arch/riscv/kernel/traps.c | 38 +++++++++++++++++++++++++
3 files changed, 42 insertions(+)
diff --git a/arch/riscv/include/asm/asm-prototypes.h b/arch/riscv/include/asm/asm-prototypes.h
index 36b955c762ba..4ba8aea58dd0 100644
--- a/arch/riscv/include/asm/asm-prototypes.h
+++ b/arch/riscv/include/asm/asm-prototypes.h
@@ -24,6 +24,7 @@ DECLARE_DO_ERROR_INFO(do_trap_ecall_u);
DECLARE_DO_ERROR_INFO(do_trap_ecall_s);
DECLARE_DO_ERROR_INFO(do_trap_ecall_m);
DECLARE_DO_ERROR_INFO(do_trap_break);
+DECLARE_DO_ERROR_INFO(do_trap_software_check);
asmlinkage void handle_bad_stack(struct pt_regs *regs);
asmlinkage void do_page_fault(struct pt_regs *regs);
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index 410659e2eadb..56dfe04094c1 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -369,6 +369,9 @@ SYM_DATA_START_LOCAL(excp_vect_table)
RISCV_PTR do_page_fault /* load page fault */
RISCV_PTR do_trap_unknown
RISCV_PTR do_page_fault /* store page fault */
+ RISCV_PTR do_trap_unknown /* cause=16 */
+ RISCV_PTR do_trap_unknown /* cause=17 */
+ RISCV_PTR do_trap_software_check /* cause=18 is sw check exception */
SYM_DATA_END_LABEL(excp_vect_table, SYM_L_LOCAL, excp_vect_table_end)
#ifndef CONFIG_MMU
diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index a1b9be3c4332..9fba263428a1 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -339,6 +339,44 @@ asmlinkage __visible __trap_section void do_trap_ecall_u(struct pt_regs *regs)
}
+#define CFI_TVAL_FCFI_CODE 2
+#define CFI_TVAL_BCFI_CODE 3
+/* handle cfi violations */
+bool handle_user_cfi_violation(struct pt_regs *regs)
+{
+ bool ret = false;
+ unsigned long tval = csr_read(CSR_TVAL);
+
+ if (((tval == CFI_TVAL_FCFI_CODE) && cpu_supports_indirect_br_lp_instr()) ||
+ ((tval == CFI_TVAL_BCFI_CODE) && cpu_supports_shadow_stack())) {
+ do_trap_error(regs, SIGSEGV, SEGV_CPERR, regs->epc,
+ "Oops - control flow violation");
+ ret = true;
+ }
+
+ return ret;
+}
+/*
+ * software check exception is defined with risc-v cfi spec. Software check
+ * exception is raised when:-
+ * a) An indirect branch doesn't land on 4 byte aligned PC or `lpad`
+ * instruction or `label` value programmed in `lpad` instr doesn't
+ * match with value setup in `x7`. reported code in `xtval` is 2.
+ * b) `sspopchk` instruction finds a mismatch between top of shadow stack (ssp)
+ * and x1/x5. reported code in `xtval` is 3.
+ */
+asmlinkage __visible __trap_section void do_trap_software_check(struct pt_regs *regs)
+{
+ if (user_mode(regs)) {
+ /* not a cfi violation, then merge into flow of unknown trap handler */
+ if (!handle_user_cfi_violation(regs))
+ do_trap_unknown(regs);
+ } else {
+ /* sw check exception coming from kernel is a bug in kernel */
+ die(regs, "Kernel BUG");
+ }
+}
+
#ifdef CONFIG_MMU
asmlinkage __visible noinstr void do_page_fault(struct pt_regs *regs)
{
--
2.43.0
From a825cd99de13621b7e6e45cafa202f11cb3c3596 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Wed, 4 Jan 2023 19:20:09 -0800
Subject: [PATCH 22/28] riscv sigcontext: adding cfi state field in sigcontext
Shadow stack needs to be saved and restored on signal delivery and signal
return.
sigcontext embedded in ucontext is extendible. Adding cfi state in there
which can be used to save cfi state before signal delivery and restore
cfi state on sigreturn
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/uapi/asm/sigcontext.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/riscv/include/uapi/asm/sigcontext.h b/arch/riscv/include/uapi/asm/sigcontext.h
index cd4f175dc837..5ccdd94a0855 100644
--- a/arch/riscv/include/uapi/asm/sigcontext.h
+++ b/arch/riscv/include/uapi/asm/sigcontext.h
@@ -21,6 +21,10 @@ struct __sc_riscv_v_state {
struct __riscv_v_ext_state v_state;
} __attribute__((aligned(16)));
+struct __sc_riscv_cfi_state {
+ unsigned long ss_ptr; /* shadow stack pointer */
+ unsigned long rsvd; /* keeping another word reserved in case we need it */
+};
/*
* Signal context structure
*
@@ -29,6 +33,7 @@ struct __sc_riscv_v_state {
*/
struct sigcontext {
struct user_regs_struct sc_regs;
+ struct __sc_riscv_cfi_state sc_cfi_state;
union {
union __riscv_fp_state sc_fpregs;
struct __riscv_extra_ext_header sc_extdesc;
--
2.43.0
From 463c2e77fb9a1cf3bb75f1e359c55f1792b88349 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Sat, 14 Jan 2023 17:45:54 -0800
Subject: [PATCH 23/28] riscv signal: Save and restore of shadow stack for
signal
Save shadow stack pointer in sigcontext structure while delivering signal.
Restore shadow stack pointer from sigcontext on sigreturn.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/usercfi.h | 18 ++++++++++++
arch/riscv/kernel/signal.c | 45 ++++++++++++++++++++++++++++++
arch/riscv/kernel/usercfi.c | 47 ++++++++++++++++++++++++++++++++
3 files changed, 110 insertions(+)
diff --git a/arch/riscv/include/asm/usercfi.h b/arch/riscv/include/asm/usercfi.h
index 4bd10dcd48aa..28c67866ff6f 100644
--- a/arch/riscv/include/asm/usercfi.h
+++ b/arch/riscv/include/asm/usercfi.h
@@ -33,6 +33,9 @@ bool is_shstk_enabled(struct task_struct *task);
bool is_shstk_locked(struct task_struct *task);
bool is_indir_lp_enabled(struct task_struct *task);
bool is_indir_lp_locked(struct task_struct *task);
+unsigned long get_active_shstk(struct task_struct *task);
+int restore_user_shstk(struct task_struct *tsk, unsigned long shstk_ptr);
+int save_user_shstk(struct task_struct *tsk, unsigned long *saved_shstk_ptr);
#define PR_SHADOW_STACK_SUPPORTED_STATUS_MASK (PR_SHADOW_STACK_ENABLE)
@@ -70,6 +73,16 @@ static inline bool is_shstk_locked(struct task_struct *task)
return false;
}
+int restore_user_shstk(struct task_struct *tsk, unsigned long shstk_ptr)
+{
+ return -EINVAL;
+}
+
+int save_user_shstk(struct task_struct *tsk, unsigned long *saved_shstk_ptr)
+{
+ return -EINVAL;
+}
+
static inline bool is_indir_lp_enabled(struct task_struct *task)
{
return false;
@@ -81,6 +94,11 @@ static inline bool is_indir_lp_locked(struct task_struct *task)
return false;
}
+static inline unsigned long get_active_shstk(struct task_struct *task)
+{
+ return 0;
+}
+
#endif /* CONFIG_RISCV_USER_CFI */
#endif /* __ASSEMBLY__ */
diff --git a/arch/riscv/kernel/signal.c b/arch/riscv/kernel/signal.c
index 88b6220b2608..d1092f0a6363 100644
--- a/arch/riscv/kernel/signal.c
+++ b/arch/riscv/kernel/signal.c
@@ -22,6 +22,7 @@
#include <asm/vector.h>
#include <asm/csr.h>
#include <asm/cacheflush.h>
+#include <asm/usercfi.h>
unsigned long signal_minsigstksz __ro_after_init;
@@ -229,6 +230,7 @@ SYSCALL_DEFINE0(rt_sigreturn)
struct pt_regs *regs = current_pt_regs();
struct rt_sigframe __user *frame;
struct task_struct *task;
+ unsigned long ss_ptr = 0;
sigset_t set;
size_t frame_size = get_rt_frame_size(false);
@@ -251,6 +253,26 @@ SYSCALL_DEFINE0(rt_sigreturn)
if (restore_altstack(&frame->uc.uc_stack))
goto badframe;
+ /*
+ * Restore shadow stack as a form of token stored on shadow stack itself as a safe
+ * way to restore.
+ * A token on shadow gives following properties
+ * - Safe save and restore for shadow stack switching. Any save of shadow stack
+ * must have had saved a token on shadow stack. Similarly any restore of shadow
+ * stack must check the token before restore. Since writing to shadow stack with
+ * address of shadow stack itself is not easily allowed. A restore without a save
+ * is quite difficult for an attacker to perform.
+ * - A natural break. A token in shadow stack provides a natural break in shadow stack
+ * So a single linear range can be bucketed into different shadow stack segments.
+ * sspopchk will detect the condition and fault to kernel as sw check exception.
+ */
+ if (__copy_from_user(&ss_ptr, &frame->uc.uc_mcontext.sc_cfi_state.ss_ptr,
+ sizeof(unsigned long)))
+ goto badframe;
+
+ if (is_shstk_enabled(current) && restore_user_shstk(current, ss_ptr))
+ goto badframe;
+
regs->cause = -1UL;
return regs->a0;
@@ -320,6 +342,7 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t *set,
struct rt_sigframe __user *frame;
long err = 0;
unsigned long __maybe_unused addr;
+ unsigned long ss_ptr = 0;
size_t frame_size = get_rt_frame_size(false);
frame = get_sigframe(ksig, regs, frame_size);
@@ -331,6 +354,23 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t *set,
/* Create the ucontext. */
err |= __put_user(0, &frame->uc.uc_flags);
err |= __put_user(NULL, &frame->uc.uc_link);
+ /*
+ * Save a pointer to shadow stack itself on shadow stack as a form of token.
+ * A token on shadow gives following properties
+ * - Safe save and restore for shadow stack switching. Any save of shadow stack
+ * must have had saved a token on shadow stack. Similarly any restore of shadow
+ * stack must check the token before restore. Since writing to shadow stack with
+ * address of shadow stack itself is not easily allowed. A restore without a save
+ * is quite difficult for an attacker to perform.
+ * - A natural break. A token in shadow stack provides a natural break in shadow stack
+ * So a single linear range can be bucketed into different shadow stack segments. Any
+ * sspopchk will detect the condition and fault to kernel as sw check exception.
+ */
+ if (is_shstk_enabled(current)) {
+ err |= save_user_shstk(current, &ss_ptr);
+ err |= __put_user(ss_ptr, &frame->uc.uc_mcontext.sc_cfi_state.ss_ptr);
+ }
+
err |= __save_altstack(&frame->uc.uc_stack, regs->sp);
err |= setup_sigcontext(frame, regs);
err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set));
@@ -341,6 +381,11 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t *set,
#ifdef CONFIG_MMU
regs->ra = (unsigned long)VDSO_SYMBOL(
current->mm->context.vdso, rt_sigreturn);
+
+ /* if bcfi is enabled x1 (ra) and x5 (t0) must match. not sure if we need this? */
+ if (is_shstk_enabled(current))
+ regs->t0 = regs->ra;
+
#else
/*
* For the nommu case we don't have a VDSO. Instead we push two
diff --git a/arch/riscv/kernel/usercfi.c b/arch/riscv/kernel/usercfi.c
index af8cc8f4616c..f5eb0124571b 100644
--- a/arch/riscv/kernel/usercfi.c
+++ b/arch/riscv/kernel/usercfi.c
@@ -52,6 +52,11 @@ void set_active_shstk(struct task_struct *task, unsigned long shstk_addr)
task->thread_info.user_cfi_state.user_shdw_stk = shstk_addr;
}
+unsigned long get_active_shstk(struct task_struct *task)
+{
+ return task->thread_info.user_cfi_state.user_shdw_stk;
+}
+
void set_shstk_status(struct task_struct *task, bool enable)
{
task->thread_info.user_cfi_state.ubcfi_en = enable ? 1 : 0;
@@ -165,6 +170,48 @@ static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
return 0;
}
+/*
+ * Save user shadow stack pointer on shadow stack itself and return pointer to saved location
+ * returns -EFAULT if operation was unsuccessful
+ */
+int save_user_shstk(struct task_struct *tsk, unsigned long *saved_shstk_ptr)
+{
+ unsigned long ss_ptr = 0;
+ unsigned long token_loc = 0;
+ int ret = 0;
+
+ if (saved_shstk_ptr == NULL)
+ return -EINVAL;
+
+ ss_ptr = get_active_shstk(tsk);
+ ret = create_rstor_token(ss_ptr, &token_loc);
+
+ *saved_shstk_ptr = token_loc;
+ return ret;
+}
+
+/*
+ * Restores user shadow stack pointer from token on shadow stack for task `tsk`
+ * returns -EFAULT if operation was unsuccessful
+ */
+int restore_user_shstk(struct task_struct *tsk, unsigned long shstk_ptr)
+{
+ unsigned long token = 0;
+
+ token = amo_user_shstk((unsigned long __user *)shstk_ptr, 0);
+
+ if (token == -1)
+ return -EFAULT;
+
+ /* invalid token, return EINVAL */
+ if ((token - shstk_ptr) != SHSTK_ENTRY_SIZE)
+ return -EINVAL;
+
+ /* all checks passed, set active shstk and return success */
+ set_active_shstk(tsk, token);
+ return 0;
+}
+
static unsigned long allocate_shadow_stack(unsigned long addr, unsigned long size,
unsigned long token_offset,
bool set_tok)
--
2.43.0
From fe9611befe43cded39d708768e56ecaed14be37c Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Thu, 19 Jan 2023 10:47:15 -0800
Subject: [PATCH 24/28] riscv: select config for shadow stack and landing pad
instr support
This patch selects config shadow stack support and landing pad instr
support. Shadow stack support and landing instr support is hidden behind
`CONFIG_RISCV_USER_CFI`. Selecting `CONFIG_RISCV_USER_CFI` wires up path
to enumerate CPU support and if cpu support exists, kernel will support
cpu assisted user mode cfi.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/Kconfig | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 9d386e9edc45..437b2f9abf3e 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -163,6 +163,7 @@ config RISCV
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
select TRACE_IRQFLAGS_SUPPORT
+ select RISCV_USER_CFI
select UACCESS_MEMCPY if !MMU
select ZONE_DMA32 if 64BIT
@@ -182,6 +183,20 @@ config HAVE_SHADOW_CALL_STACK
# https://github.com/riscv-non-isa/riscv-elf-psabi-doc/commit/a484e843e6eeb51…
depends on $(ld-option,--no-relax-gp)
+config RISCV_USER_CFI
+ bool "riscv userspace control flow integrity"
+ help
+ Provides CPU assisted control flow integrity to userspace tasks.
+ Control flow integrity is provided by implementing shadow stack for
+ backward edge and indirect branch tracking for forward edge in program.
+ Shadow stack protection is a hardware feature that detects function
+ return address corruption. This helps mitigate ROP attacks.
+ Indirect branch tracking enforces that all indirect branches must land
+ on a landing pad instruction else CPU will fault. This mitigates against
+ JOP / COP attacks. Applications must be enabled to use it, and old user-
+ space does not get protection "for free".
+ default y
+
config ARCH_MMAP_RND_BITS_MIN
default 18 if 64BIT
default 8
--
2.43.0
From 1b49d1810f8447ee58fc09c0f9ca467dff51181d Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Wed, 17 Jan 2024 10:27:13 -0800
Subject: [PATCH 25/28] riscv/ptrace: riscv cfi status and state via ptrace and
in core files
Expose a new register type NT_RISCV_USER_CFI for risc-v cfi status and
state. Intentionally both landing pad and shadow stack status and state
are rolled into cfi state. Creating two different NT_RISCV_USER_XXX would
not be useful and wastage of a note type. Enabling or disabling of feature
is not allowed via ptrace set interface. However setting `elp` state or
setting shadow stack pointer are allowed via ptrace set interface. It is
expected `gdb` might have use to fixup `elp` state or `shadow stack`
pointer.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/uapi/asm/ptrace.h | 18 ++++++
arch/riscv/kernel/ptrace.c | 83 ++++++++++++++++++++++++++++
include/uapi/linux/elf.h | 1 +
3 files changed, 102 insertions(+)
diff --git a/arch/riscv/include/uapi/asm/ptrace.h b/arch/riscv/include/uapi/asm/ptrace.h
index a38268b19c3d..512be06a8661 100644
--- a/arch/riscv/include/uapi/asm/ptrace.h
+++ b/arch/riscv/include/uapi/asm/ptrace.h
@@ -127,6 +127,24 @@ struct __riscv_v_regset_state {
*/
#define RISCV_MAX_VLENB (8192)
+struct __cfi_status {
+ /* indirect branch tracking state */
+ __u64 lp_en : 1;
+ __u64 lp_lock : 1;
+ __u64 elp_state : 1;
+
+ /* shadow stack status */
+ __u64 shstk_en : 1;
+ __u64 shstk_lock : 1;
+
+ __u64 rsvd : sizeof(__u64) - 5;
+};
+
+struct user_cfi_state {
+ struct __cfi_status cfi_status;
+ __u64 shstk_ptr;
+};
+
#endif /* __ASSEMBLY__ */
#endif /* _UAPI_ASM_RISCV_PTRACE_H */
diff --git a/arch/riscv/kernel/ptrace.c b/arch/riscv/kernel/ptrace.c
index 2afe460de16a..8ddd529bef0b 100644
--- a/arch/riscv/kernel/ptrace.c
+++ b/arch/riscv/kernel/ptrace.c
@@ -19,6 +19,7 @@
#include <linux/regset.h>
#include <linux/sched.h>
#include <linux/sched/task_stack.h>
+#include <asm/usercfi.h>
enum riscv_regset {
REGSET_X,
@@ -28,6 +29,9 @@ enum riscv_regset {
#ifdef CONFIG_RISCV_ISA_V
REGSET_V,
#endif
+#ifdef CONFIG_RISCV_USER_CFI
+ REGSET_CFI,
+#endif
};
static int riscv_gpr_get(struct task_struct *target,
@@ -149,6 +153,75 @@ static int riscv_vr_set(struct task_struct *target,
}
#endif
+#ifdef CONFIG_RISCV_USER_CFI
+static int riscv_cfi_get(struct task_struct *target,
+ const struct user_regset *regset,
+ struct membuf to)
+{
+ struct user_cfi_state user_cfi;
+ struct pt_regs *regs;
+
+ regs = task_pt_regs(target);
+
+ user_cfi.cfi_status.lp_en = is_indir_lp_enabled(target);
+ user_cfi.cfi_status.lp_lock = is_indir_lp_locked(target);
+ user_cfi.cfi_status.elp_state = (regs->status & SR_ELP);
+
+ user_cfi.cfi_status.shstk_en = is_shstk_enabled(target);
+ user_cfi.cfi_status.shstk_lock = is_shstk_locked(target);
+ user_cfi.shstk_ptr = get_active_shstk(target);
+
+ return membuf_write(&to, &user_cfi, sizeof(user_cfi));
+}
+
+/*
+ * Does it make sense to allowing enable / disable of cfi via ptrace?
+ * Not allowing enable / disable / locking control via ptrace for now.
+ * Setting shadow stack pointer is allowed. GDB might use it to unwind or
+ * some other fixup. Similarly gdb might want to suppress elp and may want
+ * to reset elp state.
+ */
+static int riscv_cfi_set(struct task_struct *target,
+ const struct user_regset *regset,
+ unsigned int pos, unsigned int count,
+ const void *kbuf, const void __user *ubuf)
+{
+ int ret;
+ struct user_cfi_state user_cfi;
+ struct pt_regs *regs;
+
+ regs = task_pt_regs(target);
+
+ ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf, &user_cfi, 0, -1);
+ if (ret)
+ return ret;
+
+ /*
+ * Not allowing enabling or locking shadow stack or landing pad
+ * There is no disabling of shadow stack or landing pad via ptrace
+ * rsvd field should be set to zero so that if those fields are needed in future
+ */
+ if (user_cfi.cfi_status.lp_en || user_cfi.cfi_status.lp_lock ||
+ user_cfi.cfi_status.shstk_en || user_cfi.cfi_status.shstk_lock ||
+ !user_cfi.cfi_status.rsvd)
+ return -EINVAL;
+
+ /* If lpad is enabled on target and ptrace requests to set / clear elp, do that */
+ if (is_indir_lp_enabled(target)) {
+ if (user_cfi.cfi_status.elp_state) /* set elp state */
+ regs->status |= SR_ELP;
+ else
+ regs->status &= ~SR_ELP; /* clear elp state */
+ }
+
+ /* If shadow stack enabled on target, set new shadow stack pointer */
+ if (is_shstk_enabled(target))
+ set_active_shstk(target, user_cfi.shstk_ptr);
+
+ return 0;
+}
+#endif
+
static const struct user_regset riscv_user_regset[] = {
[REGSET_X] = {
.core_note_type = NT_PRSTATUS,
@@ -179,6 +252,16 @@ static const struct user_regset riscv_user_regset[] = {
.set = riscv_vr_set,
},
#endif
+#ifdef CONFIG_RISCV_USER_CFI
+ [REGSET_CFI] = {
+ .core_note_type = NT_RISCV_USER_CFI,
+ .align = sizeof(__u64),
+ .n = sizeof(struct user_cfi_state) / sizeof(__u64),
+ .size = sizeof(__u64),
+ .regset_get = riscv_cfi_get,
+ .set = riscv_cfi_set,
+ }
+#endif
};
static const struct user_regset_view riscv_user_native_view = {
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 9417309b7230..f60b2de66b1c 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -447,6 +447,7 @@ typedef struct elf64_shdr {
#define NT_MIPS_MSA 0x802 /* MIPS SIMD registers */
#define NT_RISCV_CSR 0x900 /* RISC-V Control and Status Registers */
#define NT_RISCV_VECTOR 0x901 /* RISC-V vector registers */
+#define NT_RISCV_USER_CFI 0x902 /* RISC-V shadow stack state */
#define NT_LOONGARCH_CPUCFG 0xa00 /* LoongArch CPU config registers */
#define NT_LOONGARCH_CSR 0xa01 /* LoongArch control and status registers */
#define NT_LOONGARCH_LSX 0xa02 /* LoongArch Loongson SIMD Extension registers */
--
2.43.0
From 59a6070ad5808df72eb638b8886c29f4dd0e7290 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Fri, 12 Jan 2024 10:11:04 -0800
Subject: [PATCH 26/28] riscv: Documentation for landing pad / indirect branch
tracking
Adding documentation on landing pad aka indirect branch tracking on riscv
and kernel interfaces exposed so that user tasks can enable it.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
Documentation/arch/riscv/zicfilp.rst | 104 +++++++++++++++++++++++++++
1 file changed, 104 insertions(+)
create mode 100644 Documentation/arch/riscv/zicfilp.rst
diff --git a/Documentation/arch/riscv/zicfilp.rst b/Documentation/arch/riscv/zicfilp.rst
new file mode 100644
index 000000000000..3007c81f0465
--- /dev/null
+++ b/Documentation/arch/riscv/zicfilp.rst
@@ -0,0 +1,104 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+:Author: Deepak Gupta <debug(a)rivosinc.com>
+:Date: 12 January 2024
+
+====================================================
+Tracking indirect control transfers on RISC-V Linux
+====================================================
+
+This document briefly describes the interface provided to userspace by Linux
+to enable indirect branch tracking for user mode applications on RISV-V
+
+1. Feature Overview
+--------------------
+
+Memory corruption issues usually result in to crashes, however when in hands of
+an adversary and if used creatively can result into variety security issues.
+
+One of those security issues can be code re-use attacks on program where adversary
+can use corrupt function pointers and chain them together to perform jump oriented
+programming (JOP) or call oriented programming (COP) and thus compromising control
+flow integrity (CFI) of the program.
+
+Function pointers live in read-write memory and thus are susceptible to corruption
+and allows an adversary to reach any program counter (PC) in address space. On
+RISC-V zicfilp extension enforces a restriction on such indirect control transfers
+
+ - indirect control transfers must land on a landing pad instruction `lpad`.
+ There are two exception to this rule
+ - rs1 = x1 or rs1 = x5, i.e. a return from a function and returns are
+ protected using shadow stack (see zicfiss.rst)
+
+ - rs1 = x7. On RISC-V compiler usually does below to reach function
+ which is beyond the offset possible J-type instruction.
+
+ "auipc x7, <imm>"
+ "jalr (x7)"
+
+ Such form of indirect control transfer are still immutable and don't rely
+ on memory and thus rs1=x7 is exempted from tracking and considered software
+ guarded jumps.
+
+`lpad` instruction is pseudo of `auipc rd, <imm_20bit>` and is a HINT nop. `lpad`
+instruction must be aligned on 4 byte boundary and compares 20 bit immediate with x7.
+If `imm_20bit` == 0, CPU don't perform any comparision with x7. If `imm_20bit` != 0,
+then `imm_20bit` must match x7 else CPU will raise `software check exception`
+(cause=18)with `*tval = 2`.
+
+Compiler can generate a hash over function signatures and setup them (truncated
+to 20bit) in x7 at callsites and function proglogs can have `lpad` with same
+function hash. This further reduces number of program counters a call site can
+reach.
+
+2. ELF and psABI
+-----------------
+
+Toolchain sets up `GNU_PROPERTY_RISCV_FEATURE_1_FCFI` for property
+`GNU_PROPERTY_RISCV_FEATURE_1_AND` in notes section of the object file.
+
+3. Linux enabling
+------------------
+
+User space programs can have multiple shared objects loaded in its address space
+and it's a difficult task to make sure all the dependencies have been compiled
+with support of indirect branch. Thus it's left to dynamic loader to enable
+indirect branch tracking for the program.
+
+4. prctl() enabling
+--------------------
+
+`PR_SET_INDIR_BR_LP_STATUS` / `PR_GET_INDIR_BR_LP_STATUS` /
+`PR_LOCK_INDIR_BR_LP_STATUS` are three prctls added to manage indirect branch
+tracking. prctls are arch agnostic and returns -EINVAL on other arches.
+
+`PR_SET_INDIR_BR_LP_STATUS`: If arg1 `PR_INDIR_BR_LP_ENABLE` and if CPU supports
+`zicfilp` then kernel will enabled indirect branch tracking for the task.
+Dynamic loader can issue this `prctl` once it has determined that all the objects
+loaded in address space support indirect branch tracking. Additionally if there is
+a `dlopen` to an object which wasn't compiled with `zicfilp`, dynamic loader can
+issue this prctl with arg1 set to 0 (i.e. `PR_INDIR_BR_LP_ENABLE` being clear)
+
+`PR_GET_INDIR_BR_LP_STATUS`: Returns current status of indirect branch tracking.
+If enabled it'll return `PR_INDIR_BR_LP_ENABLE`
+
+`PR_LOCK_INDIR_BR_LP_STATUS`: Locks current status of indirect branch tracking on
+the task. User space may want to run with strict security posture and wouldn't want
+loading of objects without `zicfilp` support in it and thus would want to disallow
+disabling of indirect branch tracking. In that case user space can use this prctl
+to lock current settings.
+
+5. violations related to indirect branch tracking
+--------------------------------------------------
+
+Pertaining to indirect branch tracking, CPU raises software check exception in
+following conditions
+ - missing `lpad` after indirect call / jmp
+ - `lpad` not on 4 byte boundary
+ - `imm_20bit` embedded in `lpad` instruction doesn't match with `x7`
+
+In all 3 cases, `*tval = 2` is captured and software check exception is raised
+(cause=18)
+
+Linux kernel will treat this as `SIGSEV`` with code = `SEGV_CPERR` and follow
+normal course of signal delivery.
--
2.43.0
From 31d400d5299f01b457fd082e32786c25ee4a9491 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Fri, 12 Jan 2024 10:14:29 -0800
Subject: [PATCH 27/28] riscv: Documentation for shadow stack on riscv
Adding documentation on shadow stack for user mode on riscv and kernel
interfaces exposed so that user tasks can enable it.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
Documentation/arch/riscv/zicfiss.rst | 169 +++++++++++++++++++++++++++
1 file changed, 169 insertions(+)
create mode 100644 Documentation/arch/riscv/zicfiss.rst
diff --git a/Documentation/arch/riscv/zicfiss.rst b/Documentation/arch/riscv/zicfiss.rst
new file mode 100644
index 000000000000..f133b6af9c15
--- /dev/null
+++ b/Documentation/arch/riscv/zicfiss.rst
@@ -0,0 +1,169 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+:Author: Deepak Gupta <debug(a)rivosinc.com>
+:Date: 12 January 2024
+
+=========================================================
+Shadow stack to protect function returns on RISC-V Linux
+=========================================================
+
+This document briefly describes the interface provided to userspace by Linux
+to enable shadow stack for user mode applications on RISV-V
+
+1. Feature Overview
+--------------------
+
+Memory corruption issues usually result in to crashes, however when in hands of
+an adversary and if used creatively can result into variety security issues.
+
+One of those security issues can be code re-use attacks on program where adversary
+can use corrupt return addresses present on stack and chain them together to perform
+return oriented programming (ROP) and thus compromising control flow integrity (CFI)
+of the program.
+
+Return addresses live on stack and thus in read-write memory and thus are
+susceptible to corruption and allows an adversary to reach any program counter
+(PC) in address space. On RISC-V `zicfiss` extension provides an alternate stack
+`shadow stack` on which return addresses can be safely placed in prolog of the
+function and retrieved in epilog. `zicfiss` extension makes following changes
+
+ - PTE encodings for shadow stack virtual memory
+ An earlier reserved encoding in first stage translation i.e.
+ PTE.R=0, PTE.W=1, PTE.X=0 becomes PTE encoding for shadow stack pages.
+
+ - `sspush x1/x5` instruction pushes (stores) `x1/x5` to shadow stack.
+
+ - `sspopchk x1/x5` instruction pops (loads) from shadow stack and compares
+ with `x1/x5` and if un-equal, CPU raises `software check exception` with
+ `*tval = 3`
+
+Compiler toolchain makes sure that function prologs have `sspush x1/x5` to save return
+address on shadow stack in addition to regular stack. Similarly function epilogs have
+`ld x5, offset(x2)`; `sspopchk x5` to ensure that popped value from regular stack
+matches with popped value from shadow stack.
+
+2. Shadow stack protections and linux memory manager
+-----------------------------------------------------
+
+As mentioned earlier, shadow stack get new page table encodings and thus have some
+special properties assigned to them and instructions that operate on them as below
+
+ - Regular stores to shadow stack memory raises access store faults.
+ This way shadow stack memory is protected from stray inadvertant
+ writes
+
+ - Regular loads to shadow stack memory are allowed.
+ This allows stack trace utilities or backtrace functions to read
+ true callstack (not tampered)
+
+ - Only shadow stack instructions can generate shadow stack load or
+ shadow stack store.
+
+ - Shadow stack load / shadow stack store on read-only memory raises
+ AMO/store page fault. Thus both `sspush x1/x5` and `sspopchk x1/x5`
+ will raise AMO/store page fault. This simplies COW handling in kernel
+ During fork, kernel can convert shadow stack pages into read-only
+ memory (as it does for regular read-write memory) and as soon as
+ subsequent `sspush` or `sspopchk` in userspace is encountered, then
+ kernel can perform COW.
+
+ - Shadow stack load / shadow stack store on read-write, read-write-
+ execute memory raises an access fault. This is a fatal condition
+ because shadow stack should never be operating on read-write, read-
+ write-execute memory.
+
+3. ELF and psABI
+-----------------
+
+Toolchain sets up `GNU_PROPERTY_RISCV_FEATURE_1_BCFI` for property
+`GNU_PROPERTY_RISCV_FEATURE_1_AND` in notes section of the object file.
+
+4. Linux enabling
+------------------
+
+User space programs can have multiple shared objects loaded in its address space
+and it's a difficult task to make sure all the dependencies have been compiled
+with support of shadow stack. Thus it's left to dynamic loader to enable
+shadow stack for the program.
+
+5. prctl() enabling
+--------------------
+
+`PR_SET_SHADOW_STACK_STATUS` / `PR_GET_SHADOW_STACK_STATUS` /
+`PR_LOCK_SHADOW_STACK_STATUS` are three prctls added to manage shadow stack
+enabling for tasks. prctls are arch agnostic and returns -EINVAL on other arches.
+
+`PR_SET_SHADOW_STACK_STATUS`: If arg1 `PR_SHADOW_STACK_ENABLE` and if CPU supports
+`zicfiss` then kernel will enable shadow stack for the task. Dynamic loader can
+issue this `prctl` once it has determined that all the objects loaded in address
+space have support for shadow stack. Additionally if there is a `dlopen` to an
+object which wasn't compiled with `zicfiss`, dynamic loader can issue this prctl
+with arg1 set to 0 (i.e. `PR_SHADOW_STACK_ENABLE` being clear)
+
+`PR_GET_SHADOW_STACK_STATUS`: Returns current status of indirect branch tracking.
+If enabled it'll return `PR_SHADOW_STACK_ENABLE`
+
+`PR_LOCK_SHADOW_STACK_STATUS`: Locks current status of shadow stack enabling on the
+task. User space may want to run with strict security posture and wouldn't want
+loading of objects without `zicfiss` support in it and thus would want to disallow
+disabling of shadow stack on current task. In that case user space can use this prctl
+to lock current settings.
+
+5. violations related to returns with shadow stack enabled
+-----------------------------------------------------------
+
+Pertaining to shadow stack, CPU raises software check exception in following
+condition
+
+ - On execution of `sspopchk x1/x5`, x1/x5 didn't match top of shadow stack.
+ If mismatch happens then cpu does `*tval = 3` and raise software check
+ exception
+
+Linux kernel will treat this as `SIGSEV`` with code = `SEGV_CPERR` and follow
+normal course of signal delivery.
+
+6. Shadow stack tokens
+-----------------------
+Regular stores on shadow stacks are not allowed and thus can't be tampered with via
+arbitrary stray writes due to bugs. Method of pivoting / switching to shadow stack
+is simply writing to csr `CSR_SSP` changes active shadow stack. This can be problematic
+because usually value to be written to `CSR_SSP` will be loaded somewhere in writeable
+memory and thus allows an adversary to corruption bug in software to pivot to an any
+address in shadow stack range. Shadow stack tokens can help mitigate this problem by
+making sure that:
+
+ - When software is switching away from a shadow stack, shadow stack pointer should be
+ saved on shadow stack itself and call it `shadow stack token`
+
+ - When software is switching to a shadow stack, it should read the `shadow stack token`
+ from shadow stack pointer and verify that `shadow stack token` itself is pointer to
+ shadow stack itself.
+
+ - Once the token verification is done, software can perform the write to `CSR_SSP` to
+ switch shadow stack.
+
+Here software can be user mode task runtime itself which is managing various contexts
+as part of single thread. Software can be kernel as well when kernel has to deliver a
+signal to user task and must save shadow stack pointer. Kernel can perform similar
+procedure by saving a token on user shadow stack itself. This way whenever sigreturn
+happens, kernel can read the token and verify the token and then switch to shadow stack.
+Using this mechanism, kernel helps user task so that any corruption issue in user task
+is not exploited by adversary by arbitrarily using `sigreturn`. Adversary will have to
+make sure that there is a `shadow stack token` in addition to invoking `sigreturn`
+
+7. Signal shadow stack
+-----------------------
+Following structure has been added to sigcontext for RISC-V. `rsvd` field has been kept
+in case we need some extra information in future for landing pads / indirect branch
+tracking. It has been kept today in order to allow backward compatibility in future.
+
+struct __sc_riscv_cfi_state {
+ unsigned long ss_ptr;
+ unsigned long rsvd;
+};
+
+As part of signal delivery, shadow stack token is saved on current shadow stack itself and
+updated pointer is saved away in `ss_ptr` field in `__sc_riscv_cfi_state` under `sigcontext`
+Existing shadow stack allocation is used for signal delivery. During `sigreturn`, kernel will
+obtain `ss_ptr` from `sigcontext` and verify the saved token on shadow stack itself and switch
+shadow stack.
--
2.43.0
From 93a6c741f2881cddeeeba28f1bbd2f06ca070fcc Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Wed, 24 Jan 2024 08:50:40 -0800
Subject: [PATCH 28/28] kselftest/riscv: kselftest for user mode cfi
Adds kselftest for RISC-V control flow integrity implementation for user
mode. There is not a lot going on in kernel for enabling landing pad for
user mode. Thus kselftest simply enables landing pad for the binary and
a signal handler is registered for SIGSEGV. Any control flow violation are
reported as SIGSEGV with si_code = SEGV_CPERR. Test will fail on recieving
any SEGV_CPERR. Shadow stack part has more changes in kernel and thus there
are separate tests for that
- enable and disable
- Exercise `map_shadow_stack` syscall
- `fork` test to make sure COW works for shadow stack pages
- gup tests
As of today kernel uses FOLL_FORCE when access happens to memory via
/proc/<pid>/mem. Not breaking that for shadow stack
- signal test. Make sure signal delivery results in token creation on
shadow stack and consumes (and verifies) token on sigreturn
- shadow stack protection test. attempts to write using regular store
instruction on shadow stack memory must result in access faults
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/cfi/Makefile | 10 +
.../testing/selftests/riscv/cfi/cfi_rv_test.h | 85 ++++
.../selftests/riscv/cfi/riscv_cfi_test.c | 91 +++++
.../testing/selftests/riscv/cfi/shadowstack.c | 376 ++++++++++++++++++
.../testing/selftests/riscv/cfi/shadowstack.h | 39 ++
6 files changed, 602 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/riscv/cfi/Makefile
create mode 100644 tools/testing/selftests/riscv/cfi/cfi_rv_test.h
create mode 100644 tools/testing/selftests/riscv/cfi/riscv_cfi_test.c
create mode 100644 tools/testing/selftests/riscv/cfi/shadowstack.c
create mode 100644 tools/testing/selftests/riscv/cfi/shadowstack.h
diff --git a/tools/testing/selftests/riscv/Makefile b/tools/testing/selftests/riscv/Makefile
index 4a9ff515a3a0..867e5875b7ce 100644
--- a/tools/testing/selftests/riscv/Makefile
+++ b/tools/testing/selftests/riscv/Makefile
@@ -5,7 +5,7 @@
ARCH ?= $(shell uname -m 2>/dev/null || echo not)
ifneq (,$(filter $(ARCH),riscv))
-RISCV_SUBTARGETS ?= hwprobe vector mm
+RISCV_SUBTARGETS ?= hwprobe vector mm cfi
else
RISCV_SUBTARGETS :=
endif
diff --git a/tools/testing/selftests/riscv/cfi/Makefile b/tools/testing/selftests/riscv/cfi/Makefile
new file mode 100644
index 000000000000..77f12157fa29
--- /dev/null
+++ b/tools/testing/selftests/riscv/cfi/Makefile
@@ -0,0 +1,10 @@
+CFLAGS += -I$(top_srcdir)/tools/include
+
+CFLAGS += -march=rv64gc_zicfilp_zicfiss
+
+TEST_GEN_PROGS := cfitests
+
+include ../../lib.mk
+
+$(OUTPUT)/cfitests: riscv_cfi_test.c shadowstack.c
+ $(CC) -static -o$@ $(CFLAGS) $(LDFLAGS) $^
diff --git a/tools/testing/selftests/riscv/cfi/cfi_rv_test.h b/tools/testing/selftests/riscv/cfi/cfi_rv_test.h
new file mode 100644
index 000000000000..27267a2e1008
--- /dev/null
+++ b/tools/testing/selftests/riscv/cfi/cfi_rv_test.h
@@ -0,0 +1,85 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef SELFTEST_RISCV_CFI_H
+#define SELFTEST_RISCV_CFI_H
+#include <stddef.h>
+#include <sys/types.h>
+#include "shadowstack.h"
+
+#define RISCV_CFI_SELFTEST_COUNT RISCV_SHADOW_STACK_TESTS
+
+#define CHILD_EXIT_CODE_SSWRITE 10
+#define CHILD_EXIT_CODE_SIG_TEST 11
+
+#define BAD_POINTER (NULL)
+
+#define my_syscall5(num, arg1, arg2, arg3, arg4, arg5) \
+({ \
+ register long _num __asm__ ("a7") = (num); \
+ register long _arg1 __asm__ ("a0") = (long)(arg1); \
+ register long _arg2 __asm__ ("a1") = (long)(arg2); \
+ register long _arg3 __asm__ ("a2") = (long)(arg3); \
+ register long _arg4 __asm__ ("a3") = (long)(arg4); \
+ register long _arg5 __asm__ ("a4") = (long)(arg5); \
+ \
+ __asm__ volatile ( \
+ "ecall\n" \
+ : "+r"(_arg1) \
+ : "r"(_arg2), "r"(_arg3), "r"(_arg4), "r"(_arg5), \
+ "r"(_num) \
+ : "memory", "cc" \
+ ); \
+ _arg1; \
+})
+
+#define my_syscall3(num, arg1, arg2, arg3) \
+({ \
+ register long _num __asm__ ("a7") = (num); \
+ register long _arg1 __asm__ ("a0") = (long)(arg1); \
+ register long _arg2 __asm__ ("a1") = (long)(arg2); \
+ register long _arg3 __asm__ ("a2") = (long)(arg3); \
+ \
+ __asm__ volatile ( \
+ "ecall\n" \
+ : "+r"(_arg1) \
+ : "r"(_arg2), "r"(_arg3), \
+ "r"(_num) \
+ : "memory", "cc" \
+ ); \
+ _arg1; \
+})
+
+#ifndef __NR_prctl
+#define __NR_prctl 167
+#endif
+
+#ifndef __NR_map_shadow_stack
+#define __NR_map_shadow_stack 453
+#endif
+
+#define CSR_SSP 0x011
+
+#ifdef __ASSEMBLY__
+#define __ASM_STR(x) x
+#else
+#define __ASM_STR(x) #x
+#endif
+
+#define csr_read(csr) \
+({ \
+ register unsigned long __v; \
+ __asm__ __volatile__ ("csrr %0, " __ASM_STR(csr) \
+ : "=r" (__v) : \
+ : "memory"); \
+ __v; \
+})
+
+#define csr_write(csr, val) \
+({ \
+ unsigned long __v = (unsigned long) (val); \
+ __asm__ __volatile__ ("csrw " __ASM_STR(csr) ", %0" \
+ : : "rK" (__v) \
+ : "memory"); \
+})
+
+#endif
diff --git a/tools/testing/selftests/riscv/cfi/riscv_cfi_test.c b/tools/testing/selftests/riscv/cfi/riscv_cfi_test.c
new file mode 100644
index 000000000000..c116ae4bb358
--- /dev/null
+++ b/tools/testing/selftests/riscv/cfi/riscv_cfi_test.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "../../kselftest.h"
+#include <signal.h>
+#include <asm/ucontext.h>
+#include <linux/prctl.h>
+#include "cfi_rv_test.h"
+
+/* do not optimize cfi related test functions */
+#pragma GCC push_options
+#pragma GCC optimize("O0")
+
+#define SEGV_CPERR 10 /* control protection fault */
+
+void sigsegv_handler(int signum, siginfo_t *si, void *uc)
+{
+ struct ucontext *ctx = (struct ucontext *) uc;
+
+ if (si->si_code == SEGV_CPERR) {
+ printf("Control flow violation happened somewhere\n");
+ printf("pc where violation happened %lx\n", ctx->uc_mcontext.gregs[0]);
+ exit(-1);
+ }
+
+ /* null pointer deref */
+ if (si->si_addr == BAD_POINTER)
+ exit(CHILD_EXIT_CODE_NULL_PTR_DEREF);
+
+ /* shadow stack write case */
+ exit(CHILD_EXIT_CODE_SSWRITE);
+}
+
+int lpad_enable(void)
+{
+ int ret = 0;
+
+ ret = my_syscall5(__NR_prctl, PR_SET_INDIR_BR_LP_STATUS, PR_INDIR_BR_LP_ENABLE, 0, 0, 0);
+
+ return ret;
+}
+
+bool register_signal_handler(void)
+{
+ struct sigaction sa = {};
+
+ sa.sa_sigaction = sigsegv_handler;
+ sa.sa_flags = SA_SIGINFO;
+ if (sigaction(SIGSEGV, &sa, NULL)) {
+ printf("registering signal handler for landing pad violation failed\n");
+ return false;
+ }
+
+ return true;
+}
+
+int main(int argc, char *argv[])
+{
+ int ret = 0;
+ unsigned long lpad_status = 0;
+
+ ksft_print_header();
+
+ ksft_set_plan(RISCV_CFI_SELFTEST_COUNT);
+
+ ksft_print_msg("starting risc-v tests\n");
+
+ /*
+ * Landing pad test. Not a lot of kernel changes to support landing
+ * pad for user mode except lighting up a bit in senvcfg via a prctl
+ * Enable landing pad through out the execution of test binary
+ */
+ ret = my_syscall5(__NR_prctl, PR_GET_INDIR_BR_LP_STATUS, &lpad_status, 0, 0, 0);
+ if (ret)
+ ksft_exit_skip("Get landing pad status failed with %d\n", ret);
+
+ ret = lpad_enable();
+
+ if (ret)
+ ksft_exit_skip("Enabling landing pad failed with %d\n", ret);
+
+ if (!register_signal_handler())
+ ksft_exit_skip("registering signal handler for SIGSEGV failed\n");
+
+ ksft_print_msg("landing pad enabled for binary\n");
+ ksft_print_msg("starting risc-v shadow stack tests\n");
+ execute_shadow_stack_tests();
+
+ ksft_finished();
+}
+
+#pragma GCC pop_options
diff --git a/tools/testing/selftests/riscv/cfi/shadowstack.c b/tools/testing/selftests/riscv/cfi/shadowstack.c
new file mode 100644
index 000000000000..126654801bed
--- /dev/null
+++ b/tools/testing/selftests/riscv/cfi/shadowstack.c
@@ -0,0 +1,376 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "../../kselftest.h"
+#include <sys/wait.h>
+#include <signal.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include "shadowstack.h"
+#include "cfi_rv_test.h"
+
+/* do not optimize shadow stack related test functions */
+#pragma GCC push_options
+#pragma GCC optimize("O0")
+
+void zar(void)
+{
+ unsigned long ssp = 0, swaped_val = 0;
+
+ ssp = csr_read(CSR_SSP);
+ printf("inside %s and shadow stack ptr is %lx\n", __func__, ssp);
+}
+
+void bar(void)
+{
+ printf("inside %s\n", __func__);
+ zar();
+}
+
+void foo(void)
+{
+ printf("inside %s\n", __func__);
+ bar();
+}
+
+void zar_child(void)
+{
+ unsigned long ssp = 0;
+
+ ssp = csr_read(CSR_SSP);
+ printf("inside %s and shadow stack ptr is %lx\n", __func__, ssp);
+}
+
+void bar_child(void)
+{
+ printf("inside %s\n", __func__);
+ zar_child();
+}
+
+void foo_child(void)
+{
+ printf("inside %s\n", __func__);
+ bar_child();
+}
+
+typedef void (call_func_ptr)(void);
+/*
+ * call couple of functions to test push pop.
+ */
+int shadow_stack_call_tests(call_func_ptr fn_ptr, bool parent)
+{
+ if (parent)
+ printf("call test for parent\n");
+ else
+ printf("call test for child\n");
+
+ (fn_ptr)();
+
+ return 0;
+}
+
+bool enable_disable_check(unsigned long test_num, void *ctx)
+{
+ int ret = 0;
+
+ if (!my_syscall5(__NR_prctl, PR_SET_SHADOW_STACK_STATUS, PR_SHADOW_STACK_ENABLE, 0, 0, 0)) {
+ printf("Shadow stack was enabled\n");
+ shadow_stack_call_tests(&foo, true);
+
+ ret = my_syscall5(__NR_prctl, PR_SET_SHADOW_STACK_STATUS, 0, 0, 0, 0);
+ if (ret)
+ ksft_test_result_fail("shadow stack disable failed\n");
+ } else {
+ ksft_test_result_fail("shadow stack enable failed\n");
+ ret = -EINVAL;
+ }
+
+ return ret ? false : true;
+}
+
+/* forks a thread, and ensure shadow stacks fork out */
+bool shadow_stack_fork_test(unsigned long test_num, void *ctx)
+{
+ int pid = 0, child_status = 0, parent_pid = 0;
+
+ printf("exercising shadow stack fork test\n");
+
+ if (my_syscall5(__NR_prctl, PR_SET_SHADOW_STACK_STATUS, PR_SHADOW_STACK_ENABLE, 0, 0, 0)) {
+ printf("shadow stack enable prctl failed\n");
+ return false;
+ }
+
+ parent_pid = getpid();
+ pid = fork();
+
+ if (pid) {
+ printf("Parent pid %d and child pid %d\n", parent_pid, pid);
+ shadow_stack_call_tests(&foo, true);
+ } else
+ shadow_stack_call_tests(&foo_child, false);
+
+ if (pid) {
+ printf("waiting on child to finish\n");
+ wait(&child_status);
+ } else {
+ /* exit child gracefully */
+ exit(0);
+ }
+
+ if (pid && WIFSIGNALED(child_status)) {
+ printf("child faulted");
+ return false;
+ }
+
+ /* disable shadow stack again */
+ if (my_syscall5(__NR_prctl, PR_SET_SHADOW_STACK_STATUS, 0, 0, 0, 0)) {
+ printf("shadow stack disable prctl failed\n");
+ return false;
+ }
+
+ return true;
+}
+
+/* exercise `map_shadow_stack`, pivot to it and call some functions to ensure it works */
+#define SHADOW_STACK_ALLOC_SIZE 4096
+bool shadow_stack_map_test(unsigned long test_num, void *ctx)
+{
+ unsigned long shdw_addr;
+ int ret = 0;
+
+ shdw_addr = my_syscall3(__NR_map_shadow_stack, NULL, SHADOW_STACK_ALLOC_SIZE, 0);
+
+ if (((long) shdw_addr) <= 0) {
+ printf("map_shadow_stack failed with error code %d\n", (int) shdw_addr);
+ return false;
+ }
+
+ ret = munmap((void *) shdw_addr, SHADOW_STACK_ALLOC_SIZE);
+
+ if (ret) {
+ printf("munmap failed with error code %d\n", ret);
+ return false;
+ }
+
+ return true;
+}
+
+/*
+ * shadow stack protection tests. map a shadow stack and
+ * validate all memory protections work on it
+ */
+bool shadow_stack_protection_test(unsigned long test_num, void *ctx)
+{
+ unsigned long shdw_addr;
+ unsigned long *write_addr = NULL;
+ int ret = 0, pid = 0, child_status = 0;
+
+ shdw_addr = my_syscall3(__NR_map_shadow_stack, NULL, SHADOW_STACK_ALLOC_SIZE, 0);
+
+ if (((long) shdw_addr) <= 0) {
+ printf("map_shadow_stack failed with error code %d\n", (int) shdw_addr);
+ return false;
+ }
+
+ write_addr = (unsigned long *) shdw_addr;
+ pid = fork();
+
+ /* no child was created, return false */
+ if (pid == -1)
+ return false;
+
+ /*
+ * try to perform a store from child on shadow stack memory
+ * it should result in SIGSEGV
+ */
+ if (!pid) {
+ /* below write must lead to SIGSEGV */
+ *write_addr = 0xdeadbeef;
+ } else {
+ wait(&child_status);
+ }
+
+ /* test fail, if 0xdeadbeef present on shadow stack address */
+ if (*write_addr == 0xdeadbeef) {
+ printf("write suceeded\n");
+ return false;
+ }
+
+ /* if child reached here, then fail */
+ if (!pid) {
+ printf("child reached unreachable state\n");
+ return false;
+ }
+
+ /* if child exited via signal handler but not for write on ss */
+ if (WIFEXITED(child_status) &&
+ WEXITSTATUS(child_status) != CHILD_EXIT_CODE_SSWRITE) {
+ printf("child wasn't signaled for write on shadow stack\n");
+ return false;
+ }
+
+ ret = munmap(write_addr, SHADOW_STACK_ALLOC_SIZE);
+ if (ret) {
+ printf("munmap failed with error code %d\n", ret);
+ return false;
+ }
+
+ return true;
+}
+
+#define SS_MAGIC_WRITE_VAL 0xbeefdead
+
+int gup_tests(int mem_fd, unsigned long *shdw_addr)
+{
+ unsigned long val = 0;
+
+ lseek(mem_fd, (unsigned long)shdw_addr, SEEK_SET);
+ if (read(mem_fd, &val, sizeof(val)) < 0) {
+ printf("reading shadow stack mem via gup failed\n");
+ return 1;
+ }
+
+ val = SS_MAGIC_WRITE_VAL;
+ lseek(mem_fd, (unsigned long)shdw_addr, SEEK_SET);
+ if (write(mem_fd, &val, sizeof(val)) < 0) {
+ printf("writing shadow stack mem via gup failed\n");
+ return 1;
+ }
+
+ if (*shdw_addr != SS_MAGIC_WRITE_VAL) {
+ printf("GUP write to shadow stack memory didn't happen\n");
+ return 1;
+ }
+
+ return 0;
+}
+
+bool shadow_stack_gup_tests(unsigned long test_num, void *ctx)
+{
+ unsigned long shdw_addr = 0;
+ unsigned long *write_addr = NULL;
+ int fd = 0;
+ bool ret = false;
+
+ shdw_addr = my_syscall3(__NR_map_shadow_stack, NULL, SHADOW_STACK_ALLOC_SIZE, 0);
+
+ if (((long) shdw_addr) <= 0) {
+ printf("map_shadow_stack failed with error code %d\n", (int) shdw_addr);
+ return false;
+ }
+
+ write_addr = (unsigned long *) shdw_addr;
+
+ fd = open("/proc/self/mem", O_RDWR);
+ if (fd == -1)
+ return false;
+
+ if (gup_tests(fd, write_addr)) {
+ printf("gup tests failed\n");
+ goto out;
+ }
+
+ ret = true;
+out:
+ if (shdw_addr && munmap(write_addr, SHADOW_STACK_ALLOC_SIZE)) {
+ printf("munmap failed with error code %d\n", ret);
+ ret = false;
+ }
+
+ return ret;
+}
+
+volatile bool break_loop;
+
+void sigusr1_handler(int signo)
+{
+ printf("In sigusr1 handler\n");
+ break_loop = true;
+}
+
+bool sigusr1_signal_test(void)
+{
+ if (signal(SIGUSR1, sigusr1_handler) == SIG_ERR) {
+ printf("registerting sigusr1 handler failed\n");
+ return false;
+ }
+
+ return true;
+}
+/*
+ * shadow stack signal test. shadow stack must be enabled.
+ * register a signal, fork another thread which is waiting
+ * on signal. Send a signal from parent to child, verify
+ * that signal was received by child. If not test fails
+ */
+bool shadow_stack_signal_test(unsigned long test_num, void *ctx)
+{
+ int pid = 0, child_status = 0;
+ unsigned long ssp = 0;
+
+ if (my_syscall5(__NR_prctl, PR_SET_SHADOW_STACK_STATUS, PR_SHADOW_STACK_ENABLE, 0, 0, 0)) {
+ printf("shadow stack enable prctl failed\n");
+ return false;
+ }
+
+ pid = fork();
+
+ if (pid == -1) {
+ printf("signal test: fork failed\n");
+ goto out;
+ }
+
+ if (pid == 0) {
+ /* this should be caught by signal handler and do an exit */
+ if (!sigusr1_signal_test()) {
+ printf("sigusr1_signal_test failed\n");
+ exit(-1);
+ }
+
+ while (!break_loop)
+ sleep(1);
+
+ exit(11);
+ /* child shouldn't go beyond here */
+ }
+ /* send SIGUSR1 to child */
+ kill(pid, SIGUSR1);
+ wait(&child_status);
+
+out:
+ if (my_syscall5(__NR_prctl, PR_SET_SHADOW_STACK_STATUS, 0, 0, 0, 0)) {
+ printf("shadow stack disable prctl failed\n");
+ return false;
+ }
+
+ return (WIFEXITED(child_status) &&
+ WEXITSTATUS(child_status) == 11);
+}
+
+int execute_shadow_stack_tests(void)
+{
+ int ret = 0;
+ unsigned long test_count = 0;
+ unsigned long shstk_status = 0;
+
+ printf("Executing RISC-V shadow stack self tests\n");
+
+ ret = my_syscall5(__NR_prctl, PR_GET_SHADOW_STACK_STATUS, &shstk_status, 0, 0, 0);
+
+ if (ret != 0)
+ ksft_exit_skip("Get shadow stack status failed with %d\n", ret);
+
+ /*
+ * If we are here that means get shadow stack status succeeded and
+ * thus shadow stack support is baked in the kernel.
+ */
+ while (test_count < ARRAY_SIZE(shstk_tests)) {
+ ksft_test_result((*shstk_tests[test_count].t_func)(test_count, NULL),
+ shstk_tests[test_count].name);
+ test_count++;
+ }
+
+ return 0;
+}
+
+#pragma GCC pop_options
diff --git a/tools/testing/selftests/riscv/cfi/shadowstack.h b/tools/testing/selftests/riscv/cfi/shadowstack.h
new file mode 100644
index 000000000000..92cb0752238d
--- /dev/null
+++ b/tools/testing/selftests/riscv/cfi/shadowstack.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef SELFTEST_SHADOWSTACK_TEST_H
+#define SELFTEST_SHADOWSTACK_TEST_H
+#include <stddef.h>
+#include <linux/prctl.h>
+
+/*
+ * a cfi test returns true for success or false for fail
+ * takes a number for test number to index into array and void pointer.
+ */
+typedef bool (*shstk_test_func)(unsigned long test_num, void *);
+
+struct shadow_stack_tests {
+ char *name;
+ shstk_test_func t_func;
+};
+
+bool enable_disable_check(unsigned long test_num, void *ctx);
+bool shadow_stack_fork_test(unsigned long test_num, void *ctx);
+bool shadow_stack_map_test(unsigned long test_num, void *ctx);
+bool shadow_stack_protection_test(unsigned long test_num, void *ctx);
+bool shadow_stack_gup_tests(unsigned long test_num, void *ctx);
+bool shadow_stack_signal_test(unsigned long test_num, void *ctx);
+
+static struct shadow_stack_tests shstk_tests[] = {
+ { "enable disable\n", enable_disable_check },
+ { "shstk fork test\n", shadow_stack_fork_test },
+ { "map shadow stack syscall\n", shadow_stack_map_test },
+ { "shadow stack gup tests\n", shadow_stack_gup_tests },
+ { "shadow stack signal tests\n", shadow_stack_signal_test},
+ { "memory protections of shadow stack memory\n", shadow_stack_protection_test }
+};
+
+#define RISCV_SHADOW_STACK_TESTS ARRAY_SIZE(shstk_tests)
+
+int execute_shadow_stack_tests(void);
+
+#endif
--
2.43.0
This test is missing a whole bunch of checks for interface
renaming and one ifup. Presumably it was only used on a system
with renaming disabled and NetworkManager running.
Fixes: 91f430b2c49d ("selftests: net: add a test for UDP tunnel info infra")
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
CC: shuah(a)kernel.org
CC: horms(a)kernel.org
CC: linux-kselftest(a)vger.kernel.org
---
.../selftests/drivers/net/netdevsim/udp_tunnel_nic.sh | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh b/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
index 4855ef597a15..f98435c502f6 100755
--- a/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
+++ b/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
@@ -270,6 +270,7 @@ for port in 0 1; do
echo 1 > $NSIM_DEV_SYS/new_port
fi
NSIM_NETDEV=`get_netdev_name old_netdevs`
+ ifconfig $NSIM_NETDEV up
msg="new NIC device created"
exp0=( 0 0 0 0 )
@@ -431,6 +432,7 @@ for port in 0 1; do
fi
echo $port > $NSIM_DEV_SYS/new_port
+ NSIM_NETDEV=`get_netdev_name old_netdevs`
ifconfig $NSIM_NETDEV up
overflow_table0 "overflow NIC table"
@@ -488,6 +490,7 @@ for port in 0 1; do
fi
echo $port > $NSIM_DEV_SYS/new_port
+ NSIM_NETDEV=`get_netdev_name old_netdevs`
ifconfig $NSIM_NETDEV up
overflow_table0 "overflow NIC table"
@@ -544,6 +547,7 @@ for port in 0 1; do
fi
echo $port > $NSIM_DEV_SYS/new_port
+ NSIM_NETDEV=`get_netdev_name old_netdevs`
ifconfig $NSIM_NETDEV up
overflow_table0 "destroy NIC"
@@ -573,6 +577,7 @@ for port in 0 1; do
fi
echo $port > $NSIM_DEV_SYS/new_port
+ NSIM_NETDEV=`get_netdev_name old_netdevs`
ifconfig $NSIM_NETDEV up
msg="create VxLANs v6"
@@ -633,6 +638,7 @@ for port in 0 1; do
fi
echo $port > $NSIM_DEV_SYS/new_port
+ NSIM_NETDEV=`get_netdev_name old_netdevs`
ifconfig $NSIM_NETDEV up
echo 110 > $NSIM_DEV_DFS/ports/$port/udp_ports_inject_error
@@ -688,6 +694,7 @@ for port in 0 1; do
fi
echo $port > $NSIM_DEV_SYS/new_port
+ NSIM_NETDEV=`get_netdev_name old_netdevs`
ifconfig $NSIM_NETDEV up
msg="create VxLANs v6"
@@ -747,6 +754,7 @@ for port in 0 1; do
fi
echo $port > $NSIM_DEV_SYS/new_port
+ NSIM_NETDEV=`get_netdev_name old_netdevs`
ifconfig $NSIM_NETDEV up
msg="create VxLANs v6"
@@ -877,6 +885,7 @@ msg="re-add a port"
echo 2 > $NSIM_DEV_SYS/del_port
echo 2 > $NSIM_DEV_SYS/new_port
+NSIM_NETDEV=`get_netdev_name old_netdevs`
check_tables
msg="replace VxLAN in overflow table"
--
2.43.0
If there is more than 32 cpus the bitmask will start to contain
commas, leading to:
./rps_default_mask.sh: line 36: [: 00000000,00000000: integer expression expected
Remove the commas, bash doesn't interpret leading zeroes as oct
so that should be good enough. Switch to bash, Simon reports that
not all shells support this type of substitution.
Fixes: c12e0d5f267d ("self-tests: introduce self-tests for RPS default mask")
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
v3:
- switch to bash
v2: https://lore.kernel.org/all/20240120210256.3864747-1-kuba@kernel.org/
- remove all commas
v1: https://lore.kernel.org/all/20240119151248.3476897-1-kuba@kernel.org/
CC: shuah(a)kernel.org
CC: horms(a)kernel.org
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/rps_default_mask.sh | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/rps_default_mask.sh b/tools/testing/selftests/net/rps_default_mask.sh
index a26c5624429f..4287a8529890 100755
--- a/tools/testing/selftests/net/rps_default_mask.sh
+++ b/tools/testing/selftests/net/rps_default_mask.sh
@@ -1,4 +1,4 @@
-#!/bin/sh
+#!/bin/bash
# SPDX-License-Identifier: GPL-2.0
readonly ksft_skip=4
@@ -33,6 +33,10 @@ chk_rps() {
rps_mask=$($cmd /sys/class/net/$dev_name/queues/rx-0/rps_cpus)
printf "%-60s" "$msg"
+
+ # In case there is more than 32 CPUs we need to remove commas from masks
+ rps_mask=${rps_mask//,}
+ expected_rps_mask=${expected_rps_mask//,}
if [ $rps_mask -eq $expected_rps_mask ]; then
echo "[ ok ]"
else
--
2.43.0
By allowing the filter_glob parameter to be written to, it's possible to
tweak the testsuites that will be executed on new module loads. This
makes it easier to run specific tests without having to reload kunit and
provides a way to filter tests on real HW even if kunit is builtin.
Example for xe driver:
1) Run just 1 test
# echo -n xe_bo > /sys/module/kunit/parameters/filter_glob
# modprobe -r xe_live_test
# modprobe xe_live_test
# ls /sys/kernel/debug/kunit/
xe_bo
2) Run all tests
# echo \* > /sys/module/kunit/parameters/filter_glob
# modprobe -r xe_live_test
# modprobe xe_live_test
# ls /sys/kernel/debug/kunit/
xe_bo xe_dma_buf xe_migrate xe_mocs
For completeness and to cover other use cases, also change filter and
filter_action to rw.
Link: https://lore.kernel.org/intel-xe/dzacvbdditbneiu3e3fmstjmttcbne44yspumpkd6s…
Reviewed-by: Rae Moar <rmoar(a)google.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi(a)intel.com>
---
Rae, I kept your r-b from v1 since the additions are just what we talked
about.
v2: also change filter_action and filter to rw, testing with the xe
module to see if filter=module=none filter_action=skip produces
the result expected by igt
lib/kunit/executor.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/lib/kunit/executor.c b/lib/kunit/executor.c
index 1236b3cd2fbb..371ddcee7fb5 100644
--- a/lib/kunit/executor.c
+++ b/lib/kunit/executor.c
@@ -31,13 +31,13 @@ static char *filter_glob_param;
static char *filter_param;
static char *filter_action_param;
-module_param_named(filter_glob, filter_glob_param, charp, 0400);
+module_param_named(filter_glob, filter_glob_param, charp, 0600);
MODULE_PARM_DESC(filter_glob,
"Filter which KUnit test suites/tests run at boot-time, e.g. list* or list*.*del_test");
-module_param_named(filter, filter_param, charp, 0400);
+module_param_named(filter, filter_param, charp, 0600);
MODULE_PARM_DESC(filter,
"Filter which KUnit test suites/tests run at boot-time using attributes, e.g. speed>slow");
-module_param_named(filter_action, filter_action_param, charp, 0400);
+module_param_named(filter_action, filter_action_param, charp, 0600);
MODULE_PARM_DESC(filter_action,
"Changes behavior of filtered tests using attributes, valid values are:\n"
"<none>: do not run filtered tests as normal\n"
--
2.40.1
Patches 1 and 3 are fixes for tdc that were discovered when running it
using defconfig + tc-testing config and against the latest iproute2.
Patch 2 improves the taprio script that waits for scheduler changes.
Finally, Patch 4 enables all tdc tests.
Pedro Tammela (4):
selftests: tc-testing: add missing netfilter config
selftests: tc-testing: check if 'jq' is available in taprio script
selftests: tc-testing: adjust fq test to latest iproute2
selftests: tc-testing: enable all tdc tests
tools/testing/selftests/tc-testing/config | 1 +
.../selftests/tc-testing/scripts/taprio_wait_for_admin.sh | 5 +++++
tools/testing/selftests/tc-testing/tc-tests/qdiscs/fq.json | 2 +-
tools/testing/selftests/tc-testing/tdc.sh | 3 +--
4 files changed, 8 insertions(+), 3 deletions(-)
--
2.40.1
Hi,
Here are a few fixes for seccomp_bpf tests found when testing on
Android:
user_notification_sibling_pid_ns:
unshare(CLONE_NEWPID) can return EINVAL so have added a check for this.
KILL_THREAD:
This one is a bit more Android specific.
In Bionic pthread_create is calling prctl, this is causing the test to
fail as prctl is in the filter for this test and is killed when it is
called. I've just changed prctl to getpid in this case.
user_notification_addfd:
This test can fail if there are existing file descriptors when the test
starts. It expects the next file descriptor to always increase
sequentially which is not always the case.
Added a get_next_fd function to return the next expected file descriptor.
Regards,
Terry
Terry Tritton (3):
selftests/seccomp: Handle EINVAL on unshare(CLONE_NEWPID)
selftests/seccomp: Change the syscall used in KILL_THREAD test
selftests/seccomp: user_notification_addfd check nextfd is available
tools/testing/selftests/seccomp/seccomp_bpf.c | 41 ++++++++++++++-----
1 file changed, 31 insertions(+), 10 deletions(-)
--
2.43.0.429.g432eaa2c6b-goog
On Tue, 23 Jan 2024 10:55:09 +0100 Petr Machata wrote:
> > If you authored any net or drivers/net selftests, please look around
> > and see if they are passing. If not - send patches or LMK what I need
> > to do to make them pass on the runner.. Make sure to scroll down to
> > the "Not reporting to patchwork" section.
>
> A whole bunch of them fail because of no IPv6 support in the runner
> kernel. E.g. this from bridge-mdb.sh[0]:
Thanks a lot for investigating! I take it that you're looking at
forwarding? Please send a patch to add the missing configs to
tools/testing/selftests/net/forwarding/config
The runner uses that to configure the kernel on top of defconfig.
Unless I'm doing it wrong and the sub-directories are supposed to
inherit the parent directory's config? So net/forwarding/ should
be built with net/'s config? I could not find the info in docs,
does anyone know?
Arch maintainers, please ack/review patches.
This is a resend of a series from Frank last year[1]. I worked in Rob's
review comments to unconditionally call unflatten_device_tree() and
fixup/audit calls to of_have_populated_dt() so that behavior doesn't
change.
I need this series so I can add DT based tests in the clk framework.
Either I can merge it through the clk tree once everyone is happy, or
Rob can merge it through the DT tree and provide some branch so I can
base clk patches on it.
Changes from Frank's series[1]:
* Add a DTB loaded kunit test
* Make of_have_populated_dt() return false if the DTB isn't from the
bootloader
* Architecture calls made unconditional so that a root node is always
made
Frank Rowand (2):
of: Create of_root if no dtb provided by firmware
of: unittest: treat missing of_root as error instead of fixing up
Stephen Boyd (4):
arm64: Unconditionally call unflatten_device_tree()
um: Unconditionally call unflatten_device_tree()
of: Always unflatten in unflatten_and_copy_device_tree()
of: Add KUnit test to confirm DTB is loaded
arch/arm64/kernel/setup.c | 3 +-
arch/um/kernel/dtb.c | 14 +++---
drivers/of/.kunitconfig | 3 ++
drivers/of/Kconfig | 16 ++++++-
drivers/of/Makefile | 4 +-
drivers/of/empty_root.dts | 6 +++
drivers/of/fdt.c | 57 +++++++++++++++++------
drivers/of/of_test.c | 98 +++++++++++++++++++++++++++++++++++++++
drivers/of/platform.c | 3 --
drivers/of/unittest.c | 16 ++-----
include/linux/of.h | 17 +++++--
11 files changed, 191 insertions(+), 46 deletions(-)
create mode 100644 drivers/of/.kunitconfig
create mode 100644 drivers/of/empty_root.dts
create mode 100644 drivers/of/of_test.c
Cc: Anton Ivanov <anton.ivanov(a)cambridgegreys.com>
Cc: Brendan Higgins <brendan.higgins(a)linux.dev>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: David Gow <davidgow(a)google.com>
Cc: Frank Rowand <frowand.list(a)gmail.com>
Cc: Johannes Berg <johannes(a)sipsolutions.net>
Cc: Richard Weinberger <richard(a)nod.at>
Cc: Rob Herring <robh+dt(a)kernel.org>
Cc: Will Deacon <will(a)kernel.org>
[1] https://lore.kernel.org/r/20230317053415.2254616-1-frowand.list@gmail.com
base-commit: 0dd3ee31125508cd67f7e7172247f05b7fd1753a
--
https://git.kernel.org/pub/scm/linux/kernel/git/clk/linux.git/https://git.kernel.org/pub/scm/linux/kernel/git/sboyd/spmi.git
This introduces signal->exec_bprm, which is used to
fix the case when at least one of the sibling threads
is traced, and therefore the trace process may dead-lock
in ptrace_attach, but de_thread will need to wait for the
tracer to continue execution.
The solution is to detect this situation and allow
ptrace_attach to continue by temporarily releasing the
cred_guard_mutex, while de_thread() is still waiting for
traced zombies to be eventually released by the tracer.
In the case of the thread group leader we only have to wait
for the thread to become a zombie, which may also need
co-operation from the tracer due to PTRACE_O_TRACEEXIT.
When a tracer wants to ptrace_attach a task that already
is in execve, we simply retry the ptrace_may_access
check while temporarily installing the new credentials
and dumpability which are about to be used after execve
completes. If the ptrace_attach happens on a thread that
is a sibling-thread of the thread doing execve, it is
sufficient to check against the old credentials, as this
thread will be waited for, before the new credentials are
installed.
Other threads die quickly since the cred_guard_mutex is
released, but a deadly signal is already pending. In case
the mutex_lock_killable misses the signal, the non-zero
current->signal->exec_bprm makes sure they release the
mutex immediately and return with -ERESTARTNOINTR.
This means there is no API change, unlike the previous
version of this patch which was discussed here:
https://lore.kernel.org/lkml/b6537ae6-31b1-5c50-f32b-8b8332ace882@hotmail.d…
See tools/testing/selftests/ptrace/vmaccess.c
for a test case that gets fixed by this change.
Note that since the test case was originally designed to
test the ptrace_attach returning an error in this situation,
the test expectation needed to be adjusted, to allow the
API to succeed at the first attempt.
Signed-off-by: Bernd Edlinger <bernd.edlinger(a)hotmail.de>
---
fs/exec.c | 69 ++++++++++++++++-------
fs/proc/base.c | 6 ++
include/linux/cred.h | 1 +
include/linux/sched/signal.h | 18 ++++++
kernel/cred.c | 28 +++++++--
kernel/ptrace.c | 32 +++++++++++
kernel/seccomp.c | 12 +++-
tools/testing/selftests/ptrace/vmaccess.c | 23 +++++---
8 files changed, 155 insertions(+), 34 deletions(-)
v10: Changes to previous version, make the PTRACE_ATTACH
retun -EAGAIN, instead of execve return -ERESTARTSYS.
Added some lessions learned to the description.
v11: Check old and new credentials in PTRACE_ATTACH again without
changing the API.
Note: I got actually one response from an automatic checker to the v11 patch,
https://lore.kernel.org/lkml/202107121344.wu68hEPF-lkp@intel.com/
which is complaining about:
>> kernel/ptrace.c:425:26: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct cred const *old_cred @@ got struct cred const [noderef] __rcu *real_cred @@
417 struct linux_binprm *bprm = task->signal->exec_bprm;
418 const struct cred *old_cred;
419 struct mm_struct *old_mm;
420
421 retval = down_write_killable(&task->signal->exec_update_lock);
422 if (retval)
423 goto unlock_creds;
424 task_lock(task);
> 425 old_cred = task->real_cred;
v12: Essentially identical to v11.
- Fixed a minor merge conflict in linux v5.17, and fixed the
above mentioned nit by adding __rcu to the declaration.
- re-tested the patch with all linux versions from v5.11 to v6.6
v10 was an alternative approach which did imply an API change.
But I would prefer to avoid such an API change.
The difficult part is getting the right dumpability flags assigned
before de_thread starts, hope you like this version.
If not, the v10 is of course also acceptable.
Thanks
Bernd.
diff --git a/fs/exec.c b/fs/exec.c
index 2f2b0acec4f0..902d3b230485 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1041,11 +1041,13 @@ static int exec_mmap(struct mm_struct *mm)
return 0;
}
-static int de_thread(struct task_struct *tsk)
+static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
{
struct signal_struct *sig = tsk->signal;
struct sighand_struct *oldsighand = tsk->sighand;
spinlock_t *lock = &oldsighand->siglock;
+ struct task_struct *t = tsk;
+ bool unsafe_execve_in_progress = false;
if (thread_group_empty(tsk))
goto no_thread_group;
@@ -1068,6 +1070,19 @@ static int de_thread(struct task_struct *tsk)
if (!thread_group_leader(tsk))
sig->notify_count--;
+ while_each_thread(tsk, t) {
+ if (unlikely(t->ptrace)
+ && (t != tsk->group_leader || !t->exit_state))
+ unsafe_execve_in_progress = true;
+ }
+
+ if (unlikely(unsafe_execve_in_progress)) {
+ spin_unlock_irq(lock);
+ sig->exec_bprm = bprm;
+ mutex_unlock(&sig->cred_guard_mutex);
+ spin_lock_irq(lock);
+ }
+
while (sig->notify_count) {
__set_current_state(TASK_KILLABLE);
spin_unlock_irq(lock);
@@ -1158,6 +1173,11 @@ static int de_thread(struct task_struct *tsk)
release_task(leader);
}
+ if (unlikely(unsafe_execve_in_progress)) {
+ mutex_lock(&sig->cred_guard_mutex);
+ sig->exec_bprm = NULL;
+ }
+
sig->group_exec_task = NULL;
sig->notify_count = 0;
@@ -1169,6 +1189,11 @@ static int de_thread(struct task_struct *tsk)
return 0;
killed:
+ if (unlikely(unsafe_execve_in_progress)) {
+ mutex_lock(&sig->cred_guard_mutex);
+ sig->exec_bprm = NULL;
+ }
+
/* protects against exit_notify() and __exit_signal() */
read_lock(&tasklist_lock);
sig->group_exec_task = NULL;
@@ -1253,6 +1278,24 @@ int begin_new_exec(struct linux_binprm * bprm)
if (retval)
return retval;
+ /* If the binary is not readable then enforce mm->dumpable=0 */
+ would_dump(bprm, bprm->file);
+ if (bprm->have_execfd)
+ would_dump(bprm, bprm->executable);
+
+ /*
+ * Figure out dumpability. Note that this checking only of current
+ * is wrong, but userspace depends on it. This should be testing
+ * bprm->secureexec instead.
+ */
+ if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
+ is_dumpability_changed(current_cred(), bprm->cred) ||
+ !(uid_eq(current_euid(), current_uid()) &&
+ gid_eq(current_egid(), current_gid())))
+ set_dumpable(bprm->mm, suid_dumpable);
+ else
+ set_dumpable(bprm->mm, SUID_DUMP_USER);
+
/*
* Ensure all future errors are fatal.
*/
@@ -1261,7 +1304,7 @@ int begin_new_exec(struct linux_binprm * bprm)
/*
* Make this the only thread in the thread group.
*/
- retval = de_thread(me);
+ retval = de_thread(me, bprm);
if (retval)
goto out;
@@ -1284,11 +1327,6 @@ int begin_new_exec(struct linux_binprm * bprm)
if (retval)
goto out;
- /* If the binary is not readable then enforce mm->dumpable=0 */
- would_dump(bprm, bprm->file);
- if (bprm->have_execfd)
- would_dump(bprm, bprm->executable);
-
/*
* Release all of the old mmap stuff
*/
@@ -1350,18 +1388,6 @@ int begin_new_exec(struct linux_binprm * bprm)
me->sas_ss_sp = me->sas_ss_size = 0;
- /*
- * Figure out dumpability. Note that this checking only of current
- * is wrong, but userspace depends on it. This should be testing
- * bprm->secureexec instead.
- */
- if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
- !(uid_eq(current_euid(), current_uid()) &&
- gid_eq(current_egid(), current_gid())))
- set_dumpable(current->mm, suid_dumpable);
- else
- set_dumpable(current->mm, SUID_DUMP_USER);
-
perf_event_exec();
__set_task_comm(me, kbasename(bprm->filename), true);
@@ -1480,6 +1506,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
return -ERESTARTNOINTR;
+ if (unlikely(current->signal->exec_bprm)) {
+ mutex_unlock(¤t->signal->cred_guard_mutex);
+ return -ERESTARTNOINTR;
+ }
+
bprm->cred = prepare_exec_creds();
if (likely(bprm->cred))
return 0;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ffd54617c354..0da9adfadb48 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2788,6 +2788,12 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
if (rv < 0)
goto out_free;
+ if (unlikely(current->signal->exec_bprm)) {
+ mutex_unlock(¤t->signal->cred_guard_mutex);
+ rv = -ERESTARTNOINTR;
+ goto out_free;
+ }
+
rv = security_setprocattr(PROC_I(inode)->op.lsm,
file->f_path.dentry->d_name.name, page,
count);
diff --git a/include/linux/cred.h b/include/linux/cred.h
index f923528d5cc4..b01e309f5686 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -159,6 +159,7 @@ extern const struct cred *get_task_cred(struct task_struct *);
extern struct cred *cred_alloc_blank(void);
extern struct cred *prepare_creds(void);
extern struct cred *prepare_exec_creds(void);
+extern bool is_dumpability_changed(const struct cred *, const struct cred *);
extern int commit_creds(struct cred *);
extern void abort_creds(struct cred *);
extern const struct cred *override_creds(const struct cred *);
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 0014d3adaf84..14df7073a0a8 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -234,9 +234,27 @@ struct signal_struct {
struct mm_struct *oom_mm; /* recorded mm when the thread group got
* killed by the oom killer */
+ struct linux_binprm *exec_bprm; /* Used to check ptrace_may_access
+ * against new credentials while
+ * de_thread is waiting for other
+ * traced threads to terminate.
+ * Set while de_thread is executing.
+ * The cred_guard_mutex is released
+ * after de_thread() has called
+ * zap_other_threads(), therefore
+ * a fatal signal is guaranteed to be
+ * already pending in the unlikely
+ * event, that
+ * current->signal->exec_bprm happens
+ * to be non-zero after the
+ * cred_guard_mutex was acquired.
+ */
+
struct mutex cred_guard_mutex; /* guard against foreign influences on
* credential calculations
* (notably. ptrace)
+ * Held while execve runs, except when
+ * a sibling thread is being traced.
* Deprecated do not use in new code.
* Use exec_update_lock instead.
*/
diff --git a/kernel/cred.c b/kernel/cred.c
index 98cb4eca23fb..586cb6c7cf6b 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -433,6 +433,28 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset)
return false;
}
+/**
+ * is_dumpability_changed - Will changing creds from old to new
+ * affect the dumpability in commit_creds?
+ *
+ * Return: false - dumpability will not be changed in commit_creds.
+ * Return: true - dumpability will be changed to non-dumpable.
+ *
+ * @old: The old credentials
+ * @new: The new credentials
+ */
+bool is_dumpability_changed(const struct cred *old, const struct cred *new)
+{
+ if (!uid_eq(old->euid, new->euid) ||
+ !gid_eq(old->egid, new->egid) ||
+ !uid_eq(old->fsuid, new->fsuid) ||
+ !gid_eq(old->fsgid, new->fsgid) ||
+ !cred_cap_issubset(old, new))
+ return true;
+
+ return false;
+}
+
/**
* commit_creds - Install new credentials upon the current task
* @new: The credentials to be assigned
@@ -467,11 +489,7 @@ int commit_creds(struct cred *new)
get_cred(new); /* we will require a ref for the subj creds too */
/* dumpability changes */
- if (!uid_eq(old->euid, new->euid) ||
- !gid_eq(old->egid, new->egid) ||
- !uid_eq(old->fsuid, new->fsuid) ||
- !gid_eq(old->fsgid, new->fsgid) ||
- !cred_cap_issubset(old, new)) {
+ if (is_dumpability_changed(old, new)) {
if (task->mm)
set_dumpable(task->mm, suid_dumpable);
task->pdeath_signal = 0;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 443057bee87c..eb1c450bb7d7 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -20,6 +20,7 @@
#include <linux/pagemap.h>
#include <linux/ptrace.h>
#include <linux/security.h>
+#include <linux/binfmts.h>
#include <linux/signal.h>
#include <linux/uio.h>
#include <linux/audit.h>
@@ -435,6 +436,28 @@ static int ptrace_attach(struct task_struct *task, long request,
if (retval)
goto unlock_creds;
+ if (unlikely(task->in_execve)) {
+ struct linux_binprm *bprm = task->signal->exec_bprm;
+ const struct cred __rcu *old_cred;
+ struct mm_struct *old_mm;
+
+ retval = down_write_killable(&task->signal->exec_update_lock);
+ if (retval)
+ goto unlock_creds;
+ task_lock(task);
+ old_cred = task->real_cred;
+ old_mm = task->mm;
+ rcu_assign_pointer(task->real_cred, bprm->cred);
+ task->mm = bprm->mm;
+ retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
+ rcu_assign_pointer(task->real_cred, old_cred);
+ task->mm = old_mm;
+ task_unlock(task);
+ up_write(&task->signal->exec_update_lock);
+ if (retval)
+ goto unlock_creds;
+ }
+
write_lock_irq(&tasklist_lock);
retval = -EPERM;
if (unlikely(task->exit_state))
@@ -508,6 +531,14 @@ static int ptrace_traceme(void)
{
int ret = -EPERM;
+ if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
+ return -ERESTARTNOINTR;
+
+ if (unlikely(current->signal->exec_bprm)) {
+ mutex_unlock(¤t->signal->cred_guard_mutex);
+ return -ERESTARTNOINTR;
+ }
+
write_lock_irq(&tasklist_lock);
/* Are we already being traced? */
if (!current->ptrace) {
@@ -523,6 +554,7 @@ static int ptrace_traceme(void)
}
}
write_unlock_irq(&tasklist_lock);
+ mutex_unlock(¤t->signal->cred_guard_mutex);
return ret;
}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 255999ba9190..b29bbfa0b044 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1955,9 +1955,15 @@ static long seccomp_set_mode_filter(unsigned int flags,
* Make sure we cannot change seccomp or nnp state via TSYNC
* while another thread is in the middle of calling exec.
*/
- if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
- mutex_lock_killable(¤t->signal->cred_guard_mutex))
- goto out_put_fd;
+ if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
+ if (mutex_lock_killable(¤t->signal->cred_guard_mutex))
+ goto out_put_fd;
+
+ if (unlikely(current->signal->exec_bprm)) {
+ mutex_unlock(¤t->signal->cred_guard_mutex);
+ goto out_put_fd;
+ }
+ }
spin_lock_irq(¤t->sighand->siglock);
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
index 4db327b44586..3b7d81fb99bb 100644
--- a/tools/testing/selftests/ptrace/vmaccess.c
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -39,8 +39,15 @@ TEST(vmaccess)
f = open(mm, O_RDONLY);
ASSERT_GE(f, 0);
close(f);
- f = kill(pid, SIGCONT);
- ASSERT_EQ(f, 0);
+ f = waitpid(-1, NULL, 0);
+ ASSERT_NE(f, -1);
+ ASSERT_NE(f, 0);
+ ASSERT_NE(f, pid);
+ f = waitpid(-1, NULL, 0);
+ ASSERT_EQ(f, pid);
+ f = waitpid(-1, NULL, 0);
+ ASSERT_EQ(f, -1);
+ ASSERT_EQ(errno, ECHILD);
}
TEST(attach)
@@ -57,22 +64,24 @@ TEST(attach)
sleep(1);
k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
- ASSERT_EQ(errno, EAGAIN);
- ASSERT_EQ(k, -1);
+ ASSERT_EQ(k, 0);
k = waitpid(-1, &s, WNOHANG);
ASSERT_NE(k, -1);
ASSERT_NE(k, 0);
ASSERT_NE(k, pid);
ASSERT_EQ(WIFEXITED(s), 1);
ASSERT_EQ(WEXITSTATUS(s), 0);
- sleep(1);
- k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+ k = waitpid(-1, &s, 0);
+ ASSERT_EQ(k, pid);
+ ASSERT_EQ(WIFSTOPPED(s), 1);
+ ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
+ k = ptrace(PTRACE_CONT, pid, 0L, 0L);
ASSERT_EQ(k, 0);
k = waitpid(-1, &s, 0);
ASSERT_EQ(k, pid);
ASSERT_EQ(WIFSTOPPED(s), 1);
ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
- k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
+ k = ptrace(PTRACE_CONT, pid, 0L, 0L);
ASSERT_EQ(k, 0);
k = waitpid(-1, &s, 0);
ASSERT_EQ(k, pid);
--
2.39.2
Currently the seccomp benchmark selftest produces non-standard output,
meaning that while it makes a number of checks of the performance it
observes this has to be parsed by humans. This means that automated
systems running this suite of tests are almost certainly ignoring the
results which isn't ideal for spotting problems. Let's rework things so
that each check that the program does is reported as a test result to
the framework.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v3:
- Re-add signoff.
- Link to v2: https://lore.kernel.org/r/20240122-b4-kselftest-seccomp-benchmark-ktap-v2-0…
Changes in v2:
- Rebase onto v6.8-rc1.
- Link to v1: https://lore.kernel.org/r/20231219-b4-kselftest-seccomp-benchmark-ktap-v1-0…
---
Mark Brown (2):
kselftest/seccomp: Use kselftest output functions for benchmark
kselftest/seccomp: Report each expectation we assert as a KTAP test
.../testing/selftests/seccomp/seccomp_benchmark.c | 105 +++++++++++++--------
1 file changed, 65 insertions(+), 40 deletions(-)
---
base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
change-id: 20231219-b4-kselftest-seccomp-benchmark-ktap-357603823708
Best regards,
--
Mark Brown <broonie(a)kernel.org>
This is part of an effort to improve detection of regressions impacting
device probe on all platforms. The recently merged DT kselftest [3]
detects probe issues for all devices described statically in the DT.
That leaves out devices discovered at run-time from discoverable buses.
This is where this test comes in. All of the devices that are connected
through discoverable buses (ie USB and PCI), and which are internal and
therefore always present, can be described based on their position in
the system topology in a per-platform YAML file so they can be checked
for. The test will check that the device has been instantiated and bound
to a driver.
Patch 1 introduces the test. Patch 2 and 3 add the device definitions
for the google,spherion machine (Acer Chromebook 514) and XPS 13 as
examples.
This is the output from the test running on Spherion:
TAP version 13
Using board file: boards/google,spherion.yaml
1..8
ok 1 /usb2-controller(a)11200000/1.4.1/camera.device
ok 2 /usb2-controller(a)11200000/1.4.1/camera.0.driver
ok 3 /usb2-controller(a)11200000/1.4.1/camera.1.driver
ok 4 /usb2-controller(a)11200000/1.4.2/bluetooth.device
ok 5 /usb2-controller(a)11200000/1.4.2/bluetooth.0.driver
ok 6 /usb2-controller(a)11200000/1.4.2/bluetooth.1.driver
ok 7 /pci-controller(a)11230000/0.0/0.0/wifi.device
ok 8 /pci-controller(a)11230000/0.0/0.0/wifi.driver
Totals: pass:8 fail:0 xfail:0 xpass:0 skip:0 error:0
[3] https://lore.kernel.org/all/20230828211424.2964562-1-nfraprado@collabora.co…
Changes in v4:
- Dropped RFC tag
- Fixed 'busses' misspelling
- Link to v3: https://lore.kernel.org/all/20231227123643.52348-1-nfraprado@collabora.com
Changes in v3:
- Reverted approach of encoding stable device reference in test file
from device match fields (from modalias) back to HW topology (from v1)
- Changed board file description to YAML
- Rewrote test script in python to handle YAML and support x86 platforms
- Link to v2: https://lore.kernel.org/all/20231127233558.868365-1-nfraprado@collabora.com
Changes in v2:
- Changed approach of encoding stable device reference in test file from
HW topology to device match fields (the ones from modalias)
- Better documented test format
- Link to v1: https://lore.kernel.org/all/20231024211818.365844-1-nfraprado@collabora.com
---
Nícolas F. R. A. Prado (3):
kselftest: Add test to verify probe of devices from discoverable buses
kselftest: devices: Add sample board file for google,spherion
kselftest: devices: Add sample board file for XPS 13 9300
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/devices/Makefile | 4 +
.../devices/boards/Dell Inc.,XPS 13 9300.yaml | 40 +++
.../selftests/devices/boards/google,spherion.yaml | 50 ++++
tools/testing/selftests/devices/ksft.py | 90 ++++++
.../selftests/devices/test_discoverable_devices.py | 318 +++++++++++++++++++++
6 files changed, 503 insertions(+)
---
base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
change-id: 20240122-discoverable-devs-ksft-9d501e312688
Best regards,
--
Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
Add missing tests to run_vmtests.sh. The mm kselftests are run through
run_vmtests.sh. If a test isn't present in this script, it'll not run
with run_tests or `make -C tools/testing/selftests/mm run_tests`.
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
---
Changes since v1:
- Copy the original scripts and their dependence script to install directory as well
---
tools/testing/selftests/mm/Makefile | 3 +++
tools/testing/selftests/mm/run_vmtests.sh | 3 +++
2 files changed, 6 insertions(+)
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 2453add65d12f..c9c8112a7262e 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -114,6 +114,9 @@ TEST_PROGS := run_vmtests.sh
TEST_FILES := test_vmalloc.sh
TEST_FILES += test_hmm.sh
TEST_FILES += va_high_addr_switch.sh
+TEST_FILES += charge_reserved_hugetlb.sh
+TEST_FILES += write_hugetlb_memory.sh
+TEST_FILES += hugetlb_reparenting_test.sh
include ../lib.mk
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 246d53a5d7f28..12754af00b39c 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -248,6 +248,9 @@ CATEGORY="hugetlb" run_test ./map_hugetlb
CATEGORY="hugetlb" run_test ./hugepage-mremap
CATEGORY="hugetlb" run_test ./hugepage-vmemmap
CATEGORY="hugetlb" run_test ./hugetlb-madvise
+CATEGORY="hugetlb" run_test ./charge_reserved_hugetlb.sh -cgroup-v2
+CATEGORY="hugetlb" run_test ./hugetlb_reparenting_test.sh -cgroup-v2
+CATEGORY="hugetlb" run_test ./hugetlb-read-hwpoison
nr_hugepages_tmp=$(cat /proc/sys/vm/nr_hugepages)
# For this test, we need one and just one huge page
--
2.42.0
The busywait timeout value is a millisecond, not a second. So the
current setting 2 is meaningless. Let's copy the WAIT_TIMEOUT from
forwarding/lib.sh and set a BUSYWAIT_TIMEOUT here.
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
Not sure if the default WAIT_TIMEOUT 20s is too large. But since
we usually don't need to wait for that long. I think it's OK to
stay the same value with forwarding/lib.sh. Please tell me if you
think we need to set a more proper value.
BTW, This doesn't look like a fix. But also not a feature. So I just
post it to net tree.
---
tools/testing/selftests/net/lib.sh | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index dca549443801..f9fe182dfbd4 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -4,6 +4,9 @@
##############################################################################
# Defines
+WAIT_TIMEOUT=${WAIT_TIMEOUT:=20}
+BUSYWAIT_TIMEOUT=$((WAIT_TIMEOUT * 1000)) # ms
+
# Kselftest framework requirement - SKIP code is 4.
ksft_skip=4
# namespace list created by setup_ns
@@ -48,7 +51,7 @@ cleanup_ns()
for ns in "$@"; do
ip netns delete "${ns}" &> /dev/null
- if ! busywait 2 ip netns list \| grep -vq "^$ns$" &> /dev/null; then
+ if ! busywait $BUSYWAIT_TIMEOUT ip netns list \| grep -vq "^$ns$" &> /dev/null; then
echo "Warn: Failed to remove namespace $ns"
ret=1
fi
--
2.43.0
Add missing tests to run_vmtests.sh. The mm kselftests are run through
run_vmtests.sh. If a test isn't present in this script, it'll not run
with run_tests or `make -C tools/testing/selftests/mm run_tests`.
Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
---
tools/testing/selftests/mm/run_vmtests.sh | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 246d53a5d7f2..a5e6ba8d3579 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -248,6 +248,9 @@ CATEGORY="hugetlb" run_test ./map_hugetlb
CATEGORY="hugetlb" run_test ./hugepage-mremap
CATEGORY="hugetlb" run_test ./hugepage-vmemmap
CATEGORY="hugetlb" run_test ./hugetlb-madvise
+CATEGORY="hugetlb" run_test ./charge_reserved_hugetlb.sh
+CATEGORY="hugetlb" run_test ./hugetlb_reparenting_test.sh
+CATEGORY="hugetlb" run_test ./hugetlb-read-hwpoison
nr_hugepages_tmp=$(cat /proc/sys/vm/nr_hugepages)
# For this test, we need one and just one huge page
--
2.42.0
In an effort to separate intentional arithmetic wrap-around from
unexpected wrap-around, we need to refactor places that depend on this
kind of math. One of the most common code patterns of this is:
VAR + value < VAR
Notably, this is considered "undefined behavior" for signed and pointer
types, which the kernel works around by using the -fno-strict-overflow
option in the build[1] (which used to just be -fwrapv). Regardless, we
want to get the kernel source to the position where we can meaningfully
instrument arithmetic wrap-around conditions and catch them when they
are unexpected, regardless of whether they are signed[2], unsigned[3],
or pointer[4] types.
Refactor open-coded wrap-around addition test to use add_would_overflow().
This paves the way to enabling the wrap-around sanitizers in the future.
Link: https://git.kernel.org/linus/68df3755e383e6fecf2354a67b08f92f18536594 [1]
Link: https://github.com/KSPP/linux/issues/26 [2]
Link: https://github.com/KSPP/linux/issues/27 [3]
Link: https://github.com/KSPP/linux/issues/344 [4]
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: linux-mm(a)kvack.org
Cc: linux-kselftest(a)vger.kernel.org
Signed-off-by: Kees Cook <keescook(a)chromium.org>
---
mm/memory.c | 4 ++--
mm/mmap.c | 2 +-
mm/mremap.c | 2 +-
mm/nommu.c | 4 ++--
mm/util.c | 2 +-
5 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 7e1f4849463a..d47acdff7af3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2559,7 +2559,7 @@ int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long
unsigned long vm_len, pfn, pages;
/* Check that the physical memory area passed in looks valid */
- if (start + len < start)
+ if (add_would_overflow(start, len))
return -EINVAL;
/*
* You *really* shouldn't map things that aren't page-aligned,
@@ -2569,7 +2569,7 @@ int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long
len += start & ~PAGE_MASK;
pfn = start >> PAGE_SHIFT;
pages = (len + ~PAGE_MASK) >> PAGE_SHIFT;
- if (pfn + pages < pfn)
+ if (add_would_overflow(pfn, pages))
return -EINVAL;
/* We start the mapping 'vm_pgoff' pages into the area */
diff --git a/mm/mmap.c b/mm/mmap.c
index b78e83d351d2..16501fcaf511 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3023,7 +3023,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
return ret;
/* Does pgoff wrap? */
- if (pgoff + (size >> PAGE_SHIFT) < pgoff)
+ if (add_would_overflow(pgoff, (size >> PAGE_SHIFT)))
return ret;
if (mmap_write_lock_killable(mm))
diff --git a/mm/mremap.c b/mm/mremap.c
index 38d98465f3d8..efa27019a05d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -848,7 +848,7 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
/* Need to be careful about a growing mapping */
pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
pgoff += vma->vm_pgoff;
- if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)
+ if (add_would_overflow(pgoff, (new_len >> PAGE_SHIFT)))
return ERR_PTR(-EINVAL);
if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
diff --git a/mm/nommu.c b/mm/nommu.c
index b6dc558d3144..299bcfe19eed 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -202,7 +202,7 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
{
/* Don't allow overflow */
- if ((unsigned long) addr + count < count)
+ if (add_would_overflow(count, (unsigned long)addr))
count = -(unsigned long) addr;
return copy_to_iter(addr, count, iter);
@@ -1705,7 +1705,7 @@ int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, in
{
struct mm_struct *mm;
- if (addr + len < addr)
+ if (add_would_overflow(addr, len))
return 0;
mm = get_task_mm(tsk);
diff --git a/mm/util.c b/mm/util.c
index 5a6a9802583b..e6beeb23b48b 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -567,7 +567,7 @@ unsigned long vm_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long offset)
{
- if (unlikely(offset + PAGE_ALIGN(len) < offset))
+ if (unlikely(add_would_overflow(offset, PAGE_ALIGN(len))))
return -EINVAL;
if (unlikely(offset_in_page(offset)))
return -EINVAL;
--
2.34.1
6.1-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mirsad Todorovac <mirsad.todorovac(a)alu.unizg.hr>
[ Upstream commit 3f47c1ebe5ca9c5883e596c7888dec4bec0176d8 ]
The GCC 13.2.0 compiler issued the following warning:
mixer-test.c: In function ‘ctl_value_index_valid’:
mixer-test.c:322:79: warning: format ‘%lld’ expects argument of type ‘long long int’, \
but argument 5 has type ‘long int’ [-Wformat=]
322 | ksft_print_msg("%s.%d value %lld more than maximum %lld\n",
| ~~~^
| |
| long long int
| %ld
323 | ctl->name, index, int64_val,
324 | snd_ctl_elem_info_get_max(ctl->info));
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| long int
Fixing the format specifier as advised by the compiler suggestion removes the
warning.
Fixes: 3f48b137d88e7 ("kselftest: alsa: Factor out check that values meet constraints")
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Jaroslav Kysela <perex(a)perex.cz>
Cc: Takashi Iwai <tiwai(a)suse.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: linux-sound(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Signed-off-by: Mirsad Todorovac <mirsad.todorovac(a)alu.unizg.hr>
Acked-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20240107173704.937824-3-mirsad.todorovac@alu.uniz…
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/alsa/mixer-test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/alsa/mixer-test.c b/tools/testing/selftests/alsa/mixer-test.c
index d59910658c8c..9ad39db32d14 100644
--- a/tools/testing/selftests/alsa/mixer-test.c
+++ b/tools/testing/selftests/alsa/mixer-test.c
@@ -358,7 +358,7 @@ static bool ctl_value_index_valid(struct ctl_data *ctl,
}
if (int64_val > snd_ctl_elem_info_get_max64(ctl->info)) {
- ksft_print_msg("%s.%d value %lld more than maximum %lld\n",
+ ksft_print_msg("%s.%d value %lld more than maximum %ld\n",
ctl->name, index, int64_val,
snd_ctl_elem_info_get_max(ctl->info));
return false;
--
2.43.0
6.1-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mirsad Todorovac <mirsad.todorovac(a)alu.unizg.hr>
[ Upstream commit 8c51c13dc63d46e754c44215eabc0890a8bd9bfb ]
Minor fix in the number of arguments to error reporting function in the
test program as reported by GCC 13.2.0 warning.
mixer-test.c: In function ‘find_controls’:
mixer-test.c:169:44: warning: too many arguments for format [-Wformat-extra-args]
169 | ksft_exit_fail_msg("snd_ctl_poll_descriptors() failed for %d\n",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The number of arguments in call to ksft_exit_fail_msg() doesn't correspond
to the format specifiers, so this is adjusted resembling the sibling calls
to the error function.
Fixes: b1446bda56456 ("kselftest: alsa: Check for event generation when we write to controls")
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Jaroslav Kysela <perex(a)perex.cz>
Cc: Takashi Iwai <tiwai(a)suse.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: linux-sound(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Signed-off-by: Mirsad Todorovac <mirsad.todorovac(a)alu.unizg.hr>
Acked-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20240107173704.937824-2-mirsad.todorovac@alu.uniz…
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/alsa/mixer-test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/alsa/mixer-test.c b/tools/testing/selftests/alsa/mixer-test.c
index 37da902545a4..d59910658c8c 100644
--- a/tools/testing/selftests/alsa/mixer-test.c
+++ b/tools/testing/selftests/alsa/mixer-test.c
@@ -205,7 +205,7 @@ static void find_controls(void)
err = snd_ctl_poll_descriptors(card_data->handle,
&card_data->pollfd, 1);
if (err != 1) {
- ksft_exit_fail_msg("snd_ctl_poll_descriptors() failed for %d\n",
+ ksft_exit_fail_msg("snd_ctl_poll_descriptors() failed for card %d: %d\n",
card, err);
}
--
2.43.0